This is an automated email from the ASF dual-hosted git repository.
lidavidm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 6e84d99037 GH-45028: [C++][Compute] Allow cast to reorder struct
fields (#45246)
6e84d99037 is described below
commit 6e84d990379a0fe20a0a89311aa864f40efded23
Author: Thomas Newton <[email protected]>
AuthorDate: Sun Apr 27 01:30:50 2025 +0100
GH-45028: [C++][Compute] Allow cast to reorder struct fields (#45246)
### Rationale for this change
When reading a parquet dataset where the physical schema has inconsistent
column order for top level columns Arrow can still read the table. However it
cannot handle similar inconsistency in the order of struct fields and raises
errors like
```
Traceback (most recent call last):
File "/home/tomnewton/arrow/cpp/src/arrow/compute/example.py", line 30,
in <module>
table_read = pq.read_table(
File
"/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py",
line 1843, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File
"/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py",
line 1485, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 562, in
pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in
pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong
order: Input fields: struct<sub_column0: int32, sub_column1: int32> output
fields: struct<sub_column1: int32, sub_column0: int32>
```
This issue is quite closely related to
https://github.com/apache/arrow/issues/44555
### What changes are included in this PR?
Change the implementation of `CastStruct::Exec` to be primarily based on
the column names rather than the column order. Each input field can still only
be used once and if there are many input fields with the same name they will be
used in the order of the input fields.
Alternatives I considered:
Implement this behaviour in the same place as the equivalent logic for top
level columns at
https://github.com/apache/arrow/blob/6252e9ceeb0f8544c14f79d895a37ac198131f88/cpp/src/arrow/compute/expression.cc#L669.
This would effect parquet scans without modifying cast behaviour.
I decided against this because I want this behaviour to work recursively
e.g. if there are nested structs or structs inside arrays of maps, etc.
Have a config option to switch between field name and field order based
matching. This would make things more explicit but there would be 2 code paths
to maintain instead of one.
IMO the logic I've implemented where each input can only be used once and
column order is maintained for duplicate names achieves what I want without
breaking any usecases that rely on column order and without too much
complexity. So I decided a config option was not necessary.
### Are these changes tested?
Yes. A few new assertions were added but mostly it was a case of adjusting
the expected behaviour on existing tests.
### Are there any user-facing changes?
Yes. Casts that require changing the struct field order will now succeed
without error.
* GitHub Issue: #45028
Authored-by: Thomas Newton <[email protected]>
Signed-off-by: David Li <[email protected]>
---
.../arrow/compute/kernels/scalar_cast_nested.cc | 46 +++-----
cpp/src/arrow/compute/kernels/scalar_cast_test.cc | 119 ++++++++++++---------
docs/source/cpp/compute.rst | 11 +-
3 files changed, 92 insertions(+), 84 deletions(-)
diff --git a/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc
b/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc
index 8e12e82357..3ab42d89b6 100644
--- a/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc
@@ -18,7 +18,7 @@
// Implementation of casting to (or between) list types
#include <limits>
-#include <set>
+#include <map>
#include <utility>
#include <vector>
@@ -338,41 +338,25 @@ struct CastStruct {
std::vector<int> fields_to_select(out_field_count, -1);
- std::set<std::string> all_in_field_names;
+ std::multimap<std::string, int> in_fields;
for (int in_field_index = 0; in_field_index < in_field_count;
++in_field_index) {
- all_in_field_names.insert(in_type.field(in_field_index)->name());
+ in_fields.insert({in_type.field(in_field_index)->name(),
in_field_index});
}
- for (int in_field_index = 0, out_field_index = 0;
- out_field_index < out_field_count;) {
+ for (int out_field_index = 0; out_field_index < out_field_count;
++out_field_index) {
const auto& out_field = out_type.field(out_field_index);
- if (in_field_index < in_field_count) {
- const auto& in_field = in_type.field(in_field_index);
- // If there are more in_fields check if they match the out_field.
- if (in_field->name() == out_field->name()) {
- // Found matching in_field and out_field.
- fields_to_select[out_field_index++] = in_field_index;
- // Using the same in_field for multiple out_fields is not allowed.
- in_field_index++;
- continue;
- }
- }
- if (all_in_field_names.count(out_field->name()) == 0 &&
out_field->nullable()) {
- // Didn't match current in_field, but we can fill with null.
- // Filling with null is only acceptable on nullable fields when there
- // is definitely no in_field with matching name.
-
- fields_to_select[out_field_index++] = kFillNullSentinel;
- } else if (in_field_index < in_field_count) {
- // Didn't match current in_field, and the we cannot fill with null, so
- // try next in_field.
- in_field_index++;
+
+ // Take the first field with matching name, if any. Extract it from the
map so it
+ // can't be reused.
+ auto maybe_in_field_index = in_fields.extract(out_field->name());
+ if (!maybe_in_field_index.empty()) {
+ fields_to_select[out_field_index] = maybe_in_field_index.mapped();
+ } else if (out_field->nullable()) {
+ fields_to_select[out_field_index] = kFillNullSentinel;
} else {
- // Didn't match current in_field, we cannot fill with null, and there
- // are no more in_fields to try, so fail.
- return Status::TypeError(
- "struct fields don't match or are in the wrong order: Input
fields: ",
- in_type.ToString(), " output fields: ", out_type.ToString());
+ return Status::TypeError("struct fields don't match: non-nullable out
field `",
+ out_field->name(), "` not found in in fields
",
+ in_type.ToString());
}
}
diff --git a/cpp/src/arrow/compute/kernels/scalar_cast_test.cc
b/cpp/src/arrow/compute/kernels/scalar_cast_test.cc
index 528e6acded..44b50b31f7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_cast_test.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_cast_test.cc
@@ -3825,37 +3825,34 @@ static void CheckStructToStructSubset(
// field does not exist
ASSERT_OK_AND_ASSIGN(auto dest6,
- StructArray::Make({a1, d1, nulls}, {"a", "d",
"f"}));
+ StructArray::Make({a2, d2, nulls}, {"a", "d",
"f"}));
CheckCast(src, dest6);
const auto dest7 = arrow::struct_(
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("d",
int16()),
std::make_shared<Field>("f", int64(), /*nullable=*/false)});
const auto options7 = CastOptions::Safe(dest7);
- EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src, options7));
+ EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
+ ::testing::HasSubstr("struct fields
don't match"),
+ Cast(src, options7));
// fields in wrong order
- const auto dest8 = arrow::struct_({std::make_shared<Field>("a", int8()),
- std::make_shared<Field>("c", int16()),
- std::make_shared<Field>("b",
int64())});
- const auto options8 = CastOptions::Safe(dest8);
- EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src, options8));
+ ASSERT_OK_AND_ASSIGN(auto dest8, StructArray::Make({a2, c2, b2}, {"a",
"c", "b"}));
+ CheckCast(src, dest8);
// duplicate missing field names
- const auto dest9 = arrow::struct_(
+ ASSERT_OK_AND_ASSIGN(auto dest9,
+ StructArray::Make({a2, c2, d2, nulls}, {"a", "c",
"d", "a"}));
+ CheckCast(src, dest9);
+
+ const auto dest10 = arrow::struct_(
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("c",
int16()),
- std::make_shared<Field>("d", int32()), std::make_shared<Field>("a",
int64())});
- const auto options9 = CastOptions::Safe(dest9);
- EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src, options9));
+ std::make_shared<Field>("d", int32()),
+ std::make_shared<Field>("a", int64(), /*nullable=*/false)});
+ const auto options10 = CastOptions::Safe(dest10);
+ EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
+ ::testing::HasSubstr("struct fields
don't match"),
+ Cast(src, options10));
// duplicate present field names
ASSERT_OK_AND_ASSIGN(
@@ -3875,6 +3872,21 @@ static void CheckStructToStructSubset(
auto dest3_duplicate_field_names,
StructArray::Make({a2, b2, c2}, std::vector<std::string>{"a", "a",
"a"}));
CheckCast(src_duplicate_field_names, dest3_duplicate_field_names);
+
+ // more duplicate outputs than duplicate inputs
+ ASSERT_OK_AND_ASSIGN(auto dest4_duplicate_field_names,
+ StructArray::Make({a2, b2, c2, nulls}, {"a", "a",
"a", "a"}));
+ CheckCast(src_duplicate_field_names, dest4_duplicate_field_names);
+
+ const auto dest5_duplicate_field_names = arrow::struct_(
+ {std::make_shared<Field>("a", int8()), std::make_shared<Field>("a",
int8()),
+ std::make_shared<Field>("a", int8()),
+ std::make_shared<Field>("a", int8(), /*nullable=*/false)});
+ const auto options5_duplicate_field_names =
+ CastOptions::Safe(dest5_duplicate_field_names);
+ EXPECT_RAISES_WITH_MESSAGE_THAT(
+ TypeError, ::testing::HasSubstr("struct fields don't match"),
+ Cast(src_duplicate_field_names, options5_duplicate_field_names));
}
}
}
@@ -3941,37 +3953,36 @@ static void CheckStructToStructSubsetWithNulls(
// field does not exist
ASSERT_OK_AND_ASSIGN(
auto dest6_null,
- StructArray::Make({a1, d1, nulls}, {"a", "d", "f"}, null_bitmap));
+ StructArray::Make({a2, d2, nulls}, {"a", "d", "f"}, null_bitmap));
CheckCast(src_null, dest6_null);
const auto dest7_null = arrow::struct_(
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("d",
int16()),
std::make_shared<Field>("f", int64(), /*nullable=*/false)});
const auto options7_null = CastOptions::Safe(dest7_null);
- EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src_null, options7_null));
+ EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
+ ::testing::HasSubstr("struct fields
don't match"),
+ Cast(src_null, options7_null));
// fields in wrong order
- const auto dest8_null = arrow::struct_({std::make_shared<Field>("a",
int8()),
- std::make_shared<Field>("c",
int16()),
- std::make_shared<Field>("b",
int64())});
- const auto options8_null = CastOptions::Safe(dest8_null);
- EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src_null, options8_null));
+ ASSERT_OK_AND_ASSIGN(auto dest8_null,
+ StructArray::Make({a2, c2, b2}, {"a", "c", "b"},
null_bitmap));
+ CheckCast(src_null, dest8_null);
// duplicate missing field names
- const auto dest9_null = arrow::struct_(
+ ASSERT_OK_AND_ASSIGN(
+ auto dest9_null,
+ StructArray::Make({a2, c2, d2, nulls}, {"a", "c", "d", "a"},
null_bitmap));
+ CheckCast(src_null, dest9_null);
+
+ const auto dest10_null = arrow::struct_(
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("c",
int16()),
- std::make_shared<Field>("d", int32()), std::make_shared<Field>("a",
int64())});
- const auto options9_null = CastOptions::Safe(dest9_null);
- EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src_null, options9_null));
+ std::make_shared<Field>("d", int64()),
+ std::make_shared<Field>("a", int8(), /*nullable=*/false)});
+ const auto options10_null = CastOptions::Safe(dest10_null);
+ EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
+ ::testing::HasSubstr("struct fields
don't match"),
+ Cast(src_null, options10_null));
// duplicate present field values
ASSERT_OK_AND_ASSIGN(
@@ -3994,6 +4005,22 @@ static void CheckStructToStructSubsetWithNulls(
StructArray::Make({a2, b2, c2}, std::vector<std::string>{"a", "a",
"a"},
null_bitmap));
CheckCast(src_duplicate_field_names_null,
dest3_duplicate_field_names_null);
+
+ // more duplicate outputs than duplicate inputs
+ ASSERT_OK_AND_ASSIGN(
+ auto dest4_duplicate_field_names_null,
+ StructArray::Make({a2, b2, c2, nulls}, {"a", "a", "a", "a"},
null_bitmap));
+ CheckCast(src_duplicate_field_names_null,
dest4_duplicate_field_names_null);
+
+ const auto dest5_duplicate_field_names_null = arrow::struct_(
+ {std::make_shared<Field>("a", int8()), std::make_shared<Field>("a",
int8()),
+ std::make_shared<Field>("a", int8()),
+ std::make_shared<Field>("a", int8(), /*nullable=*/false)});
+ const auto options5_duplicate_field_names_null =
+ CastOptions::Safe(dest5_duplicate_field_names_null);
+ EXPECT_RAISES_WITH_MESSAGE_THAT(
+ TypeError, ::testing::HasSubstr("struct fields don't match"),
+ Cast(src_duplicate_field_names_null,
options5_duplicate_field_names_null));
}
}
}
@@ -4024,9 +4051,7 @@ TEST(Cast, StructToSameSizedButDifferentNamedStruct) {
const auto options2 = CastOptions::Safe(dest2);
EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src, options2));
+ TypeError, ::testing::HasSubstr("struct fields don't match"), Cast(src,
options2));
}
TEST(Cast, StructToBiggerStruct) {
@@ -4042,9 +4067,7 @@ TEST(Cast, StructToBiggerStruct) {
const auto options1 = CastOptions::Safe(dest1);
EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src, options1));
+ TypeError, ::testing::HasSubstr("struct fields don't match"), Cast(src,
options1));
const auto dest2 =
arrow::struct_({std::make_shared<Field>("a", int8()),
@@ -4053,9 +4076,7 @@ TEST(Cast, StructToBiggerStruct) {
const auto options2 = CastOptions::Safe(dest2);
EXPECT_RAISES_WITH_MESSAGE_THAT(
- TypeError,
- ::testing::HasSubstr("struct fields don't match or are in the wrong
order"),
- Cast(src, options2));
+ TypeError, ::testing::HasSubstr("struct fields don't match"), Cast(src,
options2));
}
TEST(Cast, StructToBiggerNullableStruct) {
diff --git a/docs/source/cpp/compute.rst b/docs/source/cpp/compute.rst
index 051ed7da4e..4d39ebbbc7 100644
--- a/docs/source/cpp/compute.rst
+++ b/docs/source/cpp/compute.rst
@@ -1488,10 +1488,13 @@ null input value is converted into a null output value.
cast from the input value type to the output value type (if a conversion
is available).
-* \(2) The field names of the output type must be the same or a subset of the
- field names of the input type; they also must have the same order. Casting to
- a subset of field names "selects" those fields such that each output field
- matches the data of the input field with the same name.
+* \(2) Fields are casted primarily by matching field names between the input
+ and output type. For duplicate field names, their relative order is
preserved,
+ with each input field used for one or zero output fields. This allows casting
+ to a subset or re-ordered field names. If a nullable field in the output type
+ has no matching field name in the input type, it will be filled with nulls.
+ Casting a field from nullable to not null is supported if the input data
+ contains zero nulls.
* \(3) The list offsets are unchanged, the list values are cast from the
input value type to the output value type (if a conversion is