scovich opened a new pull request, #8540: URL: https://github.com/apache/arrow-rs/pull/8540
# Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/8332 # Rationale for this change Missing feature # What changes are included in this PR? Add decimal unshredding support, which _should_ have been straightforward except: 1. The variant decimal types are not generic and do not implement any common trait that lets us generalize the logic easily. I added a custom trait in the unshredding module as a workaround, but we should probably look at something similar to arrow's `DecimalType` trait for `VariantDecimalXX` classes to implement. 2. The parquet reader seems to have a bug that forces 32- and 64-bit decimal columns to Decimal128 unless the reader specifically requests a narrower type. I think I fixed that bug, but naturally several existing parquet unit tests started to fail. I don't know yet whether those tests are simply expecting buggy behavior, or if my fix is wrong/misguided/misplaced? <details> <summary>Test failures caused by the parquet decimal fix</summary> ``` test arrow::array_reader::primitive_array::tests::test_primitive_array_reader_decimal_types ... FAILED test arrow::arrow_reader::tests::test_arbitrary_decimal ... FAILED test arrow::arrow_reader::tests::test_decimal ... FAILED test arrow::arrow_reader::tests::test_read_decimal_file ... FAILED test arrow::arrow_writer::tests::arrow_writer_decimal ... FAILED test arrow::arrow_writer::tests::arrow_writer_decimal128_dictionary ... FAILED test arrow::arrow_writer::tests::arrow_writer_decimal256_dictionary ... FAILED test arrow::arrow_writer::tests::arrow_writer_decimal64_dictionary ... FAILED test arrow::schema::tests::test_arrow_schema_roundtrip ... FAILED test arrow::schema::tests::test_column_desc_to_field ... FAILED test arrow::schema::tests::test_decimal_fields ... FAILED test statistics::test_decimal128 ... FAILED test statistics::test_decimal64 ... FAILED test statistics::test_decimal_256 ... FAILED test statistics::test_data_page_stats_with_all_null_page ... FAILED ``` For example: ``` ---- arrow::arrow_reader::tests::test_decimal stdout ---- thread 'arrow::arrow_reader::tests::test_decimal' panicked at parquet/src/arrow/arrow_reader/mod.rs:4570:9: assertion `left == right` failed left: Schema { fields: [Field { name: "d1", data_type: Decimal64(9, 2) }, Field { name: "d2", data_type: Decimal64(10, 2) }, Field { name: "d3", data_type: Decimal64(18, 2) }], metadata: {} } right: Schema { fields: [Field { name: "d1", data_type: Decimal32(9, 2) }, Field { name: "d2", data_type: Decimal64(10, 2) }, Field { name: "d3", data_type: Decimal64(18, 2) }], metadata: {} } ---- arrow::arrow_reader::tests::test_read_decimal_file stdout ---- thread 'arrow::arrow_reader::tests::test_read_decimal_file' panicked at parquet/src/arrow/arrow_reader/mod.rs:2101:81: called `Result::unwrap()` on an `Err` value: General("invalid data type for byte array reader - Decimal32(4, 2)") ---- arrow::arrow_writer::tests::arrow_writer_decimal stdout ---- thread 'arrow::arrow_writer::tests::arrow_writer_decimal' panicked at parquet/src/arrow/arrow_writer/mod.rs:2326:9: assertion `left == right` failed left: Schema { fields: [Field { name: "a", data_type: Decimal128(5, 2) }], metadata: {} } right: Schema { fields: [Field { name: "a", data_type: Decimal32(5, 2) }], metadata: {} } ``` or ``` --- arrow::array_reader::primitive_array::tests::test_primitive_array_reader_decimal_types stdout ---- thread 'arrow::array_reader::primitive_array::tests::test_primitive_array_reader_decimal_types' panicked at parquet/src/arrow/array_reader/primitive_array.rs:916:13: assertion `left == right` failed left: Decimal32(8, 2) right: Decimal128(8, 2) ``` or ``` ---- arrow::arrow_reader::tests::test_arbitrary_decimal stdout ---- thread 'arrow::arrow_reader::tests::test_arbitrary_decimal' panicked at parquet/src/arrow/arrow_reader/mod.rs:4473:9: assertion `left == right` failed left: RecordBatch { schema: Schema { fields: [Field { name: "decimal_values_19_0", data_type: Decimal128(19, 0) }, Field { name: "decimal_values_12_0", data_type: Decimal128(12, 0) }, Field { name: "decimal_values_17_10", data_type: Decimal128(17, 10) }], metadata: {} }, columns: [PrimitiveArray<Decimal128(19, 0)> [ 1, 2, 3, 4, 5, 6, 7, 8, ], PrimitiveArray<Decimal128(12, 0)> [ 1, 2, 3, 4, 5, 6, 7, 8, ], PrimitiveArray<Decimal128(17, 10)> [ 1, 2, 3, 4, 5, 6, 7, 8, ]], row_count: 8 } right: RecordBatch { schema: Schema { fields: [Field { name: "decimal_values_19_0", data_type: Decimal128(19, 0) }, Field { name: "decimal_values_12_0", data_type: Decimal64(12, 0) }, Field { name: "decimal_values_17_10", data_type: Decimal64(17, 10) }], metadata: {} }, columns: [PrimitiveArray<Decimal128(19, 0)> [ 1, 2, 3, 4, 5, 6, 7, 8, ], PrimitiveArray<Decimal64(12, 0)> [ 1, 2, 3, 4, 5, 6, 7, 8, ], PrimitiveArray<Decimal64(17, 10)> [ 1, 2, 3, 4, 5, 6, 7, 8, ]], row_count: 8 } ``` ``` </details> # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 4. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
