andygrove opened a new issue, #4351:
URL: https://github.com/apache/datafusion-comet/issues/4351
## Description
When a Parquet column is plain BINARY (no `DecimalLogicalTypeAnnotation`),
Spark's vectorized reader rejects reading it as a `DecimalType` via
`ParquetVectorUpdaterFactory.getUpdater`'s `BINARY` case (lines 199-205):
`canReadAsDecimal` and `canReadAsBinaryDecimal` both require the column to have
a `DecimalLogicalTypeAnnotation`. Without the annotation, Spark falls through
and throws `SchemaColumnConvertNotSupportedException`.
`native_datafusion`'s `schema_adapter.rs` currently allows this conversion
silently. The first rejection block (lines 599-619) lists `Decimal128(_, _) |
Decimal256(_, _)` in the allowed targets, intending to permit a "binary-encoded
decimal", but Arrow already exposes a Parquet BINARY column with
`DecimalLogicalTypeAnnotation` as `DataType::Decimal128`, never as
`DataType::Binary`. So when `physical_type == Binary` is observed in the
adapter, the source is unambiguously a non-decimal BINARY column.
When the cast falls through, DataFusion can't actually cast Binary to
Decimal128; the column reaches Arrow's `RecordBatch::try_new` validation and
fails with:
```
Invalid argument error: column types must match schema types, expected
Decimal128(37, 1) but found Binary at column index 0
```
surfaced to the JVM as `CometNativeException` instead of the
Spark-equivalent `SchemaColumnConvertNotSupportedException`.
## Reproducer
`SPARK-34212 Parquet should read decimals correctly` in `ParquetQuerySuite`
(Spark 4.1.x):
```scala
val df = sql(s"SELECT 1 a, 123456 b, ${Int.MaxValue.toLong * 10} c,
CAST('1.2' AS BINARY) d")
df.write.parquet(path.toString)
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {
Seq("a DECIMAL(3, 2)", "c DECIMAL(18, 1)", "d DECIMAL(37, 1)").foreach {
schema =>
val e = intercept[SparkException] {
readParquet(schema, path).collect()
}.getCause
assert(e.isInstanceOf[SchemaColumnConvertNotSupportedException])
}
}
```
The `d DECIMAL(37, 1)` iteration fails because `native_datafusion` doesn't
reject the BINARY → DECIMAL read.
## Fix
Drop `DataType::Decimal128(_, _) | DataType::Decimal256(_, _)` from the
allowed-targets match at `native/core/src/parquet/schema_adapter.rs:608-609`.
Add a regression test in the `parquet_string_read_as_int_errors` style.
## Related
- #3720 (parent: native_datafusion silent schema-mismatch acceptance)
- #4297 / #4343 / #4344 (other rejection-class gaps already addressed)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]