andygrove opened a new issue, #4351:
URL: https://github.com/apache/datafusion-comet/issues/4351

   ## Description
   
   When a Parquet column is plain BINARY (no `DecimalLogicalTypeAnnotation`), 
Spark's vectorized reader rejects reading it as a `DecimalType` via 
`ParquetVectorUpdaterFactory.getUpdater`'s `BINARY` case (lines 199-205): 
`canReadAsDecimal` and `canReadAsBinaryDecimal` both require the column to have 
a `DecimalLogicalTypeAnnotation`. Without the annotation, Spark falls through 
and throws `SchemaColumnConvertNotSupportedException`.
   
   `native_datafusion`'s `schema_adapter.rs` currently allows this conversion 
silently. The first rejection block (lines 599-619) lists `Decimal128(_, _) | 
Decimal256(_, _)` in the allowed targets, intending to permit a "binary-encoded 
decimal", but Arrow already exposes a Parquet BINARY column with 
`DecimalLogicalTypeAnnotation` as `DataType::Decimal128`, never as 
`DataType::Binary`. So when `physical_type == Binary` is observed in the 
adapter, the source is unambiguously a non-decimal BINARY column.
   
   When the cast falls through, DataFusion can't actually cast Binary to 
Decimal128; the column reaches Arrow's `RecordBatch::try_new` validation and 
fails with:
   
   ```
   Invalid argument error: column types must match schema types, expected 
Decimal128(37, 1) but found Binary at column index 0
   ```
   
   surfaced to the JVM as `CometNativeException` instead of the 
Spark-equivalent `SchemaColumnConvertNotSupportedException`.
   
   ## Reproducer
   
   `SPARK-34212 Parquet should read decimals correctly` in `ParquetQuerySuite` 
(Spark 4.1.x):
   
   ```scala
   val df = sql(s"SELECT 1 a, 123456 b, ${Int.MaxValue.toLong * 10} c, 
CAST('1.2' AS BINARY) d")
   df.write.parquet(path.toString)
   
   withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {
     Seq("a DECIMAL(3, 2)", "c DECIMAL(18, 1)", "d DECIMAL(37, 1)").foreach { 
schema =>
       val e = intercept[SparkException] {
         readParquet(schema, path).collect()
       }.getCause
       assert(e.isInstanceOf[SchemaColumnConvertNotSupportedException])
     }
   }
   ```
   
   The `d DECIMAL(37, 1)` iteration fails because `native_datafusion` doesn't 
reject the BINARY → DECIMAL read.
   
   ## Fix
   
   Drop `DataType::Decimal128(_, _) | DataType::Decimal256(_, _)` from the 
allowed-targets match at `native/core/src/parquet/schema_adapter.rs:608-609`. 
Add a regression test in the `parquet_string_read_as_int_errors` style.
   
   ## Related
   
   - #3720 (parent: native_datafusion silent schema-mismatch acceptance)
   - #4297 / #4343 / #4344 (other rejection-class gaps already addressed)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to