andygrove opened a new issue, #4343:
URL: https://github.com/apache/datafusion-comet/issues/4343

   ## Description
   
   `native_datafusion` silently accepts decimal-to-decimal Parquet reads where 
the requested read type narrows the precision or scale below what is needed to 
represent the file's values. Spark's vectorized reader rejects these 
conversions with `SchemaColumnConvertNotSupportedException` because the file 
values cannot be safely represented in the requested type. `native_datafusion` 
instead returns wrong (truncated/overflowed) values.
   
   This is the decimal-to-decimal counterpart to #4297 (primitive-to-primitive 
numeric/date conversions), and a generalisation of the specific case that was 
tracked in #4089.
   
   ## Affected tests (Spark 4.1.1, `dev/diffs/4.1.1.diff`)
   
   Currently tagged `IgnoreCometNativeDataFusion` pointing at the umbrella 
#3720:
   
   - `ParquetQuerySuite` — `SPARK-34212 Parquet should read decimals correctly`
     Asserts `SchemaColumnConvertNotSupportedException` when reading e.g. 
`DECIMAL(18,2)` as `DECIMAL(3,0)`.
   - `ParquetTypeWideningSuite` — `parquet decimal precision change 
Decimal($fromPrecision, 2) -> Decimal($toPrecision, 2)`
     Iterates precision pairs across INT32 / INT64 / FIXED_LEN_BYTE_ARRAY 
backed decimals; expects an error whenever the vectorized reader is enabled and 
`fromPrecision > toPrecision`.
   - `ParquetTypeWideningSuite` — `parquet decimal precision and scale change 
Decimal($fromPrecision, $fromScale) -> Decimal($toPrecision, $toScale)`
     Same idea but varies both precision and scale.
   
   The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under 
#3720 there as well.
   
   ## Reproduction
   
   ```scala
   import org.apache.comet.CometConf
   import org.apache.spark.sql.internal.SQLConf
   
   withSQLConf(
     CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
     SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
     withTempPath { dir =>
       val path = dir.getCanonicalPath
       Seq(BigDecimal("123.45")).toDF("d")
         .selectExpr("cast(d as decimal(10,2)) as d")
         .write.parquet(path)
       spark.read.schema("d decimal(5,0)").parquet(path).show()
       // Expected: SparkException(SchemaColumnConvertNotSupportedException)
       // Actual: silent wrong/truncated value
     }
   }
   ```
   
   `native_iceberg_compat` correctly throws `SparkException` for this case.
   
   ## Suggested approach
   
   Same direction as #4297: extend the allowlist used by 
`replace_with_spark_cast` / the decimal branch of the schema adapter so that 
decimal-to-decimal coercions match Spark's `ParquetVectorUpdaterFactory` rules 
— only accept widening (or equal-scale precision widening) and reject 
everything else with `SparkError::ParquetSchemaConvert`.
   
   ## Parent issue
   
   Split from umbrella #3720 (and #4089, which fixed a single decimal narrowing 
case but did not unblock the broader test coverage).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to