andygrove opened a new pull request, #4229:
URL: https://github.com/apache/datafusion-comet/pull/4229

   ## Which issue does this PR close?
   
   Closes #3720.
   
   ## Rationale for this change
   
   When `spark.comet.scan.impl=native_datafusion` and 
`COMET_SCHEMA_EVOLUTION_ENABLED` is false, several Spark SQL tests that expect 
`SchemaColumnConvertNotSupportedException` on incompatible Parquet reads pass 
silently because DataFusion's reader is more permissive and coerces mismatched 
numeric types instead of erroring.
   
   This PR makes the native_datafusion scan path reject the same numeric 
widening cases that Spark's vectorized reader rejects, and formats the 
resulting error so it matches Spark's `_LEGACY_ERROR_TEMP_2063` template 
byte-for-byte.
   
   ## What changes are included in this PR?
   
   - Pass `COMET_SCHEMA_EVOLUTION_ENABLED` from JVM to native via protobuf 
(`allow_type_promotion` on the common scan options).
   - In `replace_with_spark_cast`, reject `INT32 -> INT64`, `FLOAT -> DOUBLE`, 
and `INT32 -> DOUBLE` when `allow_type_promotion` is false, raising 
`SparkError::ParquetSchemaConvert` (mirrors `TypeUtil.checkParquetType` in the 
JVM code).
   - Format the column name as `[name]` and emit Spark catalog names (`bigint`, 
`int`) plus Parquet primitive names (`INT32`, `INT64`) so the message matches 
Spark's vectorized reader output exactly.
   - Update `dev/diffs/3.4.3.diff`:
     - Remove `IgnoreCometNativeDataFusion` for `SPARK-35640: int as long` and 
`row group skipping doesn't overflow when reading into larger type` (now 
passing).
     - Repoint `SPARK-36182: can't read TimestampLTZ as TimestampNTZ` at #4219 
(out of scope here).
   - Revert an incidental `#[cfg(test)]` gate on `parquet/util/test_common` so 
the `parquet_read` benchmark builds.
   
   ## How are these changes tested?
   
   Verified locally against Apache Spark 3.4.3 with the regenerated diff and 
`ENABLE_COMET=true ENABLE_COMET_ONHEAP=true build/sbt sql/testOnly`:
   
   - `ParquetIOSuite > SPARK-35640: int as long should throw schema 
incompatible error` passes.
   - `ParquetV1QuerySuite > row group skipping doesn't overflow when reading 
into larger type` passes.
   - `ParquetV1QuerySuite > SPARK-36182` and `ParquetV1QuerySuite > 
SPARK-34212` are correctly ignored under the kept tags.
   - Existing schema-adapter Rust tests still pass under `cargo clippy 
--all-targets --workspace -- -D warnings`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to