andygrove opened a new pull request, #3689: URL: https://github.com/apache/datafusion-comet/pull/3689
## Which issue does this PR close? Closes #3311. ## Rationale for this change When `spark.comet.schemaEvolution.enabled` is set to `false` (the default), the `native_datafusion` scan should reject Parquet files whose physical schema differs from the expected logical schema (e.g., int written as long). Previously, `native_datafusion` silently allowed schema widening, producing incorrect results or confusing errors instead of the expected `SchemaColumnConvertNotSupportedException`-style error that Spark produces. ## What changes are included in this PR? **Runtime schema mismatch detection in native code:** - Added `detect_schema_mismatch()` function in `schema_adapter.rs` that compares logical and physical schemas per-file at runtime - Added `is_type_promotion()` recursive function to distinguish real type promotions (Int32→Int64) from adapter-handled differences (timestamp tz/unit, list/map/struct metadata, unsigned ints, FixedSizeBinary) - The `schema_evolution_enabled` config flows from JVM through protobuf to `SparkParquetOptions` **Spark-compatible error conversion:** - Added `SchemaColumnConvertNotSupported` variant to `SparkError` enum - Errors are emitted as `DataFusionError::External(SparkError)` so they flow through the JSON error path - Added `SchemaColumnConvertNotSupported` handler in `ShimSparkErrorConverter` (all 3 Spark versions) that calls `QueryExecutionErrors.unsupportedSchemaColumnConvertError()`, producing the same `SparkException` with error class `_LEGACY_ERROR_TEMP_2063` that Spark natively produces **Spark SQL test updates:** - Unignored SPARK-35640 tests (`read binary as timestamp should throw schema incompatible error`, `int as long should throw schema incompatible error`) for `native_datafusion` scan since the enforcement now produces matching Spark errors ## How are these changes tested? - Rust unit test `parquet_schema_mismatch_rejected_when_evolution_disabled` validates that type mismatches are rejected when schema evolution is disabled and allowed when enabled - Existing `ParquetReadSuite` schema evolution tests validate end-to-end behavior - Spark SQL tests (SPARK-35640) run in CI with Comet enabled to verify error compatibility -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
