schenksj opened a new issue, #4189: URL: https://github.com/apache/datafusion-comet/issues/4189
## Summary Comet's native scan paths (`SCAN_NATIVE_DATAFUSION` and the new `SCAN_NATIVE_DELTA_COMPAT` in the delta-kernel-phase-1 work) read parquet columns by name. When the user enables Spark's parquet field-ID-based column resolution via `spark.sql.parquet.fieldId.read.enabled=true`, Spark's parquet reader matches columns by `parquet.field.id` metadata on each `StructField` rather than by name. DataFusion's parquet path does not honour that metadata, so columns are still resolved by name -- silently producing wrong results when names and IDs disagree. ## Repro (Delta column-mapping `id` mode) The Delta `id` column-mapping mode relies on field-ID matching to decouple the table's logical column name from the parquet file's physical name. Tests that exercise the rename-detection semantics (e.g. `DeltaColumnMappingSuite` "column mapping batch scan should detect physical name changes" and "explicit id matching") expect nulls when a field's ID is changed in Delta metadata such that it no longer matches the file's stored ID. Vanilla Spark + Delta returns nulls; Comet returns the actual data because its by-name resolver finds the column whose name didn't change. ## Workaround `nativeDataFusionScan` already declines when both `spark.sql.parquet.fieldId.read.enabled=true` and the requiredSchema has field-IDs (`ParquetUtils.hasFieldIds`). The same gate has now been mirrored in `nativeDeltaScan`. However, the check returns false for Delta because Delta's `HadoopFsRelation` strips the field-ID metadata from `requiredSchema` -- the IDs live on the snapshot's metadata, which the Comet rule doesn't consult. So the gate never fires for Delta column-mapping `id` mode. ## Proposed fix Extend Comet's parquet-read path to honour `parquet.field.id` / `field_id` Arrow metadata for column resolution when the session's `PARQUET_FIELD_ID_READ_ENABLED` is true, mirroring Spark's `ParquetReadSupport.matchByName/matchByID` selection. Track per-field IDs on `data_schema` and pass them through to the native parquet reader so the schema adapter prefers ID-match. Filed against: branch delta-kernel-phase-1 (PR #3932) Related Spark behavior: `org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
