schenksj opened a new issue, #4189:
URL: https://github.com/apache/datafusion-comet/issues/4189

   ## Summary
   
   Comet's native scan paths (`SCAN_NATIVE_DATAFUSION` and the new
   `SCAN_NATIVE_DELTA_COMPAT` in the delta-kernel-phase-1 work) read parquet
   columns by name. When the user enables Spark's parquet field-ID-based
   column resolution via `spark.sql.parquet.fieldId.read.enabled=true`,
   Spark's parquet reader matches columns by `parquet.field.id` metadata
   on each `StructField` rather than by name. DataFusion's parquet path
   does not honour that metadata, so columns are still resolved by name --
   silently producing wrong results when names and IDs disagree.
   
   ## Repro (Delta column-mapping `id` mode)
   
   The Delta `id` column-mapping mode relies on field-ID matching to
   decouple the table's logical column name from the parquet file's
   physical name. Tests that exercise the rename-detection semantics
   (e.g. `DeltaColumnMappingSuite` "column mapping batch scan should
   detect physical name changes" and "explicit id matching") expect
   nulls when a field's ID is changed in Delta metadata such that it
   no longer matches the file's stored ID. Vanilla Spark + Delta returns
   nulls; Comet returns the actual data because its by-name resolver
   finds the column whose name didn't change.
   
   ## Workaround
   
   `nativeDataFusionScan` already declines when both
   `spark.sql.parquet.fieldId.read.enabled=true` and the requiredSchema
   has field-IDs (`ParquetUtils.hasFieldIds`). The same gate has now been
   mirrored in `nativeDeltaScan`. However, the check returns false for
   Delta because Delta's `HadoopFsRelation` strips the field-ID metadata
   from `requiredSchema` -- the IDs live on the snapshot's metadata,
   which the Comet rule doesn't consult. So the gate never fires for
   Delta column-mapping `id` mode.
   
   ## Proposed fix
   
   Extend Comet's parquet-read path to honour `parquet.field.id` /
   `field_id` Arrow metadata for column resolution when the session's
   `PARQUET_FIELD_ID_READ_ENABLED` is true, mirroring Spark's
   `ParquetReadSupport.matchByName/matchByID` selection. Track per-field
   IDs on `data_schema` and pass them through to the native parquet
   reader so the schema adapter prefers ID-match.
   
   Filed against: branch delta-kernel-phase-1 (PR #3932)
   Related Spark behavior: 
`org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to