malinjawi commented on issue #12195: URL: https://github.com/apache/gluten/issues/12195#issuecomment-4586598822
Thanks @felipepessoto for raising this issue! I took a closer look at this, and I think this is a valid Gluten/Velox Delta gap. The fallback seems to happen before we reach the normal Delta scan offload path. Delta CDF reads are exposed to Spark as `CDCReader.DeltaCDFRelation`, so Gluten does not initially see the usual Delta file scan shape that `OffloadDeltaScan` handles. The useful detail is that Delta itself eventually expands the CDF read through `CDCReader.changesToBatchDF(...)`, producing normal file scans over CDF-aware indexes such as `TahoeChangeFileIndex`, `CdcAddFileIndex`, and `TahoeRemoveFileIndex`, with `DeltaParquetFileFormat(isCDCRead = true)`. The CDF metadata columns like `_commit_version`, `_commit_timestamp`, and `_change_type` are mostly represented through Delta’s generated/partition-style columns, while CDC parquet files carry the CDC row data. So I do not think we need to start by adding a completely new native CDF reader in Velox. A cleaner approach may be to add a Gluten Delta planner strategy that recognizes `CDCReader.DeltaCDFRelation`, expands it using Delta’s internal CDF batch plan, and then lets the existing Gluten/Velox Delta scan offload rules apply normally. A good fix should probably include: - A planner strategy for `DeltaCDFRelation`. - Version-specific Delta helper shims, since the CDF APIs differ between Delta versions. - A regression test covering `table_changes(...)` with insert, update preimage/postimage, and delete rows. - A plan assertion that the resulting CDF query contains `DeltaScanTransformer` instead of falling back fully to Spark. I have tinkered a prototype in this direction and it compiles for the Spark 3.5 / Delta 3.3 profile. I have not yet validated the full Velox runtime test because that needs the native Gluten/Velox library setup, but the issue itself looks real and actionable. cc: @zhztheplayer @zhouyuan Any other ideas? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
