Re: [I] [VL][DELTA] Support reading Delta Lake Change Data Feed (CDF) without falling back to vanilla Spark [gluten]

via GitHub Sun, 31 May 2026 04:50:26 -0700


malinjawi commented on issue #12195:
URL: https://github.com/apache/gluten/issues/12195#issuecomment-4586598822


   Thanks @felipepessoto for raising this issue!
   
   I took a closer look at this, and I think this is a valid Gluten/Velox Delta 
gap.
   
   The fallback seems to happen before we reach the normal Delta scan offload 
path. Delta CDF reads are exposed to Spark as `CDCReader.DeltaCDFRelation`, so 
Gluten does not initially see the usual Delta file scan shape that 
`OffloadDeltaScan` handles.
   
   The useful detail is that Delta itself eventually expands the CDF read 
through `CDCReader.changesToBatchDF(...)`, producing normal file scans over 
CDF-aware indexes such as `TahoeChangeFileIndex`, `CdcAddFileIndex`, and 
`TahoeRemoveFileIndex`, with `DeltaParquetFileFormat(isCDCRead = true)`. The 
CDF metadata columns like `_commit_version`, `_commit_timestamp`, and 
`_change_type` are mostly represented through Delta’s generated/partition-style 
columns, while CDC parquet files carry the CDC row data.
   
   So I do not think we need to start by adding a completely new native CDF 
reader in Velox. A cleaner approach may be to add a Gluten Delta planner 
strategy that recognizes `CDCReader.DeltaCDFRelation`, expands it using Delta’s 
internal CDF batch plan, and then lets the existing Gluten/Velox Delta scan 
offload rules apply normally.
   
   A good fix should probably include:
   
   - A planner strategy for `DeltaCDFRelation`.
   - Version-specific Delta helper shims, since the CDF APIs differ between 
Delta versions.
   - A regression test covering `table_changes(...)` with insert, update 
preimage/postimage, and delete rows.
   - A plan assertion that the resulting CDF query contains 
`DeltaScanTransformer` instead of falling back fully to Spark.
   
   I have tinkered a prototype in this direction and it compiles for the Spark 
3.5 / Delta 3.3 profile. I have not yet validated the full Velox runtime test 
because that needs the native Gluten/Velox library setup, but the issue itself 
looks real and actionable.
   
   
   cc: @zhztheplayer @zhouyuan Any other ideas?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [VL][DELTA] Support reading Delta Lake Change Data Feed (CDF) without falling back to vanilla Spark [gluten]

Reply via email to