virajjasani opened a new pull request, #2493:
URL: https://github.com/apache/phoenix/pull/2493

   Jira: PHOENIX-7878
   
   ### What changes were proposed in this pull request?
   CDC perf improvement - skip redundant cell versions on data table scans
   
   ### Why are the changes needed?
   When a CDC query runs with pre, post, and/or change scopes, it scans the 
data table to reconstruct each change event (the change image plus the 
pre-image, and for the consumer path the full data-row state). Today that data 
table scan is a raw, all-versions scan, so for every data row we read back 
every version of every column - even though, for a given batch of changes, we 
only need two cells per column per change: the cell at the change timestamp, 
and the most recent cell just below it (the pre-image). On rows that are 
updated frequently this means we read, transfer, and process far more cells 
than the event reconstruction actually uses, which adds CPU, memory, and 
network overhead to CDC reads.
   
   The purpose of this Jira is to add new CDCVersionFilter, in addition to 
SkipScanFilter on the data table scans. For each row it is given the set of 
change timestamps from the current batch and keeps only the cells that matter: 
the cell at each change timestamp, the first cell below each change timestamp 
(the pre-image), and all DeleteFamily markers (needed for deletion tracking), 
other cells are skipped to avoid redundant data transfer.
   
   ### Does this PR introduce _any_ user-facing change?
   This is performance improvement
   
   ### How was this patch tested?
   UT and IT tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Claude Opus 4.8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to