[ 
https://issues.apache.org/jira/browse/PHOENIX-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned PHOENIX-7878:
-------------------------------------

    Assignee: Viraj Jasani

> CDC perf improvement - skip redundant cell versions on data table scans
> -----------------------------------------------------------------------
>
>                 Key: PHOENIX-7878
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7878
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>
> When a CDC query runs with pre, post, and/or change scopes, it scans the data 
> table to reconstruct each change event (the change image plus the pre-image, 
> and for the consumer path the full data-row state). Today that data table 
> scan is a raw, all-versions scan, so for every data row we read back _every_ 
> version of _every_ column - even though, for a given batch of changes, we 
> only need two cells per column per change: the cell at the change timestamp, 
> and the most recent cell just below it (the pre-image). On rows that are 
> updated frequently this means we read, transfer, and process far more cells 
> than the event reconstruction actually uses, which adds CPU, memory, and 
> network overhead to CDC reads.
> The purpose of this Jira is to add new CDCVersionFilter, in addition to 
> SkipScanFilter on the data table scans. For each row it is given the set of 
> change timestamps from the current batch and keeps only the cells that 
> matter: the cell at each change timestamp, the first cell below each change 
> timestamp (the pre-image), and all DeleteFamily markers (needed for deletion 
> tracking), other cells are skipped to avoid redundant data transfer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to