[ 
https://issues.apache.org/jira/browse/PHOENIX-7456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved PHOENIX-7456.
-----------------------------------
      Assignee: Viraj Jasani
    Resolution: Resolved

> Change Data Capture for Phoenix Stream
> --------------------------------------
>
>                 Key: PHOENIX-7456
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7456
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>
> Apache Phoenix provides Change Data Capture (CDC) with PHOENIX-7001. The CDC 
> design in Phoenix leverages the write-optimized Uncovered Index as well as 
> Max Lookback features. The changes are captured in the time-ordered event of 
> row level modifications with an option to provide pre-image and post-image 
> with every modification.
> Since the CDC uses an uncovered index on PHOENIX_ROW_TIMESTAMP(), the CDC 
> consumer needs to provide time duration for which the records are expected to 
> be read from the CDC Index. In this approach, any event of HBase table region 
> split does not have any impact on the consumer queries as the index 
> regionserver performs the scans on multiple data table regions depending on 
> how many regions have been involved with the table data modifications in the 
> given time range. Therefore, the consumer does not have the option to consume 
> only the given table region (partition) specific change events. It is 
> specifically important for the cloud native applications that consume Change 
> Stream records to be able to identify how much compute units (memory, CPU, IO 
> etc) needs to be allocated according to the number of data table regions 
> involved for the given time range. As the region size grows beyond a certain 
> limit, HBase considers splitting the given region based on the split policy. 
> The default split policy is based on region size growth. For instance, 
> regions are not allowed to grow beyond 10 GB by default.
> The proposed solution requires a new framework introducing the Streaming 
> concepts for Phoenix CDC. The solution needs to provide one active stream for 
> the given table on which the CDC is enabled by the client or consumer.
>  
> *Change Stream:*
> Phoenix Stream captures a time-ordered sequence of row-level modifications in 
> any table and stores this information in a log for up to TTL window (24 hour 
> by default). Client applications can access this log and view the changes 
> with an optional support of how the data appeared before and after the row is 
> modified, in near-real time.
> *Stream Partitions:*
> Stream records are organized into groups, or partitions. Each partition acts 
> as a container for multiple stream records and contains information required 
> for accessing and iterating through these records. The stream records within 
> a partition are removed automatically after the TTL window.
>  
> *Note:* PHOENIX-7425 introduces partition id while generating the index 
> mutations for the CDC.
>  
> Detailed design doc: 
> [https://docs.google.com/document/d/157KXONp4ZoWmhUBsy-lK3LNAanCsuJ3cb7qYTmozgZo/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to