[
https://issues.apache.org/jira/browse/PHOENIX-7456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani resolved PHOENIX-7456.
-----------------------------------
Assignee: Viraj Jasani
Resolution: Resolved
> Change Data Capture for Phoenix Stream
> --------------------------------------
>
> Key: PHOENIX-7456
> URL: https://issues.apache.org/jira/browse/PHOENIX-7456
> Project: Phoenix
> Issue Type: New Feature
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
>
> Apache Phoenix provides Change Data Capture (CDC) with PHOENIX-7001. The CDC
> design in Phoenix leverages the write-optimized Uncovered Index as well as
> Max Lookback features. The changes are captured in the time-ordered event of
> row level modifications with an option to provide pre-image and post-image
> with every modification.
> Since the CDC uses an uncovered index on PHOENIX_ROW_TIMESTAMP(), the CDC
> consumer needs to provide time duration for which the records are expected to
> be read from the CDC Index. In this approach, any event of HBase table region
> split does not have any impact on the consumer queries as the index
> regionserver performs the scans on multiple data table regions depending on
> how many regions have been involved with the table data modifications in the
> given time range. Therefore, the consumer does not have the option to consume
> only the given table region (partition) specific change events. It is
> specifically important for the cloud native applications that consume Change
> Stream records to be able to identify how much compute units (memory, CPU, IO
> etc) needs to be allocated according to the number of data table regions
> involved for the given time range. As the region size grows beyond a certain
> limit, HBase considers splitting the given region based on the split policy.
> The default split policy is based on region size growth. For instance,
> regions are not allowed to grow beyond 10 GB by default.
> The proposed solution requires a new framework introducing the Streaming
> concepts for Phoenix CDC. The solution needs to provide one active stream for
> the given table on which the CDC is enabled by the client or consumer.
>
> *Change Stream:*
> Phoenix Stream captures a time-ordered sequence of row-level modifications in
> any table and stores this information in a log for up to TTL window (24 hour
> by default). Client applications can access this log and view the changes
> with an optional support of how the data appeared before and after the row is
> modified, in near-real time.
> *Stream Partitions:*
> Stream records are organized into groups, or partitions. Each partition acts
> as a container for multiple stream records and contains information required
> for accessing and iterating through these records. The stream records within
> a partition are removed automatically after the TTL window.
>
> *Note:* PHOENIX-7425 introduces partition id while generating the index
> mutations for the CDC.
>
> Detailed design doc:
> [https://docs.google.com/document/d/157KXONp4ZoWmhUBsy-lK3LNAanCsuJ3cb7qYTmozgZo/edit?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)