[ 
https://issues.apache.org/jira/browse/PHOENIX-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated PHOENIX-7653:
----------------------------------
    Description: 
The purpose of this Jira is to extend the Change Data Capture (CDC) 
capabilities to generate CDC events when rows expire due to Time-To-Live (TTL) 
settings (literal or conditional) during the major compaction. The 
implementation ensures that applications consuming CDC streams receive 
notification when data is automatically removed from tables, providing 
additional visibility into the system-initiated deletions.

The proposed new event_type: *ttl_delete*

Example of TTL expired CDC event, assuming the row had two columns c1 and c2 
with values "v1" and "v2" respectively:
{code:java}
{
  "event_type": "ttl_delete",
  "pre_image": {
    "c1": "v1",
    "c2": "v2"
  },
  "post_image": {}
} {code}
 

*High level Design steps:*
 * Identify the event which causes the row expiration: conditional_ttl, 
maxlookback/ttl expired rows
 * Capture the complete row image for the expiration. The image needs to be 
directly inserted into the CDC index. If we do not provide the expired row 
pre-image upfront, CDC index can not scan it after the major compaction because 
the data table row no longer exists after it is expired by the major 
compaction. CompactionScanner needs to send the exact CDC Json structure with 
encoded bytes, which can later be directly sent to the client by the scanner 
when requested.
 * CDCGlobalIndexRegionScanner needs to check for the existence of the special 
CF:CQ, which if found, can be directly returned as the value of "CDC JSON" 
column.
 * For single CF, CompactionScanner needs to perform mutation to the CDC index 
directly only once.
 * For multi CF, CompactionScanner might perform multiple mutation to the CDC 
index. Therefore, it should use checkAndMutate to ensure the mutation happens 
if the row does not exist. If the row is already inserted, and the other CF 
compaction tries to put recent row values, it can update the existing pre-image.
 * In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC 
index while multiple CF compactions are taking place, CompactionScanner needs 
to provide compactionTime as the timestamp value in the CDC index rowkey by 
updating the rowkey before performing the mutation.

  was:
The purpose of this Jira is to extend the Change Data Capture (CDC) 
capabilities to generate CDC events when rows expire due to Time-To-Live (TTL) 
settings (literal or conditional) during the major compaction. The 
implementation ensures that applications consuming CDC streams receive 
notification when data is automatically removed from tables, providing 
additional visibility into the system-initiated deletions.

The proposed new event_type: *ttl_delete*

Example of TTL expired CDC event, assuming the row had two columns c1 and c2 
with values "v1" and "v2" respectively:
{code:java}
{
  "event_type": "ttl_delete",
  "pre_image": {
    "c1": "v1",
    "c2": "v2"
  },
  "post_image": {}
} {code}
 

*High level Design steps:*
 * Identify the event which causes the row expiration: conditional_ttl, 
maxlookback/ttl expired rows
 * Capture the complete row image for the expiration. The image needs to be 
directly inserted into the CDC index. If we do not provide the expired row 
pre-image upfront, CDC index can not scan it after the major compaction because 
the data table row no longer exists after it is expired by the major 
compaction. CompactionScanner needs to send the exact CDC Json structure with 
encoded bytes, which can later be directly sent to the client by the scanner 
when requested.
 * 
CDCGlobalIndexRegionScanner needs to check for the existence of the special 
CF:CQ, which if found, can be directly returned as the value of "CDC JSON" 
column.
 * For single CF, CompactionScanner needs to perform mutation to the CDC index 
directly only once.
 * For multi CF, CompactionScanner might perform multiple mutation to the CDC 
index. Therefore, it should use checkAndMutate to ensure the mutation happens 
if the row does not exist. If the row is already inserted, and the other CF 
compaction tries to put recent row values, it can update the existing pre-image.
 * In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC 
index while multiple CF compactions are taking place, CompactionScanner needs 
to provide compactionTime as the timestamp value in the CDC index rowkey by 
updating the rowkey before performing the mutation.


> New CDC Event for TTL expired rows
> ----------------------------------
>
>                 Key: PHOENIX-7653
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7653
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Viraj Jasani
>            Priority: Major
>
> The purpose of this Jira is to extend the Change Data Capture (CDC) 
> capabilities to generate CDC events when rows expire due to Time-To-Live 
> (TTL) settings (literal or conditional) during the major compaction. The 
> implementation ensures that applications consuming CDC streams receive 
> notification when data is automatically removed from tables, providing 
> additional visibility into the system-initiated deletions.
> The proposed new event_type: *ttl_delete*
> Example of TTL expired CDC event, assuming the row had two columns c1 and c2 
> with values "v1" and "v2" respectively:
> {code:java}
> {
>   "event_type": "ttl_delete",
>   "pre_image": {
>     "c1": "v1",
>     "c2": "v2"
>   },
>   "post_image": {}
> } {code}
>  
> *High level Design steps:*
>  * Identify the event which causes the row expiration: conditional_ttl, 
> maxlookback/ttl expired rows
>  * Capture the complete row image for the expiration. The image needs to be 
> directly inserted into the CDC index. If we do not provide the expired row 
> pre-image upfront, CDC index can not scan it after the major compaction 
> because the data table row no longer exists after it is expired by the major 
> compaction. CompactionScanner needs to send the exact CDC Json structure with 
> encoded bytes, which can later be directly sent to the client by the scanner 
> when requested.
>  * CDCGlobalIndexRegionScanner needs to check for the existence of the 
> special CF:CQ, which if found, can be directly returned as the value of "CDC 
> JSON" column.
>  * For single CF, CompactionScanner needs to perform mutation to the CDC 
> index directly only once.
>  * For multi CF, CompactionScanner might perform multiple mutation to the CDC 
> index. Therefore, it should use checkAndMutate to ensure the mutation happens 
> if the row does not exist. If the row is already inserted, and the other CF 
> compaction tries to put recent row values, it can update the existing 
> pre-image.
>  * In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC 
> index while multiple CF compactions are taking place, CompactionScanner needs 
> to provide compactionTime as the timestamp value in the CDC index rowkey by 
> updating the rowkey before performing the mutation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to