[ https://issues.apache.org/jira/browse/PHOENIX-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani updated PHOENIX-7653: ---------------------------------- Description: The purpose of this Jira is to extend the Change Data Capture (CDC) capabilities to generate CDC events when rows expire due to Time-To-Live (TTL) settings (literal or conditional) during the major compaction. The implementation ensures that applications consuming CDC streams receive notification when data is automatically removed from tables, providing additional visibility into the system-initiated deletions. The proposed new event_type: *ttl_delete* Example of TTL expired CDC event, assuming the row had two columns c1 and c2 with values "v1" and "v2" respectively: {code:java} { "event_type": "ttl_delete", "pre_image": { "c1": "v1", "c2": "v2" }, "post_image": {} } {code} *High level Design steps:* * Identify the event which causes the row expiration: conditional_ttl, maxlookback/ttl expired rows * Capture the complete row image for the expiration. The image needs to be directly inserted into the CDC index. If we do not provide the expired row pre-image upfront, CDC index can not scan it after the major compaction because the data table row no longer exists after it is expired by the major compaction. CompactionScanner needs to send the exact CDC Json structure with encoded bytes, which can later be directly sent to the client by the scanner when requested. * CDCGlobalIndexRegionScanner needs to check for the existence of the special CF:CQ, which if found, can be directly returned as the value of "CDC JSON" column. * For single CF, CompactionScanner needs to perform mutation to the CDC index directly only once. * For multi CF, CompactionScanner might perform multiple mutation to the CDC index. Therefore, it should use checkAndMutate to ensure the mutation happens if the row does not exist. If the row is already inserted, and the other CF compaction tries to put recent row values, it can update the existing pre-image. * In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC index while multiple CF compactions are taking place, CompactionScanner needs to provide compactionTime as the timestamp value in the CDC index rowkey by updating the rowkey before performing the mutation. * Introduce some retries in case of HTable mutation failures. was: The purpose of this Jira is to extend the Change Data Capture (CDC) capabilities to generate CDC events when rows expire due to Time-To-Live (TTL) settings (literal or conditional) during the major compaction. The implementation ensures that applications consuming CDC streams receive notification when data is automatically removed from tables, providing additional visibility into the system-initiated deletions. The proposed new event_type: *ttl_delete* Example of TTL expired CDC event, assuming the row had two columns c1 and c2 with values "v1" and "v2" respectively: {code:java} { "event_type": "ttl_delete", "pre_image": { "c1": "v1", "c2": "v2" }, "post_image": {} } {code} *High level Design steps:* * Identify the event which causes the row expiration: conditional_ttl, maxlookback/ttl expired rows * Capture the complete row image for the expiration. The image needs to be directly inserted into the CDC index. If we do not provide the expired row pre-image upfront, CDC index can not scan it after the major compaction because the data table row no longer exists after it is expired by the major compaction. CompactionScanner needs to send the exact CDC Json structure with encoded bytes, which can later be directly sent to the client by the scanner when requested. * CDCGlobalIndexRegionScanner needs to check for the existence of the special CF:CQ, which if found, can be directly returned as the value of "CDC JSON" column. * For single CF, CompactionScanner needs to perform mutation to the CDC index directly only once. * For multi CF, CompactionScanner might perform multiple mutation to the CDC index. Therefore, it should use checkAndMutate to ensure the mutation happens if the row does not exist. If the row is already inserted, and the other CF compaction tries to put recent row values, it can update the existing pre-image. * In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC index while multiple CF compactions are taking place, CompactionScanner needs to provide compactionTime as the timestamp value in the CDC index rowkey by updating the rowkey before performing the mutation. > New CDC Event for TTL expired rows > ---------------------------------- > > Key: PHOENIX-7653 > URL: https://issues.apache.org/jira/browse/PHOENIX-7653 > Project: Phoenix > Issue Type: New Feature > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Fix For: 5.3.0 > > > The purpose of this Jira is to extend the Change Data Capture (CDC) > capabilities to generate CDC events when rows expire due to Time-To-Live > (TTL) settings (literal or conditional) during the major compaction. The > implementation ensures that applications consuming CDC streams receive > notification when data is automatically removed from tables, providing > additional visibility into the system-initiated deletions. > The proposed new event_type: *ttl_delete* > Example of TTL expired CDC event, assuming the row had two columns c1 and c2 > with values "v1" and "v2" respectively: > {code:java} > { > "event_type": "ttl_delete", > "pre_image": { > "c1": "v1", > "c2": "v2" > }, > "post_image": {} > } {code} > > *High level Design steps:* > * Identify the event which causes the row expiration: conditional_ttl, > maxlookback/ttl expired rows > * Capture the complete row image for the expiration. The image needs to be > directly inserted into the CDC index. If we do not provide the expired row > pre-image upfront, CDC index can not scan it after the major compaction > because the data table row no longer exists after it is expired by the major > compaction. CompactionScanner needs to send the exact CDC Json structure with > encoded bytes, which can later be directly sent to the client by the scanner > when requested. > * CDCGlobalIndexRegionScanner needs to check for the existence of the > special CF:CQ, which if found, can be directly returned as the value of "CDC > JSON" column. > * For single CF, CompactionScanner needs to perform mutation to the CDC > index directly only once. > * For multi CF, CompactionScanner might perform multiple mutation to the CDC > index. Therefore, it should use checkAndMutate to ensure the mutation happens > if the row does not exist. If the row is already inserted, and the other CF > compaction tries to put recent row values, it can update the existing > pre-image. > * In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC > index while multiple CF compactions are taking place, CompactionScanner needs > to provide compactionTime as the timestamp value in the CDC index rowkey by > updating the rowkey before performing the mutation. > * Introduce some retries in case of HTable mutation failures. -- This message was sent by Atlassian Jira (v8.20.10#820010)