virajjasani opened a new pull request, #2209:
URL: https://github.com/apache/phoenix/pull/2209
Jira: PHOENIX-7653
The purpose of this PR is to extend the Change Data Capture (CDC)
capabilities to generate CDC events when rows expire due to Time-To-Live (TTL)
settings (literal or conditional) during the major compaction. The
implementation ensures that applications consuming CDC streams receive
notification when data is automatically removed from tables, providing
additional visibility into the system-initiated deletions.
The proposed new event_type: ttl_delete
Example of TTL expired CDC event, assuming the row had two columns c1 and c2
with values "v1" and "v2" respectively:
```
{
"event_type": "ttl_delete",
"pre_image": {
"c1": "v1",
"c2": "v2"
},
"post_image": {}
}
```
High level Design steps:
- Identify the event which causes the row expiration: conditional_ttl,
maxlookback/ttl expired rows
- Capture the complete row image for the expiration. The image needs to be
directly inserted into the CDC index. If we do not provide the expired row
pre-image upfront, CDC index can not scan it after the major compaction because
the data table row no longer exists after it is expired by the major
compaction. CompactionScanner needs to send the exact CDC Json structure with
encoded bytes, which can later be directly sent to the client by the scanner
when requested.
- CDCGlobalIndexRegionScanner needs to check for the existence of the
special CF:CQ, which if found, can be directly returned as the value of "CDC
JSON" column.
- For single CF, CompactionScanner needs to perform mutation to the CDC
index directly only once.
- For multi CF, CompactionScanner might perform multiple mutation to the CDC
index. Therefore, it should use checkAndMutate to ensure the mutation happens
if the row does not exist. If the row is already inserted, and the other CF
compaction tries to put recent row values, it can update the existing pre-image.
- In order to distinguish the same PHOENIX_ROW_TIMESTAMP() value for the CDC
index while multiple CF compactions are taking place, CompactionScanner needs
to provide compactionTime as the timestamp value in the CDC index rowkey by
updating the rowkey before performing the mutation.
- Introduce some retries in case of HTable mutation failures.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]