[GitHub] [hudi] xushiyan commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

GitBox Tue, 06 Sep 2022 14:31:38 -0700


xushiyan commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r964188068



##########
rfc/rfc-51/rfc-51.md:
##########
@@ -148,20 +152,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+#### Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket
+indexing, `op` (I/U/D) data is not available at indexing.
+
+Write-on-compaction can always persist CDC info and achieve standardization of 
implementation logic across engines,
+however, some delays are added to the CDC query results. Based on the business 
requirements, Log Compaction (RFC-48) or
+scheduling more frequent compaction can be used to minimize the latency.
 
-Spark DataSource example:
+The semantics we propose to establish are: when base files are written, the 
corresponding CDC data is also persisted.
+
+- For Spark
+  - inserts are written to base files: the CDC data `op=I` will be persisted
+  - updates/deletes that written to log files are compacted into base files: 
the CDC data `op=U|D` will be persisted
+- For Flink
+  - inserts/updates/deletes that written to log files are compacted into base 
files: the CDC data `op=I|U|D` will be
+    persisted
+
+In summary, we propose CDC data to be persisted synchronously upon base files 
generation. It is therefore
+write-on-indexing for Spark inserts (non-bucket index) and write-on-compaction 
for everything else.
+
+Note that it may also be necessary to provide capabilities for asynchronously 
persisting CDC data, in terms of a
+separate table service like `ChangeTrackingService`, which can be scheduled to 
fine-tune the CDC-persisting timings.

Review Comment:
   updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

Reply via email to