[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

GitBox Thu, 01 Sep 2022 00:44:01 -0700


prasannarajaperumal commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r960323342



##########
rfc/rfc-51/rfc-51.md:
##########
@@ -215,18 +245,31 @@ Note:
 
 - Only instants that are active can be queried in a CDC scenario.
 - `CDCReader` manages all the things on CDC, and all the spark 
entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`.
-- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work 
to get the change data. The following illustration explains the difference when 
this config is true or false.
+- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute 
the changed data. The following illustrates the difference.
 
 ![](read_cdc_log_file.jpg)
 
 #### COW table
 
-Reading COW table in CDC query mode is equivalent to reading a simplified MOR 
table that has no normal log files.
+Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO 
mode.
 
 #### MOR table
 
-According to the design of the writing part, only the cases where writing mor 
tables will write out the base file (which call the `HoodieMergeHandle` and 
it's subclasses) will write out the cdc files.
-In other words, cdc files will be written out only for the index and file size 
reasons.
+According to the section "Persisting CDC in MOR", CDC data is available upon 
base files' generation.
+
+When users want to get fresher real-time CDC results:
+
+- users are to set `hoodie.datasource.query.incremental.type=snapshot`
+- the implementation logic is to compute the results in-flight by reading log 
files and the corresponding base files (
+  current and previous file slices).
+- this is equivalent to running incremental-query on MOR RT tables
+
+When users want to optimize compute-cost and are tolerant with latency of CDC 
results,
+
+- users are to set `hoodie.datasource.query.incremental.type=read_optimized`
+- the implementation logic is to extract the results by reading persisted CDC 
data and the corresponding base files (
+  current and previous file slices).

Review Comment:
   There are indexing choices which result in eventual CDC stream freshness and 
we could put the impl of constructing CDC on the read to later point



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

Reply via email to