[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

GitBox Thu, 08 Sep 2022 20:57:08 -0700


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r966599692



##########
rfc/rfc-51/rfc-51.md:
##########
@@ -215,18 +245,31 @@ Note:
 
 - Only instants that are active can be queried in a CDC scenario.
 - `CDCReader` manages all the things on CDC, and all the spark 
entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`.
-- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work 
to get the change data. The following illustration explains the difference when 
this config is true or false.
+- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute 
the changed data. The following illustrates the difference.
 
 ![](read_cdc_log_file.jpg)
 
 #### COW table
 
-Reading COW table in CDC query mode is equivalent to reading a simplified MOR 
table that has no normal log files.
+Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO 
mode.
 
 #### MOR table
 
-According to the design of the writing part, only the cases where writing mor 
tables will write out the base file (which call the `HoodieMergeHandle` and 
it's subclasses) will write out the cdc files.
-In other words, cdc files will be written out only for the index and file size 
reasons.
+According to the section "Persisting CDC in MOR", CDC data is available upon 
base files' generation.
+
+When users want to get fresher real-time CDC results:
+
+- users are to set `hoodie.datasource.query.incremental.type=snapshot`
+- the implementation logic is to compute the results in-flight by reading log 
files and the corresponding base files (
+  current and previous file slices).
+- this is equivalent to running incremental-query on MOR RT tables
+
+When users want to optimize compute-cost and are tolerant with latency of CDC 
results,
+
+- users are to set `hoodie.datasource.query.incremental.type=read_optimized`
+- the implementation logic is to extract the results by reading persisted CDC 
data and the corresponding base files (
+  current and previous file slices).

Review Comment:
   > can you pls clarify what cases exactly represent writer anomalies
   
   I'm a little worried about the cdc semantics correctness before it is widely 
used in production, such as the deletion, the merging, the data sequence.
   
   > 2nd step to support read_optimized i
   
   We can all it an optimization, but not exposed as read_optimized, which 
conflicts a little with current read_optimized view, WDYT.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

Reply via email to