danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r966599692
########## rfc/rfc-51/rfc-51.md: ########## @@ -215,18 +245,31 @@ Note: - Only instants that are active can be queried in a CDC scenario. - `CDCReader` manages all the things on CDC, and all the spark entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`. -- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work to get the change data. The following illustration explains the difference when this config is true or false. +- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute the changed data. The following illustrates the difference. ![](read_cdc_log_file.jpg) #### COW table -Reading COW table in CDC query mode is equivalent to reading a simplified MOR table that has no normal log files. +Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO mode. #### MOR table -According to the design of the writing part, only the cases where writing mor tables will write out the base file (which call the `HoodieMergeHandle` and it's subclasses) will write out the cdc files. -In other words, cdc files will be written out only for the index and file size reasons. +According to the section "Persisting CDC in MOR", CDC data is available upon base files' generation. + +When users want to get fresher real-time CDC results: + +- users are to set `hoodie.datasource.query.incremental.type=snapshot` +- the implementation logic is to compute the results in-flight by reading log files and the corresponding base files ( + current and previous file slices). +- this is equivalent to running incremental-query on MOR RT tables + +When users want to optimize compute-cost and are tolerant with latency of CDC results, + +- users are to set `hoodie.datasource.query.incremental.type=read_optimized` +- the implementation logic is to extract the results by reading persisted CDC data and the corresponding base files ( + current and previous file slices). Review Comment: > can you pls clarify what cases exactly represent writer anomalies I'm a little worried about the cdc semantics correctness before it is widely used in production, such as the deletion, the merging, the data sequence. > 2nd step to support read_optimized i We can all it an optimization, but not exposed as read_optimized, which conflicts a little with current read_optimized view, WDYT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org