prasannarajaperumal commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r960323342
########## rfc/rfc-51/rfc-51.md: ########## @@ -215,18 +245,31 @@ Note: - Only instants that are active can be queried in a CDC scenario. - `CDCReader` manages all the things on CDC, and all the spark entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`. -- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work to get the change data. The following illustration explains the difference when this config is true or false. +- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute the changed data. The following illustrates the difference. ![](read_cdc_log_file.jpg) #### COW table -Reading COW table in CDC query mode is equivalent to reading a simplified MOR table that has no normal log files. +Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO mode. #### MOR table -According to the design of the writing part, only the cases where writing mor tables will write out the base file (which call the `HoodieMergeHandle` and it's subclasses) will write out the cdc files. -In other words, cdc files will be written out only for the index and file size reasons. +According to the section "Persisting CDC in MOR", CDC data is available upon base files' generation. + +When users want to get fresher real-time CDC results: + +- users are to set `hoodie.datasource.query.incremental.type=snapshot` +- the implementation logic is to compute the results in-flight by reading log files and the corresponding base files ( + current and previous file slices). +- this is equivalent to running incremental-query on MOR RT tables + +When users want to optimize compute-cost and are tolerant with latency of CDC results, + +- users are to set `hoodie.datasource.query.incremental.type=read_optimized` +- the implementation logic is to extract the results by reading persisted CDC data and the corresponding base files ( + current and previous file slices). Review Comment: There are indexing choices which result in eventual CDC stream freshness and we could put the impl of constructing CDC on the read to later point -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org