[GitHub] [hudi] xushiyan commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

GitBox Tue, 06 Sep 2022 15:03:24 -0700


xushiyan commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r964213596



##########
rfc/rfc-51/rfc-51.md:
##########
@@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key                                                 | default  | description 
                                                                                
                                                                                
                                                                                
                                                         |
+|-----------------------------------------------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.table.cdc.enabled                            | `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.                                          
                                                                                
                                                              |
+| hoodie.table.cdc.supplemental.logging.mode          | `KEY_OP` | A mode to 
indicate the level of changed data being persisted. At the minimum level, 
`KEY_OP` indicates changed records' keys and operations to be persisted. 
`DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. 
`DATA_BEFORE_AFTER`: persist records' after-images in addition to 
`DATA_BEFORE`. |
 
-![](arch.jpg)
+To perform CDC queries, users need to set 
`hoodie.datasource.query.incremental.format=cdc` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key                                        | default        | description    
                                                                                
                                      |
+|--------------------------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.datasource.query.type               | `snapshot`     | set to 
`incremental` for incremental query.                                            
                                              |
+| hoodie.datasource.query.incremental.format | `latest_state` | `latest_state` 
(current incremental query behavior) returns the latest records' values. Set to 
`cdc` to return the full CDC results. |
+| hoodie.datasource.read.start.timestamp     | -              | requried.      
                                                                                
                                      |
+| hoodie.datasource.read.end.timestamp       | -              | optional.      
                                                                                
                                      |

Review Comment:
   These are not new configs. We're referring to existing configs here
   
   - 
https://hudi.apache.org/docs/configurations#hoodiedatasourcereadbegininstanttime
   - 
https://hudi.apache.org/docs/configurations#hoodiedatasourcereadendinstanttime 
(docs should be updated to say optional instead of required)
   
   I'll also update the config keys to using "begin.instanttime" and 
"end.instanttime"
   
   For whether to have default value for begin instant time or not, I'm also 
inclined to have it required and keep the user behavior unchanged. It might 
improve the UX if we set a default like latest time, but it's a separate topic 
from this RFC.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

Reply via email to