[GitHub] [hudi] xushiyan commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

GitBox Sun, 14 Aug 2022 00:37:58 -0700


xushiyan commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r945243884



##########
rfc/rfc-51/rfc-51.md:
##########
@@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
 1. Support row-level CDC records generation and persistence;
 2. Support both MOR and COW tables;
 3. Support all the write operations;
 4. Support Spark DataFrame/SQL/Streaming Query;
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key                                                 | default  | description 
                                                                                
                                                     |
+|-----------------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.table.cdc.enabled                            | `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.                    |
+| hoodie.table.cdc.supplemental.logging               | `false`  | If `true`, 
persist the required information about the changed data, including `before`. If 
`false`, only `op` and record keys will be persisted. |
+| hoodie.table.cdc.supplemental.logging.include_after | `false`  | If `true`, 
persist `after` as well.                                                        
                                                      |
 
-![](arch.jpg)
+To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key                                    | default    | description            
              |
+|----------------------------------------|------------|--------------------------------------|
+| hoodie.table.cdc.enabled               | `false`    | set to `true` for CDC 
queries        |
+| hoodie.datasource.query.type           | `snapshot` | set to `incremental` 
for CDC queries |
+| hoodie.datasource.read.start.timestamp | -          | requried.              
              |
+| hoodie.datasource.read.end.timestamp   | -          | optional.              
              |
 
-![](points.jpg)
+### Logical File Types
 
-### Config Definitions
+We define 4 logical file types for the CDC scenario.
 
-Define a new config:
+- CDC_LOG_File: a file consists of CDC Blocks with the changing data related 
to one commit.
+  - For COW tables, this file type refers to newly written log files alongside 
base files. The log files in this case only contain CDC info.
+  - For MOR tables, this file type refers to the typical log files in MOR 
tables. CDC info will be persisted as log blocks in the log files.
+- ADD_BASE_File: a normal base file for a specified instant and a specified 
file group. All the data in this file are new-incoming. For example, we first 
write data to a new file group. So we can load this file, treat each record in 
this as the value of `after`, and the value of `op` of each record is `i`.
+- REMOVE_BASE_FILE: a normal base file for a specified instant and a specified 
file group, but this file is empty. A file like this will be generated when we 
delete all the data in a file group. So we need to find the previous version of 
the file group, load it, treat each record in this as the value of `before`, 
and the value of `op` of each record is `d`.
+- REPLACED_FILE_GROUP: a file group that be replaced totally, like 
`DELETE_PARTITION` and `INSERT_OVERWRITE` operations. We load this file group, 
treat all the records as the value of `before`, and the value of `op` of each 
record is `d`.
 
-| key | default | description |
-| --- | --- | --- |
-| hoodie.table.cdc.enabled | false | `true` represents the table to be used 
for CDC queries and will write cdc data if needed. |
-| hoodie.table.cdc.supplemental.logging | true | If true, persist all the 
required information about the change data, including 'before' and 'after'. 
Otherwise, just persist the 'op' and the record key. |
+Note:
 
-Other existing config that can be reused in cdc mode is as following:
-Define another query mode named `cdc`, which is similar to `snapshpt`, 
`read_optimized` and `incremental`.
-When read in cdc mode, set `hoodie.datasource.query.type` to `cdc`.
+**`CDC_LOG_File` is a new file type and written out for CDC**. 
`ADD_BASE_File`, `REMOVE_BASE_FILE`, and `REPLACED_FILE_GROUP` represent the 
existing data files in the CDC scenario. 
 
-| key | default  | description |
-| --- |---| --- |
-| hoodie.datasource.query.type | snapshot | set to cdc, enable the cdc quey 
mode |
-| hoodie.datasource.read.start.timestamp | -        | requried. |
-| hoodie.datasource.read.end.timestamp | -        | optional. |
+For examples:
+- `INSERT` operation will maybe create a list of new data files. These files 
will be treated as ADD_BASE_FILE;
+- `DELETE_PARTITION` operation will replace a list of file slice. For each of 
these, we get the cdc data in the `REPLACED_FILE_GROUP` way.
 
+## When `supplemental.logging=false`
 
-### CDC File Types
+In this mode, we minimized the additional storage for CDC information. 
 
-Here we define 5 cdc file types in CDC scenario.
+- When write, the logging process is similar to the one described in section 
"When `supplemental.logging=true`", just that only change type `op` and record 
key are persisted.
+- When read, changed info will be inferred on-the-fly, which costs more 
computation power.
 
-- CDC_LOG_File: a file consists of CDC Blocks with the changing data related 
to one commit.
-  - when `hoodie.table.cdc.supplemental.logging` is true, it keeps all the 
fields about the change data, including `op`, `ts_ms`, `before` and `after`. 
When query hudi table in cdc query mode, load this file and return directly.
-  - when `hoodie.table.cdc.supplemental.logging` is false, it just keeps the 
`op` and the key of the changing record. When query hudi table in cdc query 
mode, we need to load the previous version and the current one of the touched 
file slice to extract the other info like `before` and `after` on the fly.
-- ADD_BASE_File: a normal base file for a specified instant and a specified 
file group. All the data in this file are new-incoming. For example, we first 
write data to a new file group. So we can load this file, treat each record in 
this as the value of `after`, and the value of `op` of each record is `i`.
-- REMOVE_BASE_FILE: a normal base file for a specified instant and a specified 
file group, but this file is empty. A file like this will be generated when we 
delete all the data in a file group. So we need to find the previous version of 
the file group, load it, treat each record in this as the value of `before`, 
and the value of `op` of each record is `d`.
-- MOR_LOG_FILE: a normal log file. For this type, we need to load the previous 
version of file slice, and merge each record in the log file with this data 
loaded separately to determine how the record has changed, and get the value of 
`before` and `after`.
-- REPLACED_FILE_GROUP: a file group that be replaced totally, like 
`DELETE_PARTITION` and `INSERT_OVERWRITE` operations. We load this file group, 
treat all the records as the value of `before`, and the value of `op` of each 
record is `d`.
+Detailed inference algorithms are illustrated in this [design 
document](https://docs.google.com/document/d/1vb6EwTqGE0XBpZxWH6grUP2erCjQYb1dS4K24Dk3vL8/).

Review Comment:
   yes the original algorithm in the doc assumes no cdc data (key and op) is 
logged hence it won't be as optimized as when logging key and op. it should be 
updated accordingly, at least on this PR to align with the case of logging key 
and op.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

Reply via email to