[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-26 Thread GitBox


alexeykudinkin commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r980381575


##
rfc/rfc-51/rfc-51.md:
##
@@ -148,20 +155,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+ Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket
+indexing, `op` (I/U/D) data is not available at indexing.
+
+Write-on-compaction can always persist CDC info and achieve standardization of 
implementation logic across engines,
+however, some delays are added to the CDC query results. Based on the business 
requirements, Log Compaction (RFC-48) or
+scheduling more frequent compaction can be used to minimize the latency.
 
-Spark DataSource example:
+The semantics we propose to establish are: when base files are written, the 
corresponding CDC data is also persisted.

Review Comment:
   @xushiyan can you elaborate about "on-the-fly inference", i don't think i 
saw it being mentioned anywhere in the RFC. 
   
   I still don't think i fully understand how we're going to be tackling the 
issue of CDC records being deferred until compactor runs. What if compactor 
isn't even setup to run for the table? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-19 Thread GitBox


alexeykudinkin commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r971371531


##
rfc/rfc-51/rfc-51.md:
##
@@ -62,73 +63,79 @@ We follow the debezium output format: four columns as shown 
below
 - u: represent `update`; when `op` is `u`, both `before` and `after` don't be 
null;
 - d: represent `delete`; when `op` is `d`, `after` is always null;
 
-Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
+**Note**
 
-## Goals
+* In case of the same record having operations like insert -> delete -> 
insert, CDC data should be produced to reflect the exact behaviors.
+* The illustration above ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+## Design Goals
 
-## Implementation
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines

Review Comment:
   Let's also explicitly call out that:
   
   - For CDC-enabled Tables performance of non-CDC queries should not be 
affected



##
rfc/rfc-51/rfc-51.md:
##
@@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
 1. Support row-level CDC records generation and persistence;
 2. Support both MOR and COW tables;
 3. Support all the write operations;
 4. Support Spark DataFrame/SQL/Streaming Query;
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 

 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.|
+| hoodie.table.cdc.supplemental.logging   | `false`  | If `true`, 
persist the required information about the changed data, including `before`. If 
`false`, only `op` and record keys will be persisted. |
+| hoodie.table.cdc.supplemental.logging.include_after | `false`  | If `true`, 
persist `after` as well.
  |
 
-![](arch.jpg)
+To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description
  |
+|||--|
+| hoodie.table.cdc.enabled   | `false`| set to `true` for CDC 
queries|
+| hoodie.datasource.query.type   | `snapshot` | set to `incremental` 
for CDC queries |
+| hoodie.datasource.read.start.timestamp | -  | requried.  
  |
+| hoodie.datasource.read.end.timestamp   | -  | optional.  
  |
 
-![](points.jpg)
+### Logical File Types
 
-### Config Definitions
+We define 4 logical file types for the CDC scenario.

Review Comment:
   Agree w/ @xushiyan proposal, let's simplify this -- having a table mapping 
an action on the Data table into the action on CDC log makes message much more 
clear.
   



##
rfc/rfc-51/rfc-51.md:
##
@@ -148,20 +155,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+ Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket
+indexing, `op` (I/U/D) data is not available at indexing.
+
+Write-on-compaction can always persist CDC info and achieve standardization of 
implementation log

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-13 Thread GitBox


alexeykudinkin commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r970277127


##
rfc/rfc-51/rfc-51.md:
##
@@ -62,73 +63,79 @@ We follow the debezium output format: four columns as shown 
below
 - u: represent `update`; when `op` is `u`, both `before` and `after` don't be 
null;
 - d: represent `delete`; when `op` is `d`, `after` is always null;
 
-Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
+**Note**
 
-## Goals
+* In case of the same record having operations like insert -> delete -> 
insert, CDC data should be produced to reflect the exact behaviors.
+* The illustration above ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+## Design Goals
 
-## Implementation
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines
 
-### CDC Architecture
+## Configurations
 
-![](arch.jpg)
+| key | default  | description 



 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.  

  |
+| hoodie.table.cdc.supplemental.logging.mode  | `KEY_OP` | A mode to 
indicate the level of changed data being persisted. At the minimum level, 
`KEY_OP` indicates changed records' keys and operations to be persisted. 
`DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. 
`DATA_BEFORE_AFTER`: persist records' after-images in addition to 
`DATA_BEFORE`. |
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+To perform CDC queries, users need to set 
`hoodie.datasource.query.incremental.format=cdc` and 
`hoodie.datasource.query.type=incremental`.
 
-![](points.jpg)
+| key| default| description

  |
+|||--|
+| hoodie.datasource.query.type   | `snapshot` | set to 
`incremental` for incremental query.
  |
+| hoodie.datasource.query.incremental.format | `latest_state` | `latest_state` 
(current incremental query behavior) returns the latest records' values. Set to 
`cdc` to return the full CDC results. |
+| hoodie.datasource.read.begin.instanttime   | -  | requried.  

  |
+| hoodie.datasource.read.end.instanttime | -  | optional.  

  |
 
-### Config Definitions
+### Logical File Types
 
-Define a new config:
-
-| key | default | description |
-| --- | --- | --- |
-| hoodie.table.cdc.enabled | false | `true` represents the table to be used 
for CDC queries and will write cdc data if needed. |
-| hoodie.table.cdc.supplemental.logging | true | If true, persist all the 
required information about the change data, including 'before' and 'after'. 
Otherwise, just persist the 'op' and the record key. |
-
-Other existing config that can be reused in cdc mode is as following:
-Define another query mode named `cdc`, which is similar to `snapshpt`,