GitHub user vinothchandar added a comment to the discussion: RLI support for 
Flink streaming

@danny0405 @geserdugarov @HuangZhenQiu I am sketching an approach here, to seed 
further discussion. Please take this forward. 

**Assumption**:  

- Flink implements a version of the chandy-lamport distributed checkpointing 
such that all operators are synchronized by the checkpoint barrier to process 
the same data. 
- The Hudi write operators (StreamOp for data writing , IndexOp for index 
writing) flushes all records before the checkpoint barrier to storage and 
returns the files produced to coordinator
- The Flink Co-ordinator waits for all operators to finish returning 
checkpointed data (i.e files produced), and then proceeds to write both `FILES` 
in MT, commit MT timeline, commit data table timeline

**DAG**: 

The main difference is instead of special casing the RLI write, we do after 
`StreamWrite Op`. Keeping all MT writes consistently in the same operator.  
This is how Spark is in 1.1. So, we need real strong technical reasons to 
deviate from this. 

<img width="2275" height="616" alt="image" 
src="https://github.com/user-attachments/assets/5a058b6f-e478-46c8-87e9-55ff8e60e174";
 />


In terms of some comments on performance or slowest operator etc, its 
understandable that if there are a lot of SIs to be updated, it will 
proportionally longer. But this design still will be similar in perf if only 
RLI is enabled. 

Once, we align on this - lets update the top level discussion description. We 
can move to discussing caching design for BucketAssignor Op. 


















GitHub link: 
https://github.com/apache/hudi/discussions/17452#discussioncomment-15236110

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to