GitHub user vinothchandar added a comment to the discussion: RLI support for Flink streaming
@danny0405 @geserdugarov @HuangZhenQiu I am sketching an approach here, to seed further discussion. Please take this forward. **Assumption**: - Flink implements a version of the chandy-lamport distributed checkpointing such that all operators are synchronized by the checkpoint barrier to process the same data. - The Hudi write operators (StreamOp for data writing , IndexOp for index writing) flushes all records before the checkpoint barrier to storage and returns the files produced to coordinator - The Flink Co-ordinator waits for all operators to finish returning checkpointed data (i.e files produced), and then proceeds to write both `FILES` in MT, commit MT timeline, commit data table timeline **DAG**: The main difference is instead of special casing the RLI write, we do after `StreamWrite Op`. Keeping all MT writes consistently in the same operator. This is how Spark is in 1.1. So, we need real strong technical reasons to deviate from this. <img width="2275" height="616" alt="image" src="https://github.com/user-attachments/assets/5a058b6f-e478-46c8-87e9-55ff8e60e174" /> In terms of some comments on performance or slowest operator etc, its understandable that if there are a lot of SIs to be updated, it will proportionally longer. But this design still will be similar in perf if only RLI is enabled. Once, we align on this - lets update the top level discussion description. We can move to discussing caching design for BucketAssignor Op. GitHub link: https://github.com/apache/hudi/discussions/17452#discussioncomment-15236110 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
