GitHub user vinothchandar added a comment to the discussion: RLI support for Flink streaming
I am still only focussing on the RLI pieces and the write (not caching or compaction or SI yet) > In BucketAssigner operator, the RLI index metadata would be utilitized as the > index state backend, Are you basically saying instead of `state index` you ll lookup RLI? its not clear when reading. > In StreamWrite operator, the index items are buffered first and write to the > MDT after the data items are flushed(triggered by Flink checkpoint), So - this happens in the same operator that writes the data files? `Stream Write op`? >Then the index items are shuffled by record keys with the same hashing >algorithm of the MDT Your diagram says "shuffled by record key" which is different. Can you clarify - is it `shuffled by record key` or hash(record_key) % num_rli_shards or partitioned by bucket (update/insert).. I see a basic conflict here. - Each write operator task is either updating or inserting (based on BucketAssignor?) to a file group. So all updates/inserts to a file group should be in 1 task right? - but the RLI update will be redistributed based on `hash(record_key) % num_rli_shards` ? So, these need to be done in separate operator stages right? For anything, we propose around RLI writes, I want to understand how we will write 1 log file per each RLI filegroup (shard) for each commit .. ( we cannot have a lot of small files in RLI) I thought you will do something like (still does not work for SI) `BucketAssignor : tag record as I/U/D` => `shuffle by hash(record key)%num_rli_shards` => `write to RLI; pass on RLI files written ` => `shuffle by bucket/filegroup` => `perform write handle, merge/append/create` When the thing then checkpoints, you know what RLI files were written and what data files were written. You commit both respectively into MT, MT files and DT . Note that the above does not work with positional updates/deletes, since we don't know the position ahead of time. I want to first understand your proposal. I am not very sure, if this is the direction we go. GitHub link: https://github.com/apache/hudi/discussions/17452#discussioncomment-15204446 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
