GitHub user vinothchandar added a comment to the discussion: RLI support for 
Flink streaming

I am still only focussing on the RLI pieces and the write (not caching or 
compaction or SI yet) 

> In BucketAssigner operator, the RLI index metadata would be utilitized as the 
> index state backend,
  
Are you basically saying instead of `state index` you ll lookup RLI? its not 
clear when reading. 

> In StreamWrite operator, the index items are buffered first and write to the 
> MDT after the data items are flushed(triggered by Flink checkpoint),

So - this happens in the same operator that writes the data files? `Stream 
Write op`? 

>Then the index items are shuffled by record keys with the same hashing 
>algorithm of the MDT 

Your diagram says "shuffled by record key" which is different. Can you clarify 
- is it `shuffled by record key` or hash(record_key) % num_rli_shards or 
partitioned by bucket (update/insert).. 

I see a basic conflict here. 

- Each write operator task is either updating or inserting (based on 
BucketAssignor?) to a file group. So all updates/inserts to a file group should 
be in 1 task right? 
- but the RLI update will be redistributed based on `hash(record_key) % 
num_rli_shards` ? So, these need to be done in separate operator stages right? 

For anything, we propose around RLI writes, I want to understand how we will 
write 1 log file per each RLI filegroup (shard) for each commit .. ( we cannot 
have a lot of small files in RLI)


I thought you will do something like (still does not work for SI) 

`BucketAssignor : tag record as I/U/D` => `shuffle by hash(record 
key)%num_rli_shards`  => `write to RLI; pass on RLI files written ` => `shuffle 
by bucket/filegroup` => `perform write handle, merge/append/create` 

When the thing then checkpoints, you know what RLI files were written and what 
data files were written. You commit both respectively into MT, MT files and DT 
.  Note that the above does not work with positional updates/deletes, since we 
don't know the position ahead of time. 

I want to first understand your proposal. I am not very sure, if this is the 
direction we go. 





GitHub link: 
https://github.com/apache/hudi/discussions/17452#discussioncomment-15204446

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to