GitHub user cshuo edited a comment on the discussion: Dynamic Bucket Index For 
Flink streaming

> With plain partitioned RLI, clustering can merge small file groups, split 
> large ones, re-sort data — and simply update the RLI. The layout remains 
> fully optimizable over time, which seems strictly more flexible.

The doc is little stale, I will update soon. Actually, what you mentioned here 
is the direction we have chosen for the proposal, using plain partitioned RLI 
and common file naming convention(not bucket index style). 
The motivation for the original dynamic bucket index abstraction is for the 
high memory efficiency of the RLI cache, for e.g., we can store `hash value of 
record key` -> `bucket id`, then only 1 GB of memory is required for 100 
million keys. However, considering we have to support RLI streaming write at 
same time, the solution does not work, since the RLI cache should always store 
the complete record key to determine whether a key already exists.

> But for dimension table workloads, where updates arrive across all partitions 
> randomly and continuously, most partitions stay hot. In that scenario, the 
> cache effectively needs to hold key → bucket mappings for the entire table in 
> memory, and partition-level eviction provides little relief. How would this 
> design handle such workloads without running into memory pressure?

Good point. For this workload, partition-level eviction is not expected to 
provide much benefit if most partitions are continuously hot. The design relies 
on a few controls here:
1. The per-partition cache is not a pure in-memory cache, it can spill, like 
ExternalSpillableMap. We enforce a total heap budget through a config option; 
once the in-memory portion exceeds the limit, entries can spill to local 
disk/RocksDB. So in the worst case we bound heap usage and trade lookup latency 
for memory safety.
2. The cache is not table-wide map. Each assigner only caches record keys that 
belong to its own key-group range, so the key -> bucket mapping is sharded by 
the bucket-assign parallelism instead of being duplicated on every task.

So for a dimension-table style workload where almost all partitions are hot, 
this proposal would not magically avoid maintaining a large working set; it 
bounds heap and spills/loads as needed. Operators would need to size the total 
cache, local spill storage, and assigner parallelism according to the update 
working set. We can also document this as a trade-off/limitation.


GitHub link: 
https://github.com/apache/hudi/discussions/18514#discussioncomment-16739740

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to