GitHub user cshuo edited a comment on the discussion: Dynamic Bucket Index For Flink streaming
> With plain partitioned RLI, clustering can merge small file groups, split > large ones, re-sort data — and simply update the RLI. The layout remains > fully optimizable over time, which seems strictly more flexible. The doc is little stale, I will update soon. Actually, what you mentioned here is the direction we have chosen for the proposal, using plain partitioned RLI and common file naming convention(not bucket index style). The motivation for the original dynamic bucket index abstraction is for the high memory efficiency of the RLI cache, for e.g., we can store `hash value of record key` -> `bucket id`, then only 1 GB of memory is required for 100 million keys. However, considering we have to support RLI streaming write at same time, the solution does not work, since the RLI cache should always store the complete record key to determine whether a key already exists. > But for dimension table workloads, where updates arrive across all partitions > randomly and continuously, most partitions stay hot. In that scenario, the > cache effectively needs to hold key → bucket mappings for the entire table in > memory, and partition-level eviction provides little relief. How would this > design handle such workloads without running into memory pressure? Good point. For this workload, partition-level eviction is not expected to provide much benefit if most partitions are continuously hot. The design relies on a few controls here: 1. The per-partition cache is not a pure in-memory cache, it can spill, like ExternalSpillableMap. We enforce a total heap budget through a config option; once the in-memory portion exceeds the limit, entries can spill to local disk/RocksDB. So in the worst case we bound heap usage and trade lookup latency for memory safety. 2. The cache is not table-wide map. Each assigner only caches record keys that belong to its own key-group range, so the key -> bucket mapping is sharded by the bucket-assign parallelism instead of being duplicated on every task. So for a dimension-table style workload where almost all partitions are hot, this proposal would not magically avoid maintaining a large working set; it bounds heap and spills/loads as needed. Operators would need to size the total cache, local spill storage, and assigner parallelism according to the update working set. We can also document this as a trade-off/limitation. GitHub link: https://github.com/apache/hudi/discussions/18514#discussioncomment-16739740 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
