Re: [D] RocksDB as The Replica of MDT/RLI [hudi]

via GitHub Sun, 15 Mar 2026 22:54:21 -0700


GitHub user codope added a comment to the discussion: RocksDB as The Replica of 
MDT/RLI


This is a very interesting proposal! I had a few questions to understand the 
motivation and mechanics of the proposal:

1. What is the motivation behind the proposal? Is it just to cut down cloud 
storage i/o and log scan overhead (and hence latency per write request)? I can 
see how local, materialized index can help esp for streaming writes, but what 
were other concerns? Is it just for Flink, or Spark is also being considered? I 
think if we can update the discussion with just a short para on 
motivation/background, it would be very helpful.
2. On the concurrent upsert consistency discussion above, re-bootstrapping from 
MDT on failover is correct for crash recovery. However, what happens in case of 
the following scenario with NBCC enabled?
  - Writer A starts, bootstraps RocksDB from MDT at instant T1
  - Writer B starts, bootstraps RocksDB from MDT at instant T1
  - Writer A commits records {k1, k2} --> MDT updated at T2
  - Writer B is still running with stale RocksDB (knows nothing about k1, k2)
  - Writer B receives an upsert for k1 --> RocksDB says "not found" --> Will it 
then tag as INSERT instead of UPDATE  which will lead to duplicate record in a 
different file group?
  The question is whether NBCC's existing mechanisms are sufficient, or if the 
stale RocksDB creates scenarios that require additional handling. Another 
related question is how does the RocksDB replica handle file group replacement 
from concurrent compaction/clustering?
3. On the storage overhead, the 2x number is for the initial static size 
(compression difference) right? During active writes, esp under heavy load, it 
could be more due to unmerged L0 files and multiple levels of SST files. So, 
total disk usage could peak at `compression * compaction` overhead. It would be 
nice to benchmark the prototype with active writes. Also, on memory overhead, 
wouldn't the RocksDB block cache compete with JVM heap used by Hudi's own 
processing?
4. Bootstrap latency could be a concern for large tables. IIRC, RLI is about 
50-70 bytes per record. Even for a table with 1B records, and assuming about 
500 Mbps cloud throughput, that's about 2 minutes. Is there a way to persist 
the RocksDB state across restarts?
5. How does this coexist with or replace the existing `RecordIndexCache`? 
Should updates be buffered in memory and only flushed to RocksDB on commit 
success? (This is essentially what `RecordIndexCache` does with 
checkpoint-based eviction right?)
6. Looking at the new write flow diagram, the third shuffle by RLI/Si shard ids 
is new. Is this shuffle strictly necessary? Could the `IndexWrite Op` be 
collocated with the `StreamWrite Op` to avoid one shuffle?

GitHub link: 
https://github.com/apache/hudi/discussions/18296#discussioncomment-16152659

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] RocksDB as The Replica of MDT/RLI [hudi]

Reply via email to