GitHub user codope added a comment to the discussion: RocksDB as The Replica of
MDT/RLI
This is a very interesting proposal! I had a few questions to understand the
motivation and mechanics of the proposal:
1. What is the motivation behind the proposal? Is it just to cut down cloud
storage i/o and log scan overhead (and hence latency per write request)? I can
see how local, materialized index can help esp for streaming writes, but what
were other concerns? Is it just for Flink, or Spark is also being considered? I
think if we can update the discussion with just a short para on
motivation/background, it would be very helpful.
2. On the concurrent upsert consistency discussion above, re-bootstrapping from
MDT on failover is correct for crash recovery. However, what happens in case of
the following scenario with NBCC enabled?
- Writer A starts, bootstraps RocksDB from MDT at instant T1
- Writer B starts, bootstraps RocksDB from MDT at instant T1
- Writer A commits records {k1, k2} --> MDT updated at T2
- Writer B is still running with stale RocksDB (knows nothing about k1, k2)
- Writer B receives an upsert for k1 --> RocksDB says "not found" --> Will it
then tag as INSERT instead of UPDATE which will lead to duplicate record in a
different file group?
The question is whether NBCC's existing mechanisms are sufficient, or if the
stale RocksDB creates scenarios that require additional handling. Another
related question is how does the RocksDB replica handle file group replacement
from concurrent compaction/clustering?
3. On the storage overhead, the 2x number is for the initial static size
(compression difference) right? During active writes, esp under heavy load, it
could be more due to unmerged L0 files and multiple levels of SST files. So,
total disk usage could peak at `compression * compaction` overhead. It would be
nice to benchmark the prototype with active writes. Also, on memory overhead,
wouldn't the RocksDB block cache compete with JVM heap used by Hudi's own
processing?
4. Bootstrap latency could be a concern for large tables. IIRC, RLI is about
50-70 bytes per record. Even for a table with 1B records, and assuming about
500 Mbps cloud throughput, that's about 2 minutes. Is there a way to persist
the RocksDB state across restarts?
5. How does this coexist with or replace the existing `RecordIndexCache`?
Should updates be buffered in memory and only flushed to RocksDB on commit
success? (This is essentially what `RecordIndexCache` does with
checkpoint-based eviction right?)
6. Looking at the new write flow diagram, the third shuffle by RLI/Si shard ids
is new. Is this shuffle strictly necessary? Could the `IndexWrite Op` be
collocated with the `StreamWrite Op` to avoid one shuffle?
GitHub link:
https://github.com/apache/hudi/discussions/18296#discussioncomment-16152659
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]