GitHub user danny0405 edited a comment on the discussion: RocksDB as The 
Replica of MDT/RLI

I'm assuming you already read the [Flink RLI 
RFC](https://github.com/apache/hudi/pull/17610).

1. we did some local benchmark with MDT(plus hotspot cache in memory) as the 
index backend, and it turns out to be an average 50ms access latency per record 
query which does not scale well for steaming. Thus a local fast lookup replica 
is necessary, we have two solutions on table:

    1. use RocksDB replica as the mirror image of the MDT, and always replicate 
the index payloads from MDT to local RocksDB to ensure the performance;
    2. still uses the MDT, but introduces more caches: data block cache, local 
index metadata, bloomfilter cache, local files cache on SSD(secondary cache) 
and a fallback to remote MDT queries.

The 2nd one would take a lot of efforts and we deem it as long-term solution, 
as of now, we prefer 1.

2. NBCC is only working with simple bucket index, so not very related.

3. MDT does not clean the legacy files on each compaction immediately so the 
estimation still makes sense, based on the compression frequency 
configured(both can adjust the cleaning/compaction strategy), the RocksDB block 
cache utilizes off-heap memory I think. But yes, there could be resouce 
contention just like the Flink-State index which proves good perf in production 
already. And we will definitely do a lot more benchmarks for long-running job, 
that's a collaboration with Uber team.

4. the bootstrap op parallelism is scalable, the bottleneck might be the full 
scan load time for a single file group of MDT. As long as the time is less than 
the checkpoint timeout, it works well and the bootstrap only happens once for a 
job restart. Why not persistent RocksDB instances across jobs recovery: a). 
don't think we have good way to main the consistency between the RocksDB and 
MDT based on the complexities of DT and MDT consistency. b). another reason is 
the RocksDB local storage is within local container which will be release once 
the task is killed(we can not use remote storage which kills the perf).

5. The RocksDB is updated per-record in the `BucketAssign` op, and the 
`RocksDB` itself has memtable to manage the buffer and flush of the index 
records. Each time the task failsover, the bootstrap retriggers to ensure the 
integrity.

6. we want the write scalable as independent and we don't want to want to every 
file group from a single write task to limit the small files.





GitHub link: 
https://github.com/apache/hudi/discussions/18296#discussioncomment-16154189

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to