sanastas commented on issue #5698: Oak: New Concurrent Key-Value Map URL: https://github.com/apache/incubator-druid/issues/5698#issuecomment-503995370 @jihoonson please take a look on the following proposal draft! PROPOSAL ### Motivation The following are some imperfections of current Druid IncrementalIndex: 1. No possibility to take the Incremental Index data off-heap. Off-heap memory allows working with bigger memory chunks in RAM without additional performance degradation. 2. Big on-heap memory foot-print. A lot of objects vs serialization. 3. Frequent and slow segment merges. 4. Low multi-thread scalability of queries ### Proposed changes Oak (Off-heap Allocated Keys) is a scalable concurrent KV-map for real-time analytics, which has all its keys and values kept in the off-heap memory. Oak is faster and scales better with the number of CPU cores than popular NavigableMap implementations, e.g., Doug Lea’s ConcurrentSkipListMap (Java’s default). We suggest to integrate Oak-based Incremental Index as an alternative to currently existing Druid’s Incremental Index. Because Oak is naturally built for off-heap memory allocation, allows usage of much bigger RAM, has greater concurrency support, and should show better performance results. ### Rationale A discussion of why this particular solution is the best one. One good way to approach this is to discuss other alternative solutions that you considered and decided against. This should also include a discussion of any specific benefits or drawbacks you are aware of. Let’s follow the initial motivation points: 1. Taking data off-heap allows: - Working with bigger data without paying the cost of JVM GC slowdown - Oak allows writing the data off-heap (even with additional copy to off-heap) while increasing the ingestion speed significantly 2. When working with IncrementalIndex Druid benchmarks and ingesting (for example) 2GB of data, we have encountered that we need to give 10GB more (!) in order to see proper results. Please see the IncrementalIngestionBenchmark results below. OakMap serialize the keys and the values (no object overheads). OakMap metadata is modest and it mostly comes as arrays (again no object overheads). Additionally, OakMap can estimate its off-heap usage precisely. 3. Segment merge in Druid is quite expensive. Segments are merged after (not big) IncrementalIndexes are flushed to disk and then need to be combined. Oak allows to keep more data in-memory giving the same (and sometimes much better) performance. Therefore less flushes to disk (less files requiring merge) may appear. As a future work to OakMaps can be merged while still in-memory. 4. Experimenting with IncrementalIndexReads are still working in progress. However, comparing the multithreading scalability of OakMap with ConcurrentSkipListMap we see that OakMap scales better. So we believe to see that scalability also in Oak-based IncrementalIndex. ### Operational impact The suggested OakIncrementalIndex should live alongside the current solution, giving the ability to switch to OakIncrementalIndex when preferable. As OakIncrementalIndex doesn’t affect the on disk layout, we do not foresee any specific operational impact. ### Test plan (optional) The plan is to add a specific unit-test for OakIncrementalIndex and to pass all existing tests for cluster based on OakIncrementalIndex. ### Future work (optional) As StringIndexer also takes a significant part of IncrementalIndex we might take also this map off-heap. ### IncrementalIngestionBenchmark results In attached files please see the results of IncrementalIngestionBenchmark when 2 million vs 3 million of rows are inserted. It is done with native Druid’s incremental index and with newly suggested OakIncrementalIndex (data taken off-heap). Druid’s incremental index always gets 12GB on heap memory. OakIncrementalIndex always gets 8GB on-heap and 4GB off-heap (in total 12GB RAM). Please note that the rows come prepared before the benchmark measurement and actually already take big chunk of on-heap memory. The graphs show throughput (number of operations in seconds) so the bigger the better. In all results OakIncrementalIndex show better performance. When moving from 2M rows to 3M rows (using same memory limits) Oak shows about 10% performance degradation while IncrementalIndex shows about 60% degradation. [Ingestion Throughput 2M rows (2.5GB data) ingested.pdf](https://github.com/apache/incubator-druid/files/3310244/Ingestion.Throughput.2M.rows.2.5GB.data.ingested.pdf) [Ingestion Throughput 3M rows (3.8GB data) ingested.pdf](https://github.com/apache/incubator-druid/files/3310245/Ingestion.Throughput.3M.rows.3.8GB.data.ingested.pdf)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
