[GitHub] [incubator-druid] sanastas commented on issue #5698: Oak: New Concurrent Key-Value Map

GitBox Thu, 20 Jun 2019 04:52:23 -0700

sanastas commented on issue #5698: Oak: New Concurrent Key-Value Map
URL:
https://github.com/apache/incubator-druid/issues/5698#issuecomment-503995370

@jihoonson please take a look on the following proposal draft!

PROPOSAL

### Motivation

The following are some imperfections of current Druid IncrementalIndex:
1. No possibility to take the Incremental Index data off-heap. Off-heap
memory allows working with bigger memory chunks in RAM without additional
performance degradation.
2. Big on-heap memory foot-print. A lot of objects vs serialization.
3. Frequent and slow segment merges.
4. Low multi-thread scalability of queries

### Proposed changes

Oak (Off-heap Allocated Keys) is a scalable concurrent KV-map for real-time
analytics, which has all its keys and values kept in the off-heap memory. Oak
is faster and scales better with the number of CPU cores than popular
NavigableMap implementations, e.g., Doug Lea’s ConcurrentSkipListMap (Java’s
default).
We suggest to integrate Oak-based Incremental Index as an alternative to
currently existing Druid’s Incremental Index. Because Oak is naturally built
for off-heap memory allocation, allows usage of much bigger RAM, has greater
concurrency support, and should show better performance results.

### Rationale

A discussion of why this particular solution is the best one. One good way
to approach this is to discuss other alternative solutions that you considered
and decided against. This should also include a discussion of any specific
benefits or drawbacks you are aware of.

Let’s follow the initial motivation points:

1. Taking data off-heap allows:
- Working with bigger data without paying the cost of JVM GC slowdown
- Oak allows writing the data off-heap (even with additional copy to
off-heap) while increasing the ingestion speed significantly

2. When working with IncrementalIndex Druid benchmarks and ingesting (for
example) 2GB of data, we have encountered that we need to give 10GB more (!) in
order to see proper results. Please see the IncrementalIngestionBenchmark
results below. OakMap serialize the keys and the values (no object overheads).
OakMap metadata is modest and it mostly comes as arrays (again no object
overheads). Additionally, OakMap can estimate its off-heap usage precisely.
3. Segment merge in Druid is quite expensive. Segments are merged after (not
big) IncrementalIndexes are flushed to disk and then need to be combined. Oak
allows to keep more data in-memory giving the same (and sometimes much better)
performance. Therefore less flushes to disk (less files requiring merge) may
appear. As a future work to OakMaps can be merged while still in-memory.
4. Experimenting with IncrementalIndexReads are still working in progress.
However, comparing the multithreading scalability of OakMap with
ConcurrentSkipListMap we see that OakMap scales better. So we believe to see
that scalability also in Oak-based IncrementalIndex.

### Operational impact

The suggested OakIncrementalIndex should live alongside the current
solution, giving the ability to switch to OakIncrementalIndex when preferable.
As OakIncrementalIndex doesn’t affect the on disk layout, we do not foresee any
specific operational impact.

### Test plan (optional)

The plan is to add a specific unit-test for OakIncrementalIndex and to pass
all existing tests for cluster based on OakIncrementalIndex.

### Future work (optional)

As StringIndexer also takes a significant part of IncrementalIndex we might
take also this map off-heap.

### IncrementalIngestionBenchmark results

In attached files please see the results of IncrementalIngestionBenchmark
when 2 million vs 3 million of rows are inserted. It is done with native
Druid’s incremental index and with newly suggested OakIncrementalIndex (data
taken off-heap). Druid’s incremental index always gets 12GB on heap memory.
OakIncrementalIndex always gets 8GB on-heap and 4GB off-heap (in total 12GB
RAM). Please note that the rows come prepared before the benchmark measurement
and actually already take big chunk of on-heap memory. The graphs show
throughput (number of operations in seconds) so the bigger the better. In all
results OakIncrementalIndex show better performance. When moving from 2M rows
to 3M rows (using same memory limits) Oak shows about 10% performance
degradation while IncrementalIndex shows about 60% degradation.

[Ingestion Throughput 2M rows (2.5GB data)
ingested.pdf](https://github.com/apache/incubator-druid/files/3310244/Ingestion.Throughput.2M.rows.2.5GB.data.ingested.pdf)
[Ingestion Throughput 3M rows (3.8GB data)
ingested.pdf](https://github.com/apache/incubator-druid/files/3310245/Ingestion.Throughput.3M.rows.3.8GB.data.ingested.pdf)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] sanastas commented on issue #5698: Oak: New Concurrent Key-Value Map

Reply via email to