Druid + Theta Sketches performance

alex . rnv . ru Fri, 19 Oct 2018 03:38:33 -0700

Hi Druid devs,
I am testing Druid for our specific count distinct estimation case. Data was 
ingested via Hadoop indexer. 
When simplified, it has following schema:
timestamp    key    country    theta-sketch<id>    event-counter
So, there are 2 dimensions, one counter metric, one theta sketch metric.
Data granularity is a DAY.
Data source in deep storage is 150-200GB per day.

I was doing some test runs with our small test cluster (4 Historical nodes, 8
CPU, 64GB RAM, 500SSD RAM). I admit with this RAM-SSD ratio and number of nodes
it is not going to be fast. The question though is in theta-sketches
performance compared to counters aggregation. The difference is an order of
magnitude. E.g.: GroupBy query for a single key, aggregated on 7 days:
event-counters - 30 seconds.
theta-sketches - 7 minutes.

Theta Sketch aggregation implies more work than summing up numbers of course.
But Theta Sketch documentation says that union operation is very fast.

I did some profiling of one of Historical nodes. Most of CPU time is spent in
io.druid.query.aggregation.datasketches.theta.SketchObjectStrategy.fromByteBuffer(ByteBuffer,
int). Which I think is moving Sketch objects from off-heap to managed heap.
To be precise, time is spent in sketch library methods
com.yahoo.memory.WritableMemoryImpl.region
com.yahoo.memory.Memory.wrap

Do not think anything is wrong with this code, except for why is it called so
many times.
Which leads to main question. I do not really understand how theta-sketch is
stored in columnar database. Assuming it is stored same way as counter, it
means that for every combination of "key" and "country" (dimensions from above)
- there is a theta sketch structure to be stored. In our case "key" cardinality
is quite high. Hence so many Sketch structure accesses in Java. Looks extremely
ineffective. Again, it is just an assumption. Please excuse me if am wrong here.

If you continue thinking in this direction, in terms of performance it makes
sense to store one Theta sketch for every dimension value, so instead of having
cardinality(key) * cardinality(countries) entries there will be
cardinality(key) + cardinality(countries) sketches. In this case it looks like
an index, not a part of columnar storage itself.
Queries for 2 dimensions are easy, as there is only one INTERSECTION to be
done. It all looks like a natural thing to do for sketches, as there will be a
win in terms of storage and query performance.
My question is if I am right or wrong in my assumptions. If my understanding is
not correct and sketches are already stored in optimal way, could someone give
advice on speeding up computations on a single Historical node? Otherwise,
wanted to ask if there is an attempt or discussion to use sketches in the way I
described.
Thanks in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Druid + Theta Sketches performance

Reply via email to