[GitHub] leerho edited a comment on issue #6743: IncrementalIndex generally overestimates theta sketch size

GitBox Fri, 21 Dec 2018 12:34:44 -0800

leerho edited a comment on issue #6743: IncrementalIndex generally
overestimates theta sketch size
URL:
https://github.com/apache/incubator-druid/issues/6743#issuecomment-449490568

@gianm

> The granddaddy of Druid sketches, hyperUnique, doesn't have a very high
max memory footprint, so it was less of an issue back then.

I think you are referring to the [HyperLogLog (HLL) sketch
implementation](https://github.com/apache/incubator-druid/blob/master/hll/src/main/java/org/apache/druid/hll/HyperLogLogCollector.java)
that was developed for Druid some years ago. There are several things you
need to understand about that implementation.
- The implementation is flawed and has been from the beginning. It can
produce erroneous results that can be 7 times worse than what it should be had
it been implemented correctly. These errors show up during merge operations
and is documented in [characterization
studies](https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html)
we performed this past year. I strongly advise that this Druid HLL sketch be
deprecated and removed from Druid. The unsuspecting user will have no warning
whatsoever when this sketch is producing incorrect results.
- The HLL implementation in the DataSketches library is not only much more
accurate, but it is faster as well and is approximately the same size. We can
work with you to create a PR that would possibly redirect references to the old
Druid HLL sketch to our implementation that fixes these problems.
- HLL sketches are small in size because of the nature of the underlying HLL
algorithm. If users were to choose the DataSketches HLL implementation it also
would be small in size. However, the HLL algorithm is not designed to allow
Intersections. If the use-case for counting uniques only requires merging and
does not require Intersection operations then the HLL algorithm is a reasonable
choice (although there is now an algorithm that is superior to HLL in terms of
accuracy per space, and we can provide that as an option to your users as
well.) The Theta Sketch from the DS library is a larger sketch (and the one
that I was modeling earlier in this thread), but it is designed from the outset
to enable set Intersection and set difference operations. Because set
expressions are so powerful from an analysis point-of-view, many users choose
the Theta Sketch in spite of its larger size. This has also been our
experience at Yahoo.
***
I'm not sure I follow your proposal and may be missing something.
Nonetheless, one of the challenges of just carving out a chunk of the
processing buffer to be managed by by the individual BufferAggregators is that
in order truly gain back the "unused memory" would require memory management
sophistication similar to the design of a malloc(), free() and a garbage
collector, which is non-trivial. Imagine each sketch with an internal
hash-table. As individual sketches grow they need to obtain a larger chunk of
memory for a bigger hash-table, move their data into the larger chunk and free
the previous chunk. If those freed chunks don't get reused, then we will not
achieve the optimum memory savings. Allowing the sketches to allocate and free
memory directly from the operating system allows us to leverage the already
existing and very sophisticated malloc() and free() of the underlying
C-libraries and the OS itself.

BTW, there is already a mechanism in the JVM to track usage of off-heap
memory used by DirectByteBuffers. We use this same mechanism to track
allocations we make against off-heap memory even though we don't use
DirectByteBuffers. So the JVM knows and tracks this usage already. Druid
could also monitor these same JVM counters if it wanted or needed to.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] leerho edited a comment on issue #6743: IncrementalIndex generally overestimates theta sketch size

Reply via email to