leerho edited a comment on issue #6743: IncrementalIndex generally 
overestimates theta sketch size
URL: 
https://github.com/apache/incubator-druid/issues/6743#issuecomment-449490568
 
 
   @gianm 
   
   > The granddaddy of Druid sketches, hyperUnique, doesn't have a very high 
max memory footprint, so it was less of an issue back then.
   
   I think you are referring to the [HyperLogLog (HLL) sketch 
implementation](https://github.com/apache/incubator-druid/blob/master/hll/src/main/java/org/apache/druid/hll/HyperLogLogCollector.java)
 that was developed for Druid some years ago.   There are several things you 
need to understand about that implementation.
   - The implementation is flawed and has been from the beginning.  It can 
produce erroneous results that can be 7 times worse than what it should be had 
it been implemented correctly.  These errors show up during merge operations 
and is documented in [characterization 
studies](https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html)
 we performed this past year.  I strongly advise that this Druid HLL sketch be 
deprecated and removed from Druid.  The unsuspecting user will have no warning 
whatsoever when this sketch is producing incorrect results. 
   - The HLL implementation in the DataSketches library is not only much more 
accurate, but it is faster as well and is approximately the same size.  We can 
work with you to create a PR that would possibly redirect references to the old 
Druid HLL sketch to our implementation that fixes these problems.
   - HLL sketches are small in size because of the nature of the underlying HLL 
algorithm.  If users were to choose the DataSketches HLL implementation it also 
would be small in size.  However, the HLL algorithm is not designed to allow 
Intersections.  If the use-case for counting uniques only requires merging and 
does not require Intersection operations then the HLL algorithm is a reasonable 
choice (although there is now an algorithm that is superior to HLL in terms of 
accuracy per space, and we can provide that as an option to your users as 
well.)  The Theta Sketch from the DS library is a larger sketch (and the one 
that I was modeling earlier in this thread), but it is designed from the outset 
to enable set Intersection and set difference operations.  Because set 
expressions are so powerful from an analysis point-of-view, many users choose 
the Theta Sketch in spite of its larger size.  This has also been our 
experience at Yahoo.
   ***
   I'm not sure I follow your proposal and may be missing something. 
Nonetheless, one of the challenges of just carving out a chunk of the 
processing buffer to be managed by by the individual BufferAggregators is that 
in order truly gain back the "unused memory" would require memory management 
sophistication similar to the design of a malloc(), free() and a garbage 
collector, which is non-trivial.  Imagine each sketch with an internal 
hash-table. As individual sketches grow they need to obtain a larger chunk of 
memory for a bigger hash-table, move their data into the larger chunk and free 
the previous chunk.  If those freed chunks don't get reused, then we will not 
achieve the optimum memory savings.  Allowing the sketches to allocate and free 
memory directly from the operating system allows us to leverage the already 
existing and very sophisticated malloc() and free() of the underlying 
C-libraries and the OS itself.  
   
   BTW, there is already a mechanism in the JVM to track usage of off-heap 
memory used by DirectByteBuffers.  We use this same mechanism to track 
allocations we make against off-heap memory even though we don't use 
DirectByteBuffers.  So the JVM knows and tracks this usage already.  Druid 
could also monitor these same JVM counters if it wanted or needed to. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to