leerho commented on issue #446:
URL: 
https://github.com/apache/datasketches-java/issues/446#issuecomment-1581761804

   If your data is in Hive and you are willing to allow two passes on your data 
you could use KLL to establish the histogram boundaries you are interested in 
on the first pass, and then on the second pass feed an array of HLL sketches 
corresponding to the histogram ranges that would do distinct counts filtered 
for each range.  This is a little clumsy, but would provide reliable accuracy 
bounds based on the HLL configuration.  This avoids the kind of approximation 
of approximations issue @jmalkin mentioned.  
   
   Of course, the resulting histogram boundaries are also approximations, but 
at least you would have independent control of the accuracy of the boundaries 
and the accuracy of the NDV of each of the bins separately :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to