leerho commented on issue #446: URL: https://github.com/apache/datasketches-java/issues/446#issuecomment-1581761804
If your data is in Hive and you are willing to allow two passes on your data you could use KLL to establish the histogram boundaries you are interested in on the first pass, and then on the second pass feed an array of HLL sketches corresponding to the histogram ranges that would do distinct counts filtered for each range. This is a little clumsy, but would provide reliable accuracy bounds based on the HLL configuration. This avoids the kind of approximation of approximations issue @jmalkin mentioned. Of course, the resulting histogram boundaries are also approximations, but at least you would have independent control of the accuracy of the boundaries and the accuracy of the NDV of each of the bins separately :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
