Hi all,
I recently learned about the DataSketch project that is so brilliant,
but questions occurred when prepared to utilize it.
I want to get the count of distinct values for a range query in my
project. After some study about the KMV algorithm according to the
introduction in DataSketch project, we propose an adjusted KMV algorithm to
solve it.
In origin KMV, it only stores K hash_values and then computes the
NDV through the average density. So what if we store extra origin values
for which hash_value contained by the k -Minimum hash_values ? So we can
estimate the distinct value of the range query through
> * ndv_in_the_range = ( ndv_in_range_for_k_minimum / k) *
> total_ndv*
So if the idea works and the Sketch does not implement it, could you
give some advice
on how to implement it in this project (P.s prefer the java version).
Thanks for your help in advance!