Hi Lee,
Thanks for your reply at first.
As for what I proposed, let's take a look for a simple example
Question: get the distinct value of that satisfy (vi > 5 && vi < 7) in
the below array
Input: V: [10,8,7,7,1,4,5,6,8,9]
First, if we wanna get the distinct value for the whole data through
KMV, we should get the hash value for each value in the array V and keep
the smallest K hashes。
In this example, let K =3。
V: [10 , 8 , 7 , 7 , 1 , 4 , 5 ,6 ,8 ,9 ]
hash: [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ]
the smallest 3 hash : [0.12, 0.2, 0.21]
(Of course, the real hash function will be more random and evenly
distributed)
Secondly, we estimate the distinct value for the whole data,
NDV = (3-1)/0.21 ~= 9.5
Then,if we wanna get the distinct value of that satisfy (vi > 5 && vi <
7) , we can realize it through storing extra 3 origin value for each
smallest hash:
smallest 3 hash: [0.12, 0.2, 0.21]
smallest 3 values: [9 , 10 , 6]
the value satisfy (vi > 5 && vi < 7) is [6]
At last, we can estimate the distinct value of that (vi > 5 && vi < 7):
NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
in which 1.3 represents the fraction of the smallest 3 hash, that satisfy
(vi > 5 && vi < 7) .
leerho <[email protected]> 于2021年5月20日周四 上午7:50写道:
> Yiifeger,
> I'm sorry, but I'm having difficulty in understanding your question or
> what you are trying to do.
>
> In our library, all of our Theta Sketches are derivatives of the KMV
> sketch as explained on our website. And, on our website under *Sketch
> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory* you
> will find several documents that explain the math behind theta sketches. Of
> particular relevance would be the original published paper on the Theta
> Sketch Framework <https://arxiv.org/abs/1510.01455v2>. A simplified
> development of the math can be found in the short paper Theta Sketch
> Equations
> <https://github.com/apache/datasketches-website/blob/master/docs/pdf/ThetaSketchEquations.pdf>,
> also from the website.
>
> I hope this helps,
>
> Lee.
>
> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]> wrote:
>
>> Hi all,
>> I recently learned about the DataSketch project that is so
>> brilliant, but questions occurred when prepared to utilize it.
>> I want to get the count of distinct values for a range query in my
>> project. After some study about the KMV algorithm according to the
>> introduction in DataSketch project, we propose an adjusted KMV algorithm
>> to solve it.
>> In origin KMV, it only stores K hash_values and then computes the
>> NDV through the average density. So what if we store extra origin values
>> for which hash_value contained by the k -Minimum hash_values ? So we can
>> estimate the distinct value of the range query through
>>
>>> * ndv_in_the_range = ( ndv_in_range_for_k_minimum / k) *
>>> total_ndv*
>>
>>
>> So if the idea works and the Sketch does not implement it, could you
>> give some advice
>> on how to implement it in this project (P.s prefer the java version).
>> Thanks for your help in advance!
>>
>>
--
Best Regards
Yifei Wu