Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Jon Malkin Thu, 20 May 2021 09:56:12 -0700

Echoing what Will said, you should use a tuple sketch to store the raw
values. You want to avoid any situation where you're storing hash values
yourself.


  jon

On Thu, May 20, 2021 at 7:09 AM Will Lauer <[email protected]>
wrote:

> If you want to do something like this, you should be looking at tuple
> sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but
> allow you to associate extra data with each entry. You can then apply
> operations like filtering based on the tuple value, getting the estimate
> for only those values that satisfy your tuple predicate. We use tuple
> sketches to perform a very similar calculation to the one you are
> describing.
>
> Will
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Thu, May 20, 2021 at 6:27 AM yiifeger wu <[email protected]> wrote:
>
>> Hi Lee,
>>    Thanks for your reply at first.
>>    As for what I proposed, let's take a look for a simple example
>>
>>    Question: get the distinct value of that satisfy (vi > 5 && vi < 7) in
>> the below array
>>    Input: V: [10,8,7,7,1,4,5,6,8,9]
>>
>>    First, if we wanna get the distinct value for the whole data through
>> KMV, we should get the hash value for each value in the array V and keep
>> the smallest K hashes。
>>    In this example， let K =3。
>>    V:           [10 , 8   ,  7 ,  7 , 1   , 4   , 5    ,6    ,8   ,9 ]
>>    hash:         [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ]
>>    the smallest 3 hash : [0.12, 0.2, 0.21]
>>    (Of course, the real hash function will be more random and evenly
>> distributed)
>>
>>    Secondly, we estimate the distinct value for the whole data,
>>     NDV = (3-1)/0.21 ~= 9.5
>>
>>    Then，if we wanna get the distinct value of that satisfy (vi > 5 && vi
>> < 7) , we can realize it through storing extra 3 origin value for each
>> smallest hash：
>>     smallest 3 hash:   [0.12, 0.2, 0.21]
>>     smallest 3 values: [9   , 10 , 6]
>>     the value satisfy (vi > 5 && vi < 7) is [6]
>>
>>    At last, we can estimate the distinct value of that (vi > 5 && vi < 7):
>>     NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16
>>   in which 1.3 represents the fraction of the smallest 3 hash, that
>> satisfy (vi > 5 && vi < 7) .
>>
>>
>> leerho <[email protected]> 于2021年5月20日周四 上午7:50写道：
>>
>>> Yiifeger,
>>> I'm sorry, but I'm having difficulty in understanding your question or
>>> what you are trying to do.
>>>
>>> In our library, all of our Theta Sketches are derivatives of the KMV
>>> sketch as explained on our website.  And, on our website under *Sketch
>>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory*
>>> you will find several documents that explain the math behind theta
>>> sketches. Of particular relevance would be the original published paper on
>>> the Theta Sketch Framework
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>.
>>> A simplified development of the math can be found in the short paper Theta
>>> Sketch Equations
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>,
>>> also from the website.
>>>
>>> I hope this helps,
>>>
>>> Lee.
>>>
>>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>      I recently learned about the DataSketch project that is so
>>>> brilliant, but questions occurred when prepared to utilize it.
>>>>      I want to get the count of distinct values for a range query in
>>>> my project. After some study about the KMV algorithm according to the
>>>> introduction in DataSketch project, we propose an adjusted KMV
>>>> algorithm to solve it.
>>>>       In origin KMV, it only stores K  hash_values and then computes
>>>> the NDV through the average density. So what if we store extra origin
>>>> values for which hash_value contained by the k -Minimum hash_values ?  So
>>>> we can estimate the distinct value of the range query through
>>>>
>>>>>           *  ndv_in_the_range = ( ndv_in_range_for_k_minimum / k)  *
>>>>> total_ndv*
>>>>
>>>>
>>>>     So if the idea works and the Sketch does not  implement it, could
>>>> you give some advice
>>>> on how to implement it in this project (P.s prefer the java version).
>>>>      Thanks for your help in advance!
>>>>
>>>>
>>
>> --
>> Best Regards
>> Yifei Wu
>>
>

Re: [E] Re: Proposal to get the NDV of a Range query through KMV

Reply via email to