Echoing what Will said, you should use a tuple sketch to store the raw values. You want to avoid any situation where you're storing hash values yourself.
jon On Thu, May 20, 2021 at 7:09 AM Will Lauer <[email protected]> wrote: > If you want to do something like this, you should be looking at tuple > sketches. Tuple sketches use the same KVM algorithm as Theta sketches, but > allow you to associate extra data with each entry. You can then apply > operations like filtering based on the tuple value, getting the estimate > for only those values that satisfy your tuple predicate. We use tuple > sketches to perform a very similar calculation to the one you are > describing. > > Will > > <http://www.verizonmedia.com> > > Will Lauer > > Senior Principal Architect, Audience & Advertising Reporting > Data Platforms & Systems Engineering > > M 508 561 6427 > 1908 S. First St > Champaign, IL 61822 > > <http://www.facebook.com/verizonmedia> <http://twitter.com/verizonmedia> > <https://www.linkedin.com/company/verizon-media/> > <http://www.instagram.com/verizonmedia> > > > > On Thu, May 20, 2021 at 6:27 AM yiifeger wu <[email protected]> wrote: > >> Hi Lee, >> Thanks for your reply at first. >> As for what I proposed, let's take a look for a simple example >> >> Question: get the distinct value of that satisfy (vi > 5 && vi < 7) in >> the below array >> Input: V: [10,8,7,7,1,4,5,6,8,9] >> >> First, if we wanna get the distinct value for the whole data through >> KMV, we should get the hash value for each value in the array V and keep >> the smallest K hashes。 >> In this example, let K =3。 >> V: [10 , 8 , 7 , 7 , 1 , 4 , 5 ,6 ,8 ,9 ] >> hash: [0.2, 0.6, 0.3, 0.3, 0.25, 0.34, 0.21 ,0.21 ,0.23,0.12 ] >> the smallest 3 hash : [0.12, 0.2, 0.21] >> (Of course, the real hash function will be more random and evenly >> distributed) >> >> Secondly, we estimate the distinct value for the whole data, >> NDV = (3-1)/0.21 ~= 9.5 >> >> Then,if we wanna get the distinct value of that satisfy (vi > 5 && vi >> < 7) , we can realize it through storing extra 3 origin value for each >> smallest hash: >> smallest 3 hash: [0.12, 0.2, 0.21] >> smallest 3 values: [9 , 10 , 6] >> the value satisfy (vi > 5 && vi < 7) is [6] >> >> At last, we can estimate the distinct value of that (vi > 5 && vi < 7): >> NDV(vi > 5 && vi < 7) = 1/3 * 9.5 ~= 3.16 >> in which 1.3 represents the fraction of the smallest 3 hash, that >> satisfy (vi > 5 && vi < 7) . >> >> >> leerho <[email protected]> 于2021年5月20日周四 上午7:50写道: >> >>> Yiifeger, >>> I'm sorry, but I'm having difficulty in understanding your question or >>> what you are trying to do. >>> >>> In our library, all of our Theta Sketches are derivatives of the KMV >>> sketch as explained on our website. And, on our website under *Sketch >>> Families / Distinct Counting / Theta Sketches / Theta Sketch Theory* >>> you will find several documents that explain the math behind theta >>> sketches. Of particular relevance would be the original published paper on >>> the Theta Sketch Framework >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1510.01455v2&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=0-Tqpi8xjR_17PMhGfORmBlIQZ70P5jrjd8F2ENuFP8&e=>. >>> A simplified development of the math can be found in the short paper Theta >>> Sketch Equations >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dwebsite_blob_master_docs_pdf_ThetaSketchEquations.pdf&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=DWIYW6tKNxOsHfyk_oyTGrxFkinjW3XwCwQTdSCGc7A&s=WSAs5sqgluz5ZyA5pEe4doacA9Dcoc0BPD8aRgdhkmY&e=>, >>> also from the website. >>> >>> I hope this helps, >>> >>> Lee. >>> >>> On Wed, May 19, 2021 at 9:39 AM yiifeger wu <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> I recently learned about the DataSketch project that is so >>>> brilliant, but questions occurred when prepared to utilize it. >>>> I want to get the count of distinct values for a range query in >>>> my project. After some study about the KMV algorithm according to the >>>> introduction in DataSketch project, we propose an adjusted KMV >>>> algorithm to solve it. >>>> In origin KMV, it only stores K hash_values and then computes >>>> the NDV through the average density. So what if we store extra origin >>>> values for which hash_value contained by the k -Minimum hash_values ? So >>>> we can estimate the distinct value of the range query through >>>> >>>>> * ndv_in_the_range = ( ndv_in_range_for_k_minimum / k) * >>>>> total_ndv* >>>> >>>> >>>> So if the idea works and the Sketch does not implement it, could >>>> you give some advice >>>> on how to implement it in this project (P.s prefer the java version). >>>> Thanks for your help in advance! >>>> >>>> >> >> -- >> Best Regards >> Yifei Wu >> >
