Hi Matt,
What version of PostgreSQL and DataSketches are you using?
Could you give some numbers? How many sketches? How long does the union
take?

The graph you are referring to was based on performance in Druid I believe.
So it may or may not be transferable to PostgreSQL. We did not do a
large-scale test in PostgreSQL.

Also we have a performance improvement in the works, which is supposed to
avoid some cost of deserialization of Theta sketches. It might speed things
up 10-15% according to some preliminary testing.



On Fri, Jul 9, 2021 at 10:32 AM Matthew Farkas <[email protected]> wrote:

> Hi Will,
>
> Thanks for the quick response! For your questions:
>
> 1. Yup, looking at Theta sketches for set operations.
> 2. So I'm creating the initial sketches in dataflow like so, with K=4096
> (so lgK=12) right now:
>     UpdateSketch userSketch = UpdateSketch.builder().build(K);
>     userSketch.update(requestValue.userId())
>     // pass to PG using
>     ByteString.copyFrom(userSketch.compact().toByteArray());
> 3. By "sketch size", do you mean the number of uniques in each sketch? If
> so, there's a good bit of variance in sketch size, as I'm segmenting (by
> dimensions like demo, geo, etc.) users and saving a sketch for each segment.
> 4. I do not know the proportion that are in direct vs. estimation.
> (Admittedly, I'm not familiar with the differences there, will check it
> out.) Is this explicitly set? Or maybe determined based on K & sketch size.
>
> One thing I found interesting was that doing a
> `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved
> query time (70s to 6s), and produced the exact same results. I expected the
> results to be the same, since lgK=12 when originally creating the sketches,
> but I'm not sure why that would improve query time.
>
> Thanks again!
>
>
>
> *Matthew Z. Farkas*
>
> Data Science @ Spotify
> MS Northwestern University, BS Georgia Tech
>
> m: (770) 337-2709
> e: [email protected]
>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=zHLsL8UzcCcVZJGnwJ_cAY9tZt12_0GAe-aetSX7hRs&e=>
>
>
> On Fri, Jul 9, 2021 at 1:13 PM Will Lauer <[email protected]>
> wrote:
>
>> Welcome Matt!
>>
>> One of the others is probably best qualified to answer your question, but
>> I'll chime in early with a couple of questions. The performance of merging
>> depends on many factors, including type of sketch and sketch size. I'm
>> assuming from the link you posted that you are dealing with Theta sketches,
>> for count unique operations. Can you confirm that? If so, what's the logK
>> you are using? What is the sketch size? Do you happen to know what
>> proportion of your sketches are in estimation mode vs exact mode?
>>
>> Will
>>
>> <http://www.verizonmedia.com>
>>
>> Will Lauer
>>
>> Senior Principal Architect, Audience & Advertising Reporting
>> Data Platforms & Systems Engineering
>>
>> M 508 561 6427
>> 1908 S. First St
>> Champaign, IL 61822
>>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=jRrfF2nGEDNEOSN9u2TMIRbAao3Qya1dLiv0QLMNIrw&e=>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=R7lAUjJWXf1nxnzQVpYAnTkOe0Nj7JensDwaKj9B-r0&e=>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=l_zRh61jHy17fBuu9BQPIqxm4y9-HZCwKEtwhH8Qnos&e=>
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=L5CKzXaeysdQ8JJq0pCGb3V6CM43b-vd-9vUK5qEgk8&e=>
>>
>>
>>
>> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> My name is Matt and I'm a data engineer at Spotify. I'm testing out
>>> trying Data Sketches with Postgres, and running into some
>>> performance issues. I'm seeing merge times much slower than what I'm seeing
>>> in the docs here
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=>
>>>  (millions
>>> of sketches/sec).
>>>
>>> In my case, I've pre-computed many sketches, inserted then into PG, then
>>> I'm running queries in PG and doing the merging there. My hunch is that
>>> there's something wrong with my Postgres configs, which I've tried tweaking
>>> extensively but haven't been able to improve query time.
>>>
>>> My question is if anyone knows what type of performance can be expected
>>> in Postgres and if anyone has any examples/tips in general from their
>>> implementations.
>>>
>>> Also, this is my first message to this list, so please let me know if I
>>> should be directing it anywhere else!
>>>
>>> Thanks!!
>>> Matt
>>>
>>>
>>>
>>> *Matthew Z. Farkas*
>>>
>>> Data Science @ Spotify
>>> MS Northwestern University, BS Georgia Tech
>>>
>>> m: (770) 337-2709
>>> e: [email protected]
>>>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=>
>>>
>>

Reply via email to