Matt,

In my production case, I'm building sketches using java in an ETL pipeline
and then loading them into a Druid datamart, which aggregates them together
when it receives queries. Queries might aggregate several hundred sketches
all the way to many millions (the average number is probably in the 100's
of thousands), depending on the time frame involved in the query and the
particular dimensions selected. The majority of our queries (95%+) return
in less than 10 seconds. This is running on a cluster with between 150 and
200 nodes.

We are investigating implementing this in an alternative database, but
haven't gotten that database working in a performant way yet (due to some
problems with the databases' API, not due to sketches), but are working
with the vendor to find some workarounds.

Will

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Fri, Jul 9, 2021 at 2:57 PM Matthew Farkas <[email protected]> wrote:

> Hi,
>
> I'm running PG 13.3 and pg-datasketches 1.3.0 (I built from master after
> running into this issue
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dpostgresql_issues_34&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=4tX6hAxcgLT0zeFgrAKVZ-oxngSqXLrUVy9rYDZIPZE&s=bEI9ZIoMM-58NW0wMXeJ0Ben3Mg0BYk2FamasN9e75A&e=>
> ).
>
> So some rough numbers- I have a week-hour table with 168 user_id sketches,
> all would be estimates and not exact, and that is taking 21ms for
> unioning those 168 sketches.
> - 13k sketches is taking 1-2s
> - 13m sketches was taking ~2min yesterday (I must have updated a config
> that hurt this, though, I'm cancelling the query after 9mins now)
>
> Will-
> Thanks for the background. So you're combining the sketches in Java- are
> you retrieving them from a db? Also, how many sketches are you typically
> merging?
>
>
>
> *Matthew Z. Farkas*
>
> Data Science @ Spotify
> MS Northwestern University, BS Georgia Tech
>
> m: (770) 337-2709
> e: [email protected]
>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=4tX6hAxcgLT0zeFgrAKVZ-oxngSqXLrUVy9rYDZIPZE&s=wR4KZ0n2kgAyu0WCCxyxdMddHWTfnUSaY9H4r9fjJ2U&e=>
>
>
> On Fri, Jul 9, 2021 at 1:53 PM Alexander Saydakov
> <[email protected]> wrote:
>
>> Hi Matt,
>> What version of PostgreSQL and DataSketches are you using?
>> Could you give some numbers? How many sketches? How long does the union
>> take?
>>
>> The graph you are referring to was based on performance in Druid I
>> believe. So it may or may not be transferable to PostgreSQL. We did not do
>> a large-scale test in PostgreSQL.
>>
>> Also we have a performance improvement in the works, which is supposed to
>> avoid some cost of deserialization of Theta sketches. It might speed things
>> up 10-15% according to some preliminary testing.
>>
>>
>>
>> On Fri, Jul 9, 2021 at 10:32 AM Matthew Farkas <[email protected]>
>> wrote:
>>
>>> Hi Will,
>>>
>>> Thanks for the quick response! For your questions:
>>>
>>> 1. Yup, looking at Theta sketches for set operations.
>>> 2. So I'm creating the initial sketches in dataflow like so, with K=4096
>>> (so lgK=12) right now:
>>>     UpdateSketch userSketch = UpdateSketch.builder().build(K);
>>>     userSketch.update(requestValue.userId())
>>>     // pass to PG using
>>>     ByteString.copyFrom(userSketch.compact().toByteArray());
>>> 3. By "sketch size", do you mean the number of uniques in each sketch?
>>> If so, there's a good bit of variance in sketch size, as I'm segmenting (by
>>> dimensions like demo, geo, etc.) users and saving a sketch for each segment.
>>> 4. I do not know the proportion that are in direct vs. estimation.
>>> (Admittedly, I'm not familiar with the differences there, will check it
>>> out.) Is this explicitly set? Or maybe determined based on K & sketch size.
>>>
>>> One thing I found interesting was that doing a
>>> `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved
>>> query time (70s to 6s), and produced the exact same results. I expected the
>>> results to be the same, since lgK=12 when originally creating the sketches,
>>> but I'm not sure why that would improve query time.
>>>
>>> Thanks again!
>>>
>>>
>>>
>>> *Matthew Z. Farkas*
>>>
>>> Data Science @ Spotify
>>> MS Northwestern University, BS Georgia Tech
>>>
>>> m: (770) 337-2709
>>> e: [email protected]
>>>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=zHLsL8UzcCcVZJGnwJ_cAY9tZt12_0GAe-aetSX7hRs&e=>
>>>
>>>
>>> On Fri, Jul 9, 2021 at 1:13 PM Will Lauer
>>> <[email protected]> wrote:
>>>
>>>> Welcome Matt!
>>>>
>>>> One of the others is probably best qualified to answer your question,
>>>> but I'll chime in early with a couple of questions. The performance of
>>>> merging depends on many factors, including type of sketch and sketch size.
>>>> I'm assuming from the link you posted that you are dealing with Theta
>>>> sketches, for count unique operations. Can you confirm that? If so, what's
>>>> the logK you are using? What is the sketch size? Do you happen to know what
>>>> proportion of your sketches are in estimation mode vs exact mode?
>>>>
>>>> Will
>>>>
>>>> <http://www.verizonmedia.com>
>>>>
>>>> Will Lauer
>>>>
>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>> Data Platforms & Systems Engineering
>>>>
>>>> M 508 561 6427
>>>> 1908 S. First St
>>>> Champaign, IL 61822
>>>>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=jRrfF2nGEDNEOSN9u2TMIRbAao3Qya1dLiv0QLMNIrw&e=>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=R7lAUjJWXf1nxnzQVpYAnTkOe0Nj7JensDwaKj9B-r0&e=>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=l_zRh61jHy17fBuu9BQPIqxm4y9-HZCwKEtwhH8Qnos&e=>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=L5CKzXaeysdQ8JJq0pCGb3V6CM43b-vd-9vUK5qEgk8&e=>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> My name is Matt and I'm a data engineer at Spotify. I'm testing out
>>>>> trying Data Sketches with Postgres, and running into some
>>>>> performance issues. I'm seeing merge times much slower than what I'm 
>>>>> seeing
>>>>> in the docs here
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=>
>>>>>  (millions
>>>>> of sketches/sec).
>>>>>
>>>>> In my case, I've pre-computed many sketches, inserted then into PG,
>>>>> then I'm running queries in PG and doing the merging there. My hunch is
>>>>> that there's something wrong with my Postgres configs, which I've tried
>>>>> tweaking extensively but haven't been able to improve query time.
>>>>>
>>>>> My question is if anyone knows what type of performance can be
>>>>> expected in Postgres and if anyone has any examples/tips in general from
>>>>> their implementations.
>>>>>
>>>>> Also, this is my first message to this list, so please let me know if
>>>>> I should be directing it anywhere else!
>>>>>
>>>>> Thanks!!
>>>>> Matt
>>>>>
>>>>>
>>>>>
>>>>> *Matthew Z. Farkas*
>>>>>
>>>>> Data Science @ Spotify
>>>>> MS Northwestern University, BS Georgia Tech
>>>>>
>>>>> m: (770) 337-2709
>>>>> e: [email protected]
>>>>>
>>>>>
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=>
>>>>>
>>>>

Reply via email to