Everyone is using all sorts of engines: Druid, Spark, Hive, Pig, Splice
Machine and so on. I would love to know if there is any Greenplum
installation using Datasketches.

On Mon, Jul 12, 2021 at 10:00 AM Matthew Farkas <[email protected]> wrote:

> Ah, thanks, Alexander.
>
> That makes sense, I started digging into cpu usage, and noticed that
> queries can only use one cpu in my single-host case.
>
> So sounds like to use datasketches at this scale, everyone is currently
> using druid (if no one is using greenplum)?
>
> [image: image.png]
>
>
>
> *Matthew Z. Farkas*
>
> Data Science @ Spotify
> MS Northwestern University, BS Georgia Tech
>
> m: (770) 337-2709
> e: [email protected]
>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=qLTR_K9NKiNg1ePOz3nolQUm9_f6BH9WjB1R7pW7kVc&s=ZsFGcOKd7oVEW5UuRd3MBcbNWbVXkyvL-1uFVgqSr9Y&e=>
>
>
> On Mon, Jul 12, 2021 at 12:36 PM Alexander Saydakov
> <[email protected]> wrote:
>
>> Matt,
>> I assume you are running a single-host PostgreSQL. If so, your numbers
>> don't look too bad I would say. You may want to consider the distributed
>> variant, which is Greenplum. However I am not aware of any deployment of
>> our extension in such environments.
>>
>> On Fri, Jul 9, 2021 at 2:41 PM Will Lauer <[email protected]>
>> wrote:
>>
>>> Matt,
>>>
>>> In my production case, I'm building sketches using java in an ETL
>>> pipeline and then loading them into a Druid datamart, which aggregates them
>>> together when it receives queries. Queries might aggregate several hundred
>>> sketches all the way to many millions (the average number is probably in
>>> the 100's of thousands), depending on the time frame involved in the query
>>> and the particular dimensions selected. The majority of our queries (95%+)
>>> return in less than 10 seconds. This is running on a cluster with between
>>> 150 and 200 nodes.
>>>
>>> We are investigating implementing this in an alternative database, but
>>> haven't gotten that database working in a performant way yet (due to some
>>> problems with the databases' API, not due to sketches), but are working
>>> with the vendor to find some workarounds.
>>>
>>> Will
>>>
>>> <http://www.verizonmedia.com>
>>>
>>> Will Lauer
>>>
>>> Senior Principal Architect, Audience & Advertising Reporting
>>> Data Platforms & Systems Engineering
>>>
>>> M 508 561 6427
>>> 1908 S. First St
>>> Champaign, IL 61822
>>>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=57XsolCiQCmaWB6pOS1IQ3j3GHdH3P95fd1GxvaPJ2M&e=>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=sYwbOS3PJaMGg7HGlH8AtHTxJrjAr-zbzNptyihuvDM&e=>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=KqW7eRDxcvjFALxVALwlain6zSytoHqDJLipg3rSunM&e=>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=JkYL6uq0qoDR1Cvko3w9WWpX6sPJ5r64kDiNY_i0Stk&s=trPaP60q5pMtDCgPGneuxL0UMfJ8DnavcbkMHNiHj9Y&e=>
>>>
>>>
>>>
>>> On Fri, Jul 9, 2021 at 2:57 PM Matthew Farkas <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm running PG 13.3 and pg-datasketches 1.3.0 (I built from master
>>>> after running into this issue
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dpostgresql_issues_34&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=4tX6hAxcgLT0zeFgrAKVZ-oxngSqXLrUVy9rYDZIPZE&s=bEI9ZIoMM-58NW0wMXeJ0Ben3Mg0BYk2FamasN9e75A&e=>
>>>> ).
>>>>
>>>> So some rough numbers- I have a week-hour table with 168 user_id
>>>> sketches, all would be estimates and not exact, and that is taking 21ms for
>>>> unioning those 168 sketches.
>>>> - 13k sketches is taking 1-2s
>>>> - 13m sketches was taking ~2min yesterday (I must have updated a config
>>>> that hurt this, though, I'm cancelling the query after 9mins now)
>>>>
>>>> Will-
>>>> Thanks for the background. So you're combining the sketches in Java-
>>>> are you retrieving them from a db? Also, how many sketches are you
>>>> typically merging?
>>>>
>>>>
>>>>
>>>> *Matthew Z. Farkas*
>>>>
>>>> Data Science @ Spotify
>>>> MS Northwestern University, BS Georgia Tech
>>>>
>>>> m: (770) 337-2709
>>>> e: [email protected]
>>>>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=4tX6hAxcgLT0zeFgrAKVZ-oxngSqXLrUVy9rYDZIPZE&s=wR4KZ0n2kgAyu0WCCxyxdMddHWTfnUSaY9H4r9fjJ2U&e=>
>>>>
>>>>
>>>> On Fri, Jul 9, 2021 at 1:53 PM Alexander Saydakov
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi Matt,
>>>>> What version of PostgreSQL and DataSketches are you using?
>>>>> Could you give some numbers? How many sketches? How long does the
>>>>> union take?
>>>>>
>>>>> The graph you are referring to was based on performance in Druid I
>>>>> believe. So it may or may not be transferable to PostgreSQL. We did not do
>>>>> a large-scale test in PostgreSQL.
>>>>>
>>>>> Also we have a performance improvement in the works, which is supposed
>>>>> to avoid some cost of deserialization of Theta sketches. It might speed
>>>>> things up 10-15% according to some preliminary testing.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 9, 2021 at 10:32 AM Matthew Farkas <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Will,
>>>>>>
>>>>>> Thanks for the quick response! For your questions:
>>>>>>
>>>>>> 1. Yup, looking at Theta sketches for set operations.
>>>>>> 2. So I'm creating the initial sketches in dataflow like so, with
>>>>>> K=4096 (so lgK=12) right now:
>>>>>>     UpdateSketch userSketch = UpdateSketch.builder().build(K);
>>>>>>     userSketch.update(requestValue.userId())
>>>>>>     // pass to PG using
>>>>>>     ByteString.copyFrom(userSketch.compact().toByteArray());
>>>>>> 3. By "sketch size", do you mean the number of uniques in each
>>>>>> sketch? If so, there's a good bit of variance in sketch size, as I'm
>>>>>> segmenting (by dimensions like demo, geo, etc.) users and saving a sketch
>>>>>> for each segment.
>>>>>> 4. I do not know the proportion that are in direct vs. estimation.
>>>>>> (Admittedly, I'm not familiar with the differences there, will check it
>>>>>> out.) Is this explicitly set? Or maybe determined based on K & sketch 
>>>>>> size.
>>>>>>
>>>>>> One thing I found interesting was that doing a
>>>>>> `THETA_SKETCH_UNION(user_id_sketch, 10)` on all sketches vastly improved
>>>>>> query time (70s to 6s), and produced the exact same results. I expected 
>>>>>> the
>>>>>> results to be the same, since lgK=12 when originally creating the 
>>>>>> sketches,
>>>>>> but I'm not sure why that would improve query time.
>>>>>>
>>>>>> Thanks again!
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Matthew Z. Farkas*
>>>>>>
>>>>>> Data Science @ Spotify
>>>>>> MS Northwestern University, BS Georgia Tech
>>>>>>
>>>>>> m: (770) 337-2709
>>>>>> e: [email protected]
>>>>>>
>>>>>>
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=zHLsL8UzcCcVZJGnwJ_cAY9tZt12_0GAe-aetSX7hRs&e=>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 9, 2021 at 1:13 PM Will Lauer
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Welcome Matt!
>>>>>>>
>>>>>>> One of the others is probably best qualified to answer your
>>>>>>> question, but I'll chime in early with a couple of questions. The
>>>>>>> performance of merging depends on many factors, including type of sketch
>>>>>>> and sketch size. I'm assuming from the link you posted that you are 
>>>>>>> dealing
>>>>>>> with Theta sketches, for count unique operations. Can you confirm that? 
>>>>>>> If
>>>>>>> so, what's the logK you are using? What is the sketch size? Do you 
>>>>>>> happen
>>>>>>> to know what proportion of your sketches are in estimation mode vs exact
>>>>>>> mode?
>>>>>>>
>>>>>>> Will
>>>>>>>
>>>>>>> <http://www.verizonmedia.com>
>>>>>>>
>>>>>>> Will Lauer
>>>>>>>
>>>>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>>>>> Data Platforms & Systems Engineering
>>>>>>>
>>>>>>> M 508 561 6427
>>>>>>> 1908 S. First St
>>>>>>> Champaign, IL 61822
>>>>>>>
>>>>>>>
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=jRrfF2nGEDNEOSN9u2TMIRbAao3Qya1dLiv0QLMNIrw&e=>
>>>>>>>
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=R7lAUjJWXf1nxnzQVpYAnTkOe0Nj7JensDwaKj9B-r0&e=>
>>>>>>>
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=l_zRh61jHy17fBuu9BQPIqxm4y9-HZCwKEtwhH8Qnos&e=>
>>>>>>>
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=3trc9dYkJzjsSQRfnDur7ImwclKqOBk4r-JAAZZewII&s=L5CKzXaeysdQ8JJq0pCGb3V6CM43b-vd-9vUK5qEgk8&e=>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 9, 2021 at 12:02 PM Matthew Farkas <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> My name is Matt and I'm a data engineer at Spotify. I'm testing out
>>>>>>>> trying Data Sketches with Postgres, and running into some
>>>>>>>> performance issues. I'm seeing merge times much slower than what I'm 
>>>>>>>> seeing
>>>>>>>> in the docs here
>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__datasketches.apache.org_docs_Theta_ThetaMergeSpeed.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=R8BDffIXwyiZ46IUKowhz2-gQqGfpM3u-KkwplE4Ing&e=>
>>>>>>>>  (millions
>>>>>>>> of sketches/sec).
>>>>>>>>
>>>>>>>> In my case, I've pre-computed many sketches, inserted then into PG,
>>>>>>>> then I'm running queries in PG and doing the merging there. My hunch is
>>>>>>>> that there's something wrong with my Postgres configs, which I've tried
>>>>>>>> tweaking extensively but haven't been able to improve query time.
>>>>>>>>
>>>>>>>> My question is if anyone knows what type of performance can be
>>>>>>>> expected in Postgres and if anyone has any examples/tips in general 
>>>>>>>> from
>>>>>>>> their implementations.
>>>>>>>>
>>>>>>>> Also, this is my first message to this list, so please let me know
>>>>>>>> if I should be directing it anywhere else!
>>>>>>>>
>>>>>>>> Thanks!!
>>>>>>>> Matt
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Matthew Z. Farkas*
>>>>>>>>
>>>>>>>> Data Science @ Spotify
>>>>>>>> MS Northwestern University, BS Georgia Tech
>>>>>>>>
>>>>>>>> m: (770) 337-2709
>>>>>>>> e: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_matthewzfarkas&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=vGHo2vqhE2ZeS_hHdb4Y3eoJ4WjVKhEg5Xld1w9ptEQ&m=wfXanJfFTJqpoX0hDe-0GzEkE5YndUaxQMI4dCAQM3c&s=WBAi_Zz2AI6QpCCX6AsWbHRrBwTG4JtAMLfzxzllOU4&e=>
>>>>>>>>
>>>>>>>

Reply via email to