Oh and also I want to say that this is awesome and I've wanted to integrate
with Datasketches for a long time but was saving it for a newcomer since it
is (hopefully) mostly wrapping them in CombineFns. Thanks for doing this! I
don't see the different HLL implementations are redundant at all - I view
each of them not so much as functionality but as a linkage with another
project / maintainership. So the one that we maintain is the least good,
the Zetasketch one is about linking to Google/GCP/BigQuery, and the Apache
Datasketches one is about linking with that very active project.

Kenn

On Fri, Jan 20, 2023 at 9:54 AM Kenneth Knowles <k...@apache.org> wrote:

> My take: it is useful to isolate dependencies. So, packages that are based
> on specific other projects like Apache Datasketches benefit from being in
> their own isolated module in Beam, separate from the Zetasketch-based
> package.
>
> Having a generalized "sketching" package that abstracts away the details
> so that we can swap out implementation should be a third thing independent
> of the others IMO and could have some sort of plugin architecture. It is
> overengineering to do so at this point. And like Byron brought up, a key
> aspects of sketches is their serialized form being compatible so the user
> really needs to know exactly what implementation they are using.
>
> Kenn
>
> On Wed, Jan 18, 2023 at 12:22 PM Byron Ellis via dev <dev@beam.apache.org>
> wrote:
>
>> Another enhancement/modification to the sketching library might be to
>> introduce generic encodings for at least the major sketches (HLL, Bloom,
>> Count-Min) that can translate into the major implementations. Talking with
>> Kenn it sounds like zetasketch has the side benefit of using an encoding
>> compatible with BigQuery, but in general I think it would be a nice thing
>> to let users store the sketch payload in, say, files that they could then
>> be confident would still be mergeable even if the underlying implementation
>> of that sketch changed.
>>
>> On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis <byronel...@google.com>
>> wrote:
>>
>>> Thanks Luke, my plan was to mostly add ones that didn't already exist.
>>> I'd also add that there are other techniques (Max-Gumbel Reservoir Sampling
>>> for example) that aren't in any common library so far as I know that I
>>> happen to know how to implement which might bias towards the general
>>> "sketching" library as you say. I generally agree that implementation used
>>> should be a detail and not something relevant to users.
>>>
>>> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> I would suggest adding it to the existing package(s) (either
>>>> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're
>>>> replacing existing sketches or adding new ones) since we shouldn't expose
>>>> sketching libraries API surface. We should make the API take all the
>>>> relevant parameters since this allows us to move between variants and
>>>> choose the best sketching library.
>>>>
>>>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> I believe that when zetasketch was added, it was also noticeably more
>>>>> efficient than other sketch implementations. However this was a number of
>>>>> years ago, and I don't know whether it still has an advantage or not.
>>>>>
>>>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I was looking at adding at least a couple of the sketches from the
>>>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>>>>> folks had a preference for adding to the existing "sketching" extension 
>>>>>> vs
>>>>>> splitting it out into its own extension?
>>>>>>
>>>>>> The reason I ask is that there's some overlap (which already exists
>>>>>> in zetasketch) between the sketches available in Datasketches vs Beam
>>>>>> today, particularly HyperLogLog which would have 3 implementations if we
>>>>>> were to add all of them.
>>>>>>
>>>>>> I don't really have a strong opinion, though personally I'd probably
>>>>>> lean towards a single sketching extension (zetasketch being something of 
>>>>>> a
>>>>>> special case as it exists for format compatibility as far as I can tell).
>>>>>> But I could see how that could be confusing if you had the Apache
>>>>>> Datasketch implementation and the existing implementation derived from 
>>>>>> the
>>>>>> clearspring implementations.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Best,
>>>>>> B
>>>>>>
>>>>>

Reply via email to