Okay, that makes sense to me. In the case of Datasketches perhaps more
necessary as it turns out it has its own concept of Coders so there's some
"infrastructure" work if you want to ensure binary representation
compatibility. So I can do that for Datasketches and if I'm inspired to
bring in the ones that wouldn't have a dependency I can put those in the
"generic" sketching extension if that makes sense for folks.

On Fri, Jan 20, 2023 at 9:57 AM Kenneth Knowles <k...@apache.org> wrote:

> Oh and also I want to say that this is awesome and I've wanted to
> integrate with Datasketches for a long time but was saving it for a
> newcomer since it is (hopefully) mostly wrapping them in CombineFns. Thanks
> for doing this! I don't see the different HLL implementations are redundant
> at all - I view each of them not so much as functionality but as a linkage
> with another project / maintainership. So the one that we maintain is the
> least good, the Zetasketch one is about linking to Google/GCP/BigQuery, and
> the Apache Datasketches one is about linking with that very active project.
>
> Kenn
>
> On Fri, Jan 20, 2023 at 9:54 AM Kenneth Knowles <k...@apache.org> wrote:
>
>> My take: it is useful to isolate dependencies. So, packages that are
>> based on specific other projects like Apache Datasketches benefit from
>> being in their own isolated module in Beam, separate from the
>> Zetasketch-based package.
>>
>> Having a generalized "sketching" package that abstracts away the details
>> so that we can swap out implementation should be a third thing independent
>> of the others IMO and could have some sort of plugin architecture. It is
>> overengineering to do so at this point. And like Byron brought up, a key
>> aspects of sketches is their serialized form being compatible so the user
>> really needs to know exactly what implementation they are using.
>>
>> Kenn
>>
>> On Wed, Jan 18, 2023 at 12:22 PM Byron Ellis via dev <dev@beam.apache.org>
>> wrote:
>>
>>> Another enhancement/modification to the sketching library might be to
>>> introduce generic encodings for at least the major sketches (HLL, Bloom,
>>> Count-Min) that can translate into the major implementations. Talking with
>>> Kenn it sounds like zetasketch has the side benefit of using an encoding
>>> compatible with BigQuery, but in general I think it would be a nice thing
>>> to let users store the sketch payload in, say, files that they could then
>>> be confident would still be mergeable even if the underlying implementation
>>> of that sketch changed.
>>>
>>> On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis <byronel...@google.com>
>>> wrote:
>>>
>>>> Thanks Luke, my plan was to mostly add ones that didn't already exist.
>>>> I'd also add that there are other techniques (Max-Gumbel Reservoir Sampling
>>>> for example) that aren't in any common library so far as I know that I
>>>> happen to know how to implement which might bias towards the general
>>>> "sketching" library as you say. I generally agree that implementation used
>>>> should be a detail and not something relevant to users.
>>>>
>>>> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> I would suggest adding it to the existing package(s) (either
>>>>> sdks/java/extensions or sdks/java/zetasketch or both depending on if 
>>>>> you're
>>>>> replacing existing sketches or adding new ones) since we shouldn't expose
>>>>> sketching libraries API surface. We should make the API take all the
>>>>> relevant parameters since this allows us to move between variants and
>>>>> choose the best sketching library.
>>>>>
>>>>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> I believe that when zetasketch was added, it was also noticeably more
>>>>>> efficient than other sketch implementations. However this was a number of
>>>>>> years ago, and I don't know whether it still has an advantage or not.
>>>>>>
>>>>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I was looking at adding at least a couple of the sketches from the
>>>>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>>>>>> folks had a preference for adding to the existing "sketching" extension 
>>>>>>> vs
>>>>>>> splitting it out into its own extension?
>>>>>>>
>>>>>>> The reason I ask is that there's some overlap (which already exists
>>>>>>> in zetasketch) between the sketches available in Datasketches vs Beam
>>>>>>> today, particularly HyperLogLog which would have 3 implementations if we
>>>>>>> were to add all of them.
>>>>>>>
>>>>>>> I don't really have a strong opinion, though personally I'd probably
>>>>>>> lean towards a single sketching extension (zetasketch being something 
>>>>>>> of a
>>>>>>> special case as it exists for format compatibility as far as I can 
>>>>>>> tell).
>>>>>>> But I could see how that could be confusing if you had the Apache
>>>>>>> Datasketch implementation and the existing implementation derived from 
>>>>>>> the
>>>>>>> clearspring implementations.
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Best,
>>>>>>> B
>>>>>>>
>>>>>>

Reply via email to