Oh and also I want to say that this is awesome and I've wanted to integrate with Datasketches for a long time but was saving it for a newcomer since it is (hopefully) mostly wrapping them in CombineFns. Thanks for doing this! I don't see the different HLL implementations are redundant at all - I view each of them not so much as functionality but as a linkage with another project / maintainership. So the one that we maintain is the least good, the Zetasketch one is about linking to Google/GCP/BigQuery, and the Apache Datasketches one is about linking with that very active project.
Kenn On Fri, Jan 20, 2023 at 9:54 AM Kenneth Knowles <k...@apache.org> wrote: > My take: it is useful to isolate dependencies. So, packages that are based > on specific other projects like Apache Datasketches benefit from being in > their own isolated module in Beam, separate from the Zetasketch-based > package. > > Having a generalized "sketching" package that abstracts away the details > so that we can swap out implementation should be a third thing independent > of the others IMO and could have some sort of plugin architecture. It is > overengineering to do so at this point. And like Byron brought up, a key > aspects of sketches is their serialized form being compatible so the user > really needs to know exactly what implementation they are using. > > Kenn > > On Wed, Jan 18, 2023 at 12:22 PM Byron Ellis via dev <dev@beam.apache.org> > wrote: > >> Another enhancement/modification to the sketching library might be to >> introduce generic encodings for at least the major sketches (HLL, Bloom, >> Count-Min) that can translate into the major implementations. Talking with >> Kenn it sounds like zetasketch has the side benefit of using an encoding >> compatible with BigQuery, but in general I think it would be a nice thing >> to let users store the sketch payload in, say, files that they could then >> be confident would still be mergeable even if the underlying implementation >> of that sketch changed. >> >> On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis <byronel...@google.com> >> wrote: >> >>> Thanks Luke, my plan was to mostly add ones that didn't already exist. >>> I'd also add that there are other techniques (Max-Gumbel Reservoir Sampling >>> for example) that aren't in any common library so far as I know that I >>> happen to know how to implement which might bias towards the general >>> "sketching" library as you say. I generally agree that implementation used >>> should be a detail and not something relevant to users. >>> >>> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik <lc...@google.com> wrote: >>> >>>> I would suggest adding it to the existing package(s) (either >>>> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're >>>> replacing existing sketches or adding new ones) since we shouldn't expose >>>> sketching libraries API surface. We should make the API take all the >>>> relevant parameters since this allows us to move between variants and >>>> choose the best sketching library. >>>> >>>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev < >>>> dev@beam.apache.org> wrote: >>>> >>>>> I believe that when zetasketch was added, it was also noticeably more >>>>> efficient than other sketch implementations. However this was a number of >>>>> years ago, and I don't know whether it still has an advantage or not. >>>>> >>>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev < >>>>> dev@beam.apache.org> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I was looking at adding at least a couple of the sketches from the >>>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if >>>>>> folks had a preference for adding to the existing "sketching" extension >>>>>> vs >>>>>> splitting it out into its own extension? >>>>>> >>>>>> The reason I ask is that there's some overlap (which already exists >>>>>> in zetasketch) between the sketches available in Datasketches vs Beam >>>>>> today, particularly HyperLogLog which would have 3 implementations if we >>>>>> were to add all of them. >>>>>> >>>>>> I don't really have a strong opinion, though personally I'd probably >>>>>> lean towards a single sketching extension (zetasketch being something of >>>>>> a >>>>>> special case as it exists for format compatibility as far as I can tell). >>>>>> But I could see how that could be confusing if you had the Apache >>>>>> Datasketch implementation and the existing implementation derived from >>>>>> the >>>>>> clearspring implementations. >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> Best, >>>>>> B >>>>>> >>>>>