I think this thread is sufficient. On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <r...@google.com> wrote:
> Hi, > > So do we need a vote for the final list of actions? Or is this thread > enough to go ahead and raise the PR's? > > Cheers > > Reza > > On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote: > >> >> >> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <r...@google.com> wrote: >>> >>>> *Ahmet: FWIW, There is a python implementation only for this >>>> version: >>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> >>>> * >>>> Eventually we will be able to make use of cross language transforms to >>>> help with feature parity. Until then, are we ok with marking this >>>> deprecated in python, even though we do not have another solution. Or leave >>>> it as is in Python now, as it does not have sketch capability so can only >>>> be used for outputting results directly from the pipeline. >>>> >>> >> If it is our intention to add the capability eventually, IMO it makes >> sense to mark the existing functionality deprecated in Python as well. >> >> >>>> *Reuven: I think this is the sort of thing that has been experimental >>>> forever, and therefore not experimental (e.g. the entire triggering API is >>>> experimental as are all our file-based sinks). I think that many users use >>>> this, and probably store the state implicitly in streaming pipelines.* >>>> True, I have an old action item to try and go through and PR against >>>> old @experimental annotations but need to find time. So for this >>>> discussion; I guess this should be marked as deprecated if we change it >>>> even though its @experimental. >>>> >>> >>> Agreed. >>> >>> >>>> *Rob: I'm not following this--by naming things after their >>>> implementation rather than their intent I think they will be harder to >>>> search for. * >>>> This is to add to the name the implementation, after the intent. For >>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to >>>> search for and it is clear which implementation is used. Allowing for a >>>> potentially better implementation ApproximateCountDistinct<FooBar>. >>>> >>> >>> OK, if we have both I'm more OK with that. This is better than the names >>> like HllCount, which seems to be what was suggested. >>> >>> Another approach would be to have a required parameter which is an enum >>>> of the implementation options. >>>> ApproximateCountDistinct.of().usingImpl(ZETA) ? >>>> >>> >>> Ideally this could be an optional parameter, or possibly only required >>> during update until we figure out a good way for the runner to plug this in >>> appropreately. >>> >>> Rob/Kenn: On Combiner discussion, should we tie action items from the >>>> needs of this thread to this larger discussion? >>>> >>>> Cheers >>>> Reza >>>> >>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <k...@apache.org> >>>>> wrote: >>>>> >>>>>> Wow. Nice summary, yes. Major calls to action: >>>>>> >>>>>> 0. Never allow a combiner that does not include the format of its >>>>>> state clear in its name/URN. The "update compatibility" problem makes >>>>>> their >>>>>> internal accumulator state essentially part of their public API. >>>>>> Combiners >>>>>> named for what they do are an inherent risk, since we might have a new >>>>>> way >>>>>> to do the same operation with different implementation-detail state. >>>>>> >>>>> >>>>> It seems this will make for a worse user experience, motivated solely >>>>> by limitations in our implementation. I think we can do better. >>>>> Hypothetical idea: what if upgrade required access to the original graph >>>>> (or at least metadata about it) during construction? In this case an >>>>> ApproximateDistinct could look at what was used last time and try to do >>>>> the >>>>> same, but be free to do something better when unconstrained. Another >>>>> approach would be to encode several alternative expansions in the Beam >>>>> graph and let the runner do the picking (based on prior submission). >>>>> (Making the CombineFn, as opposed to the composite, have several >>>>> alternatives seems harder to reason about, but maybe worth pursuing as >>>>> well). >>>>> >>>>> This is not unique to Combiners, but any stateful DoFn, or composite >>>>> operations with non-trivial internal structure (and coders). This has been >>>>> discussed a lot, perhaps there are some ideas there we could borrow? >>>>> >>>>> And they will match search terms better, which is a major problem. >>>>>> >>>>> >>>>> I'm not following this--by naming things after their implementation >>>>> rather than their intent I think they will be harder to search for. >>>>> >>>>> >>>>>> 1. Point users to HllCount. This seems to be the best of the three. >>>>>> Does it have a name that is clear enough about the format of its state? >>>>>> Noting that its Java package name includes zetasketch, perhaps. >>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc. >>>>>> >>>>> >>>>> +1 >>>>> >>>>> >>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you for writing this summary. >>>>>>>> >>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <r...@google.com> wrote: >>>>>>>> >>>>>>>>> Hi everyone; >>>>>>>>> >>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count >>>>>>>>> algorithms. >>>>>>>>> >>>>>>>>> Today there are several options for Approximate Algorithms in >>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would >>>>>>>>> like to >>>>>>>>> canvas opinions here on the possibility of rationalizing these API's >>>>>>>>> by >>>>>>>>> removing obsolete / less efficient implementations. >>>>>>>>> The current situation: >>>>>>>>> >>>>>>>>> There are three options available to users: ApproximateUnique.java >>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>, >>>>>>>>> ApproximateDistinct.java >>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>>>> and HllCount.java >>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>. >>>>>>>>> A quick summary of these API's as I understand them: >>>>>>>>> >>>>>>>>> HllCount.java >>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>: >>>>>>>>> Marked as @Experimental >>>>>>>>> >>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on >>>>>>>>> data streams based on the ZetaSketch >>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed >>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam. >>>>>>>>> >>>>>>>>> ApproximateUnique.java >>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>: >>>>>>>>> Not Marked with experimental >>>>>>>>> >>>>>>>>> This API does not expose the ability to create sketches so it's >>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do >>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). >>>>>>>>> It's also >>>>>>>>> less precise for the same amount of memory used: the error bounds in >>>>>>>>> the >>>>>>>>> doc comments give : >>>>>>>>> >>>>>>>>> /* The error is about >>>>>>>>> >>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */ >>>>>>>>> >>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X >>>>>>>>> larger than the HLL++ error. >>>>>>>>> >>>>>>>> >>>>>>>> FWIW, There is a python implementation only for this version: >>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> ApproximateDistinct.java >>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>>>> Marked with @Experimental >>>>>>>>> >>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the >>>>>>>>> paper published in 2013. It is exposing sketches via a >>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this >>>>>>>>> implementation compared to the HLLCount and we need to be careful to >>>>>>>>> ensure >>>>>>>>> that if we were to change any of these API's that the binary format >>>>>>>>> of the >>>>>>>>> sketches should never change, there could be users who have stored >>>>>>>>> previous >>>>>>>>> sketches using ApproximateDistinct and it will be important to try and >>>>>>>>> ensure they do not have a bad experience. >>>>>>>>> >>>>>>>>> >>>>>>>>> Proposals: >>>>>>>>> >>>>>>>>> There are two classes of users expected for these algorithms: >>>>>>>>> >>>>>>>>> 1) Users who simply use the transform to estimate the size of >>>>>>>>> their data set in Beam >>>>>>>>> >>>>>>>>> 2) Users who want to create sketches and store them, either for >>>>>>>>> interoperability with other systems, or as features to be used in >>>>>>>>> further >>>>>>>>> data processing. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> For use case 1, it is possible to make use of naming which does >>>>>>>>> not expose the implementation, however for use case 2 it is important >>>>>>>>> for >>>>>>>>> the implementation to be explicit as sketches produced with one >>>>>>>>> implementation will not work with other implementations. >>>>>>>>> >>>>>>>>> ApproximateUnique.java >>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java> >>>>>>>>> : >>>>>>>>> >>>>>>>>> This one does not provide sketches and based on the notes above, >>>>>>>>> is not as efficient as HLLCount. However it does have a very >>>>>>>>> searchable >>>>>>>>> name and is likely to be the one users will gravitate to when >>>>>>>>> searching for >>>>>>>>> Approximate unique algorithms but do not need the capabilities of >>>>>>>>> sketches. >>>>>>>>> >>>>>>>>> Ideally we should think about switching the implementation of this >>>>>>>>> transform to wrap HLLCount. However this could mean changing things >>>>>>>>> in a >>>>>>>>> way which is not transparent to the end developer. Although as a >>>>>>>>> result of >>>>>>>>> the change they would get a better implementation for free on an >>>>>>>>> upgrade :-) >>>>>>>>> >>>>>>>>> Another option would be to mark this transform as @Deprecated and >>>>>>>>> create a new transform ApproximateCountDistinct which would wrap >>>>>>>>> HLLCount. The name is also much clearer. >>>>>>>>> >>>>>>>> >>>>>>>> Marking it deprecated instead of changing its implementation will >>>>>>>> probably create a less surprising experience for the users. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> ApproximateDistinct.java >>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>>>> >>>>>>>>> This transform does generate sketches as output and given its >>>>>>>>> marked as @Experimental, one option we would have is to create a name >>>>>>>>> which >>>>>>>>> includes the algorithm implementation details, for example >>>>>>>>> ApproximateCountDistinctClearSpring. >>>>>>>>> >>>>>>>> >>>>>>>> Can we remove this, if it is both experimental and providing the >>>>>>>> same utility as HllCount? Is the concern that users might have stored >>>>>>>> sketches? >>>>>>>> >>>>>>> >>>>>>> I think this is the sort of thing that has been experimental >>>>>>> forever, and therefore not experimental (e.g. the entire triggering API >>>>>>> is >>>>>>> experimental as are all our file-based sinks). I think that many users >>>>>>> use >>>>>>> this, and probably store the state implicitly in streaming pipelines. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> HllCount.java >>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html> >>>>>>>>> . >>>>>>>>> >>>>>>>>> Again we have a few options here, as the name does not include >>>>>>>>> search words like Approximate, we can create a wrapper which has a >>>>>>>>> name >>>>>>>>> similar to ApproximateCountDistinctZetaSketch. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Reza >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> This email may be confidential and privileged. If you received >>>>>>>>> this communication by mistake, please don't forward it to anyone else, >>>>>>>>> please erase all copies and attachments, and please let me know that >>>>>>>>> it has >>>>>>>>> gone to the wrong person. >>>>>>>>> >>>>>>>>> The above terms reflect a potential business arrangement, are >>>>>>>>> provided solely as a basis for further discussion, and are not >>>>>>>>> intended to >>>>>>>>> be and do not constitute a legally binding obligation. No legally >>>>>>>>> binding >>>>>>>>> obligations will be created, implied, or inferred until an agreement >>>>>>>>> in >>>>>>>>> final form is executed in writing by all parties involved. >>>>>>>>> >>>>>>>> >>>> >>>> -- >>>> >>>> This email may be confidential and privileged. If you received this >>>> communication by mistake, please don't forward it to anyone else, please >>>> erase all copies and attachments, and please let me know that it has gone >>>> to the wrong person. >>>> >>>> The above terms reflect a potential business arrangement, are provided >>>> solely as a basis for further discussion, and are not intended to be and do >>>> not constitute a legally binding obligation. No legally binding obligations >>>> will be created, implied, or inferred until an agreement in final form is >>>> executed in writing by all parties involved. >>>> >>> > > -- > > This email may be confidential and privileged. If you received this > communication by mistake, please don't forward it to anyone else, please > erase all copies and attachments, and please let me know that it has gone > to the wrong person. > > The above terms reflect a potential business arrangement, are provided > solely as a basis for further discussion, and are not intended to be and do > not constitute a legally binding obligation. No legally binding obligations > will be created, implied, or inferred until an agreement in final form is > executed in writing by all parties involved. >