Hi, I will be making time for completing this in March, with completion and reviews planned for April.
Cheers Reza On Wed, 27 Nov 2019 at 02:29, Robert Bradshaw <rober...@google.com> wrote: > I think this thread is sufficient. > > On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <r...@google.com> wrote: > >> Hi, >> >> So do we need a vote for the final list of actions? Or is this thread >> enough to go ahead and raise the PR's? >> >> Cheers >> >> Reza >> >> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote: >> >>> >>> >>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <r...@google.com> wrote: >>>> >>>>> *Ahmet: FWIW, There is a python implementation only for this >>>>> version: >>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> >>>>> * >>>>> Eventually we will be able to make use of cross language transforms to >>>>> help with feature parity. Until then, are we ok with marking this >>>>> deprecated in python, even though we do not have another solution. Or >>>>> leave >>>>> it as is in Python now, as it does not have sketch capability so can only >>>>> be used for outputting results directly from the pipeline. >>>>> >>>> >>> If it is our intention to add the capability eventually, IMO it makes >>> sense to mark the existing functionality deprecated in Python as well. >>> >>> >>>>> *Reuven: I think this is the sort of thing that has been experimental >>>>> forever, and therefore not experimental (e.g. the entire triggering API is >>>>> experimental as are all our file-based sinks). I think that many users use >>>>> this, and probably store the state implicitly in streaming pipelines.* >>>>> True, I have an old action item to try and go through and PR against >>>>> old @experimental annotations but need to find time. So for this >>>>> discussion; I guess this should be marked as deprecated if we change it >>>>> even though its @experimental. >>>>> >>>> >>>> Agreed. >>>> >>>> >>>>> *Rob: I'm not following this--by naming things after their >>>>> implementation rather than their intent I think they will be harder to >>>>> search for. * >>>>> This is to add to the name the implementation, after the intent. For >>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to >>>>> search for and it is clear which implementation is used. Allowing for a >>>>> potentially better implementation ApproximateCountDistinct<FooBar>. >>>>> >>>> >>>> OK, if we have both I'm more OK with that. This is better than the >>>> names like HllCount, which seems to be what was suggested. >>>> >>>> Another approach would be to have a required parameter which is an >>>>> enum of the implementation options. >>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ? >>>>> >>>> >>>> Ideally this could be an optional parameter, or possibly only required >>>> during update until we figure out a good way for the runner to plug this in >>>> appropreately. >>>> >>>> Rob/Kenn: On Combiner discussion, should we tie action items from the >>>>> needs of this thread to this larger discussion? >>>>> >>>>> Cheers >>>>> Reza >>>>> >>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <rober...@google.com> >>>>> wrote: >>>>> >>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <k...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Wow. Nice summary, yes. Major calls to action: >>>>>>> >>>>>>> 0. Never allow a combiner that does not include the format of its >>>>>>> state clear in its name/URN. The "update compatibility" problem makes >>>>>>> their >>>>>>> internal accumulator state essentially part of their public API. >>>>>>> Combiners >>>>>>> named for what they do are an inherent risk, since we might have a new >>>>>>> way >>>>>>> to do the same operation with different implementation-detail state. >>>>>>> >>>>>> >>>>>> It seems this will make for a worse user experience, motivated solely >>>>>> by limitations in our implementation. I think we can do better. >>>>>> Hypothetical idea: what if upgrade required access to the original graph >>>>>> (or at least metadata about it) during construction? In this case an >>>>>> ApproximateDistinct could look at what was used last time and try to do >>>>>> the >>>>>> same, but be free to do something better when unconstrained. Another >>>>>> approach would be to encode several alternative expansions in the Beam >>>>>> graph and let the runner do the picking (based on prior submission). >>>>>> (Making the CombineFn, as opposed to the composite, have several >>>>>> alternatives seems harder to reason about, but maybe worth pursuing as >>>>>> well). >>>>>> >>>>>> This is not unique to Combiners, but any stateful DoFn, or composite >>>>>> operations with non-trivial internal structure (and coders). This has >>>>>> been >>>>>> discussed a lot, perhaps there are some ideas there we could borrow? >>>>>> >>>>>> And they will match search terms better, which is a major problem. >>>>>>> >>>>>> >>>>>> I'm not following this--by naming things after their implementation >>>>>> rather than their intent I think they will be harder to search for. >>>>>> >>>>>> >>>>>>> 1. Point users to HllCount. This seems to be the best of the three. >>>>>>> Does it have a name that is clear enough about the format of its state? >>>>>>> Noting that its Java package name includes zetasketch, perhaps. >>>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc. >>>>>>> >>>>>> >>>>>> +1 >>>>>> >>>>>> >>>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you for writing this summary. >>>>>>>>> >>>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <r...@google.com> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone; >>>>>>>>>> >>>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count >>>>>>>>>> algorithms. >>>>>>>>>> >>>>>>>>>> Today there are several options for Approximate Algorithms in >>>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would >>>>>>>>>> like to >>>>>>>>>> canvas opinions here on the possibility of rationalizing these API's >>>>>>>>>> by >>>>>>>>>> removing obsolete / less efficient implementations. >>>>>>>>>> The current situation: >>>>>>>>>> >>>>>>>>>> There are three options available to users: >>>>>>>>>> ApproximateUnique.java >>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>, >>>>>>>>>> ApproximateDistinct.java >>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>>>>> and HllCount.java >>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>. >>>>>>>>>> A quick summary of these API's as I understand them: >>>>>>>>>> >>>>>>>>>> HllCount.java >>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>: >>>>>>>>>> Marked as @Experimental >>>>>>>>>> >>>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on >>>>>>>>>> data streams based on the ZetaSketch >>>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed >>>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam. >>>>>>>>>> >>>>>>>>>> ApproximateUnique.java >>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>: >>>>>>>>>> Not Marked with experimental >>>>>>>>>> >>>>>>>>>> This API does not expose the ability to create sketches so it's >>>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do >>>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). >>>>>>>>>> It's also >>>>>>>>>> less precise for the same amount of memory used: the error bounds in >>>>>>>>>> the >>>>>>>>>> doc comments give : >>>>>>>>>> >>>>>>>>>> /* The error is about >>>>>>>>>> >>>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */ >>>>>>>>>> >>>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X >>>>>>>>>> larger than the HLL++ error. >>>>>>>>>> >>>>>>>>> >>>>>>>>> FWIW, There is a python implementation only for this version: >>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> ApproximateDistinct.java >>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>>>>> Marked with @Experimental >>>>>>>>>> >>>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the >>>>>>>>>> paper published in 2013. It is exposing sketches via a >>>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this >>>>>>>>>> implementation compared to the HLLCount and we need to be careful to >>>>>>>>>> ensure >>>>>>>>>> that if we were to change any of these API's that the binary format >>>>>>>>>> of the >>>>>>>>>> sketches should never change, there could be users who have stored >>>>>>>>>> previous >>>>>>>>>> sketches using ApproximateDistinct and it will be important to try >>>>>>>>>> and >>>>>>>>>> ensure they do not have a bad experience. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Proposals: >>>>>>>>>> >>>>>>>>>> There are two classes of users expected for these algorithms: >>>>>>>>>> >>>>>>>>>> 1) Users who simply use the transform to estimate the size of >>>>>>>>>> their data set in Beam >>>>>>>>>> >>>>>>>>>> 2) Users who want to create sketches and store them, either for >>>>>>>>>> interoperability with other systems, or as features to be used in >>>>>>>>>> further >>>>>>>>>> data processing. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> For use case 1, it is possible to make use of naming which does >>>>>>>>>> not expose the implementation, however for use case 2 it is >>>>>>>>>> important for >>>>>>>>>> the implementation to be explicit as sketches produced with one >>>>>>>>>> implementation will not work with other implementations. >>>>>>>>>> >>>>>>>>>> ApproximateUnique.java >>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java> >>>>>>>>>> : >>>>>>>>>> >>>>>>>>>> This one does not provide sketches and based on the notes above, >>>>>>>>>> is not as efficient as HLLCount. However it does have a very >>>>>>>>>> searchable >>>>>>>>>> name and is likely to be the one users will gravitate to when >>>>>>>>>> searching for >>>>>>>>>> Approximate unique algorithms but do not need the capabilities of >>>>>>>>>> sketches. >>>>>>>>>> >>>>>>>>>> Ideally we should think about switching the implementation of >>>>>>>>>> this transform to wrap HLLCount. However this could mean changing >>>>>>>>>> things in >>>>>>>>>> a way which is not transparent to the end developer. Although as a >>>>>>>>>> result >>>>>>>>>> of the change they would get a better implementation for free on an >>>>>>>>>> upgrade >>>>>>>>>> :-) >>>>>>>>>> >>>>>>>>>> Another option would be to mark this transform as @Deprecated and >>>>>>>>>> create a new transform ApproximateCountDistinct which would wrap >>>>>>>>>> HLLCount. The name is also much clearer. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Marking it deprecated instead of changing its implementation will >>>>>>>>> probably create a less surprising experience for the users. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> ApproximateDistinct.java >>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>>>>> >>>>>>>>>> This transform does generate sketches as output and given its >>>>>>>>>> marked as @Experimental, one option we would have is to create a >>>>>>>>>> name which >>>>>>>>>> includes the algorithm implementation details, for example >>>>>>>>>> ApproximateCountDistinctClearSpring. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Can we remove this, if it is both experimental and providing the >>>>>>>>> same utility as HllCount? Is the concern that users might have stored >>>>>>>>> sketches? >>>>>>>>> >>>>>>>> >>>>>>>> I think this is the sort of thing that has been experimental >>>>>>>> forever, and therefore not experimental (e.g. the entire triggering >>>>>>>> API is >>>>>>>> experimental as are all our file-based sinks). I think that many users >>>>>>>> use >>>>>>>> this, and probably store the state implicitly in streaming pipelines. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> HllCount.java >>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> Again we have a few options here, as the name does not include >>>>>>>>>> search words like Approximate, we can create a wrapper which has a >>>>>>>>>> name >>>>>>>>>> similar to ApproximateCountDistinctZetaSketch. >>>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Reza >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> This email may be confidential and privileged. If you received >>>>>>>>>> this communication by mistake, please don't forward it to anyone >>>>>>>>>> else, >>>>>>>>>> please erase all copies and attachments, and please let me know that >>>>>>>>>> it has >>>>>>>>>> gone to the wrong person. >>>>>>>>>> >>>>>>>>>> The above terms reflect a potential business arrangement, are >>>>>>>>>> provided solely as a basis for further discussion, and are not >>>>>>>>>> intended to >>>>>>>>>> be and do not constitute a legally binding obligation. No legally >>>>>>>>>> binding >>>>>>>>>> obligations will be created, implied, or inferred until an agreement >>>>>>>>>> in >>>>>>>>>> final form is executed in writing by all parties involved. >>>>>>>>>> >>>>>>>>> >>>>> >>>>> -- >>>>> >>>>> This email may be confidential and privileged. If you received this >>>>> communication by mistake, please don't forward it to anyone else, please >>>>> erase all copies and attachments, and please let me know that it has gone >>>>> to the wrong person. >>>>> >>>>> The above terms reflect a potential business arrangement, are provided >>>>> solely as a basis for further discussion, and are not intended to be and >>>>> do >>>>> not constitute a legally binding obligation. No legally binding >>>>> obligations >>>>> will be created, implied, or inferred until an agreement in final form is >>>>> executed in writing by all parties involved. >>>>> >>>> >> >> -- >> >> This email may be confidential and privileged. If you received this >> communication by mistake, please don't forward it to anyone else, please >> erase all copies and attachments, and please let me know that it has gone >> to the wrong person. >> >> The above terms reflect a potential business arrangement, are provided >> solely as a basis for further discussion, and are not intended to be and do >> not constitute a legally binding obligation. No legally binding obligations >> will be created, implied, or inferred until an agreement in final form is >> executed in writing by all parties involved. >> >