On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <[email protected]> wrote: > *Ahmet: FWIW, There is a python implementation only for this > version: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 > <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> > * > Eventually we will be able to make use of cross language transforms to > help with feature parity. Until then, are we ok with marking this > deprecated in python, even though we do not have another solution. Or leave > it as is in Python now, as it does not have sketch capability so can only > be used for outputting results directly from the pipeline. > > *Reuven: I think this is the sort of thing that has been experimental > forever, and therefore not experimental (e.g. the entire triggering API is > experimental as are all our file-based sinks). I think that many users use > this, and probably store the state implicitly in streaming pipelines.* > True, I have an old action item to try and go through and PR against > old @experimental annotations but need to find time. So for this > discussion; I guess this should be marked as deprecated if we change it > even though its @experimental. >
Agreed. > *Rob: I'm not following this--by naming things after their implementation > rather than their intent I think they will be harder to search for. * > This is to add to the name the implementation, after the intent. For > example ApproximateCountDistinctZetaSketch, I believe should be easy to > search for and it is clear which implementation is used. Allowing for a > potentially better implementation ApproximateCountDistinct<FooBar>. > OK, if we have both I'm more OK with that. This is better than the names like HllCount, which seems to be what was suggested. Another approach would be to have a required parameter which is an enum of > the implementation options. ApproximateCountDistinct.of().usingImpl(ZETA) ? > Ideally this could be an optional parameter, or possibly only required during update until we figure out a good way for the runner to plug this in appropreately. Rob/Kenn: On Combiner discussion, should we tie action items from the needs > of this thread to this larger discussion? > > Cheers > Reza > > On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <[email protected]> wrote: > >> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <[email protected]> wrote: >> >>> Wow. Nice summary, yes. Major calls to action: >>> >>> 0. Never allow a combiner that does not include the format of its state >>> clear in its name/URN. The "update compatibility" problem makes their >>> internal accumulator state essentially part of their public API. Combiners >>> named for what they do are an inherent risk, since we might have a new way >>> to do the same operation with different implementation-detail state. >>> >> >> It seems this will make for a worse user experience, motivated solely by >> limitations in our implementation. I think we can do better. Hypothetical >> idea: what if upgrade required access to the original graph (or at least >> metadata about it) during construction? In this case an ApproximateDistinct >> could look at what was used last time and try to do the same, but be free >> to do something better when unconstrained. Another approach would be to >> encode several alternative expansions in the Beam graph and let the runner >> do the picking (based on prior submission). (Making the CombineFn, as >> opposed to the composite, have several alternatives seems harder to reason >> about, but maybe worth pursuing as well). >> >> This is not unique to Combiners, but any stateful DoFn, or composite >> operations with non-trivial internal structure (and coders). This has been >> discussed a lot, perhaps there are some ideas there we could borrow? >> >> And they will match search terms better, which is a major problem. >>> >> >> I'm not following this--by naming things after their implementation >> rather than their intent I think they will be harder to search for. >> >> >>> 1. Point users to HllCount. This seems to be the best of the three. Does >>> it have a name that is clear enough about the format of its state? Noting >>> that its Java package name includes zetasketch, perhaps. >>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc. >>> >> >> +1 >> >> >>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <[email protected]> wrote: >>> >>>> >>>> >>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <[email protected]> wrote: >>>> >>>>> Thank you for writing this summary. >>>>> >>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <[email protected]> wrote: >>>>> >>>>>> Hi everyone; >>>>>> >>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count >>>>>> algorithms. >>>>>> >>>>>> Today there are several options for Approximate Algorithms in Apache >>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to >>>>>> canvas >>>>>> opinions here on the possibility of rationalizing these API's by removing >>>>>> obsolete / less efficient implementations. >>>>>> The current situation: >>>>>> >>>>>> There are three options available to users: ApproximateUnique.java >>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>, >>>>>> ApproximateDistinct.java >>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>> and HllCount.java >>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>. >>>>>> A quick summary of these API's as I understand them: >>>>>> >>>>>> HllCount.java >>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>: >>>>>> Marked as @Experimental >>>>>> >>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data >>>>>> streams based on the ZetaSketch >>>>>> <https://github.com/google/zetasketch> implementation.Detailed >>>>>> design of this class, see https://s.apache.org/hll-in-beam. >>>>>> >>>>>> ApproximateUnique.java >>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>: >>>>>> Not Marked with experimental >>>>>> >>>>>> This API does not expose the ability to create sketches so it's not >>>>>> suitable for the OLAP use case that HLL++ is geared towards (do >>>>>> pre-aggregation into sketches to allow interactive query speeds). It's >>>>>> also >>>>>> less precise for the same amount of memory used: the error bounds in the >>>>>> doc comments give : >>>>>> >>>>>> /* The error is about >>>>>> >>>>>> {@code 2 * / sqrt(sampleSize)},) */ >>>>>> >>>>>> Compared to the default HLLCount sketch size, its error is 10X larger >>>>>> than the HLL++ error. >>>>>> >>>>> >>>>> FWIW, There is a python implementation only for this version: >>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >>>>> >>>>> >>>>> >>>>>> ApproximateDistinct.java >>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>> Marked with @Experimental >>>>>> >>>>>> This is a re-implementation of the HLL++ algorithm, based on the >>>>>> paper published in 2013. It is exposing sketches via a >>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this >>>>>> implementation compared to the HLLCount and we need to be careful to >>>>>> ensure >>>>>> that if we were to change any of these API's that the binary format of >>>>>> the >>>>>> sketches should never change, there could be users who have stored >>>>>> previous >>>>>> sketches using ApproximateDistinct and it will be important to try and >>>>>> ensure they do not have a bad experience. >>>>>> >>>>>> >>>>>> Proposals: >>>>>> >>>>>> There are two classes of users expected for these algorithms: >>>>>> >>>>>> 1) Users who simply use the transform to estimate the size of their >>>>>> data set in Beam >>>>>> >>>>>> 2) Users who want to create sketches and store them, either for >>>>>> interoperability with other systems, or as features to be used in further >>>>>> data processing. >>>>>> >>>>>> >>>>>> >>>>>> For use case 1, it is possible to make use of naming which does not >>>>>> expose the implementation, however for use case 2 it is important for the >>>>>> implementation to be explicit as sketches produced with one >>>>>> implementation >>>>>> will not work with other implementations. >>>>>> >>>>>> ApproximateUnique.java >>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java> >>>>>> : >>>>>> >>>>>> This one does not provide sketches and based on the notes above, is >>>>>> not as efficient as HLLCount. However it does have a very searchable name >>>>>> and is likely to be the one users will gravitate to when searching for >>>>>> Approximate unique algorithms but do not need the capabilities of >>>>>> sketches. >>>>>> >>>>>> Ideally we should think about switching the implementation of this >>>>>> transform to wrap HLLCount. However this could mean changing things in a >>>>>> way which is not transparent to the end developer. Although as a result >>>>>> of >>>>>> the change they would get a better implementation for free on an upgrade >>>>>> :-) >>>>>> >>>>>> Another option would be to mark this transform as @Deprecated and >>>>>> create a new transform ApproximateCountDistinct which would wrap >>>>>> HLLCount. The name is also much clearer. >>>>>> >>>>> >>>>> Marking it deprecated instead of changing its implementation will >>>>> probably create a less surprising experience for the users. >>>>> >>>>> >>>>>> >>>>>> ApproximateDistinct.java >>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>> >>>>>> This transform does generate sketches as output and given its marked >>>>>> as @Experimental, one option we would have is to create a name which >>>>>> includes the algorithm implementation details, for example >>>>>> ApproximateCountDistinctClearSpring. >>>>>> >>>>> >>>>> Can we remove this, if it is both experimental and providing the same >>>>> utility as HllCount? Is the concern that users might have stored sketches? >>>>> >>>> >>>> I think this is the sort of thing that has been experimental forever, >>>> and therefore not experimental (e.g. the entire triggering API is >>>> experimental as are all our file-based sinks). I think that many users use >>>> this, and probably store the state implicitly in streaming pipelines. >>>> >>>> >>>>> >>>>>> >>>>>> >>>>>> HllCount.java >>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html> >>>>>> . >>>>>> >>>>>> Again we have a few options here, as the name does not include search >>>>>> words like Approximate, we can create a wrapper which has a name similar >>>>>> to >>>>>> ApproximateCountDistinctZetaSketch. >>>>>> >>>>> >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> >>>>>> >>>>>> Cheers >>>>>> >>>>>> >>>>>> >>>>>> Reza >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> This email may be confidential and privileged. If you received this >>>>>> communication by mistake, please don't forward it to anyone else, please >>>>>> erase all copies and attachments, and please let me know that it has gone >>>>>> to the wrong person. >>>>>> >>>>>> The above terms reflect a potential business arrangement, are >>>>>> provided solely as a basis for further discussion, and are not intended >>>>>> to >>>>>> be and do not constitute a legally binding obligation. No legally binding >>>>>> obligations will be created, implied, or inferred until an agreement in >>>>>> final form is executed in writing by all parties involved. >>>>>> >>>>> > > -- > > This email may be confidential and privileged. If you received this > communication by mistake, please don't forward it to anyone else, please > erase all copies and attachments, and please let me know that it has gone > to the wrong person. > > The above terms reflect a potential business arrangement, are provided > solely as a basis for further discussion, and are not intended to be and do > not constitute a legally binding obligation. No legally binding obligations > will be created, implied, or inferred until an agreement in final form is > executed in writing by all parties involved. >
