On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <[email protected]> wrote:
> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <[email protected]> wrote: > >> *Ahmet: FWIW, There is a python implementation only for this >> version: >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> >> * >> Eventually we will be able to make use of cross language transforms to >> help with feature parity. Until then, are we ok with marking this >> deprecated in python, even though we do not have another solution. Or leave >> it as is in Python now, as it does not have sketch capability so can only >> be used for outputting results directly from the pipeline. >> > If it is our intention to add the capability eventually, IMO it makes sense to mark the existing functionality deprecated in Python as well. >> *Reuven: I think this is the sort of thing that has been experimental >> forever, and therefore not experimental (e.g. the entire triggering API is >> experimental as are all our file-based sinks). I think that many users use >> this, and probably store the state implicitly in streaming pipelines.* >> True, I have an old action item to try and go through and PR against >> old @experimental annotations but need to find time. So for this >> discussion; I guess this should be marked as deprecated if we change it >> even though its @experimental. >> > > Agreed. > > >> *Rob: I'm not following this--by naming things after their implementation >> rather than their intent I think they will be harder to search for. * >> This is to add to the name the implementation, after the intent. For >> example ApproximateCountDistinctZetaSketch, I believe should be easy to >> search for and it is clear which implementation is used. Allowing for a >> potentially better implementation ApproximateCountDistinct<FooBar>. >> > > OK, if we have both I'm more OK with that. This is better than the names > like HllCount, which seems to be what was suggested. > > Another approach would be to have a required parameter which is an enum >> of the implementation options. >> ApproximateCountDistinct.of().usingImpl(ZETA) ? >> > > Ideally this could be an optional parameter, or possibly only required > during update until we figure out a good way for the runner to plug this in > appropreately. > > Rob/Kenn: On Combiner discussion, should we tie action items from the >> needs of this thread to this larger discussion? >> >> Cheers >> Reza >> >> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <[email protected]> >> wrote: >> >>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <[email protected]> wrote: >>> >>>> Wow. Nice summary, yes. Major calls to action: >>>> >>>> 0. Never allow a combiner that does not include the format of its state >>>> clear in its name/URN. The "update compatibility" problem makes their >>>> internal accumulator state essentially part of their public API. Combiners >>>> named for what they do are an inherent risk, since we might have a new way >>>> to do the same operation with different implementation-detail state. >>>> >>> >>> It seems this will make for a worse user experience, motivated solely by >>> limitations in our implementation. I think we can do better. Hypothetical >>> idea: what if upgrade required access to the original graph (or at least >>> metadata about it) during construction? In this case an ApproximateDistinct >>> could look at what was used last time and try to do the same, but be free >>> to do something better when unconstrained. Another approach would be to >>> encode several alternative expansions in the Beam graph and let the runner >>> do the picking (based on prior submission). (Making the CombineFn, as >>> opposed to the composite, have several alternatives seems harder to reason >>> about, but maybe worth pursuing as well). >>> >>> This is not unique to Combiners, but any stateful DoFn, or composite >>> operations with non-trivial internal structure (and coders). This has been >>> discussed a lot, perhaps there are some ideas there we could borrow? >>> >>> And they will match search terms better, which is a major problem. >>>> >>> >>> I'm not following this--by naming things after their implementation >>> rather than their intent I think they will be harder to search for. >>> >>> >>>> 1. Point users to HllCount. This seems to be the best of the three. >>>> Does it have a name that is clear enough about the format of its state? >>>> Noting that its Java package name includes zetasketch, perhaps. >>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc. >>>> >>> >>> +1 >>> >>> >>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <[email protected]> wrote: >>>>> >>>>>> Thank you for writing this summary. >>>>>> >>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <[email protected]> wrote: >>>>>> >>>>>>> Hi everyone; >>>>>>> >>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count >>>>>>> algorithms. >>>>>>> >>>>>>> Today there are several options for Approximate Algorithms in Apache >>>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to >>>>>>> canvas >>>>>>> opinions here on the possibility of rationalizing these API's by >>>>>>> removing >>>>>>> obsolete / less efficient implementations. >>>>>>> The current situation: >>>>>>> >>>>>>> There are three options available to users: ApproximateUnique.java >>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>, >>>>>>> ApproximateDistinct.java >>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>> and HllCount.java >>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>. >>>>>>> A quick summary of these API's as I understand them: >>>>>>> >>>>>>> HllCount.java >>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>: >>>>>>> Marked as @Experimental >>>>>>> >>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data >>>>>>> streams based on the ZetaSketch >>>>>>> <https://github.com/google/zetasketch> implementation.Detailed >>>>>>> design of this class, see https://s.apache.org/hll-in-beam. >>>>>>> >>>>>>> ApproximateUnique.java >>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>: >>>>>>> Not Marked with experimental >>>>>>> >>>>>>> This API does not expose the ability to create sketches so it's not >>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do >>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's >>>>>>> also >>>>>>> less precise for the same amount of memory used: the error bounds in the >>>>>>> doc comments give : >>>>>>> >>>>>>> /* The error is about >>>>>>> >>>>>>> {@code 2 * / sqrt(sampleSize)},) */ >>>>>>> >>>>>>> Compared to the default HLLCount sketch size, its error is 10X >>>>>>> larger than the HLL++ error. >>>>>>> >>>>>> >>>>>> FWIW, There is a python implementation only for this version: >>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38 >>>>>> >>>>>> >>>>>> >>>>>>> ApproximateDistinct.java >>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>> Marked with @Experimental >>>>>>> >>>>>>> This is a re-implementation of the HLL++ algorithm, based on the >>>>>>> paper published in 2013. It is exposing sketches via a >>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this >>>>>>> implementation compared to the HLLCount and we need to be careful to >>>>>>> ensure >>>>>>> that if we were to change any of these API's that the binary format of >>>>>>> the >>>>>>> sketches should never change, there could be users who have stored >>>>>>> previous >>>>>>> sketches using ApproximateDistinct and it will be important to try and >>>>>>> ensure they do not have a bad experience. >>>>>>> >>>>>>> >>>>>>> Proposals: >>>>>>> >>>>>>> There are two classes of users expected for these algorithms: >>>>>>> >>>>>>> 1) Users who simply use the transform to estimate the size of their >>>>>>> data set in Beam >>>>>>> >>>>>>> 2) Users who want to create sketches and store them, either for >>>>>>> interoperability with other systems, or as features to be used in >>>>>>> further >>>>>>> data processing. >>>>>>> >>>>>>> >>>>>>> >>>>>>> For use case 1, it is possible to make use of naming which does not >>>>>>> expose the implementation, however for use case 2 it is important for >>>>>>> the >>>>>>> implementation to be explicit as sketches produced with one >>>>>>> implementation >>>>>>> will not work with other implementations. >>>>>>> >>>>>>> ApproximateUnique.java >>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java> >>>>>>> : >>>>>>> >>>>>>> This one does not provide sketches and based on the notes above, is >>>>>>> not as efficient as HLLCount. However it does have a very searchable >>>>>>> name >>>>>>> and is likely to be the one users will gravitate to when searching for >>>>>>> Approximate unique algorithms but do not need the capabilities of >>>>>>> sketches. >>>>>>> >>>>>>> Ideally we should think about switching the implementation of this >>>>>>> transform to wrap HLLCount. However this could mean changing things in a >>>>>>> way which is not transparent to the end developer. Although as a >>>>>>> result of >>>>>>> the change they would get a better implementation for free on an >>>>>>> upgrade :-) >>>>>>> >>>>>>> Another option would be to mark this transform as @Deprecated and >>>>>>> create a new transform ApproximateCountDistinct which would wrap >>>>>>> HLLCount. The name is also much clearer. >>>>>>> >>>>>> >>>>>> Marking it deprecated instead of changing its implementation will >>>>>> probably create a less surprising experience for the users. >>>>>> >>>>>> >>>>>>> >>>>>>> ApproximateDistinct.java >>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java> >>>>>>> >>>>>>> This transform does generate sketches as output and given its marked >>>>>>> as @Experimental, one option we would have is to create a name which >>>>>>> includes the algorithm implementation details, for example >>>>>>> ApproximateCountDistinctClearSpring. >>>>>>> >>>>>> >>>>>> Can we remove this, if it is both experimental and providing the same >>>>>> utility as HllCount? Is the concern that users might have stored >>>>>> sketches? >>>>>> >>>>> >>>>> I think this is the sort of thing that has been experimental forever, >>>>> and therefore not experimental (e.g. the entire triggering API is >>>>> experimental as are all our file-based sinks). I think that many users use >>>>> this, and probably store the state implicitly in streaming pipelines. >>>>> >>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> HllCount.java >>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html> >>>>>>> . >>>>>>> >>>>>>> Again we have a few options here, as the name does not include >>>>>>> search words like Approximate, we can create a wrapper which has a name >>>>>>> similar to ApproximateCountDistinctZetaSketch. >>>>>>> >>>>>> >>>>>>> >>>>>>> Thoughts? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> >>>>>>> >>>>>>> Reza >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> This email may be confidential and privileged. If you received this >>>>>>> communication by mistake, please don't forward it to anyone else, please >>>>>>> erase all copies and attachments, and please let me know that it has >>>>>>> gone >>>>>>> to the wrong person. >>>>>>> >>>>>>> The above terms reflect a potential business arrangement, are >>>>>>> provided solely as a basis for further discussion, and are not intended >>>>>>> to >>>>>>> be and do not constitute a legally binding obligation. No legally >>>>>>> binding >>>>>>> obligations will be created, implied, or inferred until an agreement in >>>>>>> final form is executed in writing by all parties involved. >>>>>>> >>>>>> >> >> -- >> >> This email may be confidential and privileged. If you received this >> communication by mistake, please don't forward it to anyone else, please >> erase all copies and attachments, and please let me know that it has gone >> to the wrong person. >> >> The above terms reflect a potential business arrangement, are provided >> solely as a basis for further discussion, and are not intended to be and do >> not constitute a legally binding obligation. No legally binding obligations >> will be created, implied, or inferred until an agreement in final form is >> executed in writing by all parties involved. >> >
