Re: Cleaning up Approximate Algorithms in Beam

Robert Bradshaw Tue, 26 Nov 2019 10:29:58 -0800

I think this thread is sufficient.

On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <r...@google.com> wrote:


> Hi,
>
> So do we need a vote for the final list of actions? Or is this thread
> enough to go ahead and raise the PR's?
>
> Cheers
>
> Reza
>
> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <r...@google.com> wrote:
>>>
>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>> version: 
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>>>>  *
>>>> Eventually we will be able to make use of cross language transforms to
>>>> help with feature parity. Until then, are we ok with marking this
>>>> deprecated in python, even though we do not have another solution. Or leave
>>>> it as is in Python now, as it does not have sketch capability so can only
>>>> be used for outputting results directly from the pipeline.
>>>>
>>>
>> If it is our intention to add the capability eventually, IMO it makes
>> sense to mark the existing functionality deprecated in Python as well.
>>
>>
>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>> experimental as are all our file-based sinks). I think that many users use
>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>> True, I have an old action item to try and go through and PR against
>>>> old @experimental annotations but need to find time. So for this
>>>> discussion; I guess this should be marked as deprecated if we change it
>>>> even though its @experimental.
>>>>
>>>
>>> Agreed.
>>>
>>>
>>>> *Rob: I'm not following this--by naming things after their
>>>> implementation rather than their intent I think they will be harder to
>>>> search for. *
>>>> This is to add to the name the implementation, after the intent. For
>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>> search for and it is clear which implementation is used. Allowing for a
>>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>>
>>>
>>> OK, if we have both I'm more OK with that. This is better than the names
>>> like HllCount, which seems to be what was suggested.
>>>
>>> Another approach would be to have a required  parameter which is an enum
>>>> of the implementation options.
>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>
>>>
>>> Ideally this could be an optional parameter, or possibly only required
>>> during update until we figure out a good way for the runner to plug this in
>>> appropreately.
>>>
>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>> needs of this thread to this larger discussion?
>>>>
>>>> Cheers
>>>> Reza
>>>>
>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <k...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>
>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>> state clear in its name/URN. The "update compatibility" problem makes 
>>>>>> their
>>>>>> internal accumulator state essentially part of their public API. 
>>>>>> Combiners
>>>>>> named for what they do are an inherent risk, since we might have a new 
>>>>>> way
>>>>>> to do the same operation with different implementation-detail state.
>>>>>>
>>>>>
>>>>> It seems this will make for a worse user experience, motivated solely
>>>>> by limitations in our implementation. I think we can do better.
>>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>>> (or at least metadata about it) during construction? In this case an
>>>>> ApproximateDistinct could look at what was used last time and try to do 
>>>>> the
>>>>> same, but be free to do something better when unconstrained. Another
>>>>> approach would be to encode several alternative expansions in the Beam
>>>>> graph and let the runner do the picking (based on prior submission).
>>>>> (Making the CombineFn, as opposed to the composite, have several
>>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>>> well).
>>>>>
>>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>>> operations with non-trivial internal structure (and coders). This has been
>>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>>
>>>>> And they will match search terms better, which is a major problem.
>>>>>>
>>>>>
>>>>> I'm not following this--by naming things after their implementation
>>>>> rather than their intent I think they will be harder to search for.
>>>>>
>>>>>
>>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>>> Does it have a name that is clear enough about the format of its state?
>>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you for writing this summary.
>>>>>>>>
>>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <r...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone;
>>>>>>>>>
>>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>>> algorithms.
>>>>>>>>>
>>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would 
>>>>>>>>> like to
>>>>>>>>> canvas opinions here on the possibility of rationalizing these API's 
>>>>>>>>> by
>>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>>> The current situation:
>>>>>>>>>
>>>>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>>> ApproximateDistinct.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>> and HllCount.java
>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>>
>>>>>>>>> HllCount.java
>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>>> Marked as @Experimental
>>>>>>>>>
>>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on
>>>>>>>>> data streams based on the ZetaSketch
>>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>>
>>>>>>>>> ApproximateUnique.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>>> Not Marked with experimental
>>>>>>>>>
>>>>>>>>> This API does not expose the ability to create sketches so it's
>>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). 
>>>>>>>>> It's also
>>>>>>>>> less precise for the same amount of memory used: the error bounds in 
>>>>>>>>> the
>>>>>>>>> doc comments give :
>>>>>>>>>
>>>>>>>>> /* The error is about
>>>>>>>>>
>>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>>
>>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>>> larger than the HLL++ error.
>>>>>>>>>
>>>>>>>>
>>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> ApproximateDistinct.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>> Marked with @Experimental
>>>>>>>>>
>>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>>> implementation compared to the HLLCount and we need to be careful to 
>>>>>>>>> ensure
>>>>>>>>> that if we were to change any of these API's that the binary format 
>>>>>>>>> of the
>>>>>>>>> sketches should never change, there could be users who have stored 
>>>>>>>>> previous
>>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Proposals:
>>>>>>>>>
>>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>>
>>>>>>>>> 1) Users who simply use the transform to estimate the size of
>>>>>>>>> their data set in Beam
>>>>>>>>>
>>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>>> interoperability with other systems, or as features to be used in 
>>>>>>>>> further
>>>>>>>>> data processing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For use case 1, it is possible to make use of naming which does
>>>>>>>>> not expose the implementation, however for use case 2 it is important 
>>>>>>>>> for
>>>>>>>>> the implementation to be explicit as sketches produced with one
>>>>>>>>> implementation will not work with other implementations.
>>>>>>>>>
>>>>>>>>> ApproximateUnique.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>> This one does not provide sketches and based on the notes above,
>>>>>>>>> is not as efficient as HLLCount. However it does have a very 
>>>>>>>>> searchable
>>>>>>>>> name and is likely to be the one users will gravitate to when 
>>>>>>>>> searching for
>>>>>>>>> Approximate unique algorithms but do not need the capabilities of 
>>>>>>>>> sketches.
>>>>>>>>>
>>>>>>>>> Ideally we should think about switching the implementation of this
>>>>>>>>> transform to wrap HLLCount. However this could mean changing things 
>>>>>>>>> in a
>>>>>>>>> way which is not transparent to the end developer.  Although as a 
>>>>>>>>> result of
>>>>>>>>> the change they would get a better implementation for free on an 
>>>>>>>>> upgrade :-)
>>>>>>>>>
>>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>>> probably create a less surprising experience for the users.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> ApproximateDistinct.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>
>>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>>> marked as @Experimental, one option we would have is to create a name 
>>>>>>>>> which
>>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>>> sketches?
>>>>>>>>
>>>>>>>
>>>>>>> I think this is the sort of thing that has been experimental
>>>>>>> forever, and therefore not experimental (e.g. the entire triggering API 
>>>>>>> is
>>>>>>> experimental as are all our file-based sinks). I think that many users 
>>>>>>> use
>>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> HllCount.java
>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>>> search words like Approximate, we can create a wrapper which has a 
>>>>>>>>> name
>>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> This email may be confidential and privileged. If you received
>>>>>>>>> this communication by mistake, please don't forward it to anyone else,
>>>>>>>>> please erase all copies and attachments, and please let me know that 
>>>>>>>>> it has
>>>>>>>>> gone to the wrong person.
>>>>>>>>>
>>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>>> provided solely as a basis for further discussion, and are not 
>>>>>>>>> intended to
>>>>>>>>> be and do not constitute a legally binding obligation. No legally 
>>>>>>>>> binding
>>>>>>>>> obligations will be created, implied, or inferred until an agreement 
>>>>>>>>> in
>>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>>
>>>> This email may be confidential and privileged. If you received this
>>>> communication by mistake, please don't forward it to anyone else, please
>>>> erase all copies and attachments, and please let me know that it has gone
>>>> to the wrong person.
>>>>
>>>> The above terms reflect a potential business arrangement, are provided
>>>> solely as a basis for further discussion, and are not intended to be and do
>>>> not constitute a legally binding obligation. No legally binding obligations
>>>> will be created, implied, or inferred until an agreement in final form is
>>>> executed in writing by all parties involved.
>>>>
>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Re: Cleaning up Approximate Algorithms in Beam

Reply via email to