Re: Cleaning up Approximate Algorithms in Beam

Robert Bradshaw Mon, 18 Nov 2019 10:57:42 -0800

On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <[email protected]> wrote:

> *Ahmet: FWIW, There is a python implementation only for this
> version: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>  *
> Eventually we will be able to make use of cross language transforms to
> help with feature parity. Until then, are we ok with marking this
> deprecated in python, even though we do not have another solution. Or leave
> it as is in Python now, as it does not have sketch capability so can only
> be used for outputting results directly from the pipeline.
>
> *Reuven: I think this is the sort of thing that has been experimental
> forever, and therefore not experimental (e.g. the entire triggering API is
> experimental as are all our file-based sinks). I think that many users use
> this, and probably store the state implicitly in streaming pipelines.*
> True, I have an old action item to try and go through and PR against
> old @experimental annotations but need to find time. So for this
> discussion; I guess this should be marked as deprecated if we change it
> even though its @experimental.
>


Agreed.


> *Rob: I'm not following this--by naming things after their implementation
> rather than their intent I think they will be harder to search for. *
> This is to add to the name the implementation, after the intent. For
> example ApproximateCountDistinctZetaSketch, I believe should be easy to
> search for and it is clear which implementation is used. Allowing for a
> potentially better implementation ApproximateCountDistinct<FooBar>.
>

OK, if we have both I'm more OK with that. This is better than the names
like HllCount, which seems to be what was suggested.

Another approach would be to have a required  parameter which is an enum of
> the implementation options. ApproximateCountDistinct.of().usingImpl(ZETA) ?
>

Ideally this could be an optional parameter, or possibly only required
during update until we figure out a good way for the runner to plug this in
appropreately.

Rob/Kenn: On Combiner discussion, should we tie action items from the needs
> of this thread to this larger discussion?
>
> Cheers
> Reza
>
> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <[email protected]> wrote:
>
>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> Wow. Nice summary, yes. Major calls to action:
>>>
>>> 0. Never allow a combiner that does not include the format of its state
>>> clear in its name/URN. The "update compatibility" problem makes their
>>> internal accumulator state essentially part of their public API. Combiners
>>> named for what they do are an inherent risk, since we might have a new way
>>> to do the same operation with different implementation-detail state.
>>>
>>
>> It seems this will make for a worse user experience, motivated solely by
>> limitations in our implementation. I think we can do better. Hypothetical
>> idea: what if upgrade required access to the original graph (or at least
>> metadata about it) during construction? In this case an ApproximateDistinct
>> could look at what was used last time and try to do the same, but be free
>> to do something better when unconstrained. Another approach would be to
>> encode several alternative expansions in the Beam graph and let the runner
>> do the picking (based on prior submission). (Making the CombineFn, as
>> opposed to the composite, have several alternatives seems harder to reason
>> about, but maybe worth pursuing as well).
>>
>> This is not unique to Combiners, but any stateful DoFn, or composite
>> operations with non-trivial internal structure (and coders). This has been
>> discussed a lot, perhaps there are some ideas there we could borrow?
>>
>> And they will match search terms better, which is a major problem.
>>>
>>
>> I'm not following this--by naming things after their implementation
>> rather than their intent I think they will be harder to search for.
>>
>>
>>> 1. Point users to HllCount. This seems to be the best of the three. Does
>>> it have a name that is clear enough about the format of its state? Noting
>>> that its Java package name includes zetasketch, perhaps.
>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>
>>
>> +1
>>
>>
>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <[email protected]> wrote:
>>>>
>>>>> Thank you for writing this summary.
>>>>>
>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <[email protected]> wrote:
>>>>>
>>>>>> Hi everyone;
>>>>>>
>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>> algorithms.
>>>>>>
>>>>>> Today there are several options for Approximate Algorithms in Apache
>>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to 
>>>>>> canvas
>>>>>> opinions here on the possibility of rationalizing these API's by removing
>>>>>> obsolete / less efficient implementations.
>>>>>> The current situation:
>>>>>>
>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>> ApproximateDistinct.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>> and HllCount.java
>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>> A quick summary of these API's as I understand them:
>>>>>>
>>>>>> HllCount.java
>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>> Marked as @Experimental
>>>>>>
>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>>> streams based on the ZetaSketch
>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>
>>>>>> ApproximateUnique.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>> Not Marked with experimental
>>>>>>
>>>>>> This API does not expose the ability to create sketches so it's not
>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's 
>>>>>> also
>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>> doc comments give :
>>>>>>
>>>>>> /* The error is about
>>>>>>
>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>
>>>>>> Compared to the default HLLCount sketch size, its error is 10X larger
>>>>>> than the HLL++ error.
>>>>>>
>>>>>
>>>>> FWIW, There is a python implementation only for this version:
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>
>>>>>
>>>>>
>>>>>> ApproximateDistinct.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>> Marked with @Experimental
>>>>>>
>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>> implementation compared to the HLLCount and we need to be careful to 
>>>>>> ensure
>>>>>> that if we were to change any of these API's that the binary format of 
>>>>>> the
>>>>>> sketches should never change, there could be users who have stored 
>>>>>> previous
>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>> ensure they do not have a bad experience.
>>>>>>
>>>>>>
>>>>>> Proposals:
>>>>>>
>>>>>> There are two classes of users expected for these algorithms:
>>>>>>
>>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>>> data set in Beam
>>>>>>
>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>> interoperability with other systems, or as features to be used in further
>>>>>> data processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>> For use case 1, it is possible to make use of naming which does not
>>>>>> expose the implementation, however for use case 2 it is important for the
>>>>>> implementation to be explicit as sketches produced with one 
>>>>>> implementation
>>>>>> will not work with other implementations.
>>>>>>
>>>>>> ApproximateUnique.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>> :
>>>>>>
>>>>>> This one does not provide sketches and based on the notes above, is
>>>>>> not as efficient as HLLCount. However it does have a very searchable name
>>>>>> and is likely to be the one users will gravitate to when searching for
>>>>>> Approximate unique algorithms but do not need the capabilities of 
>>>>>> sketches.
>>>>>>
>>>>>> Ideally we should think about switching the implementation of this
>>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>>> way which is not transparent to the end developer.  Although as a result 
>>>>>> of
>>>>>> the change they would get a better implementation for free on an upgrade 
>>>>>> :-)
>>>>>>
>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>> HLLCount. The name is also much clearer.
>>>>>>
>>>>>
>>>>> Marking it deprecated instead of changing its implementation will
>>>>> probably create a less surprising experience for the users.
>>>>>
>>>>>
>>>>>>
>>>>>> ApproximateDistinct.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>
>>>>>> This transform does generate sketches as output and given its marked
>>>>>> as @Experimental, one option we would have is to create a name which
>>>>>> includes the algorithm implementation details, for example
>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>
>>>>>
>>>>> Can we remove this, if it is both experimental and providing the same
>>>>> utility as HllCount? Is the concern that users might have stored sketches?
>>>>>
>>>>
>>>> I think this is the sort of thing that has been experimental forever,
>>>> and therefore not experimental (e.g. the entire triggering API is
>>>> experimental as are all our file-based sinks). I think that many users use
>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> HllCount.java
>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>> .
>>>>>>
>>>>>> Again we have a few options here, as the name does not include search
>>>>>> words like Approximate, we can create a wrapper which has a name similar 
>>>>>> to
>>>>>> ApproximateCountDistinctZetaSketch.
>>>>>>
>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> This email may be confidential and privileged. If you received this
>>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>>> to the wrong person.
>>>>>>
>>>>>> The above terms reflect a potential business arrangement, are
>>>>>> provided solely as a basis for further discussion, and are not intended 
>>>>>> to
>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>> final form is executed in writing by all parties involved.
>>>>>>
>>>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Re: Cleaning up Approximate Algorithms in Beam

Reply via email to