Re: Cleaning up Approximate Algorithms in Beam

Reza Rokni Mon, 25 Nov 2019 17:59:22 -0800

Hi,

So do we need a vote for the final list of actions? Or is this thread
enough to go ahead and raise the PR's?


Cheers

Reza

On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <[email protected]> wrote:

>
>
> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <[email protected]>
> wrote:
>
>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <[email protected]> wrote:
>>
>>> *Ahmet: FWIW, There is a python implementation only for this
>>> version: 
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>>>  *
>>> Eventually we will be able to make use of cross language transforms to
>>> help with feature parity. Until then, are we ok with marking this
>>> deprecated in python, even though we do not have another solution. Or leave
>>> it as is in Python now, as it does not have sketch capability so can only
>>> be used for outputting results directly from the pipeline.
>>>
>>
> If it is our intention to add the capability eventually, IMO it makes
> sense to mark the existing functionality deprecated in Python as well.
>
>
>>> *Reuven: I think this is the sort of thing that has been experimental
>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>> experimental as are all our file-based sinks). I think that many users use
>>> this, and probably store the state implicitly in streaming pipelines.*
>>> True, I have an old action item to try and go through and PR against
>>> old @experimental annotations but need to find time. So for this
>>> discussion; I guess this should be marked as deprecated if we change it
>>> even though its @experimental.
>>>
>>
>> Agreed.
>>
>>
>>> *Rob: I'm not following this--by naming things after their
>>> implementation rather than their intent I think they will be harder to
>>> search for. *
>>> This is to add to the name the implementation, after the intent. For
>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>> search for and it is clear which implementation is used. Allowing for a
>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>
>>
>> OK, if we have both I'm more OK with that. This is better than the names
>> like HllCount, which seems to be what was suggested.
>>
>> Another approach would be to have a required  parameter which is an enum
>>> of the implementation options.
>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>
>>
>> Ideally this could be an optional parameter, or possibly only required
>> during update until we figure out a good way for the runner to plug this in
>> appropreately.
>>
>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>> needs of this thread to this larger discussion?
>>>
>>> Cheers
>>> Reza
>>>
>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>>
>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>
>>>>> 0. Never allow a combiner that does not include the format of its
>>>>> state clear in its name/URN. The "update compatibility" problem makes 
>>>>> their
>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>> to do the same operation with different implementation-detail state.
>>>>>
>>>>
>>>> It seems this will make for a worse user experience, motivated solely
>>>> by limitations in our implementation. I think we can do better.
>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>> (or at least metadata about it) during construction? In this case an
>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>> same, but be free to do something better when unconstrained. Another
>>>> approach would be to encode several alternative expansions in the Beam
>>>> graph and let the runner do the picking (based on prior submission).
>>>> (Making the CombineFn, as opposed to the composite, have several
>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>> well).
>>>>
>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>> operations with non-trivial internal structure (and coders). This has been
>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>
>>>> And they will match search terms better, which is a major problem.
>>>>>
>>>>
>>>> I'm not following this--by naming things after their implementation
>>>> rather than their intent I think they will be harder to search for.
>>>>
>>>>
>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>> Does it have a name that is clear enough about the format of its state?
>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <[email protected]> wrote:
>>>>>>
>>>>>>> Thank you for writing this summary.
>>>>>>>
>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi everyone;
>>>>>>>>
>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>> algorithms.
>>>>>>>>
>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would 
>>>>>>>> like to
>>>>>>>> canvas opinions here on the possibility of rationalizing these API's by
>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>> The current situation:
>>>>>>>>
>>>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>> ApproximateDistinct.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>> and HllCount.java
>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>
>>>>>>>> HllCount.java
>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>> Marked as @Experimental
>>>>>>>>
>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>>>>> streams based on the ZetaSketch
>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>
>>>>>>>> ApproximateUnique.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>> Not Marked with experimental
>>>>>>>>
>>>>>>>> This API does not expose the ability to create sketches so it's not
>>>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's 
>>>>>>>> also
>>>>>>>> less precise for the same amount of memory used: the error bounds in 
>>>>>>>> the
>>>>>>>> doc comments give :
>>>>>>>>
>>>>>>>> /* The error is about
>>>>>>>>
>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>
>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>> larger than the HLL++ error.
>>>>>>>>
>>>>>>>
>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> ApproximateDistinct.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>> Marked with @Experimental
>>>>>>>>
>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>> implementation compared to the HLLCount and we need to be careful to 
>>>>>>>> ensure
>>>>>>>> that if we were to change any of these API's that the binary format of 
>>>>>>>> the
>>>>>>>> sketches should never change, there could be users who have stored 
>>>>>>>> previous
>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>
>>>>>>>>
>>>>>>>> Proposals:
>>>>>>>>
>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>
>>>>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>>>>> data set in Beam
>>>>>>>>
>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>> interoperability with other systems, or as features to be used in 
>>>>>>>> further
>>>>>>>> data processing.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> For use case 1, it is possible to make use of naming which does not
>>>>>>>> expose the implementation, however for use case 2 it is important for 
>>>>>>>> the
>>>>>>>> implementation to be explicit as sketches produced with one 
>>>>>>>> implementation
>>>>>>>> will not work with other implementations.
>>>>>>>>
>>>>>>>> ApproximateUnique.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>> :
>>>>>>>>
>>>>>>>> This one does not provide sketches and based on the notes above, is
>>>>>>>> not as efficient as HLLCount. However it does have a very searchable 
>>>>>>>> name
>>>>>>>> and is likely to be the one users will gravitate to when searching for
>>>>>>>> Approximate unique algorithms but do not need the capabilities of 
>>>>>>>> sketches.
>>>>>>>>
>>>>>>>> Ideally we should think about switching the implementation of this
>>>>>>>> transform to wrap HLLCount. However this could mean changing things in 
>>>>>>>> a
>>>>>>>> way which is not transparent to the end developer.  Although as a 
>>>>>>>> result of
>>>>>>>> the change they would get a better implementation for free on an 
>>>>>>>> upgrade :-)
>>>>>>>>
>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>
>>>>>>>
>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>> probably create a less surprising experience for the users.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> ApproximateDistinct.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>
>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>> marked as @Experimental, one option we would have is to create a name 
>>>>>>>> which
>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>
>>>>>>>
>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>> sketches?
>>>>>>>
>>>>>>
>>>>>> I think this is the sort of thing that has been experimental forever,
>>>>>> and therefore not experimental (e.g. the entire triggering API is
>>>>>> experimental as are all our file-based sinks). I think that many users 
>>>>>> use
>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> HllCount.java
>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>> .
>>>>>>>>
>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> This email may be confidential and privileged. If you received this
>>>>>>>> communication by mistake, please don't forward it to anyone else, 
>>>>>>>> please
>>>>>>>> erase all copies and attachments, and please let me know that it has 
>>>>>>>> gone
>>>>>>>> to the wrong person.
>>>>>>>>
>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>> provided solely as a basis for further discussion, and are not 
>>>>>>>> intended to
>>>>>>>> be and do not constitute a legally binding obligation. No legally 
>>>>>>>> binding
>>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>
>>>>>>>
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Cleaning up Approximate Algorithms in Beam

Reply via email to