Re: Cleaning up Approximate Algorithms in Beam

Ahmet Altay Mon, 25 Nov 2019 14:02:31 -0800

On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <[email protected]>
wrote:


> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <[email protected]> wrote:
>
>> *Ahmet: FWIW, There is a python implementation only for this
>> version: 
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>>  *
>> Eventually we will be able to make use of cross language transforms to
>> help with feature parity. Until then, are we ok with marking this
>> deprecated in python, even though we do not have another solution. Or leave
>> it as is in Python now, as it does not have sketch capability so can only
>> be used for outputting results directly from the pipeline.
>>
>
If it is our intention to add the capability eventually, IMO it makes sense
to mark the existing functionality deprecated in Python as well.


>> *Reuven: I think this is the sort of thing that has been experimental
>> forever, and therefore not experimental (e.g. the entire triggering API is
>> experimental as are all our file-based sinks). I think that many users use
>> this, and probably store the state implicitly in streaming pipelines.*
>> True, I have an old action item to try and go through and PR against
>> old @experimental annotations but need to find time. So for this
>> discussion; I guess this should be marked as deprecated if we change it
>> even though its @experimental.
>>
>
> Agreed.
>
>
>> *Rob: I'm not following this--by naming things after their implementation
>> rather than their intent I think they will be harder to search for. *
>> This is to add to the name the implementation, after the intent. For
>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>> search for and it is clear which implementation is used. Allowing for a
>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>
>
> OK, if we have both I'm more OK with that. This is better than the names
> like HllCount, which seems to be what was suggested.
>
> Another approach would be to have a required  parameter which is an enum
>> of the implementation options.
>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>
>
> Ideally this could be an optional parameter, or possibly only required
> during update until we figure out a good way for the runner to plug this in
> appropreately.
>
> Rob/Kenn: On Combiner discussion, should we tie action items from the
>> needs of this thread to this larger discussion?
>>
>> Cheers
>> Reza
>>
>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> Wow. Nice summary, yes. Major calls to action:
>>>>
>>>> 0. Never allow a combiner that does not include the format of its state
>>>> clear in its name/URN. The "update compatibility" problem makes their
>>>> internal accumulator state essentially part of their public API. Combiners
>>>> named for what they do are an inherent risk, since we might have a new way
>>>> to do the same operation with different implementation-detail state.
>>>>
>>>
>>> It seems this will make for a worse user experience, motivated solely by
>>> limitations in our implementation. I think we can do better. Hypothetical
>>> idea: what if upgrade required access to the original graph (or at least
>>> metadata about it) during construction? In this case an ApproximateDistinct
>>> could look at what was used last time and try to do the same, but be free
>>> to do something better when unconstrained. Another approach would be to
>>> encode several alternative expansions in the Beam graph and let the runner
>>> do the picking (based on prior submission). (Making the CombineFn, as
>>> opposed to the composite, have several alternatives seems harder to reason
>>> about, but maybe worth pursuing as well).
>>>
>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>> operations with non-trivial internal structure (and coders). This has been
>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>
>>> And they will match search terms better, which is a major problem.
>>>>
>>>
>>> I'm not following this--by naming things after their implementation
>>> rather than their intent I think they will be harder to search for.
>>>
>>>
>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>> Does it have a name that is clear enough about the format of its state?
>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>
>>>
>>> +1
>>>
>>>
>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <[email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <[email protected]> wrote:
>>>>>
>>>>>> Thank you for writing this summary.
>>>>>>
>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <[email protected]> wrote:
>>>>>>
>>>>>>> Hi everyone;
>>>>>>>
>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>> algorithms.
>>>>>>>
>>>>>>> Today there are several options for Approximate Algorithms in Apache
>>>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to 
>>>>>>> canvas
>>>>>>> opinions here on the possibility of rationalizing these API's by 
>>>>>>> removing
>>>>>>> obsolete / less efficient implementations.
>>>>>>> The current situation:
>>>>>>>
>>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>> ApproximateDistinct.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>> and HllCount.java
>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>
>>>>>>> HllCount.java
>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>> Marked as @Experimental
>>>>>>>
>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>>>> streams based on the ZetaSketch
>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>
>>>>>>> ApproximateUnique.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>> Not Marked with experimental
>>>>>>>
>>>>>>> This API does not expose the ability to create sketches so it's not
>>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's 
>>>>>>> also
>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>> doc comments give :
>>>>>>>
>>>>>>> /* The error is about
>>>>>>>
>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>
>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>> larger than the HLL++ error.
>>>>>>>
>>>>>>
>>>>>> FWIW, There is a python implementation only for this version:
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>
>>>>>>
>>>>>>
>>>>>>> ApproximateDistinct.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>> Marked with @Experimental
>>>>>>>
>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>> implementation compared to the HLLCount and we need to be careful to 
>>>>>>> ensure
>>>>>>> that if we were to change any of these API's that the binary format of 
>>>>>>> the
>>>>>>> sketches should never change, there could be users who have stored 
>>>>>>> previous
>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>> ensure they do not have a bad experience.
>>>>>>>
>>>>>>>
>>>>>>> Proposals:
>>>>>>>
>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>
>>>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>>>> data set in Beam
>>>>>>>
>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>> interoperability with other systems, or as features to be used in 
>>>>>>> further
>>>>>>> data processing.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> For use case 1, it is possible to make use of naming which does not
>>>>>>> expose the implementation, however for use case 2 it is important for 
>>>>>>> the
>>>>>>> implementation to be explicit as sketches produced with one 
>>>>>>> implementation
>>>>>>> will not work with other implementations.
>>>>>>>
>>>>>>> ApproximateUnique.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>> :
>>>>>>>
>>>>>>> This one does not provide sketches and based on the notes above, is
>>>>>>> not as efficient as HLLCount. However it does have a very searchable 
>>>>>>> name
>>>>>>> and is likely to be the one users will gravitate to when searching for
>>>>>>> Approximate unique algorithms but do not need the capabilities of 
>>>>>>> sketches.
>>>>>>>
>>>>>>> Ideally we should think about switching the implementation of this
>>>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>>>> way which is not transparent to the end developer.  Although as a 
>>>>>>> result of
>>>>>>> the change they would get a better implementation for free on an 
>>>>>>> upgrade :-)
>>>>>>>
>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>
>>>>>>
>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>> probably create a less surprising experience for the users.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ApproximateDistinct.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>
>>>>>>> This transform does generate sketches as output and given its marked
>>>>>>> as @Experimental, one option we would have is to create a name which
>>>>>>> includes the algorithm implementation details, for example
>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>
>>>>>>
>>>>>> Can we remove this, if it is both experimental and providing the same
>>>>>> utility as HllCount? Is the concern that users might have stored 
>>>>>> sketches?
>>>>>>
>>>>>
>>>>> I think this is the sort of thing that has been experimental forever,
>>>>> and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> HllCount.java
>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>> .
>>>>>>>
>>>>>>> Again we have a few options here, as the name does not include
>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> This email may be confidential and privileged. If you received this
>>>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>>>> erase all copies and attachments, and please let me know that it has 
>>>>>>> gone
>>>>>>> to the wrong person.
>>>>>>>
>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>> provided solely as a basis for further discussion, and are not intended 
>>>>>>> to
>>>>>>> be and do not constitute a legally binding obligation. No legally 
>>>>>>> binding
>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>
>>>>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Cleaning up Approximate Algorithms in Beam

Reply via email to