Re: Cleaning up Approximate Algorithms in Beam

Reuven Lax Wed, 13 Nov 2019 10:02:04 -0800

On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <[email protected]> wrote:


> Thank you for writing this summary.
>
> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <[email protected]> wrote:
>
>> Hi everyone;
>>
>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>> algorithms.
>>
>> Today there are several options for Approximate Algorithms in Apache Beam
>> 2.16 with HLLCount being the most recently added. Would like to canvas
>> opinions here on the possibility of rationalizing these API's by removing
>> obsolete / less efficient implementations.
>> The current situation:
>>
>> There are three options available to users: ApproximateUnique.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>> ApproximateDistinct.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>> and HllCount.java
>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>> A quick summary of these API's as I understand them:
>>
>> HllCount.java
>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>> Marked as @Experimental
>>
>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>> streams based on the ZetaSketch <https://github.com/google/zetasketch>
>> implementation.Detailed design of this class, see
>> https://s.apache.org/hll-in-beam.
>>
>> ApproximateUnique.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>> Not Marked with experimental
>>
>> This API does not expose the ability to create sketches so it's not
>> suitable for the OLAP use case that HLL++ is geared towards (do
>> pre-aggregation into sketches to allow interactive query speeds). It's also
>> less precise for the same amount of memory used: the error bounds in the
>> doc comments give :
>>
>> /* The error is about
>>
>> {@code 2 * / sqrt(sampleSize)},) */
>>
>> Compared to the default HLLCount sketch size, its error is 10X larger
>> than the HLL++ error.
>>
>
> FWIW, There is a python implementation only for this version:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>
>
>
>> ApproximateDistinct.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>> Marked with @Experimental
>>
>> This is a re-implementation of the HLL++ algorithm, based on the paper
>> published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
>> have not run any benchmarks to compare this implementation compared to the
>> HLLCount and we need to be careful to ensure that if we were to change any
>> of these API's that the binary format of the sketches should never change,
>> there could be users who have stored previous sketches using
>> ApproximateDistinct and it will be important to try and ensure they do not
>> have a bad experience.
>>
>>
>> Proposals:
>>
>> There are two classes of users expected for these algorithms:
>>
>> 1) Users who simply use the transform to estimate the size of their data
>> set in Beam
>>
>> 2) Users who want to create sketches and store them, either for
>> interoperability with other systems, or as features to be used in further
>> data processing.
>>
>>
>>
>> For use case 1, it is possible to make use of naming which does not
>> expose the implementation, however for use case 2 it is important for the
>> implementation to be explicit as sketches produced with one implementation
>> will not work with other implementations.
>>
>> ApproximateUnique.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>> :
>>
>> This one does not provide sketches and based on the notes above, is not
>> as efficient as HLLCount. However it does have a very searchable name and
>> is likely to be the one users will gravitate to when searching for
>> Approximate unique algorithms but do not need the capabilities of sketches.
>>
>> Ideally we should think about switching the implementation of this
>> transform to wrap HLLCount. However this could mean changing things in a
>> way which is not transparent to the end developer.  Although as a result of
>> the change they would get a better implementation for free on an upgrade :-)
>>
>> Another option would be to mark this transform as @Deprecated and create
>> a new transform ApproximateCountDistinct which would wrap HLLCount. The
>> name is also much clearer.
>>
>
> Marking it deprecated instead of changing its implementation will probably
> create a less surprising experience for the users.
>
>
>>
>> ApproximateDistinct.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>
>> This transform does generate sketches as output and given its marked as
>> @Experimental, one option we would have is to create a name which includes
>> the algorithm implementation details, for example
>> ApproximateCountDistinctClearSpring.
>>
>
> Can we remove this, if it is both experimental and providing the same
> utility as HllCount? Is the concern that users might have stored sketches?
>

I think this is the sort of thing that has been experimental forever, and
therefore not experimental (e.g. the entire triggering API is experimental
as are all our file-based sinks). I think that many users use this, and
probably store the state implicitly in streaming pipelines.


>
>>
>>
>> HllCount.java
>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>> .
>>
>> Again we have a few options here, as the name does not include search
>> words like Approximate, we can create a wrapper which has a name similar to
>> ApproximateCountDistinctZetaSketch.
>>
>
>>
>> Thoughts?
>>
>>
>>
>> Cheers
>>
>>
>>
>> Reza
>>
>>
>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Cleaning up Approximate Algorithms in Beam

Reply via email to