Thanks for bringing this Robin,

Can you please add this to the design documents webpage.
https://beam.apache.org/contribute/design-documents/

Let some comments in the doc, It is great that this is finally open
and even better that it becomes part of Beam.

I am not sure if this feature should go into 'sdks/java/core' because
it seems a quite specific case, maybe it should go in the sketching
module so it can be easier to find, or maybe in its own extension if
the 'mix' of dependencies may be an issue and then make this
dependency a requirement for the gcp module since I suppose the
ultimate goal is to integrate it there.

CC +arnaudfournier...@gmail.com original author of the sketching
library who may be interested on this.


On Mon, Jun 24, 2019 at 9:31 PM Rui Wang <ruw...@google.com> wrote:
>
> Thanks Robin! It would also be interesting if we could offer HLL_COUNT 
> functions in BeamSQL based on your proposal!
>
>
> -Rui
>
> On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <robi...@google.com> wrote:
>>
>> Hi all,
>>
>> I have written a doc proposing we integrate the HyperLogLog++ algorithm into 
>> Beam as a new combiner. The algorithm solves the count-distinct problem, and 
>> the intermediate sketch (aggregator) format will be compatible with sketches 
>> computed via the HLL_COUNT functions in Google Cloud BigQuery (because they 
>> will be based on the same implementation: ZetaSketch). The tracking JIRA 
>> issue is BEAM-7013.
>>
>> The API design proposed in the doc is subject to change and open to 
>> comments. Please feel free to comment if you have any thoughts.
>>
>> Cheers,
>> Robin

Reply via email to