Thanks for bringing this Robin, Can you please add this to the design documents webpage. https://beam.apache.org/contribute/design-documents/
Let some comments in the doc, It is great that this is finally open and even better that it becomes part of Beam. I am not sure if this feature should go into 'sdks/java/core' because it seems a quite specific case, maybe it should go in the sketching module so it can be easier to find, or maybe in its own extension if the 'mix' of dependencies may be an issue and then make this dependency a requirement for the gcp module since I suppose the ultimate goal is to integrate it there. CC +arnaudfournier...@gmail.com original author of the sketching library who may be interested on this. On Mon, Jun 24, 2019 at 9:31 PM Rui Wang <ruw...@google.com> wrote: > > Thanks Robin! It would also be interesting if we could offer HLL_COUNT > functions in BeamSQL based on your proposal! > > > -Rui > > On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <robi...@google.com> wrote: >> >> Hi all, >> >> I have written a doc proposing we integrate the HyperLogLog++ algorithm into >> Beam as a new combiner. The algorithm solves the count-distinct problem, and >> the intermediate sketch (aggregator) format will be compatible with sketches >> computed via the HLL_COUNT functions in Google Cloud BigQuery (because they >> will be based on the same implementation: ZetaSketch). The tracking JIRA >> issue is BEAM-7013. >> >> The API design proposed in the doc is subject to change and open to >> comments. Please feel free to comment if you have any thoughts. >> >> Cheers, >> Robin