[ https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967499#comment-14967499 ]
Matthew Hayes commented on DATAFU-91: ------------------------------------- We haven't documented any recommendations yet (filed DATAFU-100). The thing to keep in mind is that each instance of HyperLogLogPlus allocates a pretty large byte array. I can't remember the exact numbers, but I think for the default precision of 20 it is hundreds of KB. So in your example if the cardinality of "a" is large you are going to allocate a lot of large byte arrays that will need to be transmitted from combiner to reducer. So I would avoid using it in "group by" situations unless you know the key cardinality is quite small. This UDF is better suited for "group all" scenarios where you have a lot of input data. Also if the input data is much smaller than the byte array then you could be worse off using this UDF. If you can accept worse precision then the byte array could be made smaller. By the way, I saw in the streaming library (https://github.com/addthis/stream-lib/tree/master/src/main/java/com/clearspring/analytics/stream/cardinality) that there are some new classes that look interesting. CountThenEstimate for example does an exact count and switches to an estimator once a threshold has been reached. This avoids allocating the byte array unless it's needed. I filed DATAFU-101 as future work. > pig version of HyperLogLog estimator should be Algebraic and use combiners > -------------------------------------------------------------------------- > > Key: DATAFU-91 > URL: https://issues.apache.org/jira/browse/DATAFU-91 > Project: DataFu > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: Ido Hadanny > Assignee: Ido Hadanny > Priority: Minor > Fix For: 1.3.0 > > Attachments: hyper-log-log-algebraic-3.diff, > hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff > > > Matt: I don't remember if there was a particular reason I didn't implement > this as AlgebraicEvalFunc. It seems like it could be. I believe the Java > MapReduce version leverages the combiner. If you want to try making this > Algebraic we would be happy to accept a patch :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)