[ 
https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967499#comment-14967499
 ] 

Matthew Hayes commented on DATAFU-91:
-------------------------------------

We haven't documented any recommendations yet (filed DATAFU-100).  The thing to 
keep in mind is that each instance of HyperLogLogPlus allocates a pretty large 
byte array.  I can't remember the exact numbers, but I think for the default 
precision of 20 it is hundreds of KB.  So in your example if the cardinality of 
"a" is large you are going to allocate a lot of large byte arrays that will 
need to be transmitted from combiner to reducer.  So I would avoid using it in 
"group by" situations unless you know the key cardinality is quite small.  This 
UDF is better suited for "group all" scenarios where you have a lot of input 
data.  Also if the input data is much smaller than the byte array then you 
could be worse off using this UDF.  If you can accept worse precision then the 
byte array could be made smaller.

By the way, I saw in the streaming library 
(https://github.com/addthis/stream-lib/tree/master/src/main/java/com/clearspring/analytics/stream/cardinality)
 that there are some new classes that look interesting.  CountThenEstimate for 
example does an exact count and switches to an estimator once a threshold has 
been reached.  This avoids allocating the byte array unless it's needed.  I 
filed DATAFU-101 as future work.

> pig version of HyperLogLog estimator should be Algebraic and use combiners
> --------------------------------------------------------------------------
>
>                 Key: DATAFU-91
>                 URL: https://issues.apache.org/jira/browse/DATAFU-91
>             Project: DataFu
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Ido Hadanny
>            Assignee: Ido Hadanny
>            Priority: Minor
>             Fix For: 1.3.0
>
>         Attachments: hyper-log-log-algebraic-3.diff, 
> hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff
>
>
> Matt: I don't remember if there was a particular reason I didn't implement 
> this as AlgebraicEvalFunc. It seems like it could be. I believe the Java 
> MapReduce version leverages the combiner. If you want to try making this 
> Algebraic we would be happy to accept a patch :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to