[ 
https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962214#comment-14962214
 ] 

Ido Hadanny commented on DATAFU-91:
-----------------------------------

[~matterhayes] - quick question: 
Do we have a recommendation of when using HyperLogLog is a mistake and will 
hurt performance??
We were using HLL as a "silver bullet" for all places of:

my_table = load 'my_table' as (a: int, b: int);
g = group my_table by a;
c = foreach g generate HyperLogLog(my_table.b);

However, this yields terrible performance (X3 than count distinct), when the 
cardinality of "a" is large... 
To our surprise, using the new Algebraic HLL made it even worse! (X5) - we 
checked and the same performance problem that happened in the reducers in the 
accumulating version now happened in the mappers/combiners...



> pig version of HyperLogLog estimator should be Algebraic and use combiners
> --------------------------------------------------------------------------
>
>                 Key: DATAFU-91
>                 URL: https://issues.apache.org/jira/browse/DATAFU-91
>             Project: DataFu
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Ido Hadanny
>            Assignee: Ido Hadanny
>            Priority: Minor
>             Fix For: 1.3.0
>
>         Attachments: hyper-log-log-algebraic-3.diff, 
> hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff
>
>
> Matt: I don't remember if there was a particular reason I didn't implement 
> this as AlgebraicEvalFunc. It seems like it could be. I believe the Java 
> MapReduce version leverages the combiner. If you want to try making this 
> Algebraic we would be happy to accept a patch :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to