[ https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962214#comment-14962214 ]
Ido Hadanny commented on DATAFU-91: ----------------------------------- [~matterhayes] - quick question: Do we have a recommendation of when using HyperLogLog is a mistake and will hurt performance?? We were using HLL as a "silver bullet" for all places of: my_table = load 'my_table' as (a: int, b: int); g = group my_table by a; c = foreach g generate HyperLogLog(my_table.b); However, this yields terrible performance (X3 than count distinct), when the cardinality of "a" is large... To our surprise, using the new Algebraic HLL made it even worse! (X5) - we checked and the same performance problem that happened in the reducers in the accumulating version now happened in the mappers/combiners... > pig version of HyperLogLog estimator should be Algebraic and use combiners > -------------------------------------------------------------------------- > > Key: DATAFU-91 > URL: https://issues.apache.org/jira/browse/DATAFU-91 > Project: DataFu > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: Ido Hadanny > Assignee: Ido Hadanny > Priority: Minor > Fix For: 1.3.0 > > Attachments: hyper-log-log-algebraic-3.diff, > hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff > > > Matt: I don't remember if there was a particular reason I didn't implement > this as AlgebraicEvalFunc. It seems like it could be. I believe the Java > MapReduce version leverages the combiner. If you want to try making this > Algebraic we would be happy to accept a patch :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)