[ https://issues.apache.org/jira/browse/DATAFU-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537745#comment-16537745 ]
Matthew Hayes commented on DATAFU-100: -------------------------------------- I ran some tests comparing HyperLogLogPlus to using DISTINCT. I have each test below. The metric for each is number of maps times avg map time plus number of reduces plus avg reduce time. This captures the total amount of work done. First number is HLLP and second number is DISTINCT. 1) 1 billion A-Z letters over 10 files (14589 vs. 5029) 2) 1 billion values between 0 and 1 million, over 10 files (14943 vs. 11684) 3) 250 million values (large keys) between 0 and 1 million, over 5 files (6032 vs. 6214) So generally I find that the UDF is either slower than distinct or only marginally better. I think given this it's better to deprecate the UDF. The improvement even for #3 doesn't seem significant enough that it is worth choosing to not get the exact number. > Document recommendations on using HyperLogLogPlus > ------------------------------------------------- > > Key: DATAFU-100 > URL: https://issues.apache.org/jira/browse/DATAFU-100 > Project: DataFu > Issue Type: Improvement > Reporter: Matthew Hayes > Priority: Minor > > We should provide recommendations about how to HyperLogLogPlus effectively. > For example 1) how should the precision value be used, 2) when would a count > distinct be better, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)