[ 
https://issues.apache.org/jira/browse/DATAFU-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537745#comment-16537745
 ] 

Matthew Hayes commented on DATAFU-100:
--------------------------------------

I ran some tests comparing HyperLogLogPlus to using DISTINCT.  I have each test 
below.  The metric for each is number of maps times avg map time plus number of 
reduces plus avg reduce time.  This captures the total amount of work done.  
First number is HLLP and second number is DISTINCT.

1) 1 billion A-Z letters over 10 files (14589 vs. 5029)

2) 1 billion values between 0 and 1 million, over 10 files (14943 vs. 11684)

3) 250 million values (large keys) between 0 and 1 million, over 5 files (6032 
vs. 6214)

So generally I find that the UDF is either slower than distinct or only 
marginally better.  I think given this it's better to deprecate the UDF.  The 
improvement even for #3 doesn't seem significant enough that it is worth 
choosing to not get the exact number.

> Document recommendations on using HyperLogLogPlus
> -------------------------------------------------
>
>                 Key: DATAFU-100
>                 URL: https://issues.apache.org/jira/browse/DATAFU-100
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Matthew Hayes
>            Priority: Minor
>
> We should provide recommendations about how to HyperLogLogPlus effectively.  
> For example 1) how should the precision value be used, 2) when would a count 
> distinct be better, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to