[ 
https://issues.apache.org/jira/browse/DATAFU-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539080#comment-16539080
 ] 

Matthew Hayes commented on DATAFU-100:
--------------------------------------

Here are the max map and reduce times.  First pair is HLLP and second pair is 
DISTINCT.  By this metric it is also worse.

1) 1 billion A-Z letters over 10 files (1072, 519 vs 356, 80)

2) 1 billion values between 0 and 1 million, over 10 files (385, 799 vs 253, 
487)

3) 250 million values (large keys) between 0 and 1 million, over 5 files (62, 
297 vs 50, 318)

Regarding the number of files, I think that this could only contribute to worse 
performance in this particular case because there could be more overhead.

> Document recommendations on using HyperLogLogPlus
> -------------------------------------------------
>
>                 Key: DATAFU-100
>                 URL: https://issues.apache.org/jira/browse/DATAFU-100
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Matthew Hayes
>            Priority: Minor
>         Attachments: DATAFU-100.patch
>
>
> We should provide recommendations about how to HyperLogLogPlus effectively.  
> For example 1) how should the precision value be used, 2) when would a count 
> distinct be better, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to