[
https://issues.apache.org/jira/browse/DATAFU-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539080#comment-16539080
]
Matthew Hayes commented on DATAFU-100:
--------------------------------------
Here are the max map and reduce times. First pair is HLLP and second pair is
DISTINCT. By this metric it is also worse.
1) 1 billion A-Z letters over 10 files (1072, 519 vs 356, 80)
2) 1 billion values between 0 and 1 million, over 10 files (385, 799 vs 253,
487)
3) 250 million values (large keys) between 0 and 1 million, over 5 files (62,
297 vs 50, 318)
Regarding the number of files, I think that this could only contribute to worse
performance in this particular case because there could be more overhead.
> Document recommendations on using HyperLogLogPlus
> -------------------------------------------------
>
> Key: DATAFU-100
> URL: https://issues.apache.org/jira/browse/DATAFU-100
> Project: DataFu
> Issue Type: Improvement
> Reporter: Matthew Hayes
> Priority: Minor
> Attachments: DATAFU-100.patch
>
>
> We should provide recommendations about how to HyperLogLogPlus effectively.
> For example 1) how should the precision value be used, 2) when would a count
> distinct be better, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)