[jira] [Commented] (DATAFU-91) pig version of HyperLogLog estimator should be Algebraic and use combiners

Jan Willem (JIRA) Tue, 09 Jun 2015 12:32:41 -0700

    [ 
https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579449#comment-14579449
 ]


Jan Willem commented on DATAFU-91:
----------------------------------

Hi there. We were looking for a solution like this as well (just ran into a 
OutOfMemory due to similar usage).

The contract for the function returned by getInitial is that it receives the 
input tuple, and returns an intermediate result. The function returned by 
getIntermediate receives an intermediate input and outputs an intermediate 
result. The function returned by getFinal receives an intermediate input and 
outputs a final result.

Applied to this case, the input to 'Initial' should be a hash (correct), and 
the output should be an intermediate type, a serialized HyperLogLogPlus would 
be my choice (wrapped in a tuple).
The 'Intermediate' should input a bag of HyperLogLogPlus, and output a 
HyperLogLogPlus.
It's not very clear to me if the 'Final' should input a bag of 'Intermediate' 
outputs, or just a single tuple. Looking at the example code for COUNT 
(http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup)
 it should take as input a bag of HyperLogLogPlus objects, just like 
'Intermediate'. The output should indeed be a single number, the cardinality.

Schematic:
Initial function: Tuple(InputType) -> Tuple(HyperLogLogPlus)
Intermediate function: Bag(HyperLogLogPlus) -> Tuple(HyperLogLogPlus)
Final function: Bag(HyperLogLogPlus) -> Tuple(Long)

In short, your countDistinct function shouldn't expect a Long or a 
DataByteArray, it should always be a DataByteArray (so you should be able to 
get rid of the 'instanceof' type switch in countDisctinct, if you fix the 
output of the Initial function).

Also, your diff shows that you get rid of the Accumulator functions 
(accumulate, cleanup and getValue). The documentation for AlgebraicEvalFunc 
(http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/AlgebraicEvalFunc.html) 
says: "IMPORTANT: the implementation of the Accumulator interface that this 
class provides is good, but it is simulated. For maximum efficiency, it is 
important to manually implement the accumulator interface." So you should 
probably keep them.

Great work though, this is a much requested feature, if you ask me. At least 
this guy 
http://stackoverflow.com/questions/28908217/why-is-data-fu-implementing-hyperloglog-as-an-accumulator-and-not-as-algebraic
 will be happy (it's not me).

Thanks!

> pig version of HyperLogLog estimator should be Algebraic and use combiners
> --------------------------------------------------------------------------
>
>                 Key: DATAFU-91
>                 URL: https://issues.apache.org/jira/browse/DATAFU-91
>             Project: DataFu
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Ido Hadanny
>            Assignee: Ido Hadanny
>            Priority: Minor
>             Fix For: 1.3.0
>
>         Attachments: hyper-log-log-algebraic-3.diff, 
> hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff
>
>
> Matt: I don't remember if there was a particular reason I didn't implement 
> this as AlgebraicEvalFunc. It seems like it could be. I believe the Java 
> MapReduce version leverages the combiner. If you want to try making this 
> Algebraic we would be happy to accept a patch :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DATAFU-91) pig version of HyperLogLog estimator should be Algebraic and use combiners

Reply via email to