[
https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579449#comment-14579449
]
Jan Willem commented on DATAFU-91:
----------------------------------
Hi there. We were looking for a solution like this as well (just ran into a
OutOfMemory due to similar usage).
The contract for the function returned by getInitial is that it receives the
input tuple, and returns an intermediate result. The function returned by
getIntermediate receives an intermediate input and outputs an intermediate
result. The function returned by getFinal receives an intermediate input and
outputs a final result.
Applied to this case, the input to 'Initial' should be a hash (correct), and
the output should be an intermediate type, a serialized HyperLogLogPlus would
be my choice (wrapped in a tuple).
The 'Intermediate' should input a bag of HyperLogLogPlus, and output a
HyperLogLogPlus.
It's not very clear to me if the 'Final' should input a bag of 'Intermediate'
outputs, or just a single tuple. Looking at the example code for COUNT
(http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup)
it should take as input a bag of HyperLogLogPlus objects, just like
'Intermediate'. The output should indeed be a single number, the cardinality.
Schematic:
Initial function: Tuple(InputType) -> Tuple(HyperLogLogPlus)
Intermediate function: Bag(HyperLogLogPlus) -> Tuple(HyperLogLogPlus)
Final function: Bag(HyperLogLogPlus) -> Tuple(Long)
In short, your countDistinct function shouldn't expect a Long or a
DataByteArray, it should always be a DataByteArray (so you should be able to
get rid of the 'instanceof' type switch in countDisctinct, if you fix the
output of the Initial function).
Also, your diff shows that you get rid of the Accumulator functions
(accumulate, cleanup and getValue). The documentation for AlgebraicEvalFunc
(http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/AlgebraicEvalFunc.html)
says: "IMPORTANT: the implementation of the Accumulator interface that this
class provides is good, but it is simulated. For maximum efficiency, it is
important to manually implement the accumulator interface." So you should
probably keep them.
Great work though, this is a much requested feature, if you ask me. At least
this guy
http://stackoverflow.com/questions/28908217/why-is-data-fu-implementing-hyperloglog-as-an-accumulator-and-not-as-algebraic
will be happy (it's not me).
Thanks!
> pig version of HyperLogLog estimator should be Algebraic and use combiners
> --------------------------------------------------------------------------
>
> Key: DATAFU-91
> URL: https://issues.apache.org/jira/browse/DATAFU-91
> Project: DataFu
> Issue Type: Bug
> Affects Versions: 1.3.0
> Reporter: Ido Hadanny
> Assignee: Ido Hadanny
> Priority: Minor
> Fix For: 1.3.0
>
> Attachments: hyper-log-log-algebraic-3.diff,
> hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff
>
>
> Matt: I don't remember if there was a particular reason I didn't implement
> this as AlgebraicEvalFunc. It seems like it could be. I believe the Java
> MapReduce version leverages the combiner. If you want to try making this
> Algebraic we would be happy to accept a patch :)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)