[
https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581699#comment-14581699
]
Jan Willem commented on DATAFU-91:
----------------------------------
I am under the impression that the type output by both initial and intermediate
should be the same. Looking at how AVG is implemented
(http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pig/pig/0.8.0/org/apache/pig/builtin/AVG.java),
or the udf manual on the wiki (https://wiki.apache.org/pig/UDFManual search
for algebraic), they return a tuple containing the combined information.
I'm probably just obsessing about the type cast, but you could get around it by
using a tagged union: a tuple of two, with the first element indicating whether
it's a Long in the second field, or rather a serialized HyperLogLogPlus. So:
(1, <longvalue>) or (2, <HyperLogLogPlus>). You could also go for the first
value containing the number of items, and as a special case have the 1 case
contain just a long value: (1, <longvalue>) or (<value larger than 1>,
<HyperLogLogValue>).
It would only add a little to the size, and get rid of the instanceof.
It's just a matter of opinion, I guess.
> pig version of HyperLogLog estimator should be Algebraic and use combiners
> --------------------------------------------------------------------------
>
> Key: DATAFU-91
> URL: https://issues.apache.org/jira/browse/DATAFU-91
> Project: DataFu
> Issue Type: Bug
> Affects Versions: 1.3.0
> Reporter: Ido Hadanny
> Assignee: Ido Hadanny
> Priority: Minor
> Fix For: 1.3.0
>
> Attachments: hyper-log-log-algebraic-3.diff,
> hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff
>
>
> Matt: I don't remember if there was a particular reason I didn't implement
> this as AlgebraicEvalFunc. It seems like it could be. I believe the Java
> MapReduce version leverages the combiner. If you want to try making this
> Algebraic we would be happy to accept a patch :)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)