[
https://issues.apache.org/jira/browse/HIVE-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809811#comment-13809811
]
Phabricator commented on HIVE-5657:
-----------------------------------
navis has commented on the revision "HIVE-5657 [jira] TopN produces incorrect
results with count(distinct)".
INLINE COMMENTS
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:268 Right.
it should be -1. I did mistake doing some refactoring.
ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:255 For distinct, it
does not store values. Check the key and decide to forward all or exclude all.
I'm not sure that the previous version was better. In this time, I've focused
simplifying the flow of RS-op.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:255 Yes
right. Previously, the key was like this
[distributeKey:distinctKey1]
[distributeKey:distinctKey2]
and each row is serialized in whole by OI
structOI[structOI(distributeKey):UnionOI(distinctKey)]
Now the key is prepared like this and
[distributeKey]
[distinctKey1,distinctKey2]
serialized for each part directly by inner OI : structOI(distributeKey) and
UnionOI(distinctKey)
I'm not feel good introducing new interface KeySerializer. But serializing
distributeKey multiple time seemed worse than that.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:125
yes.
ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java:211 Changed
the name because it was confusing that RS is for MapAggr GBY, which is not.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:243 I
didn't know there was VectorReduceSinkOperator when I've started this, which
made me include more refactorings than just amount of fixing the problem. I
think current version of patch is way simpler than that of original. But if it
makes merging of vectorization hard, I might create minimal patch just for fix.
REVISION DETAIL
https://reviews.facebook.net/D13797
To: JIRA, navis
Cc: sershe
> TopN produces incorrect results with count(distinct)
> ----------------------------------------------------
>
> Key: HIVE-5657
> URL: https://issues.apache.org/jira/browse/HIVE-5657
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Navis
> Priority: Critical
> Attachments: D13797.1.patch, example.patch, HIVE-5657.1.patch.txt
>
>
> Attached patch illustrates the problem.
> limit_pushdown test has various other cases of aggregations and distincts,
> incl. count-distinct, that work correctly (that said, src dataset is bad for
> testing these things because every count, for example, produces one record
> only), so something must be special about this.
> I am not very familiar with distinct- code and these nuances; if someone
> knows a quick fix feel free to take this, otherwise I will probably start
> looking next week.
--
This message was sent by Atlassian JIRA
(v6.1#6144)