[
https://issues.apache.org/jira/browse/HIVE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201691#comment-14201691
]
Rui Li commented on HIVE-8542:
------------------------------
I think the problem is that the sorting keys and partition keys in RS are not
identical. Partition key is the group-by key, but sorting keys are group-by key
followed by distinct key. Since RangePartitioner is used to partition the data,
and we have quite a few reducers (31), records with same group-by key can go to
different reducers, so the final results are not properly grouped. We'll have
correct results if #reducers is set to 1.
To reproduce, as long as the #reducers is large, any groupby+distinct query can
reveal this issue.
> Enable groupby_map_ppr.q and groupby_map_ppr_multi_distinct.q [Spark Branch]
> ----------------------------------------------------------------------------
>
> Key: HIVE-8542
> URL: https://issues.apache.org/jira/browse/HIVE-8542
> Project: Hive
> Issue Type: Test
> Components: Spark
> Reporter: Chao
> Assignee: Rui Li
>
> Currently, in Spark branch, results for these two test files are very
> different from MR's. We need to find out the cause for this, and identify
> potential bug in our current implementation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)