[jira] [Commented] (HIVE-28480) Disable SMB on partition hash generator mismatch across join branches in previous RS

Sungwoo Park (Jira) Tue, 27 Aug 2024 01:30:04 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876949#comment-17876949
 ]


Sungwoo Park commented on HIVE-28480:
-------------------------------------

As this is a (critical) correctness problem, I would like to suggest setting 
Priority to Critical.


> Disable SMB on partition hash generator mismatch across join branches in 
> previous RS
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-28480
>                 URL: https://issues.apache.org/jira/browse/HIVE-28480
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>            Reporter: Himanshu Mishra
>            Assignee: Himanshu Mishra
>            Priority: Major
>              Labels: pull-request-available
>
> As SMB replaces last RS op from the joining branches and the JOIN op with 
> MERGEJOIN, we need to ensure the RS before these RS, in both branches, are 
> partitioning using same hash generator.
> Hash code generator differs based on ReducerTraits.UNIFORM i.e. 
> [ReduceSinkOperator#computeMurmurHash()  or 
> ReduceSinkOperator#computeHashCode()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L340-L344],
>  leading to different hash code for same value.
> Skip SMB join in such cases.
> h3. Replication:
> Consider following query, where join would get converted to SMB. Auto reducer 
> is enabled which ensures more than 1 reducer task.
>  
> {code:java}
> CREATE TABLE t_asj_18 (k STRING, v INT);
> INSERT INTO t_asj_18 values ('a', 10), ('a', 10);
> set hive.auto.convert.join=false;
> set hive.tez.auto.reducer.parallelism=true;
> EXPLAIN SELECT * FROM (
>     SELECT k, COUNT(DISTINCT v), SUM(v)
>     FROM t_asj_18 GROUP BY k
> ) a LEFT JOIN (
>     SELECT k, COUNT(v)
>     FROM t_asj_18 GROUP BY k
> ) b ON a.k = b.k; {code}
>  
>  
> Expected result is:
>  
> {code:java}
> a   1   20  a   2 {code}
> but on master branch, it results in
>  
>  
> {code:java}
> a   1   20  NULL    NULL {code}
>  
>  
> Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In 
> such scenario [reducer trait UNIFORM is not 
> set|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SetReducerParallelism.java#L99-L104].]
>  The hash code for "a" from 2nd subquery is generated using murmurHash 
> (270516725) while 1st is generated using bucketHash (1086686554) and result 
> in rows with "a" key reaching different reducer tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-28480) Disable SMB on partition hash generator mismatch across join branches in previous RS

Reply via email to