kasakrisz commented on code in PR #4864: URL: https://github.com/apache/hive/pull/4864#discussion_r1395359081
########## ql/src/test/queries/clientpositive/auto_sortmerge_join_17.q: ########## @@ -0,0 +1,20 @@ +CREATE TABLE tbl1_n5(key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS; + +insert into tbl1_n5(key, value) +values +(0, 'val_0'), +(2, 'val_2'), +(9, 'val_9'); + +explain +SELECT t1.key from +(SELECT key , row_number() over(partition by key order by value desc) as rk from tbl1_n5) t1 +join +( SELECT key,count(distinct value) as cp_count from tbl1_n5 group by key) t2 +on t1.key = t2.key where rk = 1; + +SELECT t1.key from +(SELECT key , row_number() over(partition by key order by value desc) as rk from tbl1_n5) t1 Review Comment: To convert merge join to SMB join both join branches must have the same number of RS operators: https://github.com/apache/hive/blob/3ca97725854df10e7cd67514a3219e996453eab1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java#L749-L774 Let's see the plan of query: ``` SELECT t1.key from (SELECT key , row_number() over(partition by key order by value desc) as rk from tbl1_n5) t1 join ( SELECT key,count(distinct value) as cp_count from tbl1_n5 group by key) t2 on t1.key = t2.key where rk = 1; ``` Plan before join conversion: ``` TS[0]-FIL[26]-RS[2]-SEL[3]-PTF[4]-FIL[25]-SEL[5]-RS[18]-JOIN[20]-FS[22] -SEL[28]-GBY[29]-RS[30]-GBY[31]-RS[32] TS[9]-FIL[27]-GBY[12]-RS[13]-GBY[14]-SEL[15]-GBY[17]-RS[19]-JOIN[20] ``` Any merge join need RS operators as parents because we need the data sorted. Additional RS operators are needed because * in the first branch we have PTF * in the second branch we have GBY operator. This is the problematic branch because it has 3 GBYs and two of them are going to be in the same reducer after the conversion: ``` TS[0]-FIL[26]-RS[2]-SEL[3]-PTF[4]-FIL[25]-SEL[5]-SEL[28]-GBY[29]-RS[30]-GBY[31]-RS[32] -MERGEJOIN[33]-FS[22] TS[9]-FIL[27]-GBY[12]-RS[13]-GBY[14]-SEL[15]-GBY[17]-DUMMY_STORE[34]-MERGEJOIN[33] ``` ``` RS[13]-GBY[14]-SEL[15]-GBY[17]-DUMMY_STORE[34]-MERGEJOIN[33] ``` So if the PTF is removed from the query the Join branches going to have different number of RS operators and the SMB join conversion not happen hence no repro. We can use anything else to achieve the same number RS in both branch without PTF. Please see: https://github.com/apache/hive/blob/17525f169b9a08cd715bfb42899e45b7c689c77a/ql/src/test/results/clientpositive/llap/subquery_in_having.q.out#L263-L391 We can go with this subquery repro instead of the one with PTF. I don't have a strong opinion about this but it doesn't seems to be less complex though. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
