kasakrisz commented on PR #3706:
URL: https://github.com/apache/hive/pull/3706#issuecomment-1293439464

   Thanks @scarlin-cloudera for investigating this issue. This patch is a 
possible solution.
   I would like to share another approach: IIUC the issues is caused by the 
extra key column because of the distinct in the RS located in the mapper. 
   
https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L5753
   Without TNK the plan of the query mentioned in the jira looks like this:
   ```
   Map
     TS
       SEL
         GBY (l_orderkey, l_partkey)
           RS (l_orderkey, l_partkey)
   Reduce
     GBY (KEY._col0)
       RS (col0)
   ...
   ```
   A TNK is created on top of each RS and the keys are coming from the 
corresponding RS then both TNKs pushed until TS and at TNK merging the one with 
2 keys are accepted.
   
   How about skipping TNK creation if RS has keys defined because of distinct 
in `TopNKeyProcessor`
   
https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java#L424-L426
 and keep the existing behavior when no distinct aggregates present.
   
   I would expect that only TNK (l_orderkey) remains.
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to