kasakrisz commented on PR #3706: URL: https://github.com/apache/hive/pull/3706#issuecomment-1293439464
Thanks @scarlin-cloudera for investigating this issue. This patch is a possible solution. I would like to share another approach: IIUC the issues is caused by the extra key column because of the distinct in the RS located in the mapper. https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L5753 Without TNK the plan of the query mentioned in the jira looks like this: ``` Map TS SEL GBY (l_orderkey, l_partkey) RS (l_orderkey, l_partkey) Reduce GBY (KEY._col0) RS (col0) ... ``` A TNK is created on top of each RS and the keys are coming from the corresponding RS then both TNKs pushed until TS and at TNK merging the one with 2 keys are accepted. How about skipping TNK creation if RS has keys defined because of distinct in `TopNKeyProcessor` https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java#L424-L426 and keep the existing behavior when no distinct aggregates present. I would expect that only TNK (l_orderkey) remains. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
