[GitHub] [hive] kasakrisz commented on pull request #3706: HIVE-26671: Incorrect results with Top N Key optimization

GitBox Thu, 27 Oct 2022 05:16:47 -0700


kasakrisz commented on PR #3706:
URL: https://github.com/apache/hive/pull/3706#issuecomment-1293439464

Thanks @scarlin-cloudera for investigating this issue. This patch is a
possible solution.
I would like to share another approach: IIUC the issues is caused by the
extra key column because of the distinct in the RS located in the mapper.

https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L5753
Without TNK the plan of the query mentioned in the jira looks like this:
```
Map
TS
SEL
GBY (l_orderkey, l_partkey)
RS (l_orderkey, l_partkey)
Reduce
GBY (KEY._col0)
RS (col0)
...
```
A TNK is created on top of each RS and the keys are coming from the
corresponding RS then both TNKs pushed until TS and at TNK merging the one with
2 keys are accepted.

How about skipping TNK creation if RS has keys defined because of distinct
in `TopNKeyProcessor`

https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java#L424-L426
and keep the existing behavior when no distinct aggregates present.

I would expect that only TNK (l_orderkey) remains.
What do you think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] kasakrisz commented on pull request #3706: HIVE-26671: Incorrect results with Top N Key optimization

Reply via email to