jcamachor commented on a change in pull request #1878:
URL: https://github.com/apache/hive/pull/1878#discussion_r564886525
##########
File path: ql/src/test/results/clientpositive/llap/subquery_in.q.out
##########
@@ -408,12 +408,18 @@ STAGE PLANS:
expressions: (UDFToDouble(_col0) / _col1) (type: double)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: COMPLETE
- Reduce Output Operator
- key expressions: _col0 (type: double)
- null sort order: z
- sort order: +
- Map-reduce partition columns: _col0 (type: double)
+ Group By Operator
Review comment:
@vineetgarg02 , I was checking this. In the previous plan, we were
executing an inner join. In this plan, we are executing a semijoin. From
looking at the code, it seems for SJ we always create a mapside group by
operator deterministically, without considering whether that group by would
reduce the input data:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L9406
. That may not be too bad since the group by can internally switch to
streaming mode if it's not reducing the input size.
From your comment though, I think I understand that there is some
optimization that may have kicked in to remove that group by? Could you
elaborate?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]