[
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749290#comment-13749290
]
Phabricator commented on HIVE-3562:
-----------------------------------
navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed
down to map stage".
Thanks for a review.
INLINE COMMENTS
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I thought it's
a convention of hive to make configurations for enable/disable and threshold
each, which was requested several times till now for me for other issues. I'll
address that.
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Legend :
A(a) --> key A, value a, row A(a)
If each mapper takes rows lite this
MAP1(RS) : 40(a)-10(b)-30(c)-10(d)-70(e)-80(f)
MAP2(RS) : 90(g)-80(h)-60(i)-40(j)-30(k)-20(l)
MAP3(RS) : 40(m)-50(n)-30(o)-30(p)-60(q)-70(r)
REDUCER :
10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)-60(i,q)-70(e,r)-80(f,h)-90(g)
REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)
with this patch
MAP1 : 40(a)-10(b)-30(c)-10(d)
MAP2 : 40(j)-30(k)-20(l)
MAP3 : 40(m)-50(n)-30(o)-30(p)
REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)
REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 I
thought limit 0 is not realistic and can be used for some kind of testing. I'll
address that case but making Hash seemed a little much.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157
ReduceSinkHash makes arrays for key/value/hashes. It's for compensating those
(not exact but I like the number).
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 I
think there was a test for RS operator which has no collector. I'll check that.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Right.
I've missed that.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 It's
possible to add key/value to hash which is flushed right before. But I just
forwarded it.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 Right.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436
It's not serializable class. But I'll add transient mod.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 I'll
leave log message in the processOp() method.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 Return
value can be false only if it's for GBY. False means the key exists already in
RS hash.
For case A(a)->A(b)->B(c) with limit 5, the first A(a) will be stored into
hash. For the second A(b), value b should be stored, which means values should
be byte[][][] using additional array for each index consuming more memory. But
I suspected that case could be a common case with map aggregation. So I just
decided to forward A(b) to output collector.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 It
does not remove anything. This is called after index is removed from RS hash.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Good.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Main
difference of HashForOrder from HashForGroupBy is that it should store
duplicated keys. As described in comments, A->B->B->C limit 3 should store
A,B,B not A,B,C. I believe this behavior can be possibly implemented with
simple TreeSet but thought this is way simpler.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I
prefer Map rather than Set but I'll change that.
ql/src/test/queries/clientpositive/limit_pushdown.q:14 ok
ql/src/test/queries/clientpositive/limit_pushdown.q:43 ok.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46
This is not exact description. RS for limit (not last RS). I'll change this,
too.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264
Looks promising. Need more investigation on implementation of distinct.
REVISION DETAIL
https://reviews.facebook.net/D5967
BRANCH
HIVE-3562
ARCANIST PROJECT
hive
To: JIRA, tarball, ashutoshc, navis
Cc: njain
> Some limit can be pushed down to map stage
> ------------------------------------------
>
> Key: HIVE-3562
> URL: https://issues.apache.org/jira/browse/HIVE-3562
> Project: Hive
> Issue Type: Bug
> Reporter: Navis
> Assignee: Navis
> Priority: Trivial
> Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch,
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch,
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree,
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira