[jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage

Phabricator (JIRA) Fri, 23 Aug 2013 20:58:30 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749290#comment-13749290
 ]


Phabricator commented on HIVE-3562:
-----------------------------------

navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed 
down to map stage".

  Thanks for a review.

INLINE COMMENTS
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I thought it's 
a convention of hive to make configurations for enable/disable and threshold 
each, which was requested several times till now for me for other issues. I'll 
address that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Legend : 
A(a) --> key A, value a, row A(a)

  If each mapper takes rows lite this
  MAP1(RS) : 40(a)-10(b)-30(c)-10(d)-70(e)-80(f)
  MAP2(RS) : 90(g)-80(h)-60(i)-40(j)-30(k)-20(l)
  MAP3(RS) : 40(m)-50(n)-30(o)-30(p)-60(q)-70(r)

  REDUCER : 
10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)-60(i,q)-70(e,r)-80(f,h)-90(g)
  REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)

  with this patch

  MAP1 : 40(a)-10(b)-30(c)-10(d)
  MAP2 : 40(j)-30(k)-20(l)
  MAP3 : 40(m)-50(n)-30(o)-30(p)

  REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)
  REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 I 
thought limit 0 is not realistic and can be used for some kind of testing. I'll 
address that case but making Hash seemed a little much.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 
ReduceSinkHash makes arrays for key/value/hashes. It's for compensating those 
(not exact but I like the number).
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 I 
think there was a test for RS operator which has no collector. I'll check that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Right. 
I've missed that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 It's 
possible to add key/value to hash which is flushed right before. But I just 
forwarded it.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 Right.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 
It's not serializable class. But I'll add transient mod.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 I'll 
leave log message in the processOp() method.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 Return 
value can be false only if it's for GBY. False means the key exists already in 
RS hash.
  For case A(a)->A(b)->B(c) with limit 5, the first A(a) will be stored into 
hash. For the second A(b), value b should be stored, which means values should 
be byte[][][] using additional array for each index consuming more memory. But 
I suspected that case could be a common case with map aggregation. So I just 
decided to forward A(b) to output collector.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 It 
does not remove anything. This is called after index is removed from RS hash.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Good.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Main 
difference of HashForOrder from HashForGroupBy is that it should store 
duplicated keys. As described in comments, A->B->B->C limit 3 should store 
A,B,B not A,B,C. I believe this behavior can be possibly implemented with 
simple TreeSet but thought this is way simpler.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I 
prefer Map rather than Set but I'll change that.
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 ok
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 ok.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 
This is not exact description. RS for limit (not last RS). I'll change this, 
too.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 
Looks promising. Need more investigation on implementation of distinct.

REVISION DETAIL
  https://reviews.facebook.net/D5967

BRANCH
  HIVE-3562

ARCANIST PROJECT
  hive

To: JIRA, tarball, ashutoshc, navis
Cc: njain

                
> Some limit can be pushed down to map stage
> ------------------------------------------
>
>                 Key: HIVE-3562
>                 URL: https://issues.apache.org/jira/browse/HIVE-3562
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, 
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, 
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage

Reply via email to