[ https://issues.apache.org/jira/browse/HIVE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Phabricator updated HIVE-3286: ------------------------------ Attachment: HIVE-3286.D4287.8.patch navis updated the revision "HIVE-3286 [jira] Explicit skew join on user provided condition". Reviewers: JIRA fix mixed merge. my bad. REVISION DETAIL https://reviews.facebook.net/D4287 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java ql/src/java/org/apache/hadoop/hive/ql/io/SkewedKeyPartitioner.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/InlineSkewJoinOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/JoinReorder.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g ql/src/java/org/apache/hadoop/hive/ql/parse/QBJoinTree.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/SkewContext.java ql/src/test/queries/clientpositive/skewjoin_explict.q ql/src/test/results/clientpositive/skewjoin_explict.q.out To: JIRA, navis Cc: njain > Explicit skew join on user provided condition > --------------------------------------------- > > Key: HIVE-3286 > URL: https://issues.apache.org/jira/browse/HIVE-3286 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Navis > Assignee: Navis > Priority: Minor > Attachments: HIVE-3286.D4287.5.patch, HIVE-3286.D4287.6.patch, > HIVE-3286.D4287.7.patch, HIVE-3286.D4287.8.patch > > > Join operation on table with skewed data takes most of execution time > handling the skewed keys. But mostly we already know about that and even know > what is look like the skewed keys. > If we can explicitly assign reducer slots for the skewed keys, total > execution time could be greatly shortened. > As for a start, I've extended join grammar something like this. > {code} > select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 > < 100, a.key < 150); > {code} > which means if above query is executed by 20 reducers, one reducer for > a.key+1 < 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= > a.key < 150, and 17 reducers for others (could be extended to assign more > than one reducer later) > This can be only used with common-inner-equi joins. And skew condition should > be composed of join keys only. > Work till done now will be updated shortly after code cleanup. > ---------------------------- > Skew expressions* in "SKEW ON (expr, expr, ...)" are evaluated sequentially > at runtime, and first 'true' one decides skew group for the row. Each skew > group has reserved partition slot(s), to which all rows in a group would be > assigned. > The number of partition slot reserved for each group is decided also at > runtime by simple calculation of percentage. If a skew group is "CLUSTER BY > 20 PERCENT" and total partition slot (=number of reducer) is 20, that group > will reserve 4 partition slots, etc. > "DISTRIBUTE BY" decides how the rows in a group is dispersed in the range of > reserved slots (If there is only one slot for a group, this is meaningless). > Currently, three distribution policies are available: RANDOM, KEYS, > <expression>. > 1. RANDOM : rows of driver** alias are dispersed by random and rows of > non-driver alias are duplicated for all the slots (default if not specified) > 2. KEYS : determined by hash value of keys (same with previous) > 3. expression : determined by hash of object evaluated by user-provided > expression > Only possible with inner, equi, common-joins. Not yet supports join tree > merging. > Might be used by other RS users like "SORT BY" or "GROUP BY" > If there exists column statistics for the key, it could be possible to apply > automatically. > For example, if 20 reducers are used for the query below, > {code} > select count(*) from src a join src b on a.key=b.key skew on ( > a.key = '0' CLUSTER BY 10 PERCENT, > b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key), > cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS); > {code} > group-0 will reserve slots 6~7, group-1 8~11, group-2 12~19 and others will > reserve slots 0~5. > For a row with key='0' from alias a, the row is randomly assigned in the > range of 6~7 (driver alias) : 6 or 7 > For a row with key='0' from alias b, the row is disributed for all slots in > 6~7 (non-driver alias) : 6 and 7 > For a row with key='50', the row is assigned in the range of 8~11 by hashcode > of upper(b.key) : 8 + (hash(upper(key)) % 4) > For a row with key='500', the row is assigned in the range of 12~19 by > hashcode of join key : 12 + (hash(key) % 8) > For a row with key='200', this is not belong to any skew group : hash(key) % 6 > *expressions in skew condition : > 1. all expressions should be made of expression in join condition, which > means if join condition is "a.key=b.key", user can make any expression with > "a.key" or "b.key". But if join condition is a.key+1=b.key, user cannot make > expression with "a.key" solely (should make expression with "a.key+1"). > 2. all expressions should reference one and only-one side of aliases. For > example, simple constant expressions or expressions referencing both side of > join condition ("a.key+b.key<100") is not allowed. > 3. all functions in expression should be deteministic and stateless. > 4. if "DISTRIBUTED BY expression" is used, distibution expression also should > have same alias with skew expression. > **driver alias : > 1. driver alias means the sole referenced alias from skew expression, which > is important for RANDOM distribution. rows of driver alias are assigned to > single slot randomly, but rows of non-driver alias are duplicated for all the > slots. So, driver alias should be the biggest one in join aliases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira