[ 
https://issues.apache.org/jira/browse/HIVE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545639#comment-13545639
 ] 

Phabricator commented on HIVE-3286:
-----------------------------------

njain has commented on the revision "HIVE-3286 [jira] Explicit skew join on 
user provided condition".

  1. Added stub for optimizer (not completed cause I cannot imagine how the 
histogram look like)

  >> call the optimizer. Come up with any definition. This may change over 
time, but if you dont
  do this now, it will never get integrated with the table metadata.

INLINE COMMENTS
  ql/src/test/queries/clientpositive/skewjoin_explict.q:34 For all the negative 
stuff, it would be much simpler if an error is thrown instead of
  silently ignoring the hint. It becomes much easier to debug/enforce in 
production
  environment.
  ql/src/test/queries/clientpositive/skewjoin_explict.q:24 Can you add some 
tests which select some columns:

  select .. from
  (subq1 involving skewed join) s1
  join
  (subq2 involving skewed join) s2
  on join;


  Add some tests with auto-convert join to true (both where the map-join is 
picked and
  the map-join is not picked). Ideally, for map-join, the skew should matter.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/InlineSkewJoinOptimizer.java:51 
Is this even being called ?

REVISION DETAIL
  https://reviews.facebook.net/D4287

To: JIRA, navis
Cc: njain

                
> Explicit skew join on user provided condition
> ---------------------------------------------
>
>                 Key: HIVE-3286
>                 URL: https://issues.apache.org/jira/browse/HIVE-3286
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: HIVE-3286.D4287.5.patch, HIVE-3286.D4287.6.patch, 
> HIVE-3286.D4287.7.patch, HIVE-3286.D4287.8.patch
>
>
> Join operation on table with skewed data takes most of execution time 
> handling the skewed keys. But mostly we already know about that and even know 
> what is look like the skewed keys.
> If we can explicitly assign reducer slots for the skewed keys, total 
> execution time could be greatly shortened.
> As for a start, I've extended join grammar something like this.
> {code}
> select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 
> < 100, a.key < 150);
> {code}
> which means if above query is executed by 20 reducers, one reducer for 
> a.key+1 < 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= 
> a.key < 150, and 17 reducers for others (could be extended to assign more 
> than one reducer later)
> This can be only used with common-inner-equi joins. And skew condition should 
> be composed of join keys only.
> Work till done now will be updated shortly after code cleanup.
> ----------------------------
> Skew expressions* in "SKEW ON (expr, expr, ...)" are evaluated sequentially 
> at runtime, and first 'true' one decides skew group for the row. Each skew 
> group has reserved partition slot(s), to which all rows in a group would be 
> assigned. 
> The number of partition slot reserved for each group is decided also at 
> runtime by simple calculation of percentage. If a skew group is "CLUSTER BY 
> 20 PERCENT" and total partition slot (=number of reducer) is 20, that group 
> will reserve 4 partition slots, etc.
> "DISTRIBUTE BY" decides how the rows in a group is dispersed in the range of 
> reserved slots (If there is only one slot for a group, this is meaningless). 
> Currently, three distribution policies are available: RANDOM, KEYS, 
> <expression>. 
> 1. RANDOM : rows of driver** alias are dispersed by random and rows of 
> non-driver alias are duplicated for all the slots (default if not specified)
> 2. KEYS : determined by hash value of keys (same with previous)
> 3. expression : determined by hash of object evaluated by user-provided 
> expression
> Only possible with inner, equi, common-joins. Not yet supports join tree 
> merging.
> Might be used by other RS users like "SORT BY" or "GROUP BY"
> If there exists column statistics for the key, it could be possible to apply 
> automatically.
> For example, if 20 reducers are used for the query below,
> {code}
> select count(*) from src a join src b on a.key=b.key skew on (
>    a.key = '0' CLUSTER BY 10 PERCENT,
>    b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
>    cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
> {code}
> group-0 will reserve slots 6~7, group-1 8~11, group-2 12~19 and others will 
> reserve slots 0~5.
> For a row with key='0' from alias a, the row is randomly assigned in the 
> range of 6~7 (driver alias) : 6 or 7
> For a row with key='0' from alias b, the row is disributed for all slots in 
> 6~7 (non-driver alias) : 6 and 7
> For a row with key='50', the row is assigned in the range of 8~11 by hashcode 
> of upper(b.key) : 8 + (hash(upper(key)) % 4)
> For a row with key='500', the row is assigned in the range of 12~19 by 
> hashcode of join key : 12 + (hash(key) % 8)
> For a row with key='200', this is not belong to any skew group : hash(key) % 6
> *expressions in skew condition : 
> 1. all expressions should be made of expression in join condition, which 
> means if join condition is "a.key=b.key", user can make any expression with 
> "a.key" or "b.key". But if join condition is a.key+1=b.key, user cannot make 
> expression with "a.key" solely (should make expression with "a.key+1"). 
> 2. all expressions should reference one and only-one side of aliases. For 
> example, simple constant expressions or expressions referencing both side of 
> join condition ("a.key+b.key<100") is not allowed.
> 3. all functions in expression should be deteministic and stateless.
> 4. if "DISTRIBUTED BY expression" is used, distibution expression also should 
> have same alias with skew expression.
> **driver alias :
> 1. driver alias means the sole referenced alias from skew expression, which 
> is important for RANDOM distribution. rows of driver alias are assigned to 
> single slot randomly, but rows of non-driver alias are duplicated for all the 
> slots. So, driver alias should be the biggest one in join aliases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to