[ 
https://issues.apache.org/jira/browse/HIVE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navis updated HIVE-3286:
------------------------
    Description: 
Join operation on table with skewed data takes most of execution time handling 
the skewed keys. But mostly we already know about that and even know what is 
look like the skewed keys.

If we can explicitly assign reducer slots for the skewed keys, total execution 
time could be greatly shortened.

As for a start, I've extended join grammar something like this.
{code}
select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 < 
100, a.key < 150);
{code}

which means if above query is executed by 20 reducers, one reducer for a.key+1 
< 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= a.key < 150, 
and 17 reducers for others (could be extended to assign more than one reducer 
later)

This can be only used with common-inner-equi joins. And skew condition should 
be composed of join keys only.

Work till done now will be updated shortly after code cleanup.


----------------------------


All expressions in the clause "SKEW ON (expr1, expr2, ...)" are called skew 
condition and consist of skew expression*, which is simple boolean expression 
for the group and optional CLUSTER/DISTRIBUTED expression. Skew expressions 
will be evaluated sequentially at runtime, deciding skew group for a row. Each 
skew group has reserved partition slot(s), to which all rows in a group would 
be assigned. 

The number of partition slot reserved for each skew group is decided by cluster 
expression. Before submitting the MR job, hive calculates size of each skew 
groups. If a skew group is "CLUSTER BY 20 PERCENT" and total partition slot 
(=number of reducer) is, say, 20, the group will reserve 4 partition slots for 
it, etc.

The optional "DISTRIBUTE BY" decides how the rows in a skew group is dispersed 
in the range of reserved slots (If there is only one slot for a group, this is 
meaningless). Currently, three distribution policies are available: RANDOM, 
KEYS, <expression>. 
1. RANDOM : rows from driver alias** are dispersed by random and rows of other 
aliases are multicasted to all slots (default behavior)
2. KEYS : rows from driver alias are dispersed by hash of keys same for 
3. expression : determined by evaluation result of user-provided expression

Only possible with inner, equi, common-joins. Not yet supports join tree 
merging or vectorization.
- Might be used by other RS users like "SORT BY" or "GROUP BY" (not-yet)
- If there are column statistics for the skewness of the key, it could be 
possible applied automatically (not-yet)

For example, if 20 reducers are used for the query below,
{code}
select count(*) from src a join src b on a.key=b.key skew on (
   a.key = '0' CLUSTER BY 10 PERCENT,
   b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
   cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
{code}

Skew group-0 would reserve 2 slots (#6~#7), for group-1, 4 slots (#8~#11), for 
group-2, 8 slots (#12~#19) and others will use remaining 6 slots (#0~#5).

For key='0' from alias a(driver alias), it will be assigned to a slot of 
group-0 : 6 or 7
For key='0' from alias b(non-driver alias), it will be multicasted to all slots 
of group-0 : 6 and 7

For key='50' from alias a(non-driver alias), it will be multicasted to all 
slots of group-1 : 8 and 9 and 10 and 11
For key='50' from alias b(driver alias), it will be assigned to a slot of 
group-1 : 8 or 9 or 10 or 11

For key='500' from alias a(driver alias), it will be assigned to a slot of 
group-2 by modulation of hashcode of DISTRIBUTE expression
For key='500' from alias b(non-driver alias), it will be multicasted to all 
slots of group-2 : #12~#19

For key='200', it's not belong to any skew group and will be processed normally 
in the range of partition slot 0~5.

*skew expression : 
1. all expressions should be made of expression in join condition, which means 
if join condition is "a.key=b.key", user can make any expression with "a.key" 
or "b.key". But if join condition is a.key+1=b.key, user cannot make expression 
with "a.key" solely (or make expression with "a.key+1"). 
2. all expressions should reference one and only-one side of aliases. For 
example, simple constant expressions or expressions referencing both side of 
join condition ("a.key+b.key<100") is not allowed.
3. all functions in expression should be deterministic and stateless.
4. DISTRIBUTED expression should have same alias with skew expression.

**driver alias :
1. driver alias means the sole referenced alias from skew expression, which is 
important for RANDOM distribution. Rows from driver alias are assigned to one 
slot, but rows from other aliases will be multicasted to all slots of the 
group. 


  was:
Join operation on table with skewed data takes most of execution time handling 
the skewed keys. But mostly we already know about that and even know what is 
look like the skewed keys.

If we can explicitly assign reducer slots for the skewed keys, total execution 
time could be greatly shortened.

As for a start, I've extended join grammar something like this.
{code}
select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 < 
100, a.key < 150);
{code}

which means if above query is executed by 20 reducers, one reducer for a.key+1 
< 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= a.key < 150, 
and 17 reducers for others (could be extended to assign more than one reducer 
later)

This can be only used with common-inner-equi joins. And skew condition should 
be composed of join keys only.

Work till done now will be updated shortly after code cleanup.


----------------------------


Skew expressions* in "SKEW ON (expr, expr, ...)" are evaluated sequentially at 
runtime, and first 'true' one decides skew group for the row. Each skew group 
has reserved partition slot(s), to which all rows in a group would be assigned. 

The number of partition slot reserved for each group is decided also at runtime 
by simple calculation of percentage. If a skew group is "CLUSTER BY 20 PERCENT" 
and total partition slot (=number of reducer) is 20, that group will reserve 4 
partition slots, etc.

"DISTRIBUTE BY" decides how the rows in a group is dispersed in the range of 
reserved slots (If there is only one slot for a group, this is meaningless). 
Currently, three distribution policies are available: RANDOM, KEYS, 
<expression>. 
1. RANDOM : rows of driver** alias are dispersed by random and rows of 
non-driver alias are duplicated for all the slots (default if not specified)
2. KEYS : determined by hash value of keys (same with previous)
3. expression : determined by hash of object evaluated by user-provided 
expression

Only possible with inner, equi, common-joins. Not yet supports join tree 
merging.
Might be used by other RS users like "SORT BY" or "GROUP BY"
If there exists column statistics for the key, it could be possible to apply 
automatically.

For example, if 20 reducers are used for the query below,
{code}
select count(*) from src a join src b on a.key=b.key skew on (
   a.key = '0' CLUSTER BY 10 PERCENT,
   b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
   cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
{code}

group-0 will reserve slots 6~7, group-1 8~11, group-2 12~19 and others will 
reserve slots 0~5.

For a row with key='0' from alias a, the row is randomly assigned in the range 
of 6~7 (driver alias) : 6 or 7
For a row with key='0' from alias b, the row is disributed for all slots in 6~7 
(non-driver alias) : 6 and 7
For a row with key='50', the row is assigned in the range of 8~11 by hashcode 
of upper(b.key) : 8 + (hash(upper(key)) % 4)
For a row with key='500', the row is assigned in the range of 12~19 by hashcode 
of join key : 12 + (hash(key) % 8)
For a row with key='200', this is not belong to any skew group : hash(key) % 6

*expressions in skew condition : 
1. all expressions should be made of expression in join condition, which means 
if join condition is "a.key=b.key", user can make any expression with "a.key" 
or "b.key". But if join condition is a.key+1=b.key, user cannot make expression 
with "a.key" solely (should make expression with "a.key+1"). 
2. all expressions should reference one and only-one side of aliases. For 
example, simple constant expressions or expressions referencing both side of 
join condition ("a.key+b.key<100") is not allowed.
3. all functions in expression should be deteministic and stateless.
4. if "DISTRIBUTED BY expression" is used, distibution expression also should 
have same alias with skew expression.

**driver alias :
1. driver alias means the sole referenced alias from skew expression, which is 
important for RANDOM distribution. rows of driver alias are assigned to single 
slot randomly, but rows of non-driver alias are duplicated for all the slots. 
So, driver alias should be the biggest one in join aliases.



> Explicit skew join on user provided condition
> ---------------------------------------------
>
>                 Key: HIVE-3286
>                 URL: https://issues.apache.org/jira/browse/HIVE-3286
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: D4287.11.patch, HIVE-3286.12.patch.txt, 
> HIVE-3286.13.patch.txt, HIVE-3286.14.patch.txt, HIVE-3286.15.patch.txt, 
> HIVE-3286.16.patch.txt, HIVE-3286.17.patch.txt, HIVE-3286.18.patch.txt, 
> HIVE-3286.19.patch.txt, HIVE-3286.D4287.10.patch, HIVE-3286.D4287.5.patch, 
> HIVE-3286.D4287.6.patch, HIVE-3286.D4287.7.patch, HIVE-3286.D4287.8.patch, 
> HIVE-3286.D4287.9.patch
>
>
> Join operation on table with skewed data takes most of execution time 
> handling the skewed keys. But mostly we already know about that and even know 
> what is look like the skewed keys.
> If we can explicitly assign reducer slots for the skewed keys, total 
> execution time could be greatly shortened.
> As for a start, I've extended join grammar something like this.
> {code}
> select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 
> < 100, a.key < 150);
> {code}
> which means if above query is executed by 20 reducers, one reducer for 
> a.key+1 < 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= 
> a.key < 150, and 17 reducers for others (could be extended to assign more 
> than one reducer later)
> This can be only used with common-inner-equi joins. And skew condition should 
> be composed of join keys only.
> Work till done now will be updated shortly after code cleanup.
> ----------------------------
> All expressions in the clause "SKEW ON (expr1, expr2, ...)" are called skew 
> condition and consist of skew expression*, which is simple boolean expression 
> for the group and optional CLUSTER/DISTRIBUTED expression. Skew expressions 
> will be evaluated sequentially at runtime, deciding skew group for a row. 
> Each skew group has reserved partition slot(s), to which all rows in a group 
> would be assigned. 
> The number of partition slot reserved for each skew group is decided by 
> cluster expression. Before submitting the MR job, hive calculates size of 
> each skew groups. If a skew group is "CLUSTER BY 20 PERCENT" and total 
> partition slot (=number of reducer) is, say, 20, the group will reserve 4 
> partition slots for it, etc.
> The optional "DISTRIBUTE BY" decides how the rows in a skew group is 
> dispersed in the range of reserved slots (If there is only one slot for a 
> group, this is meaningless). Currently, three distribution policies are 
> available: RANDOM, KEYS, <expression>. 
> 1. RANDOM : rows from driver alias** are dispersed by random and rows of 
> other aliases are multicasted to all slots (default behavior)
> 2. KEYS : rows from driver alias are dispersed by hash of keys same for 
> 3. expression : determined by evaluation result of user-provided expression
> Only possible with inner, equi, common-joins. Not yet supports join tree 
> merging or vectorization.
> - Might be used by other RS users like "SORT BY" or "GROUP BY" (not-yet)
> - If there are column statistics for the skewness of the key, it could be 
> possible applied automatically (not-yet)
> For example, if 20 reducers are used for the query below,
> {code}
> select count(*) from src a join src b on a.key=b.key skew on (
>    a.key = '0' CLUSTER BY 10 PERCENT,
>    b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
>    cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
> {code}
> Skew group-0 would reserve 2 slots (#6~#7), for group-1, 4 slots (#8~#11), 
> for group-2, 8 slots (#12~#19) and others will use remaining 6 slots (#0~#5).
> For key='0' from alias a(driver alias), it will be assigned to a slot of 
> group-0 : 6 or 7
> For key='0' from alias b(non-driver alias), it will be multicasted to all 
> slots of group-0 : 6 and 7
> For key='50' from alias a(non-driver alias), it will be multicasted to all 
> slots of group-1 : 8 and 9 and 10 and 11
> For key='50' from alias b(driver alias), it will be assigned to a slot of 
> group-1 : 8 or 9 or 10 or 11
> For key='500' from alias a(driver alias), it will be assigned to a slot of 
> group-2 by modulation of hashcode of DISTRIBUTE expression
> For key='500' from alias b(non-driver alias), it will be multicasted to all 
> slots of group-2 : #12~#19
> For key='200', it's not belong to any skew group and will be processed 
> normally in the range of partition slot 0~5.
> *skew expression : 
> 1. all expressions should be made of expression in join condition, which 
> means if join condition is "a.key=b.key", user can make any expression with 
> "a.key" or "b.key". But if join condition is a.key+1=b.key, user cannot make 
> expression with "a.key" solely (or make expression with "a.key+1"). 
> 2. all expressions should reference one and only-one side of aliases. For 
> example, simple constant expressions or expressions referencing both side of 
> join condition ("a.key+b.key<100") is not allowed.
> 3. all functions in expression should be deterministic and stateless.
> 4. DISTRIBUTED expression should have same alias with skew expression.
> **driver alias :
> 1. driver alias means the sole referenced alias from skew expression, which 
> is important for RANDOM distribution. Rows from driver alias are assigned to 
> one slot, but rows from other aliases will be multicasted to all slots of the 
> group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to