[jira] [Commented] (HIVE-3286) Explicit skew join on user provided condition

Hive QA (JIRA) Wed, 27 Nov 2013 17:21:27 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834436#comment-13834436
 ]


Hive QA commented on HIVE-3286:
-------------------------------



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12616156/HIVE-3286.13.patch.txt

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/472/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/472/console

Messages:
{noformat}
**** This message was trimmed, see log for full details ****
As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_CASE TinyintLiteral" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_NULL GREATERTHAN" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_NOT DecimalLiteral" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_CASE DecimalLiteral" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:108:5: 
Decision can match input such as "KW_ORDER KW_BY LPAREN" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:121:5: 
Decision can match input such as "KW_CLUSTER KW_BY LPAREN" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:133:5: 
Decision can match input such as "KW_PARTITION KW_BY LPAREN" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:144:5: 
Decision can match input such as "KW_DISTRIBUTE KW_BY LPAREN" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:155:5: 
Decision can match input such as "KW_SORT KW_BY LPAREN" using multiple 
alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:172:7: 
Decision can match input such as "STAR" using multiple alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:185:5: 
Decision can match input such as "KW_ARRAY" using multiple alternatives: 2, 6

As a result, alternative(s) 6 were disabled for that input
warning(200): IdentifiersParser.g:185:5: 
Decision can match input such as "KW_UNIONTYPE" using multiple alternatives: 5, 
6

As a result, alternative(s) 6 were disabled for that input
warning(200): IdentifiersParser.g:185:5: 
Decision can match input such as "KW_STRUCT" using multiple alternatives: 4, 6

As a result, alternative(s) 6 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_NULL" using multiple alternatives: 1, 8

As a result, alternative(s) 8 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_FALSE" using multiple alternatives: 3, 8

As a result, alternative(s) 8 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_TRUE" using multiple alternatives: 3, 8

As a result, alternative(s) 8 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_DATE StringLiteral" using multiple 
alternatives: 2, 3

As a result, alternative(s) 3 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_SORT KW_BY" 
using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_GROUP 
KW_BY" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_INSERT 
KW_OVERWRITE" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_MAP LPAREN" 
using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_ORDER 
KW_BY" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "KW_BETWEEN KW_MAP LPAREN" using multiple 
alternatives: 8, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_LATERAL 
KW_VIEW" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_CLUSTER 
KW_BY" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_INSERT 
KW_INTO" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_DISTRIBUTE 
KW_BY" using multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:524:5: 
Decision can match input such as "{AMPERSAND..BITWISEXOR, DIV..DIVIDE, 
EQUAL..EQUAL_NS, GREATERTHAN..GREATERTHANOREQUALTO, KW_AND, KW_ARRAY, 
KW_BETWEEN..KW_BOOLEAN, KW_CASE, KW_DOUBLE, KW_FLOAT, KW_IF, KW_IN, KW_INT, 
KW_LIKE, KW_MAP, KW_NOT, KW_OR, KW_REGEXP, KW_RLIKE, KW_SMALLINT, 
KW_STRING..KW_STRUCT, KW_TINYINT, KW_UNIONTYPE, KW_WHEN, 
LESSTHAN..LESSTHANOREQUALTO, MINUS..NOTEQUAL, PLUS, STAR, TILDE}" using 
multiple alternatives: 1, 3

As a result, alternative(s) 3 were disabled for that input
[INFO] 
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ hive-exec 
---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-exec ---
[INFO] Executing tasks

main:
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ hive-exec ---
[INFO] Compiling 1400 source files to 
/data/hive-ptest/working/apache-svn-trunk-source/ql/target/classes
[INFO] -------------------------------------------------------------
[WARNING] COMPILATION WARNING : 
[INFO] -------------------------------------------------------------
[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[WARNING] Note: Some input files use unchecked or unsafe operations.
[WARNING] Note: Recompile with -Xlint:unchecked for details.
[INFO] 4 warnings 
[INFO] -------------------------------------------------------------
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] 
/data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/plan/SkewContext.java:[118,49]
 cannot find symbol
symbol  : method getRandom()
location: class org.apache.hadoop.hive.ql.io.HiveKey
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Hive .............................................. SUCCESS [2.846s]
[INFO] Hive Ant Utilities ................................ SUCCESS [9.072s]
[INFO] Hive Shims Common ................................. SUCCESS [3.459s]
[INFO] Hive Shims 0.20 ................................... SUCCESS [2.210s]
[INFO] Hive Shims Secure Common .......................... SUCCESS [2.711s]
[INFO] Hive Shims 0.20S .................................. SUCCESS [1.397s]
[INFO] Hive Shims 0.23 ................................... SUCCESS [3.679s]
[INFO] Hive Shims ........................................ SUCCESS [3.073s]
[INFO] Hive Common ....................................... SUCCESS [13.388s]
[INFO] Hive Serde ........................................ SUCCESS [11.713s]
[INFO] Hive Metastore .................................... SUCCESS [26.188s]
[INFO] Hive Query Language ............................... FAILURE [30.091s]
[INFO] Hive Service ...................................... SKIPPED
[INFO] Hive JDBC ......................................... SKIPPED
[INFO] Hive Beeline ...................................... SKIPPED
[INFO] Hive CLI .......................................... SKIPPED
[INFO] Hive Contrib ...................................... SKIPPED
[INFO] Hive HBase Handler ................................ SKIPPED
[INFO] Hive HCatalog ..................................... SKIPPED
[INFO] Hive HCatalog Core ................................ SKIPPED
[INFO] Hive HCatalog Pig Adapter ......................... SKIPPED
[INFO] Hive HCatalog Server Extensions ................... SKIPPED
[INFO] Hive HCatalog Webhcat Java Client ................. SKIPPED
[INFO] Hive HCatalog Webhcat ............................. SKIPPED
[INFO] Hive HCatalog HBase Storage Handler ............... SKIPPED
[INFO] Hive HWI .......................................... SKIPPED
[INFO] Hive ODBC ......................................... SKIPPED
[INFO] Hive Shims Aggregator ............................. SKIPPED
[INFO] Hive TestUtils .................................... SKIPPED
[INFO] Hive Packaging .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:52.444s
[INFO] Finished at: Wed Nov 27 20:19:46 EST 2013
[INFO] Final Memory: 51M/371M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project hive-exec: Compilation failure
[ERROR] 
/data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/plan/SkewContext.java:[118,49]
 cannot find symbol
[ERROR] symbol  : method getRandom()
[ERROR] location: class org.apache.hadoop.hive.ql.io.HiveKey
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hive-exec
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12616156

> Explicit skew join on user provided condition
> ---------------------------------------------
>
>                 Key: HIVE-3286
>                 URL: https://issues.apache.org/jira/browse/HIVE-3286
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: D4287.11.patch, HIVE-3286.12.patch.txt, 
> HIVE-3286.13.patch.txt, HIVE-3286.D4287.10.patch, HIVE-3286.D4287.5.patch, 
> HIVE-3286.D4287.6.patch, HIVE-3286.D4287.7.patch, HIVE-3286.D4287.8.patch, 
> HIVE-3286.D4287.9.patch
>
>
> Join operation on table with skewed data takes most of execution time 
> handling the skewed keys. But mostly we already know about that and even know 
> what is look like the skewed keys.
> If we can explicitly assign reducer slots for the skewed keys, total 
> execution time could be greatly shortened.
> As for a start, I've extended join grammar something like this.
> {code}
> select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 
> < 100, a.key < 150);
> {code}
> which means if above query is executed by 20 reducers, one reducer for 
> a.key+1 < 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= 
> a.key < 150, and 17 reducers for others (could be extended to assign more 
> than one reducer later)
> This can be only used with common-inner-equi joins. And skew condition should 
> be composed of join keys only.
> Work till done now will be updated shortly after code cleanup.
> ----------------------------
> Skew expressions* in "SKEW ON (expr, expr, ...)" are evaluated sequentially 
> at runtime, and first 'true' one decides skew group for the row. Each skew 
> group has reserved partition slot(s), to which all rows in a group would be 
> assigned. 
> The number of partition slot reserved for each group is decided also at 
> runtime by simple calculation of percentage. If a skew group is "CLUSTER BY 
> 20 PERCENT" and total partition slot (=number of reducer) is 20, that group 
> will reserve 4 partition slots, etc.
> "DISTRIBUTE BY" decides how the rows in a group is dispersed in the range of 
> reserved slots (If there is only one slot for a group, this is meaningless). 
> Currently, three distribution policies are available: RANDOM, KEYS, 
> <expression>. 
> 1. RANDOM : rows of driver** alias are dispersed by random and rows of 
> non-driver alias are duplicated for all the slots (default if not specified)
> 2. KEYS : determined by hash value of keys (same with previous)
> 3. expression : determined by hash of object evaluated by user-provided 
> expression
> Only possible with inner, equi, common-joins. Not yet supports join tree 
> merging.
> Might be used by other RS users like "SORT BY" or "GROUP BY"
> If there exists column statistics for the key, it could be possible to apply 
> automatically.
> For example, if 20 reducers are used for the query below,
> {code}
> select count(*) from src a join src b on a.key=b.key skew on (
>    a.key = '0' CLUSTER BY 10 PERCENT,
>    b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
>    cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
> {code}
> group-0 will reserve slots 6~7, group-1 8~11, group-2 12~19 and others will 
> reserve slots 0~5.
> For a row with key='0' from alias a, the row is randomly assigned in the 
> range of 6~7 (driver alias) : 6 or 7
> For a row with key='0' from alias b, the row is disributed for all slots in 
> 6~7 (non-driver alias) : 6 and 7
> For a row with key='50', the row is assigned in the range of 8~11 by hashcode 
> of upper(b.key) : 8 + (hash(upper(key)) % 4)
> For a row with key='500', the row is assigned in the range of 12~19 by 
> hashcode of join key : 12 + (hash(key) % 8)
> For a row with key='200', this is not belong to any skew group : hash(key) % 6
> *expressions in skew condition : 
> 1. all expressions should be made of expression in join condition, which 
> means if join condition is "a.key=b.key", user can make any expression with 
> "a.key" or "b.key". But if join condition is a.key+1=b.key, user cannot make 
> expression with "a.key" solely (should make expression with "a.key+1"). 
> 2. all expressions should reference one and only-one side of aliases. For 
> example, simple constant expressions or expressions referencing both side of 
> join condition ("a.key+b.key<100") is not allowed.
> 3. all functions in expression should be deteministic and stateless.
> 4. if "DISTRIBUTED BY expression" is used, distibution expression also should 
> have same alias with skew expression.
> **driver alias :
> 1. driver alias means the sole referenced alias from skew expression, which 
> is important for RANDOM distribution. rows of driver alias are assigned to 
> single slot randomly, but rows of non-driver alias are duplicated for all the 
> slots. So, driver alias should be the biggest one in join aliases.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-3286) Explicit skew join on user provided condition

Reply via email to