Joins" by EdwardCapriolo

Apache Wiki Tue, 13 Jul 2010 17:22:56 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/LanguageManual/Joins" page has been changed by EdwardCapriolo.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins?action=diff&rev1=19&rev2=20

--------------------------------------------------

  <<TableOfContents>>
  
  ## page was renamed from Hive/LanguageManual/LanguageManual/Joins
- == Join Syntax ==
+ == THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax ==
  Hive supports the following syntax for joining tables:
  
  {{{
@@ -24, +24 @@

  join_condition:
      ON equality_expression ( AND equality_expression )*
  
- equality_expression: 
+ equality_expression:
      expression = expression
  }}}
+ Only equality joins, outer joins, and left semi joins are supported in Hive. 
Hive does not support join conditions that are not equality conditions as it is 
very difficult to express such conditions as a map/reduce job. Also, more than 
two tables can be joined in Hive.
- 
- Only equality joins, outer joins, and left semi joins are supported in Hive. 
Hive does not support join conditions that are not equality
- conditions as it is very difficult to express such conditions as a map/reduce 
job. Also, more than two tables can be
- joined in Hive.
  
  Some salient points to consider when writing join queries are as follows:
  
   * Only equality joins are allowed e.g.
+ 
- {{{ 
+ {{{
-   SELECT a.* FROM a JOIN b ON (a.id = b.id) 
+   SELECT a.* FROM a JOIN b ON (a.id = b.id)
  }}}
- {{{ 
+ {{{
-   SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department) 
+   SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
  }}}
-   are both valid joins, however
+  . are both valid joins, however
+ 
  {{{
    SELECT a.* FROM a JOIN b ON (a.id <> b.id)
  }}}
-   is NOT allowed
+  . is NOT allowed
+ 
   * More than 2 tables can be joined in the same query e.g.
+ 
  {{{
    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON 
(c.key = b.key2)
  }}}
-   is a valid join.
+  . is a valid join.
+ 
   * Hive converts joins over multiple tables into a single map/reduce job if 
for every table the same column is used in the join clauses e.g.
+ 
  {{{
    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON 
(c.key = b.key1)
  }}}
-   is converted into a single map/reduce job as only key1 column for b is 
involved in the join. On the other hand
+  . is converted into a single map/reduce job as only key1 column for b is 
involved in the join. On the other hand
+ 
  {{{
    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON 
(c.key = b.key2)
  }}}
-   is converted into two map/reduce jobs because key1 column from b is used in 
the first join condition and key2 column from b is used in the second one. The 
first map/reduce job joins a with b and the results are then joined with c in 
the second map/reduce job.
+  . is converted into two map/reduce jobs because key1 column from b is used 
in the first join condition and key2 column from b is used in the second one. 
The first map/reduce job joins a with b and the results are then joined with c 
in the second map/reduce job.
+ 
   * In every map/reduce stage of the join, the last table in the sequence is 
streamed through the reducers where as the others are buffered. Therefore, it 
helps to reduce the memory needed in the reducer for buffering the rows for a 
particular value of the join key by organizing the tables such that the largest 
tables appear last in the sequence. e.g. in
+ 
  {{{
    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON 
(c.key = b.key1)
  }}}
-   all the three tables are joined in a single map/reduce job and the values 
for a particular value of the key for tables a and b are buffered in the memory 
in the reducers. Then for each row retrieved from c, the join is computed with 
the buffered rows. Similarly for
+  . all the three tables are joined in a single map/reduce job and the values 
for a particular value of the key for tables a and b are buffered in the memory 
in the reducers. Then for each row retrieved from c, the join is computed with 
the buffered rows. Similarly for
+ 
  {{{
    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON 
(c.key = b.key2)
  }}}
-   there are two map/reduce jobs involved in computing the join. The first of 
these joins a with b and buffers the values of a while streaming the values of 
b in the reducers. The second of one of these jobs buffers the results of the 
first join while streaming the values of c through the reducers.
+  . there are two map/reduce jobs involved in computing the join. The first of 
these joins a with b and buffers the values of a while streaming the values of 
b in the reducers. The second of one of these jobs buffers the results of the 
first join while streaming the values of c through the reducers.
+ 
   * In every map/reduce stage of the join, the table to be streamed can be 
specified via a hint. e.g. in
+ 
  {{{
    SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = 
b.key1) JOIN c ON (c.key = b.key1)
  }}}
-   all the three tables are joined in a single map/reduce job and the values 
for a particular value of the key for tables b and c are buffered in the memory 
in the reducers. Then for each row retrieved from a, the join is computed with 
the buffered rows. 
+  . all the three tables are joined in a single map/reduce job and the values 
for a particular value of the key for tables b and c are buffered in the memory 
in the reducers. Then for each row retrieved from a, the join is computed with 
the buffered rows.
+ 
   * LEFT, RIGHT, and FULL OUTER joins exist in order to provide more control 
over ON clauses for which there is no match. For example, this query:
+ 
  {{{
    SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
  }}}
-   will return a row for every row in a. This output row will be a.val,b.val 
when there is a b.key that equals a.key, and the output row will be a.val,NULL 
when there is no corresponding b.key. Rows from b which have no corresponding 
a.key will be dropped. The syntax "FROM a LEFT OUTER JOIN b" must be written on 
one line in order to understand how it works--a is to the LEFT of b in this 
query, and so all rows from a are kept; a RIGHT OUTER JOIN will keep all rows 
from b, and a FULL OUTER JOIN will keep all rows from a and all rows from b. 
OUTER JOIN semantics should conform to standard SQL specs.
+  . will return a row for every row in a. This output row will be a.val,b.val 
when there is a b.key that equals a.key, and the output row will be a.val,NULL 
when there is no corresponding b.key. Rows from b which have no corresponding 
a.key will be dropped. The syntax "FROM a LEFT OUTER JOIN b" must be written on 
one line in order to understand how it works--a is to the LEFT of b in this 
query, and so all rows from a are kept; a RIGHT OUTER JOIN will keep all rows 
from b, and a FULL OUTER JOIN will keep all rows from a and all rows from b. 
OUTER JOIN semantics should conform to standard SQL specs.
+ 
   * Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT 
of a join, a requirement should be in the WHERE clause, otherwise it should be 
in the JOIN clause. A big point of confusion for this issue is partitioned 
tables:
+ 
  {{{
    SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
    WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'
  }}}
-   will join a on b, producing a list of a.val and b.val. The WHERE clause, 
however, can also reference other columns of a and b that are in the output of 
the join, and then filter them out. However, whenever a row from the JOIN has 
found a key for a and no key for b, all of the columns of b will be NULL, 
'''including the ds column'''. This is to say, you will filter out all rows of 
join output for which there was no valid b.key, and thus you have outsmarted 
your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is 
irrelevant if you reference any column of b in the WHERE clause. Instead, when 
OUTER JOINing, use this syntax:
+  . will join a on b, producing a list of a.val and b.val. The WHERE clause, 
however, can also reference other columns of a and b that are in the output of 
the join, and then filter them out. However, whenever a row from the JOIN has 
found a key for a and no key for b, all of the columns of b will be NULL, 
'''including the ds column'''. This is to say, you will filter out all rows of 
join output for which there was no valid b.key, and thus you have outsmarted 
your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is 
irrelevant if you reference any column of b in the WHERE clause. Instead, when 
OUTER JOINing, use this syntax:
+ 
  {{{
    SELECT a.val, b.val FROM a LEFT OUTER JOIN b
    ON (a.key=b.key AND b.ds='2009-07-07' AND a.ds='2009-07-07')
  }}}
-   ...the result is that the output of the join is pre-filtered, and you won't 
get post-filtering trouble for rows that have a valid a.key but no matching 
b.key. The same logic applies to RIGHT and FULL joins.
+  . ..the result is that the output of the join is pre-filtered, and you won't 
get post-filtering trouble for rows that have a valid a.key but no matching 
b.key. The same logic applies to RIGHT and FULL joins.
+ 
   * Joins are NOT commutative! Joins are left-associative regardless of 
whether they are LEFT or RIGHT joins.
+ 
  {{{
    SELECT a.val1, a.val2, b.val, c.val
    FROM a
    JOIN b ON (a.key = b.key)
    LEFT OUTER JOIN c ON (a.key = c.key)
  }}}
-   ...first joins a on b, throwing away everything in a or b that does not 
have a corresponding key in the other table. The reduced table is then joined 
on c. This provides unintuitive results if there is a key that exists in both a 
and c, but not b: The whole row (including a.val1, a.val2, and a.key) is 
dropped in the "a JOIN b" step, so when the result of that is joined with c, 
any row with a c.key that had a corresponding a.key or b.key (but not both) 
will show up as NULL, NULL, NULL, c.val.
+  . ..first joins a on b, throwing away everything in a or b that does not 
have a corresponding key in the other table. The reduced table is then joined 
on c. This provides unintuitive results if there is a key that exists in both a 
and c, but not b: The whole row (including a.val1, a.val2, and a.key) is 
dropped in the "a JOIN b" step, so when the result of that is joined with c, 
any row with a c.key that had a corresponding a.key or b.key (but not both) 
will show up as NULL, NULL, NULL, c.val.
+ 
-  * LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in 
an efficient way. Since Hive currently does not support IN/EXISTS subqueries, 
you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using 
LEFT SEMI JOIN is that the right-hand-side table should only be referenced in 
the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.  
+  * LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in 
an efficient way. Since Hive currently does not support IN/EXISTS subqueries, 
you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using 
LEFT SEMI JOIN is that the right-hand-side table should only be referenced in 
the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.
+ 
  {{{
    SELECT a.key, a.value
-   FROM a 
+   FROM a
-   WHERE a.key in 
+   WHERE a.key in
     (SELECT b.key
      FROM B);
  }}}
  can be rewritten to:
+ 
  {{{
     SELECT a.key, a.val
     FROM a LEFT SEMI JOIN b on (a.key = b.key)
- }}}   
+ }}}
-  * If all but one of the tables being joined are small, the join can be 
performed as a map only job. The query  
+  * If all but one of the tables being joined are small, the join can be 
performed as a map only job. The query
+ 
  {{{
    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM a join b on a.key = b.key
  }}}
  does not need a reducer. For every mapper of A, B is read completely. The 
restriction is that '''a FULL/RIGHT OUTER JOIN b''' cannot be performed
+ 
-  * If the tables being joined are bucketized, and the buckets are a multiple 
of each other, the buckets can be joined with each other. If table A has 8 
buckets are table B has 4 buckets, the following join  
+  * If the tables being joined are bucketized, and the buckets are a multiple 
of each other, the buckets can be joined with each other. If table A has 8 
buckets are table B has 4 buckets, the following join
+ 
  {{{
    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM a join b on a.key = b.key
  }}}
- can be done on the mapper only. Instead of fetching B completely for each 
mapper of A, only the required buckets are fetched. For the query above, the 
mapper processing bucket 1 for A will only fetch bucket 1 of B.
+ can be done on the mapper only. Instead of fetching B completely for each 
mapper of A, only the required buckets are fetched. For the query above, the 
mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the 
default behavior, and is governed by the following parameter
- It is not the default behavior, and is governed by the following parameter 
+ 
  {{{
    set hive.optimize.bucketmapjoin = true
  }}}
   * If the tables being joined are sorted and bucketized, and the number of 
buckets are same, a sort-merge join can be performed. The corresponding buckets 
are joined with each other at the mapper. If both A and B have 4 buckets,
+ 
  {{{
    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM A a join B b on a.key = b.key
  }}}
  can be done on the mapper only. The mapper for the bucket for A will traverse 
the corresponding bucket for B. This is not the default behavior, and the 
following parameters need to be set:
+ 
  {{{
    set 
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
    set hive.optimize.bucketmapjoin = true;

[Hadoop Wiki] Update of "Hive/LanguageManual/Joins" by EdwardCapriolo

Reply via email to