[Pig Wiki] Update of "JoinFramework" by OlgaN

Apache Wiki Wed, 08 Oct 2008 18:09:42 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

------------------------------------------------------------------------------
  
  This join type takes advantage of the fact that the data of all relations is 
already partition by the join key or its prefix which means that the join can 
be done completely independently on separate nodes. It further helps if the 
data is sorted on the key; otherwise it might have to get sorted before the 
join.
  
- Also if some but not other tables are partitioned on the join key; the other 
table can be shuffled before the join.
+ Also if some but not other tables are partitioned on the join key, the 
unpartitioned tables can be shuffled before the join.
  
- In the case of Hadoop, this means that the join can be done in a Map avoiding 
SORT/SHUFFLE/REDUCE stages. The performance would be even better if the 
partitions for the same key ranges were collocated on the same nodes and if the 
computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
+ In the case of Hadoop, this type of join means that the join can be done in a 
Map avoiding SORT/SHUFFLE/REDUCE stages. The performance would be even better 
if the partitions for the same key ranges were co-located on the same nodes and 
if the computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
  
  Note that `GROUP` can take advantage of this knowledge as well.
  
@@ -28, +28 @@

  
  === Fragment Replicate Join (FRJ) ===
  
- This join type takes advantage of the fact that N-1 relations in the join are 
small. In this case, the small tables can be copied onto all the nodes and be 
joined with the data from the larger table. This saves the cost of sorting and 
partitioning the large table. The performance benefits can be even greater if 
small tables fit into main memory; otherwise both the small tables and the 
partition of the large need to be sorted which is still better than having to 
shuffle the large table.
+ This join type takes advantage of the fact that N-1 relations in the join are 
small. In this case, the small tables can be copied onto all the nodes and be 
joined with the data from the larger table. This saves the cost of sorting and 
partitioning the large table. The performance benefits can be even greater if 
small tables fit into main memory; otherwise, both the small tables and the 
partition of the large need to be sorted which is still better than having to 
shuffle the large table.
  
- If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.
+ If you have several larger tables in the join, it might be beneficial to 
split the join to fit FRJ pattern since it would significantly reduce the size 
of the data going into the next join and might even allow to use FRJ again.
  
- For Hadoop this means that the join can happen on the map side. 
+ For Hadoop this type of join means that the join can happen on the map side. 
  
  The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by `GROUP` or 
`ORDER BY` on the prefix of the join key. This should be taken into account 
when choosing join type.
  
@@ -40, +40 @@

  
  === Indexed Join (IJ) ===
  
- This join type takes advantage of the fact that one or more tables 
participating in the join have index on the group by key or its prefix. This is 
similar in structure to FRJ join but could be even more efficient since 
processing time can be proportional to the size of the non-indexed and 
hopefully smaller table.
+ This join type takes advantage of the fact that one or more tables 
participating in the join have index on the join key or its prefix. This is 
similar in structure to FRJ join but could be even more efficient since 
processing time can be proportional to the size of the non-indexed and 
hopefully smaller table.
  
  In Hadoop, this will also result in a map side join. 
  
@@ -67, +67 @@

  set optimizations.reorder 'off'
  }}}
  
- Also, a user should be able to specify a particular type of join to perform 
even if it contradicts with the choices that would be made by the optimizer. To 
support this we would need to extend the `JOIN` keyword to support outer joins 
and also to support JOIN type.
+ Also, a user should be able to specify a particular type of join to perform 
even if it contradicts the choices made by the optimizer. To support this we 
would need to extend the `JOIN` keyword to support outer joins and also to 
support Join type.
  
  {{{
  C = JOIN A by name, B by name USING <JOIN TYPE>;

[Pig Wiki] Update of "JoinFramework" by OlgaN

Reply via email to