Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

------------------------------------------------------------------------------
  = Join Framework =
  
- == Objective == 
+ == Objective ==
  
- This document provides a comprehensive view of performing joins in Pig. By 
=JOIN= here we mean traditional inner/outer =SQL= joins which in Pig are 
realized via =COGROUP= followed by =flatten= of the relations.
+ This document provides a comprehensive view of performing joins in Pig. By 
=JOIN= here we mean traditional inner/outer SQL joins which in Pig are realized 
via =COGROUP= followed by flatten of the relations.
  
- Some of the approaches described in this document can also be applied to 
=CROSS= and =GROUP= as well. 
+ Some of the approaches described in this document can also be applied to 
CROSS and GROUP as well. 
  
  == Joins ==
  
- Currently, Pig running on top of Hadoop executes all joins in the same way. 
During the map stage, the data from each relation is annotated with the index 
of that relation. Then, the data is sorted and partitioned by the join key and 
provided to the reducer. This is similar to SQL's =hash join=. In the next 
generation Pig (currently on types branch), the data from the same relation is 
guaranteed to be continuous for the same key. This is to allow optimization 
that only keep =N-1= relations in memory. (Unfortunately, we did not see the 
expected speedup when this optimization was tried - investigation is still in 
progress.)
+ Currently, Pig running on top of Hadoop executes all joins in the same way. 
During the map stage, the data from each relation is annotated with the index 
of that relation. Then, the data is sorted and partitioned by the join key and 
provided to the reducer. This is similar to SQL's hash join. In the next 
generation Pig (currently on types branch), the data from the same relation is 
guaranteed to be continuous for the same key. This is to allow optimization 
that only keep N-1 relations in memory. (Unfortunately, we did not see the 
expected speedup when this optimization was tried - investigation is still in 
progress.)
  
  In some situations, more efficient join implementations can be constructed if 
more is known about the data of the relations. They are described in the 
section.
  
- === Pre-partitioned Join (PPJ) === 
+ === Pre-partitioned Join (PPJ) ===
  
  This join type takes advantage of the fact that the data of all relations is 
already partition by the join key or its prefix which means that the join can 
be done completely independently on separate nodes. It further helps if the 
data is sorted on the key; otherwise it might have to get sorted before the 
join.
  
- In the case of =Hadoop=, this means that the join can be done in a =Map= 
avoiding =SORT/SHUFFLE/REDUCE= stages. The performance would be even better if 
the partitions for the same key ranges were collocated on the same nodes and if 
the computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
+ In the case of Hadoop, this means that the join can be done in a Map avoiding 
SORT/SHUFFLE/REDUCE stages. The performance would be even better if the 
partitions for the same key ranges were collocated on the same nodes and if the 
computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
  
  Note that GROUP can take advantage of this knowledge as well.
  
  [Discussion of different data layout options.]
  
- === Fragment Replicate Join (FRJ) === 
+ === Fragment Replicate Join (FRJ) ===
  
  This join type takes advantage of the fact that N-1 relations in the join are 
very small and can fit into main memory of each node. In this case, the small 
tables can be copied onto all the nodes and be joined with the data from the 
larger table. This saves the cost of sorting and partitioning the large table. 
  For Hadoop this means that the join can happen on the map side. 
  
- The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by =GROUP= or 
=ORDER BY= on the prefix of the join key. This should be taken into account 
when choosing join type.
+ The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by GROUP or ORDER 
BY on the prefix of the join key. This should be taken into account when 
choosing join type.
  
  If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.
  

Reply via email to