Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/JoinFramework ------------------------------------------------------------------------------ = Join Framework = - == Objective == + == Objective == - This document provides a comprehensive view of performing joins in Pig. By =JOIN= here we mean traditional inner/outer =SQL= joins which in Pig are realized via =COGROUP= followed by =flatten= of the relations. + This document provides a comprehensive view of performing joins in Pig. By =JOIN= here we mean traditional inner/outer SQL joins which in Pig are realized via =COGROUP= followed by flatten of the relations. - Some of the approaches described in this document can also be applied to =CROSS= and =GROUP= as well. + Some of the approaches described in this document can also be applied to CROSS and GROUP as well. == Joins == - Currently, Pig running on top of Hadoop executes all joins in the same way. During the map stage, the data from each relation is annotated with the index of that relation. Then, the data is sorted and partitioned by the join key and provided to the reducer. This is similar to SQL's =hash join=. In the next generation Pig (currently on types branch), the data from the same relation is guaranteed to be continuous for the same key. This is to allow optimization that only keep =N-1= relations in memory. (Unfortunately, we did not see the expected speedup when this optimization was tried - investigation is still in progress.) + Currently, Pig running on top of Hadoop executes all joins in the same way. During the map stage, the data from each relation is annotated with the index of that relation. Then, the data is sorted and partitioned by the join key and provided to the reducer. This is similar to SQL's hash join. In the next generation Pig (currently on types branch), the data from the same relation is guaranteed to be continuous for the same key. This is to allow optimization that only keep N-1 relations in memory. (Unfortunately, we did not see the expected speedup when this optimization was tried - investigation is still in progress.) In some situations, more efficient join implementations can be constructed if more is known about the data of the relations. They are described in the section. - === Pre-partitioned Join (PPJ) === + === Pre-partitioned Join (PPJ) === This join type takes advantage of the fact that the data of all relations is already partition by the join key or its prefix which means that the join can be done completely independently on separate nodes. It further helps if the data is sorted on the key; otherwise it might have to get sorted before the join. - In the case of =Hadoop=, this means that the join can be done in a =Map= avoiding =SORT/SHUFFLE/REDUCE= stages. The performance would be even better if the partitions for the same key ranges were collocated on the same nodes and if the computation was scheduled to run on this nodes. However, for now this is outside of Pig's control. + In the case of Hadoop, this means that the join can be done in a Map avoiding SORT/SHUFFLE/REDUCE stages. The performance would be even better if the partitions for the same key ranges were collocated on the same nodes and if the computation was scheduled to run on this nodes. However, for now this is outside of Pig's control. Note that GROUP can take advantage of this knowledge as well. [Discussion of different data layout options.] - === Fragment Replicate Join (FRJ) === + === Fragment Replicate Join (FRJ) === This join type takes advantage of the fact that N-1 relations in the join are very small and can fit into main memory of each node. In this case, the small tables can be copied onto all the nodes and be joined with the data from the larger table. This saves the cost of sorting and partitioning the large table. For Hadoop this means that the join can happen on the map side. - The data coming out of the join is not guaranteed to be sorted on the join key which could cause problems for queries that follow join by =GROUP= or =ORDER BY= on the prefix of the join key. This should be taken into account when choosing join type. + The data coming out of the join is not guaranteed to be sorted on the join key which could cause problems for queries that follow join by GROUP or ORDER BY on the prefix of the join key. This should be taken into account when choosing join type. If you have several larger tables in the join that can't fit into memory, it might be beneficial to split the join to fit FRJ pattern since it would significantly reduce the size of the data going into the next join and might even allow to use FRJ again.