Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/JoinFramework ------------------------------------------------------------------------------ This join type takes advantage of the fact that the data of all relations is already partition by the join key or its prefix which means that the join can be done completely independently on separate nodes. It further helps if the data is sorted on the key; otherwise it might have to get sorted before the join. - Also if some but not other tables are partitioned on the join key; the other table can be shuffled before the join. + Also if some but not other tables are partitioned on the join key, the unpartitioned tables can be shuffled before the join. - In the case of Hadoop, this means that the join can be done in a Map avoiding SORT/SHUFFLE/REDUCE stages. The performance would be even better if the partitions for the same key ranges were collocated on the same nodes and if the computation was scheduled to run on this nodes. However, for now this is outside of Pig's control. + In the case of Hadoop, this type of join means that the join can be done in a Map avoiding SORT/SHUFFLE/REDUCE stages. The performance would be even better if the partitions for the same key ranges were co-located on the same nodes and if the computation was scheduled to run on this nodes. However, for now this is outside of Pig's control. Note that `GROUP` can take advantage of this knowledge as well. @@ -28, +28 @@ === Fragment Replicate Join (FRJ) === - This join type takes advantage of the fact that N-1 relations in the join are small. In this case, the small tables can be copied onto all the nodes and be joined with the data from the larger table. This saves the cost of sorting and partitioning the large table. The performance benefits can be even greater if small tables fit into main memory; otherwise both the small tables and the partition of the large need to be sorted which is still better than having to shuffle the large table. + This join type takes advantage of the fact that N-1 relations in the join are small. In this case, the small tables can be copied onto all the nodes and be joined with the data from the larger table. This saves the cost of sorting and partitioning the large table. The performance benefits can be even greater if small tables fit into main memory; otherwise, both the small tables and the partition of the large need to be sorted which is still better than having to shuffle the large table. - If you have several larger tables in the join that can't fit into memory, it might be beneficial to split the join to fit FRJ pattern since it would significantly reduce the size of the data going into the next join and might even allow to use FRJ again. + If you have several larger tables in the join, it might be beneficial to split the join to fit FRJ pattern since it would significantly reduce the size of the data going into the next join and might even allow to use FRJ again. - For Hadoop this means that the join can happen on the map side. + For Hadoop this type of join means that the join can happen on the map side. The data coming out of the join is not guaranteed to be sorted on the join key which could cause problems for queries that follow join by `GROUP` or `ORDER BY` on the prefix of the join key. This should be taken into account when choosing join type. @@ -40, +40 @@ === Indexed Join (IJ) === - This join type takes advantage of the fact that one or more tables participating in the join have index on the group by key or its prefix. This is similar in structure to FRJ join but could be even more efficient since processing time can be proportional to the size of the non-indexed and hopefully smaller table. + This join type takes advantage of the fact that one or more tables participating in the join have index on the join key or its prefix. This is similar in structure to FRJ join but could be even more efficient since processing time can be proportional to the size of the non-indexed and hopefully smaller table. In Hadoop, this will also result in a map side join. @@ -67, +67 @@ set optimizations.reorder 'off' }}} - Also, a user should be able to specify a particular type of join to perform even if it contradicts with the choices that would be made by the optimizer. To support this we would need to extend the `JOIN` keyword to support outer joins and also to support JOIN type. + Also, a user should be able to specify a particular type of join to perform even if it contradicts the choices made by the optimizer. To support this we would need to extend the `JOIN` keyword to support outer joins and also to support Join type. {{{ C = JOIN A by name, B by name USING <JOIN TYPE>;