Re: [jira] Updated: (PIG-1309) Map-side Cogroup

Mridul Muralidharan Fri, 03 Sep 2010 04:29:57 -0700


Condition (1) refers to only explicit (user specified) statements right ?
Not implicit project introduced by pig to conform to schema ?



Regards,
Mridul


On Saturday 21 August 2010 12:59 AM, Ashutosh Chauhan (JIRA) wrote:


      [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
----------------------------------

     Release Note:
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions.

Following preconditions must be met to use this feature:
1) No other operations can be done between load and cogroup statements.
2) Data must be sorted on join keys for all tables in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}.
5) All other loaders must implement IndexableLoadFunc.
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well.

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
C = COGROUP A by id, B by id using 'merge';


   was:
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions.

Following preconditions must be met to use this feature:
1) No other operations can be done between load and join statements.
2) Data must be sorted on join keys for all tables in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}.
5) All other loaders must implement IndexableLoadFunc.
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well.

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
C = COGROUP A by id, B by id using 'merge';

Map-side Cogroup
----------------

                 Key: PIG-1309
                 URL: https://issues.apache.org/jira/browse/PIG-1309
             Project: Pig
          Issue Type: Bug
          Components: impl
            Reporter: Ashutosh Chauhan
            Assignee: Ashutosh Chauhan
             Fix For: 0.7.0, 0.8.0

         Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
PIG_1309_7.patch


In never ending quest to make Pig go faster, we want to parallelize as many 
relational operations as possible. Its already possible to do Group-by( PIG-984 
) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add 
map-side implementation of Cogroup in Pig. Details to follow.

Re: [jira] Updated: (PIG-1309) Map-side Cogroup

Reply via email to