[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904785#action_12904785
 ] 

Olga Natkovich commented on PIG-1506:
-------------------------------------

This is what we need to document:

In the case of GROUP/COGROUP, the data with NULL key from the same input is 
grouped together. For instance:

Input data:

joe     5       2.5
sam             3.0
bob             3.5

script:

A = load 'small' as (name, age, gpa);
B = group A by age;
dump B;

Output:

(5,{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)})

Note that both records with null age are grouped together.

However, data with null keys from different inputs is considered different and 
will generate multiple tuples in case of cogroup. For instance:

Input: Self cogroup on the same input.

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = cogroup A by age, B by age;
dump C;

Output:

(5,{(joe,5,2.5)},{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

Note that there are 2 tuples in the output corresponding to the null key: one 
that contains tuples from the first input (with no much from the second) and 
one the other way around.

JOIN adds another interesting twist to this because it follows SQL standard 
which means that JOIN by default represents inner join which through away all 
the nulls.

Input: the same as for COGROUP

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by age, B by age;
dump C;

Output:

(joe,5,2.5,joe,5,2.5)

Note that all tuples that had NULL key got filtered out.


> Need to clarify the difference between null handling in JOIN and COGROUP
> ------------------------------------------------------------------------
>
>                 Key: PIG-1506
>                 URL: https://issues.apache.org/jira/browse/PIG-1506
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to