CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2 weeks ago.

Joins on multiple columns have null key values matching.

IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, 
f2:int);
IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, 
f1:int, f2:int);
--- both the above are the same

dump IN;
(,1,2,3)
(,-5,5,5)
( ,100,200,300)
(  ,0,200,300)
(a,4,5,6)
(a,7,8,9)
(b,10,11,12)
(b,11,11,12)

IN_NULLS = FILTER IN BY ind is NULL;
dump IN_NULLS;
(,1,2,3)
(,-5,5,5)

J1 = JOIN IN by (ind), IN2 by (ind);
dump J1;
(  ,0,200,300,  ,0,200,300)
(a,4,5,6,a,4,5,6)
(a,4,5,6,a,7,8,9)
(a,7,8,9,a,4,5,6)
(a,7,8,9,a,7,8,9)
( ,100,200,300, ,100,200,300)
(b,10,11,12,b,10,11,12)
(b,10,11,12,b,11,11,12)
(b,11,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)

The above is the expected result of the self-join on the first column.

J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
dump J2;
(  ,0,200,300,  ,0,200,300)
( ,100,200,300, ,100,200,300)
(a,4,5,6,a,4,5,6)
(a,7,8,9,a,7,8,9)
(b,10,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)
(,-5,5,5,,-5,5,5)
(,1,2,3,,1,2,3)


The above is incorrect, since it matched the rows that have NULL for the ind 
field.

There is a work-around, by explicitly filtering for null on the join columns 
before the join, but the above still looks incorrect to me.
I suspect it is fixed in 0.6 or later, but I have not been able to find a JIRA 
ticket or message on this list about this.



Reply via email to