Cool! can't wait until CDH has >0.7... Kinda surprised that nobody encountered this problem before... Can I file a ticket?
On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates <ga...@yahoo-inc.com> wrote: > > On Apr 16, 2010, at 9:37 AM, hc busy wrote: > > What scott noticed is present when the multiple column join key is used in >> a >> distributed setting. The trap is that when you unit test the behavior/PIG >> script and it does the join right in a local environment and then you get >> F'ed after u deploy to production in distributed enviro. >> > > In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid > that will fix these issues with development and deployment differences. > > Alan. > > > >> >> On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey <sc...@richrelevance.com >> >wrote: >> >> CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2 >>> weeks >>> ago. >>> >>> Joins on multiple columns have null key values matching. >>> >>> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, >>> f1:int, f2:int); >>> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, >>> f1:int, f2:int); >>> --- both the above are the same >>> >>> dump IN; >>> (,1,2,3) >>> (,-5,5,5) >>> ( ,100,200,300) >>> ( ,0,200,300) >>> (a,4,5,6) >>> (a,7,8,9) >>> (b,10,11,12) >>> (b,11,11,12) >>> >>> IN_NULLS = FILTER IN BY ind is NULL; >>> dump IN_NULLS; >>> (,1,2,3) >>> (,-5,5,5) >>> >>> J1 = JOIN IN by (ind), IN2 by (ind); >>> dump J1; >>> ( ,0,200,300, ,0,200,300) >>> (a,4,5,6,a,4,5,6) >>> (a,4,5,6,a,7,8,9) >>> (a,7,8,9,a,4,5,6) >>> (a,7,8,9,a,7,8,9) >>> ( ,100,200,300, ,100,200,300) >>> (b,10,11,12,b,10,11,12) >>> (b,10,11,12,b,11,11,12) >>> (b,11,11,12,b,10,11,12) >>> (b,11,11,12,b,11,11,12) >>> >>> The above is the expected result of the self-join on the first column. >>> >>> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts); >>> dump J2; >>> ( ,0,200,300, ,0,200,300) >>> ( ,100,200,300, ,100,200,300) >>> (a,4,5,6,a,4,5,6) >>> (a,7,8,9,a,7,8,9) >>> (b,10,11,12,b,10,11,12) >>> (b,11,11,12,b,11,11,12) >>> (,-5,5,5,,-5,5,5) >>> (,1,2,3,,1,2,3) >>> >>> >>> The above is incorrect, since it matched the rows that have NULL for the >>> ind field. >>> >>> There is a work-around, by explicitly filtering for null on the join >>> columns before the join, but the above still looks incorrect to me. >>> I suspect it is fixed in 0.6 or later, but I have not been able to find a >>> JIRA ticket or message on this list about this. >>> >>> >>> >>> >>> >