[ https://issues.apache.org/jira/browse/PIG-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-1743: ---------------------------- Attachment: relation2.in relation1.in Input data for testing > Skewed join sampler generates unevenly partitioned data > ------------------------------------------------------- > > Key: PIG-1743 > URL: https://issues.apache.org/jira/browse/PIG-1743 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.7.0, 0.8.0 > Reporter: Viraj Bhat > Attachments: relation1.in, relation2.in > > > I have a data, when using the Skewed join generated uneven partitions. The > script looks like this: > {code} > Data1 = LOAD '/user/viraj/relation1.in' AS (ref,intVal); > Data2 = LOAD '/user/viraj/relation2.in' using PigStorage('\u0001') AS > (ID:chararray, Key:chararray, DomainKey:chararray); > JoinData = JOIN Data1 BY ref LEFT OUTER , Data2 BY ID using 'skewed' PARALLEL > 10; > STORE JoinData into 'skewedoutput' using PigStorage('\u0001'); > {code} > The output generated has the following part files of varying sizes > {quote} > $ hadoop fs -ls /user/viraj/skewedoutput > Found 10 items > -rw------- 3 viraj users 2090 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00000 > -rw------- 3 viraj users 19380 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00001 > -rw------- 3 viraj users 2090 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00002 > -rw------- 3 viraj users 9690 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00003 > -rw------- 3 viraj users 2090 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00004 > -rw------- 3 viraj users 2090 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00005 > -rw------- 3 viraj users 0 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00006 > -rw------- 3 viraj users 0 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00007 > -rw------- 3 viraj users 0 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00008 > -rw------- 3 viraj users 0 2010-11-23 03:44 > /user/viraj/skewedoutput/part-r-00009 > {quote} > Attaching input datasets. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.