[ https://issues.apache.org/jira/browse/PIG-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852010#action_12852010 ]
Daniel Dai commented on PIG-1327: --------------------------------- Hi, Ankur, I tested on current trunk. I see: joinTopNData: {TopNData_stored::a: chararray,TopNData_stored::b: chararray,TopNData_stored::c: long,proj::InterimData::A::a: chararray,proj::InterimData::A::b: chararray,proj::InterimData::B::y: chararray,proj::InterimData::B::z: long,proj::C::e: chararray,proj::C::f: chararray} 2010-03-31 11:10:08,889 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for C 2010-03-31 11:10:08,889 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for C 2010-03-31 11:10:08,889 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for B 2010-03-31 11:10:08,889 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for B 2010-03-31 11:10:08,890 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - Columns pruned for A: $2 2010-03-31 11:10:08,890 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for A 2010-03-31 11:10:08,890 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for TopNData_stored 2010-03-31 11:10:08,890 [main] INFO org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned for TopNData_stored And I give dummy data files which data1, data2, data3 all contains one line "1\t1\t1", and I see the output has 9 columns which matches the schema. Can you check it again? > Incorrect column pruning after multiple JOIN operations > ------------------------------------------------------- > > Key: PIG-1327 > URL: https://issues.apache.org/jira/browse/PIG-1327 > Project: Pig > Issue Type: Bug > Affects Versions: 0.6.0 > Reporter: Ankur > > In a script with multiple JOIN and GROUP operations, the column pruner > incorrectly removes some of the fields that it shouldn't. Here is a script > that demonstrates the issue > = LOAD 'data1' USING PigStorage() AS (a:chararray, b:chararray, c:long); > B = LOAD 'data2' USING PigStorage() AS (x:chararray, y:chararray, z:long); > C = LOAD 'data3' using PigStorage() AS (d:chararray, e:chararray, > f:chararray); > join1 = JOIN B by x, A by a; > filtered1 = FILTER join1 BY y == b; > InterimData = FOREACH filtered1 GENERATE a, b, c, y, z; > join2 = JOIN InterimData BY b LEFT OUTER, C BY d PARALLEL 2; > proj = FOREACH join2 GENERATE a,b,y,z,e,f; > TopNPrj = FOREACH proj GENERATE a, (( e is not null and e != '') ? e : > 'None') , z; > TopNDataGrp = GROUP TopNPrj BY (a, e) PARALLEL 2; > TopNDataSum = FOREACH TopNDataGrp GENERATE flatten(group) as (a, e), > SUM(TopNPrj.z) as views; > TopNDataRegrp = GROUP TopNDataSum BY (a) PARALLEL 2; > TopNDataCount = FOREACH TopNDataRegrp { OrderedData = ORDER TopNDataSum BY > views desc; LimitedData = LIMIT OrderedData 50; GENERATE LimitedData; } > TopNData = FOREACH TopNDataCount GENERATE flatten($0) as (a, e, views); > store TopNData into 'tmpTopN'; > TopNData_stored = load 'tmpTopN' as (a:chararray, b:chararray, c:long); > joinTopNData = JOIN TopNData_stored BY (a,b) RIGHT OUTER, proj BY (a,b) > PARALLEL 2; > describe joinTopNData; > STORE joinTopNData INTO 'output'; > The column 'f' from relation 'C' participating in the 2nd JOIN is missing > from the final join ouput -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.