[jira] Commented: (PIG-1327) Incorrect column pruning after multiple JOIN operations

Daniel Dai (JIRA) Wed, 31 Mar 2010 11:15:52 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852010#action_12852010
 ]


Daniel Dai commented on PIG-1327:
---------------------------------

Hi, Ankur,
I tested on current trunk. I see:

joinTopNData: {TopNData_stored::a: chararray,TopNData_stored::b: 
chararray,TopNData_stored::c: long,proj::InterimData::A::a: 
chararray,proj::InterimData::A::b: chararray,proj::InterimData::B::y: 
chararray,proj::InterimData::B::z: long,proj::C::e: chararray,proj::C::f: 
chararray}
2010-03-31 11:10:08,889 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for C
2010-03-31 11:10:08,889 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for C
2010-03-31 11:10:08,889 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for B
2010-03-31 11:10:08,889 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for B
2010-03-31 11:10:08,890 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - Columns pruned for A: 
$2
2010-03-31 11:10:08,890 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for A
2010-03-31 11:10:08,890 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for 
TopNData_stored
2010-03-31 11:10:08,890 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for TopNData_stored

And I give dummy data files which data1, data2, data3 all contains one line 
"1\t1\t1", and I see the output has 9 columns which matches the schema. Can you 
check it again?

> Incorrect column pruning after multiple JOIN operations
> -------------------------------------------------------
>
>                 Key: PIG-1327
>                 URL: https://issues.apache.org/jira/browse/PIG-1327
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Ankur
>
> In a script with multiple JOIN and GROUP operations, the column pruner 
> incorrectly removes some of the fields that it shouldn't. Here is a script 
> that demonstrates the issue
>  = LOAD 'data1' USING PigStorage() AS (a:chararray, b:chararray, c:long);
> B = LOAD 'data2' USING PigStorage() AS (x:chararray, y:chararray, z:long);
> C = LOAD 'data3' using PigStorage() AS (d:chararray, e:chararray, 
> f:chararray);
> join1 = JOIN B by x, A by a;
> filtered1 = FILTER join1  BY y == b;
> InterimData = FOREACH filtered1 GENERATE a, b, c, y, z;
> join2 = JOIN InterimData BY b LEFT OUTER, C BY d  PARALLEL 2;
> proj = FOREACH join2 GENERATE a,b,y,z,e,f;
> TopNPrj = FOREACH proj GENERATE a, (( e is not null and e != '') ? e : 
> 'None') , z;
> TopNDataGrp = GROUP TopNPrj BY (a, e) PARALLEL 2;
> TopNDataSum = FOREACH TopNDataGrp GENERATE flatten(group) as (a, e), 
> SUM(TopNPrj.z) as views;
> TopNDataRegrp = GROUP TopNDataSum BY (a) PARALLEL 2;
> TopNDataCount = FOREACH TopNDataRegrp { OrderedData = ORDER TopNDataSum BY 
> views desc; LimitedData = LIMIT OrderedData 50; GENERATE LimitedData; }
> TopNData = FOREACH TopNDataCount GENERATE flatten($0) as (a, e, views);
> store TopNData into 'tmpTopN';
> TopNData_stored = load 'tmpTopN' as (a:chararray, b:chararray, c:long);
> joinTopNData = JOIN TopNData_stored BY (a,b) RIGHT OUTER, proj BY (a,b) 
> PARALLEL 2;
> describe joinTopNData;
> STORE  joinTopNData  INTO 'output';
> The column 'f' from relation 'C' participating in the 2nd JOIN is missing 
> from the final join ouput

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1327) Incorrect column pruning after multiple JOIN operations

Reply via email to