[ 
https://issues.apache.org/jira/browse/PIG-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846087#action_12846087
 ] 

Daniel Dai commented on PIG-1301:
---------------------------------

Thanks for reporting. I tried the script on trunk. Seems on trunk we have fixed 
that. The code you mentioned do have the problem you mentioned. But in the 
trunk, we already changed the code to:

{code}
boolean anyPruned = false;
for (LOProject loProject : projectFinder.getProjectSet()) {
    Pair<Integer, Integer> pair = new Pair<Integer, Integer>(0, 
loProject.getCol());
    if (columns.contains(pair)) {
        anyPruned = true;
        break;
    }
}
{code}

The fix will come with next Pig release. 

> Problem pruning columns with UDF
> --------------------------------
>
>                 Key: PIG-1301
>                 URL: https://issues.apache.org/jira/browse/PIG-1301
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Andrew Groh
>             Fix For: 0.7.0
>
>
> I just upgraded to pig 0.6.0.
> I have a pig file like
> raw = load 'foo.csv' using PigStorage() as (field1:chararray, 
> field2:chararray);
> define contains com.mycompany.pig.Contains();
> rawactions = foreach raw generate contains(field1, field2) as junk,  field1;
> reqcnt = foreach rawactions generate field1;
> dump reqcnt
> When I try to run this code, I get an error:
> Problem with input: (Name: Project 1-40 Projections: [1] Overloaded: false 
> Operator Key: 1-40) of User-defined function: (Name: UserFunc 1-39 function: 
> com.mycompany.pig.Contains Operator Key: 1-39)
> Thrown from line 98 of LOUserFunction.java
> This was caused by another FrontEndException 
> Attempt to access field: 1 from schema: {field1: chararray}
> from Schema.java
> I also investigated changing the pig code
> if you change
> rawactions = foreach raw generate contains(field1, field2) as junk,  field1;
> to either
> rawactions = foreach raw generate contains(field2, field2) as junk,  field1;
> or
> rawactions = foreach raw generate contains(field2, field2) as junk,  field1;
> or if you change
> reqcnt = foreach rawactions generate field1;
> to
> reqcnt = foreach rawactions generate field1, junk;
> It all works correctly.
> The problem appears to be that it prunes out field2, but then gets confused 
> and does not prune out the plan associated with the UDF contains, since 
> field1 is not pruned.  So if the UDF only references field2 it will get 
> removed, if it only references field1 the field will have not been pruned and 
> it can run.
> I eventually tracked this down to the code around 947 of LOForEach.java
>             for (LOProject loProject : projectFinder.getProjectSet()) {
>                 Pair<Integer, Integer> pair = new Pair<Integer, Integer>(0,
>                         loProject.getCol());
>                 if (!columns.contains(pair)) {
>                     allPruned = false;
>                     break;
>                 }
>             }
>             if (allPruned) {
>                 planToRemove.add(i);
>             }
> In the example pig, allPruned is false for the plan associated the UDF.  This 
> is because field1 is both a column for the UDF and for the ForEach in 
> general.  Since field1 is not pruned, the plan is not removed and bad things 
> happen later.
> I don't really understand the pruning code all that well, so I don't have a 
> fix for it.  I hope that it will be clear to someone who understands this 
> code better.  I can provide a better test case for this if necessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to