[ 
https://issues.apache.org/jira/browse/PIG-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846222#action_12846222
 ] 

Daniel Dai commented on PIG-1289:
---------------------------------

Yes, it is safe not push filter up a branch that will be producing nulls. I 
might be wrong but what I did is try to be a little bit more aggressive. Since 
the only extra value outer join will produce is null, so if filter is not 
testing null, we can still push it up even if it is on the inner branch. 

Eg:
A = load 'foo' as (q, r, s);
B = load 'bar ' as (t, u, v);
C = join A on q outer, B on t;
D = filter C by t > 0;

The production C consists of two parts:
A + B
A + "null"

If we do a filter after join, it is a union on this two parts:
filter(A + B) union filter(A + "null")

If we are not testing nullability (eg, t > 0), then filter(A + "null") will not 
have any production, so
filter(A + B) union filter(A + "null") = filter(A + B)

In this case, outer join is equivalent as a regular join (since all generated 
null B records are filtered away), so we can still push the filter up.

> PIG Join fails while doing a filter on joined data
> --------------------------------------------------
>
>                 Key: PIG-1289
>                 URL: https://issues.apache.org/jira/browse/PIG-1289
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Karim Saadah
>            Assignee: Daniel Dai
>            Priority: Minor
>             Fix For: 0.7.0
>
>         Attachments: PIG-1289-1.patch
>
>
> PIG Join fails while doing a filter on joined data
> Here are the steps to reproduce it:
> -bash-3.1$ pig -latest -x local
> grunt> a = load 'first.dat' using PigStorage('\u0001') as (f1:int, 
> f2:chararray);
> grunt> DUMP a;
> (1,A)
> (2,B)
> (3,C)
> (4,D)
> grunt> b = load 'second.dat' using PigStorage() as (f3:chararray);
> grunt> DUMP b;
> (A)
> (D)
> (E)
> grunt> c = join a by f2 LEFT OUTER, b by f3;
> grunt> DUMP c;
> (1,A,A)
> (2,B,)
> (3,C,)
> (4,D,D)
> grunt> describe c;
> c: {a::f1: int,a::f2: chararray,b::f3: chararray}
> grunt> d = filter c by (f3 is null or f3 =='');
> grunt> dump d;
> 2010-03-03 15:00:37,129 [main] INFO  
> org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned 
> for b
> 2010-03-03 15:00:37,129 [main] INFO  
> org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
> for b
> 2010-03-03 15:00:37,129 [main] INFO  
> org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned 
> for a
> 2010-03-03 15:00:37,130 [main] INFO  
> org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
> for a
> 2010-03-03 15:00:37,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1002: Unable to store alias d
> This one is failing too:
> grunt> d = filter c by (b::f3 is null or b::f3 =='');
> or this one not returning results as expected:
> grunt> d = foreach c generate f1 as f1, f2 as f2, f3 as f3;
> grunt> e = filter d by (f3 is null or f3 =='');
> grunt> DUMP e;
> (1,A,)
> (2,B,)
> (3,C,)
> (4,D,)
> while the expected result is
> (2,B,)
> (3,C,)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to