[ 
https://issues.apache.org/jira/browse/PIG-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913120#action_12913120
 ] 

Alan Gates commented on PIG-1633:
---------------------------------

This is a design decision we made when implementing nested foreach.  Each 
expression in the generate list has its own pipeline.  This had the advantage 
that it was easy to implement.  The disadvantages are that it invokes certain 
operators (like your random function) multiple times.  This is inefficient 
performance wise.  In the case of indeterminate functions it also produces 
strange results.  We could not think of any use cases where users would have 
indeterminate functions so we did not worry about that too much.  If you have a 
real use case we would be interested.

> Using an alias withing Nested Foreach causes indeterminate behaviour
> --------------------------------------------------------------------
>
>                 Key: PIG-1633
>                 URL: https://issues.apache.org/jira/browse/PIG-1633
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0, 0.5.0, 0.6.0, 0.7.0
>            Reporter: Viraj Bhat
>
> I have created a RANDOMINT function which generates random numbers between (0 
> and specified value), For example RANDOMINT(4) gives random numbers between 0 
> and 3 (inclusive)
> {code}
> $hadoop fs -cat rand.dat
> f
> g
> h
> i
> j
> k
> l
> m
> {code}
> The pig script is as follows:
> {code}
> register math.jar;
> A = load 'rand.dat' using PigStorage() as (data);
> B = foreach A {
>         r = math.RANDOMINT(4);
>         generate
>                 data,
>                 r as random,
>                 ((r == 3)?1:0) as quarter;
>         };
> dump B;
> {code}
> The results are as follows:
> {code}
> {color:red} 
> (f,0,0)
> (g,3,0)
> (h,0,0)
> (i,2,0)
> (j,3,0)
> (k,2,0)
> (l,0,1)
> (m,1,0)
> {color} 
> {code}
> If you observe, (j,3,0) is created because r is used both in the foreach and 
> generate clauses and generate different values.
> Modifying the above script to below solves the issue. The M/R jobs from both 
> scripts are the same. It is just a matter of convenience. 
> {code}
> A = load 'rand.dat' using PigStorage() as (data);
> B = foreach A generate
>         data,
>         math.RANDOMINT(4) as r;
> C = foreach B generate
>         data,
>         r,
>         ((r == 3)?1:0) as quarter;
> dump C;
> {code}
> Is this issue related to PIG:747?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to