Hi all,
I met a problem that “group operator has different results in different
engines like "spark" and
"mapreduce"(PIG-4282<https://issues.apache.org/jira/browse/PIG-4282>).
groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int);
B = group A by age;
C = foreach B {
D = A.gpa;
E = distinct D;
generate group, MIN(E);
};
dump C;
input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field ‘group’ is not
same.
Is there any way to guarantee the sequence of “group” field as the input when
using “group” operator in pig?
Best regards
Zhang,Liyun