Hi Remi:
Thanks for your reply. I agree that "group makes no guarantee by contract".
The sequence of result is not same as the input. So we need make some changes
in org.apache.pig.test.TestForEachNestedPlan.testInnerDistinct() and
org.apache.pig.test.TestForEachNestedPlan.testInnerOrderByAliasReuse() .
Because in those two functions, it judges the result of group according to the
input sequence. I have submitted PIG-4282_1.patch. Can anyone help review? Very
thanks
TestForEachNestedPlan.testInnerDistinct() Line219:
List<Tuple> expectedResults =
Util.getTuplesFromConstantTupleStrings(
new String[] {"(10,68)", "(20,78)"});
int counter = 0;
while (iter.hasNext()) { // judges the result of group according
to the input sequence
assertEquals(expectedResults.get(counter++).toString(),
iter.next().toString());
}
assertEquals(expectedResults.size(), counter);
Best Regards
Zhang,Liyun
-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Thursday, December 18, 2014 10:56 PM
To: [email protected]
Subject: RE: Is there any way to guarantee the sequence of "group" field as the
input when using "group" operator in pig
Hi all,
If you need any kind of ordering in the output you use on the "sort" operator.
It was designed for such needs. The fact that different engines produce
differently ordered groups is due to each engine specific optimizations. If you
ask PIG to re-order the groups you just remove any benefit of those
optimization. I would rather keep groups the way it is because I know I could
rely on sort if I need and pay its price or have the best speed if I don't need
any specific ordering.
My conclusion is : group makes no guarantee by contract, so this is neither a
problem nor a bug. It is a misuse of "group" compared to "sort"
Regards,
Remi
-----Message d'origine-----
De : Zhang, Liyun [mailto:[email protected]] Envoyé : jeudi 18 décembre
2014 07:38 À : [email protected] Objet : Is there any way to guarantee
the sequence of "group" field as the input when using "group" operator in pig
Hi all,
I met a problem that "group operator has different results in different
engines like "spark" and
"mapreduce"(PIG-4282<https://issues.apache.org/jira/browse/PIG-4282>).
groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int); B = group A by age; C = foreach B {
D = A.gpa; E = distinct D; generate group, MIN(E); }; dump C; input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field 'group' is not
same.
Is there any way to guarantee the sequence of "group" field as the input when
using "group" operator in pig?
Best regards
Zhang,Liyun
_________________________________________________________________________________________________________________________
Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites
ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez
le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les
messages electroniques etant susceptibles d'alteration, Orange decline toute
responsabilite si ce message a ete altere, deforme ou falsifie. Merci.
This message and its attachments may contain confidential or privileged
information that may be protected by law; they should not be distributed, used
or copied without authorisation.
If you have received this email in error, please notify the sender and delete
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been
modified, changed or falsified.
Thank you.