Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please?
On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis < andronat_...@hotmail.com> wrote: > Hello everyone, > > I have a foreach statement and inside of it, I use an order by. After the > order by, I have a UDF. Example like this: > > > logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); > > logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; > > service_flavors = FOREACH logs_g { > t = ORDER logs BY status; > GENERATE group.date as dates, group.site as site, group.profile as > profile, > FLATTEN(MY_UDF(t)) as > (generic_status); > }; > > The problem is that I get duplicate results.. I know that MY_UDF is > running on mappers, but shouldn't each mapper take 1 group from the logs_g? > Is something wrong with order by? I tried to add order by parallel but I > get syntax errors... > > My problem is resolved if I put GROUP logs BY (date, site, profile) > PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I > am using pig 0.11 > > Cheers, > Anastasis