Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this:
logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis