Hello everyone,

I have a foreach statement and inside of it, I use an order by. After the order 
by, I have a UDF. Example like this:


logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();

logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;

service_flavors = FOREACH logs_g {
        t = ORDER logs BY status;
        GENERATE group.date as dates, group.site as site, group.profile as 
profile,
                                        FLATTEN(MY_UDF(t)) as (generic_status);
};

The problem is that I get duplicate results.. I know that MY_UDF is running on 
mappers, but shouldn't each mapper take 1 group from the logs_g? Is something 
wrong with order by? I tried to add  order by parallel but I get syntax 
errors...

My problem is resolved if I put  GROUP logs BY (date, site, profile) PARALLEL 
1; But this is not a scalable solution. Can someone help me pls? I am using pig 
0.11

Cheers,
Anastasis

Reply via email to