Yes, of course, my output is like that:

(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
(20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
.
.
.

and when I put PARALLEL 1 in GROUP BY I get:

(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
(20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
.
.
.


On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com> wrote:

> Where exactly are you getting duplicates? I'm not sure I understand your
> question. Can you give an example please?
> 
> 
> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
> andronat_...@hotmail.com> wrote:
> 
>> Hello everyone,
>> 
>> I have a foreach statement and inside of it, I use an order by. After the
>> order by, I have a UDF. Example like this:
>> 
>> 
>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>> 
>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>> 
>> service_flavors = FOREACH logs_g {
>>        t = ORDER logs BY status;
>>        GENERATE group.date as dates, group.site as site, group.profile as
>> profile,
>>                                        FLATTEN(MY_UDF(t)) as
>> (generic_status);
>> };
>> 
>> The problem is that I get duplicate results.. I know that MY_UDF is
>> running on mappers, but shouldn't each mapper take 1 group from the logs_g?
>> Is something wrong with order by? I tried to add  order by parallel but I
>> get syntax errors...
>> 
>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
>> PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
>> am using pig 0.11
>> 
>> Cheers,
>> Anastasis

Reply via email to