BTW, is this some how related[1] ?

[1]: 
http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E

On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <andronat_...@hotmail.com> 
wrote:

> Yes, of course, my output is like that:
> 
> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> .
> .
> .
> 
> and when I put PARALLEL 1 in GROUP BY I get:
> 
> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> .
> .
> .
> 
> 
> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com> wrote:
> 
>> Where exactly are you getting duplicates? I'm not sure I understand your
>> question. Can you give an example please?
>> 
>> 
>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
>> andronat_...@hotmail.com> wrote:
>> 
>>> Hello everyone,
>>> 
>>> I have a foreach statement and inside of it, I use an order by. After the
>>> order by, I have a UDF. Example like this:
>>> 
>>> 
>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>>> 
>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>>> 
>>> service_flavors = FOREACH logs_g {
>>>       t = ORDER logs BY status;
>>>       GENERATE group.date as dates, group.site as site, group.profile as
>>> profile,
>>>                                       FLATTEN(MY_UDF(t)) as
>>> (generic_status);
>>> };
>>> 
>>> The problem is that I get duplicate results.. I know that MY_UDF is
>>> running on mappers, but shouldn't each mapper take 1 group from the logs_g?
>>> Is something wrong with order by? I tried to add  order by parallel but I
>>> get syntax errors...
>>> 
>>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
>>> PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
>>> am using pig 0.11
>>> 
>>> Cheers,
>>> Anastasis
> 

Reply via email to