Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . .
and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com> wrote: > Where exactly are you getting duplicates? I'm not sure I understand your > question. Can you give an example please? > > > On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis < > andronat_...@hotmail.com> wrote: > >> Hello everyone, >> >> I have a foreach statement and inside of it, I use an order by. After the >> order by, I have a UDF. Example like this: >> >> >> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); >> >> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; >> >> service_flavors = FOREACH logs_g { >> t = ORDER logs BY status; >> GENERATE group.date as dates, group.site as site, group.profile as >> profile, >> FLATTEN(MY_UDF(t)) as >> (generic_status); >> }; >> >> The problem is that I get duplicate results.. I know that MY_UDF is >> running on mappers, but shouldn't each mapper take 1 group from the logs_g? >> Is something wrong with order by? I tried to add order by parallel but I >> get syntax errors... >> >> My problem is resolved if I put GROUP logs BY (date, site, profile) >> PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I >> am using pig 0.11 >> >> Cheers, >> Anastasis