Hi again, I added this in my UDF:
if(!((DataBag) input.get(0)).isSorted()) { throw new IOException("It's not sorted"); } And the exception arises. Why? I don't understand it. I specified ORDER BY in the nested foreach. Thank you for helping me btw! On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <pradeep...@gmail.com> wrote: > No... that wouldn't be related since you're not doing a GROUP ALL. > > The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going > wrong in your UDF. The output of your UDF is going to be a string that is > some generic status right? My uneducated guess is that there's a bug in > your UDF. To confirm, do you get the correct result if you replace your UDF > with an out of the box one e.g. COUNT? > > > On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis < > andronat_...@hotmail.com> wrote: > >> BTW, is this some how related[1] ? >> >> >> [1]: >> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E >> >> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis < >> andronat_...@hotmail.com> wrote: >> >>> Yes, of course, my output is like that: >>> >>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>> . >>> . >>> . >>> >>> and when I put PARALLEL 1 in GROUP BY I get: >>> >>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>> . >>> . >>> . >>> >>> >>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com> >> wrote: >>> >>>> Where exactly are you getting duplicates? I'm not sure I understand your >>>> question. Can you give an example please? >>>> >>>> >>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis < >>>> andronat_...@hotmail.com> wrote: >>>> >>>>> Hello everyone, >>>>> >>>>> I have a foreach statement and inside of it, I use an order by. After >> the >>>>> order by, I have a UDF. Example like this: >>>>> >>>>> >>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); >>>>> >>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; >>>>> >>>>> service_flavors = FOREACH logs_g { >>>>> t = ORDER logs BY status; >>>>> GENERATE group.date as dates, group.site as site, group.profile >> as >>>>> profile, >>>>> FLATTEN(MY_UDF(t)) as >>>>> (generic_status); >>>>> }; >>>>> >>>>> The problem is that I get duplicate results.. I know that MY_UDF is >>>>> running on mappers, but shouldn't each mapper take 1 group from the >> logs_g? >>>>> Is something wrong with order by? I tried to add order by parallel >> but I >>>>> get syntax errors... >>>>> >>>>> My problem is resolved if I put GROUP logs BY (date, site, profile) >>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me >> pls? I >>>>> am using pig 0.11 >>>>> >>>>> Cheers, >>>>> Anastasis >>> >> >>