I found the problem. I used some private variables in my class. I was thinking that in every tuple I'm getting, pig will create a new object of my class. But this not the case of course.
Sorry for the inconvenience Anastasis On 28 Φεβ 2014, at 2:07 π.μ., Anastasis Andronidis <andronat_...@hotmail.com> wrote: > I also just found out that the bag from the nested order by is > org.apache.pig.data.InternalCachedBag and not > org.apache.pig.data.SortedDataBag > > should be like that? > > On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis <andronat_...@hotmail.com> > wrote: > >> Hi again, >> >> I added this in my UDF: >> >> if(!((DataBag) input.get(0)).isSorted()) { >> throw new IOException("It's not sorted"); >> } >> >> And the exception arises. Why? I don't understand it. I specified ORDER BY >> in the nested foreach. >> >> Thank you for helping me btw! >> >> On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <pradeep...@gmail.com> wrote: >> >>> No... that wouldn't be related since you're not doing a GROUP ALL. >>> >>> The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going >>> wrong in your UDF. The output of your UDF is going to be a string that is >>> some generic status right? My uneducated guess is that there's a bug in >>> your UDF. To confirm, do you get the correct result if you replace your UDF >>> with an out of the box one e.g. COUNT? >>> >>> >>> On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis < >>> andronat_...@hotmail.com> wrote: >>> >>>> BTW, is this some how related[1] ? >>>> >>>> >>>> [1]: >>>> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E >>>> >>>> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis < >>>> andronat_...@hotmail.com> wrote: >>>> >>>>> Yes, of course, my output is like that: >>>>> >>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>>> . >>>>> . >>>>> . >>>>> >>>>> and when I put PARALLEL 1 in GROUP BY I get: >>>>> >>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>>> . >>>>> . >>>>> . >>>>> >>>>> >>>>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com> >>>> wrote: >>>>> >>>>>> Where exactly are you getting duplicates? I'm not sure I understand your >>>>>> question. Can you give an example please? >>>>>> >>>>>> >>>>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis < >>>>>> andronat_...@hotmail.com> wrote: >>>>>> >>>>>>> Hello everyone, >>>>>>> >>>>>>> I have a foreach statement and inside of it, I use an order by. After >>>> the >>>>>>> order by, I have a UDF. Example like this: >>>>>>> >>>>>>> >>>>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); >>>>>>> >>>>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; >>>>>>> >>>>>>> service_flavors = FOREACH logs_g { >>>>>>> t = ORDER logs BY status; >>>>>>> GENERATE group.date as dates, group.site as site, group.profile >>>> as >>>>>>> profile, >>>>>>> FLATTEN(MY_UDF(t)) as >>>>>>> (generic_status); >>>>>>> }; >>>>>>> >>>>>>> The problem is that I get duplicate results.. I know that MY_UDF is >>>>>>> running on mappers, but shouldn't each mapper take 1 group from the >>>> logs_g? >>>>>>> Is something wrong with order by? I tried to add order by parallel >>>> but I >>>>>>> get syntax errors... >>>>>>> >>>>>>> My problem is resolved if I put GROUP logs BY (date, site, profile) >>>>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me >>>> pls? I >>>>>>> am using pig 0.11 >>>>>>> >>>>>>> Cheers, >>>>>>> Anastasis >>>>> >>>> >>>> >> >> > >