I found the problem. I used some private variables in my class. I was thinking 
that in every tuple I'm getting, pig will create a new object of my class. But 
this not the case of course.

Sorry for the inconvenience
Anastasis

On 28 Φεβ 2014, at 2:07 π.μ., Anastasis Andronidis <andronat_...@hotmail.com> 
wrote:

> I also just found out that the bag from the nested order by is 
> org.apache.pig.data.InternalCachedBag and not 
> org.apache.pig.data.SortedDataBag
> 
> should be like that?
> 
> On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis <andronat_...@hotmail.com> 
> wrote:
> 
>> Hi again,
>> 
>> I added this in my UDF:
>> 
>>    if(!((DataBag) input.get(0)).isSorted()) {
>>        throw new IOException("It's not sorted");
>>    }
>> 
>> And the exception arises. Why? I don't understand it. I specified ORDER BY 
>> in the nested foreach.
>> 
>> Thank you for helping me btw!
>> 
>> On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <pradeep...@gmail.com> wrote:
>> 
>>> No... that wouldn't be related since you're not doing a GROUP ALL.
>>> 
>>> The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
>>> wrong in your UDF. The output of your UDF is going to be a string that is
>>> some generic status right? My uneducated guess is that there's a bug in
>>> your UDF. To confirm, do you get the correct result if you replace your UDF
>>> with an out of the box one e.g. COUNT?
>>> 
>>> 
>>> On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis <
>>> andronat_...@hotmail.com> wrote:
>>> 
>>>> BTW, is this some how related[1] ?
>>>> 
>>>> 
>>>> [1]:
>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E
>>>> 
>>>> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <
>>>> andronat_...@hotmail.com> wrote:
>>>> 
>>>>> Yes, of course, my output is like that:
>>>>> 
>>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>>> .
>>>>> .
>>>>> .
>>>>> 
>>>>> and when I put PARALLEL 1 in GROUP BY I get:
>>>>> 
>>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>>> .
>>>>> .
>>>>> .
>>>>> 
>>>>> 
>>>>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Where exactly are you getting duplicates? I'm not sure I understand your
>>>>>> question. Can you give an example please?
>>>>>> 
>>>>>> 
>>>>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
>>>>>> andronat_...@hotmail.com> wrote:
>>>>>> 
>>>>>>> Hello everyone,
>>>>>>> 
>>>>>>> I have a foreach statement and inside of it, I use an order by. After
>>>> the
>>>>>>> order by, I have a UDF. Example like this:
>>>>>>> 
>>>>>>> 
>>>>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>>>>>>> 
>>>>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>>>>>>> 
>>>>>>> service_flavors = FOREACH logs_g {
>>>>>>> t = ORDER logs BY status;
>>>>>>> GENERATE group.date as dates, group.site as site, group.profile
>>>> as
>>>>>>> profile,
>>>>>>>                                 FLATTEN(MY_UDF(t)) as
>>>>>>> (generic_status);
>>>>>>> };
>>>>>>> 
>>>>>>> The problem is that I get duplicate results.. I know that MY_UDF is
>>>>>>> running on mappers, but shouldn't each mapper take 1 group from the
>>>> logs_g?
>>>>>>> Is something wrong with order by? I tried to add  order by parallel
>>>> but I
>>>>>>> get syntax errors...
>>>>>>> 
>>>>>>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
>>>>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me
>>>> pls? I
>>>>>>> am using pig 0.11
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Anastasis
>>>>> 
>>>> 
>>>> 
>> 
>> 
> 
> 

Reply via email to