Re: Nested foreach with order by
I found the problem. I used some private variables in my class. I was thinking that in every tuple I'm getting, pig will create a new object of my class. But this not the case of course. Sorry for the inconvenience Anastasis On 28 Φεβ 2014, at 2:07 π.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: I also just found out that the bag from the nested order by is org.apache.pig.data.InternalCachedBag and not org.apache.pig.data.SortedDataBag should be like that? On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Hi again, I added this in my UDF: if(!((DataBag) input.get(0)).isSorted()) { throw new IOException(It's not sorted); } And the exception arises. Why? I don't understand it. I specified ORDER BY in the nested foreach. Thank you for helping me btw! On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota pradeep...@gmail.com wrote: No... that wouldn't be related since you're not doing a GROUP ALL. The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going wrong in your UDF. The output of your UDF is going to be a string that is some generic status right? My uneducated guess is that there's a bug in your UDF. To confirm, do you get the correct result if you replace your UDF with an out of the box one e.g. COUNT? On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis andronat_...@hotmail.com wrote: BTW, is this some how related[1] ? [1]: http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Nested foreach with order by
Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Re: Nested foreach with order by
Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Re: Nested foreach with order by
Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Re: Nested foreach with order by
BTW, is this some how related[1] ? [1]: http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Re: Nested foreach with order by
No... that wouldn't be related since you're not doing a GROUP ALL. The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going wrong in your UDF. The output of your UDF is going to be a string that is some generic status right? My uneducated guess is that there's a bug in your UDF. To confirm, do you get the correct result if you replace your UDF with an out of the box one e.g. COUNT? On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis andronat_...@hotmail.com wrote: BTW, is this some how related[1] ? [1]: http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Re: Nested foreach with order by
Hi again, I added this in my UDF: if(!((DataBag) input.get(0)).isSorted()) { throw new IOException(It's not sorted); } And the exception arises. Why? I don't understand it. I specified ORDER BY in the nested foreach. Thank you for helping me btw! On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota pradeep...@gmail.com wrote: No... that wouldn't be related since you're not doing a GROUP ALL. The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going wrong in your UDF. The output of your UDF is going to be a string that is some generic status right? My uneducated guess is that there's a bug in your UDF. To confirm, do you get the correct result if you replace your UDF with an out of the box one e.g. COUNT? On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis andronat_...@hotmail.com wrote: BTW, is this some how related[1] ? [1]: http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis
Re: Nested foreach with order by
I also just found out that the bag from the nested order by is org.apache.pig.data.InternalCachedBag and not org.apache.pig.data.SortedDataBag should be like that? On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Hi again, I added this in my UDF: if(!((DataBag) input.get(0)).isSorted()) { throw new IOException(It's not sorted); } And the exception arises. Why? I don't understand it. I specified ORDER BY in the nested foreach. Thank you for helping me btw! On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota pradeep...@gmail.com wrote: No... that wouldn't be related since you're not doing a GROUP ALL. The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going wrong in your UDF. The output of your UDF is going to be a string that is some generic status right? My uneducated guess is that there's a bug in your UDF. To confirm, do you get the correct result if you replace your UDF with an out of the box one e.g. COUNT? On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis andronat_...@hotmail.com wrote: BTW, is this some how related[1] ? [1]: http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis andronat_...@hotmail.com wrote: Yes, of course, my output is like that: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . and when I put PARALLEL 1 in GROUP BY I get: (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) . . . On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote: Where exactly are you getting duplicates? I'm not sure I understand your question. Can you give an example please? On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: Hello everyone, I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this: logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; service_flavors = FOREACH logs_g { t = ORDER logs BY status; GENERATE group.date as dates, group.site as site, group.profile as profile, FLATTEN(MY_UDF(t)) as (generic_status); }; The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors... My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11 Cheers, Anastasis