Re: Nested foreach with order by

2014-02-28 Thread Anastasis Andronidis
I found the problem. I used some private variables in my class. I was thinking 
that in every tuple I'm getting, pig will create a new object of my class. But 
this not the case of course.

Sorry for the inconvenience
Anastasis

On 28 Φεβ 2014, at 2:07 π.μ., Anastasis Andronidis andronat_...@hotmail.com 
wrote:

 I also just found out that the bag from the nested order by is 
 org.apache.pig.data.InternalCachedBag and not 
 org.apache.pig.data.SortedDataBag
 
 should be like that?
 
 On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis andronat_...@hotmail.com 
 wrote:
 
 Hi again,
 
 I added this in my UDF:
 
if(!((DataBag) input.get(0)).isSorted()) {
throw new IOException(It's not sorted);
}
 
 And the exception arises. Why? I don't understand it. I specified ORDER BY 
 in the nested foreach.
 
 Thank you for helping me btw!
 
 On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota pradeep...@gmail.com wrote:
 
 No... that wouldn't be related since you're not doing a GROUP ALL.
 
 The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
 wrong in your UDF. The output of your UDF is going to be a string that is
 some generic status right? My uneducated guess is that there's a bug in
 your UDF. To confirm, do you get the correct result if you replace your UDF
 with an out of the box one e.g. COUNT?
 
 
 On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 BTW, is this some how related[1] ?
 
 
 [1]:
 http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E
 
 On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Yes, of course, my output is like that:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 and when I put PARALLEL 1 in GROUP BY I get:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 
 On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
 Where exactly are you getting duplicates? I'm not sure I understand your
 question. Can you give an example please?
 
 
 On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Hello everyone,
 
 I have a foreach statement and inside of it, I use an order by. After
 the
 order by, I have a UDF. Example like this:
 
 
 logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
 
 logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
 
 service_flavors = FOREACH logs_g {
 t = ORDER logs BY status;
 GENERATE group.date as dates, group.site as site, group.profile
 as
 profile,
 FLATTEN(MY_UDF(t)) as
 (generic_status);
 };
 
 The problem is that I get duplicate results.. I know that MY_UDF is
 running on mappers, but shouldn't each mapper take 1 group from the
 logs_g?
 Is something wrong with order by? I tried to add  order by parallel
 but I
 get syntax errors...
 
 My problem is resolved if I put  GROUP logs BY (date, site, profile)
 PARALLEL 1; But this is not a scalable solution. Can someone help me
 pls? I
 am using pig 0.11
 
 Cheers,
 Anastasis
 
 
 
 
 
 
 



Nested foreach with order by

2014-02-27 Thread Anastasis Andronidis
Hello everyone,

I have a foreach statement and inside of it, I use an order by. After the order 
by, I have a UDF. Example like this:


logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();

logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;

service_flavors = FOREACH logs_g {
t = ORDER logs BY status;
GENERATE group.date as dates, group.site as site, group.profile as 
profile,
FLATTEN(MY_UDF(t)) as (generic_status);
};

The problem is that I get duplicate results.. I know that MY_UDF is running on 
mappers, but shouldn't each mapper take 1 group from the logs_g? Is something 
wrong with order by? I tried to add  order by parallel but I get syntax 
errors...

My problem is resolved if I put  GROUP logs BY (date, site, profile) PARALLEL 
1; But this is not a scalable solution. Can someone help me pls? I am using pig 
0.11

Cheers,
Anastasis

Re: Nested foreach with order by

2014-02-27 Thread Pradeep Gollakota
Where exactly are you getting duplicates? I'm not sure I understand your
question. Can you give an example please?


On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
andronat_...@hotmail.com wrote:

 Hello everyone,

 I have a foreach statement and inside of it, I use an order by. After the
 order by, I have a UDF. Example like this:


 logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();

 logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;

 service_flavors = FOREACH logs_g {
 t = ORDER logs BY status;
 GENERATE group.date as dates, group.site as site, group.profile as
 profile,
 FLATTEN(MY_UDF(t)) as
 (generic_status);
 };

 The problem is that I get duplicate results.. I know that MY_UDF is
 running on mappers, but shouldn't each mapper take 1 group from the logs_g?
 Is something wrong with order by? I tried to add  order by parallel but I
 get syntax errors...

 My problem is resolved if I put  GROUP logs BY (date, site, profile)
 PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
 am using pig 0.11

 Cheers,
 Anastasis


Re: Nested foreach with order by

2014-02-27 Thread Anastasis Andronidis
Yes, of course, my output is like that:

(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
(20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
.
.
.

and when I put PARALLEL 1 in GROUP BY I get:

(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
(20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
(20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
.
.
.


On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote:

 Where exactly are you getting duplicates? I'm not sure I understand your
 question. Can you give an example please?
 
 
 On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Hello everyone,
 
 I have a foreach statement and inside of it, I use an order by. After the
 order by, I have a UDF. Example like this:
 
 
 logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
 
 logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
 
 service_flavors = FOREACH logs_g {
t = ORDER logs BY status;
GENERATE group.date as dates, group.site as site, group.profile as
 profile,
FLATTEN(MY_UDF(t)) as
 (generic_status);
 };
 
 The problem is that I get duplicate results.. I know that MY_UDF is
 running on mappers, but shouldn't each mapper take 1 group from the logs_g?
 Is something wrong with order by? I tried to add  order by parallel but I
 get syntax errors...
 
 My problem is resolved if I put  GROUP logs BY (date, site, profile)
 PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
 am using pig 0.11
 
 Cheers,
 Anastasis



Re: Nested foreach with order by

2014-02-27 Thread Anastasis Andronidis
BTW, is this some how related[1] ?


[1]: 
http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E

On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis andronat_...@hotmail.com 
wrote:

 Yes, of course, my output is like that:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 and when I put PARALLEL 1 in GROUP BY I get:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 
 On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com wrote:
 
 Where exactly are you getting duplicates? I'm not sure I understand your
 question. Can you give an example please?
 
 
 On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Hello everyone,
 
 I have a foreach statement and inside of it, I use an order by. After the
 order by, I have a UDF. Example like this:
 
 
 logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
 
 logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
 
 service_flavors = FOREACH logs_g {
   t = ORDER logs BY status;
   GENERATE group.date as dates, group.site as site, group.profile as
 profile,
   FLATTEN(MY_UDF(t)) as
 (generic_status);
 };
 
 The problem is that I get duplicate results.. I know that MY_UDF is
 running on mappers, but shouldn't each mapper take 1 group from the logs_g?
 Is something wrong with order by? I tried to add  order by parallel but I
 get syntax errors...
 
 My problem is resolved if I put  GROUP logs BY (date, site, profile)
 PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
 am using pig 0.11
 
 Cheers,
 Anastasis
 



Re: Nested foreach with order by

2014-02-27 Thread Pradeep Gollakota
No... that wouldn't be related since you're not doing a GROUP ALL.

The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
wrong in your UDF. The output of your UDF is going to be a string that is
some generic status right? My uneducated guess is that there's a bug in
your UDF. To confirm, do you get the correct result if you replace your UDF
with an out of the box one e.g. COUNT?


On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis 
andronat_...@hotmail.com wrote:

 BTW, is this some how related[1] ?


 [1]:
 http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E

 On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis 
 andronat_...@hotmail.com wrote:

  Yes, of course, my output is like that:
 
  (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
  (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
  (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
  (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
  (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
  .
  .
  .
 
  and when I put PARALLEL 1 in GROUP BY I get:
 
  (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
  (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
  (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
  .
  .
  .
 
 
  On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
  Where exactly are you getting duplicates? I'm not sure I understand your
  question. Can you give an example please?
 
 
  On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
  andronat_...@hotmail.com wrote:
 
  Hello everyone,
 
  I have a foreach statement and inside of it, I use an order by. After
 the
  order by, I have a UDF. Example like this:
 
 
  logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
 
  logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
 
  service_flavors = FOREACH logs_g {
t = ORDER logs BY status;
GENERATE group.date as dates, group.site as site, group.profile
 as
  profile,
FLATTEN(MY_UDF(t)) as
  (generic_status);
  };
 
  The problem is that I get duplicate results.. I know that MY_UDF is
  running on mappers, but shouldn't each mapper take 1 group from the
 logs_g?
  Is something wrong with order by? I tried to add  order by parallel
 but I
  get syntax errors...
 
  My problem is resolved if I put  GROUP logs BY (date, site, profile)
  PARALLEL 1; But this is not a scalable solution. Can someone help me
 pls? I
  am using pig 0.11
 
  Cheers,
  Anastasis
 




Re: Nested foreach with order by

2014-02-27 Thread Anastasis Andronidis
Hi again,

I added this in my UDF:

 if(!((DataBag) input.get(0)).isSorted()) {
 throw new IOException(It's not sorted);
 }

And the exception arises. Why? I don't understand it. I specified ORDER BY in 
the nested foreach.

Thank you for helping me btw!

On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota pradeep...@gmail.com wrote:

 No... that wouldn't be related since you're not doing a GROUP ALL.
 
 The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
 wrong in your UDF. The output of your UDF is going to be a string that is
 some generic status right? My uneducated guess is that there's a bug in
 your UDF. To confirm, do you get the correct result if you replace your UDF
 with an out of the box one e.g. COUNT?
 
 
 On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 BTW, is this some how related[1] ?
 
 
 [1]:
 http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E
 
 On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Yes, of course, my output is like that:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 and when I put PARALLEL 1 in GROUP BY I get:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 
 On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
 Where exactly are you getting duplicates? I'm not sure I understand your
 question. Can you give an example please?
 
 
 On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Hello everyone,
 
 I have a foreach statement and inside of it, I use an order by. After
 the
 order by, I have a UDF. Example like this:
 
 
 logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
 
 logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
 
 service_flavors = FOREACH logs_g {
   t = ORDER logs BY status;
   GENERATE group.date as dates, group.site as site, group.profile
 as
 profile,
   FLATTEN(MY_UDF(t)) as
 (generic_status);
 };
 
 The problem is that I get duplicate results.. I know that MY_UDF is
 running on mappers, but shouldn't each mapper take 1 group from the
 logs_g?
 Is something wrong with order by? I tried to add  order by parallel
 but I
 get syntax errors...
 
 My problem is resolved if I put  GROUP logs BY (date, site, profile)
 PARALLEL 1; But this is not a scalable solution. Can someone help me
 pls? I
 am using pig 0.11
 
 Cheers,
 Anastasis
 
 
 



Re: Nested foreach with order by

2014-02-27 Thread Anastasis Andronidis
I also just found out that the bag from the nested order by is 
org.apache.pig.data.InternalCachedBag and not org.apache.pig.data.SortedDataBag

should be like that?

On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis andronat_...@hotmail.com 
wrote:

 Hi again,
 
 I added this in my UDF:
 
 if(!((DataBag) input.get(0)).isSorted()) {
 throw new IOException(It's not sorted);
 }
 
 And the exception arises. Why? I don't understand it. I specified ORDER BY in 
 the nested foreach.
 
 Thank you for helping me btw!
 
 On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota pradeep...@gmail.com wrote:
 
 No... that wouldn't be related since you're not doing a GROUP ALL.
 
 The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
 wrong in your UDF. The output of your UDF is going to be a string that is
 some generic status right? My uneducated guess is that there's a bug in
 your UDF. To confirm, do you get the correct result if you replace your UDF
 with an out of the box one e.g. COUNT?
 
 
 On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 BTW, is this some how related[1] ?
 
 
 [1]:
 http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E
 
 On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Yes, of course, my output is like that:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 and when I put PARALLEL 1 in GROUP BY I get:
 
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
 (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
 .
 .
 .
 
 
 On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
 Where exactly are you getting duplicates? I'm not sure I understand your
 question. Can you give an example please?
 
 
 On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis 
 andronat_...@hotmail.com wrote:
 
 Hello everyone,
 
 I have a foreach statement and inside of it, I use an order by. After
 the
 order by, I have a UDF. Example like this:
 
 
 logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
 
 logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
 
 service_flavors = FOREACH logs_g {
  t = ORDER logs BY status;
  GENERATE group.date as dates, group.site as site, group.profile
 as
 profile,
  FLATTEN(MY_UDF(t)) as
 (generic_status);
 };
 
 The problem is that I get duplicate results.. I know that MY_UDF is
 running on mappers, but shouldn't each mapper take 1 group from the
 logs_g?
 Is something wrong with order by? I tried to add  order by parallel
 but I
 get syntax errors...
 
 My problem is resolved if I put  GROUP logs BY (date, site, profile)
 PARALLEL 1; But this is not a scalable solution. Can someone help me
 pls? I
 am using pig 0.11
 
 Cheers,
 Anastasis