No... that wouldn't be related since you're not doing a GROUP ALL.

The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
wrong in your UDF. The output of your UDF is going to be a string that is
some generic status right? My uneducated guess is that there's a bug in
your UDF. To confirm, do you get the correct result if you replace your UDF
with an out of the box one e.g. COUNT?


On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis <
andronat_...@hotmail.com> wrote:

> BTW, is this some how related[1] ?
>
>
> [1]:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E
>
> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <
> andronat_...@hotmail.com> wrote:
>
> > Yes, of course, my output is like that:
> >
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> > (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > .
> > .
> > .
> >
> > and when I put PARALLEL 1 in GROUP BY I get:
> >
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> > (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > .
> > .
> > .
> >
> >
> > On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
> >
> >> Where exactly are you getting duplicates? I'm not sure I understand your
> >> question. Can you give an example please?
> >>
> >>
> >> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
> >> andronat_...@hotmail.com> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> I have a foreach statement and inside of it, I use an order by. After
> the
> >>> order by, I have a UDF. Example like this:
> >>>
> >>>
> >>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
> >>>
> >>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
> >>>
> >>> service_flavors = FOREACH logs_g {
> >>>       t = ORDER logs BY status;
> >>>       GENERATE group.date as dates, group.site as site, group.profile
> as
> >>> profile,
> >>>                                       FLATTEN(MY_UDF(t)) as
> >>> (generic_status);
> >>> };
> >>>
> >>> The problem is that I get duplicate results.. I know that MY_UDF is
> >>> running on mappers, but shouldn't each mapper take 1 group from the
> logs_g?
> >>> Is something wrong with order by? I tried to add  order by parallel
> but I
> >>> get syntax errors...
> >>>
> >>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
> >>> PARALLEL 1; But this is not a scalable solution. Can someone help me
> pls? I
> >>> am using pig 0.11
> >>>
> >>> Cheers,
> >>> Anastasis
> >
>
>

Reply via email to