Yes, you're right.
I fixed it before running the query.

Thanks.

On Wed, Feb 18, 2009 at 12:10 AM, Dmitriy Ryaboy <[email protected]> wrote:

> "r3 = GROUP r0 BY domain;" should probably read "r3 = GROUP r0 BY sld;"
> right?
>
> -Dmitriy
>
> On Tue, Feb 17, 2009 at 4:49 PM, Alan Gates <[email protected]> wrote:
> > Is it the join or group by that is running out of memory?  You can tell
> by
> > whether it is the first or second map reduce job that is having problems.
> >
> > How much memory do your grid machines have?  If you can up the memory
> that
> > will help.
> >
> > What version of pig are you running?  The top of trunk code has some
> changes
> > that process a nested distinct in the combiner, which should prevent you
> > from running out of memory there.
> >
> > Alan.
> >
> > On Feb 17, 2009, at 1:30 PM, Tamir Kamara wrote:
> >
> >> Thanks Alan. That is indeed better.
> >>
> >> But now I'm getting stuck by memory problems. I think the reducers are
> out
> >> of heap memory. The log I attached is from a machine that runs 2
> reducers
> >> simultaneously with Xmx640m, io.sort.factor 50 and io.sort.mb 200.
> >> I think the reducers works ok until it starts making a lot of:
> >> SpillableMemoryManager: low memory handler called
> >>
> >> How can I resolve this issue ?
> >>
> >>
> >>
> >> On Tue, Feb 17, 2009 at 6:43 PM, Alan Gates <[email protected]>
> wrote:
> >>
> >>> A couple of pointers:
> >>>
> >>> Group bys where you do a foreach/generate immediately after that
> contains
> >>> no UDF accomplish nothing other than reorganizing your data, so you can
> >>> drop
> >>> those.
> >>>
> >>> To accomplish a distinct count, use distinct nested in a foreach.
> >>>
> >>> So your script should look like:
> >>>
> >>> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
> >>> e:int, o:int);
> >>> traffic1 = FOREACH traffic GENERATE domain, subnet;
> >>>
> >>> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long,
> country:chararray,
> >>> sld:chararray, org:chararray);
> >>> us_subnets = FILTER subnet_info BY country eq 'us';
> >>> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;
> >>>
> >>> jr = JOIN traffic1 BY subnet, us_subnets1 by subnet;
> >>>
> >>> r0 = FOREACH jr GENERATE sld, domain;
> >>>
> >>> r3 = GROUP r0 BY domain;
> >>> r4 = FOREACH r3 {
> >>>      r5 = r0.domain;
> >>>      r6 = distinct r5;
> >>>      GENERATE group, COUNT(r6) as domains;
> >>> }
> >>>
> >>> store r4 into 'sld-domains-count';
> >>>
> >>> Alan.
> >>>
> >>> On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote:
> >>>
> >>> Hi,
> >>>>
> >>>> I have the following query where i want to generate (sld, count of
> >>>> distinct
> >>>> domains).
> >>>> The traffic data comes with domain, subnet and the sld is obtained by
> a
> >>>> second file (with a join).
> >>>> I had a problem with generating this in a simple fashion and
> especially
> >>>> with
> >>>> the distinct domains part. Would you have a look on the script below
> and
> >>>> help me figure out if there's a way to simplify this ?
> >>>>
> >>>> Thanks,
> >>>> Tamir
> >>>>
> >>>> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
> >>>> e:int, o:int);
> >>>> traffic1 = FOREACH traffic GENERATE domain, subnet;
> >>>>
> >>>> traffic_by_subnet = GROUP traffic1 BY subnet;
> >>>> traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS
> subnet,
> >>>> traffic1.domain;
> >>>>
> >>>> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long,
> country:chararray,
> >>>> sld:chararray, org:chararray);
> >>>> us_subnets = FILTER subnet_info BY country eq 'us';
> >>>> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;
> >>>>
> >>>> jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet;
> >>>>
> >>>> r0 = FOREACH jr GENERATE sld, domain;
> >>>> r1 = GROUP r0 BY sld;
> >>>> r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain;
> >>>> r3 = GROUP r2 BY domain;
> >>>> r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains;
> >>>>
> >>>> store r4 into 'sld-domains-count';
> >>>>
> >>>
> >>>
> >
> >
>

Reply via email to