Re: Query Help

Alan Gates Tue, 17 Feb 2009 13:50:57 -0800

Is it the join or group by that is running out of memory? You cantell by whether it is the first or second map reduce job that ishaving problems.

How much memory do your grid machines have? If you can up the memorythat will help.

What version of pig are you running? The top of trunk code has somechanges that process a nested distinct in the combiner, which shouldprevent you from running out of memory there.


Alan.

On Feb 17, 2009, at 1:30 PM, Tamir Kamara wrote:

Thanks Alan. That is indeed better.
But now I'm getting stuck by memory problems. I think the reducersare outof heap memory. The log I attached is from a machine that runs 2reducers
simultaneously with Xmx640m, io.sort.factor 50 and io.sort.mb 200.
I think the reducers works ok until it starts making a lot of:
SpillableMemoryManager: low memory handler called

How can I resolve this issue ?
On Tue, Feb 17, 2009 at 6:43 PM, Alan Gates <[email protected]>wrote:
A couple of pointers:
Group bys where you do a foreach/generate immediately after thatcontainsno UDF accomplish nothing other than reorganizing your data, so youcan drop
those.

To accomplish a distinct count, use distinct nested in a foreach.

So your script should look like:
traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long,w:int,
e:int, o:int);
traffic1 = FOREACH traffic GENERATE domain, subnet;
subnet_info = LOAD 'subnet_info.txt' AS (subnet:long,country:chararray,
sld:chararray, org:chararray);
us_subnets = FILTER subnet_info BY country eq 'us';
us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;

jr = JOIN traffic1 BY subnet, us_subnets1 by subnet;

r0 = FOREACH jr GENERATE sld, domain;

r3 = GROUP r0 BY domain;
r4 = FOREACH r3 {
      r5 = r0.domain;
      r6 = distinct r5;
      GENERATE group, COUNT(r6) as domains;
}

store r4 into 'sld-domains-count';

Alan.

On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote:

Hi,
I have the following query where i want to generate (sld, count of
distinct
domains).
The traffic data comes with domain, subnet and the sld is obtainedby a
second file (with a join).
I had a problem with generating this in a simple fashion andespecially
with
the distinct domains part. Would you have a look on the scriptbelow and
help me figure out if there's a way to simplify this ?

Thanks,
Tamir
traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long,w:int,
e:int, o:int);
traffic1 = FOREACH traffic GENERATE domain, subnet;

traffic_by_subnet = GROUP traffic1 BY subnet;
traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group ASsubnet,
traffic1.domain;
subnet_info = LOAD 'subnet_info.txt' AS (subnet:long,country:chararray,
sld:chararray, org:chararray);
us_subnets = FILTER subnet_info BY country eq 'us';
us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;

jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet;

r0 = FOREACH jr GENERATE sld, domain;
r1 = GROUP r0 BY sld;
r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain;
r3 = GROUP r2 BY domain;
r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains;

store r4 into 'sld-domains-count';

Re: Query Help

Reply via email to