Thanks Alan. That is indeed better.

But now I'm getting stuck by memory problems. I think the reducers are out
of heap memory. The log I attached is from a machine that runs 2 reducers
simultaneously with Xmx640m, io.sort.factor 50 and io.sort.mb 200.
I think the reducers works ok until it starts making a lot of:
SpillableMemoryManager: low memory handler called

How can I resolve this issue ?



On Tue, Feb 17, 2009 at 6:43 PM, Alan Gates <[email protected]> wrote:

> A couple of pointers:
>
> Group bys where you do a foreach/generate immediately after that contains
> no UDF accomplish nothing other than reorganizing your data, so you can drop
> those.
>
> To accomplish a distinct count, use distinct nested in a foreach.
>
> So your script should look like:
>
> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
> e:int, o:int);
> traffic1 = FOREACH traffic GENERATE domain, subnet;
>
> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray,
> sld:chararray, org:chararray);
> us_subnets = FILTER subnet_info BY country eq 'us';
> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;
>
> jr = JOIN traffic1 BY subnet, us_subnets1 by subnet;
>
> r0 = FOREACH jr GENERATE sld, domain;
>
> r3 = GROUP r0 BY domain;
> r4 = FOREACH r3 {
>        r5 = r0.domain;
>        r6 = distinct r5;
>        GENERATE group, COUNT(r6) as domains;
> }
>
> store r4 into 'sld-domains-count';
>
> Alan.
>
> On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote:
>
>  Hi,
>>
>> I have the following query where i want to generate (sld, count of
>> distinct
>> domains).
>> The traffic data comes with domain, subnet and the sld is obtained by a
>> second file (with a join).
>> I had a problem with generating this in a simple fashion and especially
>> with
>> the distinct domains part. Would you have a look on the script below and
>> help me figure out if there's a way to simplify this ?
>>
>> Thanks,
>> Tamir
>>
>> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
>> e:int, o:int);
>> traffic1 = FOREACH traffic GENERATE domain, subnet;
>>
>> traffic_by_subnet = GROUP traffic1 BY subnet;
>> traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS subnet,
>> traffic1.domain;
>>
>> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray,
>> sld:chararray, org:chararray);
>> us_subnets = FILTER subnet_info BY country eq 'us';
>> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;
>>
>> jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet;
>>
>> r0 = FOREACH jr GENERATE sld, domain;
>> r1 = GROUP r0 BY sld;
>> r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain;
>> r3 = GROUP r2 BY domain;
>> r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains;
>>
>> store r4 into 'sld-domains-count';
>>
>
>

Reply via email to