A couple of pointers:

Group bys where you do a foreach/generate immediately after that contains no UDF accomplish nothing other than reorganizing your data, so you can drop those.

To accomplish a distinct count, use distinct nested in a foreach.

So your script should look like:

traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
e:int, o:int);
traffic1 = FOREACH traffic GENERATE domain, subnet;

subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray,
sld:chararray, org:chararray);
us_subnets = FILTER subnet_info BY country eq 'us';
us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;

jr = JOIN traffic1 BY subnet, us_subnets1 by subnet;

r0 = FOREACH jr GENERATE sld, domain;

r3 = GROUP r0 BY domain;
r4 = FOREACH r3 {
        r5 = r0.domain;
        r6 = distinct r5;
        GENERATE group, COUNT(r6) as domains;
}

store r4 into 'sld-domains-count';

Alan.
On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote:

Hi,

I have the following query where i want to generate (sld, count of distinct
domains).
The traffic data comes with domain, subnet and the sld is obtained by a
second file (with a join).
I had a problem with generating this in a simple fashion and especially with the distinct domains part. Would you have a look on the script below and
help me figure out if there's a way to simplify this ?

Thanks,
Tamir

traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
e:int, o:int);
traffic1 = FOREACH traffic GENERATE domain, subnet;

traffic_by_subnet = GROUP traffic1 BY subnet;
traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS subnet,
traffic1.domain;

subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray,
sld:chararray, org:chararray);
us_subnets = FILTER subnet_info BY country eq 'us';
us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;

jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet;

r0 = FOREACH jr GENERATE sld, domain;
r1 = GROUP r0 BY sld;
r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain;
r3 = GROUP r2 BY domain;
r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains;

store r4 into 'sld-domains-count';

Reply via email to