A couple of pointers:
Group bys where you do a foreach/generate immediately after that
contains no UDF accomplish nothing other than reorganizing your data,
so you can drop those.
To accomplish a distinct count, use distinct nested in a foreach.
So your script should look like:
traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
e:int, o:int);
traffic1 = FOREACH traffic GENERATE domain, subnet;
subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray,
sld:chararray, org:chararray);
us_subnets = FILTER subnet_info BY country eq 'us';
us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;
jr = JOIN traffic1 BY subnet, us_subnets1 by subnet;
r0 = FOREACH jr GENERATE sld, domain;
r3 = GROUP r0 BY domain;
r4 = FOREACH r3 {
r5 = r0.domain;
r6 = distinct r5;
GENERATE group, COUNT(r6) as domains;
}
store r4 into 'sld-domains-count';
Alan.
On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote:
Hi,
I have the following query where i want to generate (sld, count of
distinct
domains).
The traffic data comes with domain, subnet and the sld is obtained
by a
second file (with a join).
I had a problem with generating this in a simple fashion and
especially with
the distinct domains part. Would you have a look on the script below
and
help me figure out if there's a way to simplify this ?
Thanks,
Tamir
traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int,
e:int, o:int);
traffic1 = FOREACH traffic GENERATE domain, subnet;
traffic_by_subnet = GROUP traffic1 BY subnet;
traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS
subnet,
traffic1.domain;
subnet_info = LOAD 'subnet_info.txt' AS (subnet:long,
country:chararray,
sld:chararray, org:chararray);
us_subnets = FILTER subnet_info BY country eq 'us';
us_subnets1 = FOREACH us_subnets GENERATE subnet, sld;
jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet;
r0 = FOREACH jr GENERATE sld, domain;
r1 = GROUP r0 BY sld;
r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain;
r3 = GROUP r2 BY domain;
r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains;
store r4 into 'sld-domains-count';