Hi, I have the following query where i want to generate (sld, count of distinct domains). The traffic data comes with domain, subnet and the sld is obtained by a second file (with a join). I had a problem with generating this in a simple fashion and especially with the distinct domains part. Would you have a look on the script below and help me figure out if there's a way to simplify this ?
Thanks, Tamir traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int, e:int, o:int); traffic1 = FOREACH traffic GENERATE domain, subnet; traffic_by_subnet = GROUP traffic1 BY subnet; traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS subnet, traffic1.domain; subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray, sld:chararray, org:chararray); us_subnets = FILTER subnet_info BY country eq 'us'; us_subnets1 = FOREACH us_subnets GENERATE subnet, sld; jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet; r0 = FOREACH jr GENERATE sld, domain; r1 = GROUP r0 BY sld; r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain; r3 = GROUP r2 BY domain; r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains; store r4 into 'sld-domains-count';
