Thanks Alan. That is indeed better. But now I'm getting stuck by memory problems. I think the reducers are out of heap memory. The log I attached is from a machine that runs 2 reducers simultaneously with Xmx640m, io.sort.factor 50 and io.sort.mb 200. I think the reducers works ok until it starts making a lot of: SpillableMemoryManager: low memory handler called
How can I resolve this issue ? On Tue, Feb 17, 2009 at 6:43 PM, Alan Gates <[email protected]> wrote: > A couple of pointers: > > Group bys where you do a foreach/generate immediately after that contains > no UDF accomplish nothing other than reorganizing your data, so you can drop > those. > > To accomplish a distinct count, use distinct nested in a foreach. > > So your script should look like: > > traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int, > e:int, o:int); > traffic1 = FOREACH traffic GENERATE domain, subnet; > > subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray, > sld:chararray, org:chararray); > us_subnets = FILTER subnet_info BY country eq 'us'; > us_subnets1 = FOREACH us_subnets GENERATE subnet, sld; > > jr = JOIN traffic1 BY subnet, us_subnets1 by subnet; > > r0 = FOREACH jr GENERATE sld, domain; > > r3 = GROUP r0 BY domain; > r4 = FOREACH r3 { > r5 = r0.domain; > r6 = distinct r5; > GENERATE group, COUNT(r6) as domains; > } > > store r4 into 'sld-domains-count'; > > Alan. > > On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote: > > Hi, >> >> I have the following query where i want to generate (sld, count of >> distinct >> domains). >> The traffic data comes with domain, subnet and the sld is obtained by a >> second file (with a join). >> I had a problem with generating this in a simple fashion and especially >> with >> the distinct domains part. Would you have a look on the script below and >> help me figure out if there's a way to simplify this ? >> >> Thanks, >> Tamir >> >> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int, >> e:int, o:int); >> traffic1 = FOREACH traffic GENERATE domain, subnet; >> >> traffic_by_subnet = GROUP traffic1 BY subnet; >> traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS subnet, >> traffic1.domain; >> >> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, country:chararray, >> sld:chararray, org:chararray); >> us_subnets = FILTER subnet_info BY country eq 'us'; >> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld; >> >> jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet; >> >> r0 = FOREACH jr GENERATE sld, domain; >> r1 = GROUP r0 BY sld; >> r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain; >> r3 = GROUP r2 BY domain; >> r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains; >> >> store r4 into 'sld-domains-count'; >> > >
