Yes, you're right. I fixed it before running the query. Thanks.
On Wed, Feb 18, 2009 at 12:10 AM, Dmitriy Ryaboy <[email protected]> wrote: > "r3 = GROUP r0 BY domain;" should probably read "r3 = GROUP r0 BY sld;" > right? > > -Dmitriy > > On Tue, Feb 17, 2009 at 4:49 PM, Alan Gates <[email protected]> wrote: > > Is it the join or group by that is running out of memory? You can tell > by > > whether it is the first or second map reduce job that is having problems. > > > > How much memory do your grid machines have? If you can up the memory > that > > will help. > > > > What version of pig are you running? The top of trunk code has some > changes > > that process a nested distinct in the combiner, which should prevent you > > from running out of memory there. > > > > Alan. > > > > On Feb 17, 2009, at 1:30 PM, Tamir Kamara wrote: > > > >> Thanks Alan. That is indeed better. > >> > >> But now I'm getting stuck by memory problems. I think the reducers are > out > >> of heap memory. The log I attached is from a machine that runs 2 > reducers > >> simultaneously with Xmx640m, io.sort.factor 50 and io.sort.mb 200. > >> I think the reducers works ok until it starts making a lot of: > >> SpillableMemoryManager: low memory handler called > >> > >> How can I resolve this issue ? > >> > >> > >> > >> On Tue, Feb 17, 2009 at 6:43 PM, Alan Gates <[email protected]> > wrote: > >> > >>> A couple of pointers: > >>> > >>> Group bys where you do a foreach/generate immediately after that > contains > >>> no UDF accomplish nothing other than reorganizing your data, so you can > >>> drop > >>> those. > >>> > >>> To accomplish a distinct count, use distinct nested in a foreach. > >>> > >>> So your script should look like: > >>> > >>> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int, > >>> e:int, o:int); > >>> traffic1 = FOREACH traffic GENERATE domain, subnet; > >>> > >>> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, > country:chararray, > >>> sld:chararray, org:chararray); > >>> us_subnets = FILTER subnet_info BY country eq 'us'; > >>> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld; > >>> > >>> jr = JOIN traffic1 BY subnet, us_subnets1 by subnet; > >>> > >>> r0 = FOREACH jr GENERATE sld, domain; > >>> > >>> r3 = GROUP r0 BY domain; > >>> r4 = FOREACH r3 { > >>> r5 = r0.domain; > >>> r6 = distinct r5; > >>> GENERATE group, COUNT(r6) as domains; > >>> } > >>> > >>> store r4 into 'sld-domains-count'; > >>> > >>> Alan. > >>> > >>> On Feb 16, 2009, at 11:36 PM, Tamir Kamara wrote: > >>> > >>> Hi, > >>>> > >>>> I have the following query where i want to generate (sld, count of > >>>> distinct > >>>> domains). > >>>> The traffic data comes with domain, subnet and the sld is obtained by > a > >>>> second file (with a join). > >>>> I had a problem with generating this in a simple fashion and > especially > >>>> with > >>>> the distinct domains part. Would you have a look on the script below > and > >>>> help me figure out if there's a way to simplify this ? > >>>> > >>>> Thanks, > >>>> Tamir > >>>> > >>>> traffic = LOAD 'traffic.txt' AS (domain:chararray, subnet:long, w:int, > >>>> e:int, o:int); > >>>> traffic1 = FOREACH traffic GENERATE domain, subnet; > >>>> > >>>> traffic_by_subnet = GROUP traffic1 BY subnet; > >>>> traffic_by_subnet1 = FOREACH traffic_by_subnet GENERATE group AS > subnet, > >>>> traffic1.domain; > >>>> > >>>> subnet_info = LOAD 'subnet_info.txt' AS (subnet:long, > country:chararray, > >>>> sld:chararray, org:chararray); > >>>> us_subnets = FILTER subnet_info BY country eq 'us'; > >>>> us_subnets1 = FOREACH us_subnets GENERATE subnet, sld; > >>>> > >>>> jr = JOIN traffic_by_subnet1 BY subnet, us_subnets1 by subnet; > >>>> > >>>> r0 = FOREACH jr GENERATE sld, domain; > >>>> r1 = GROUP r0 BY sld; > >>>> r2 = FOREACH r1 GENERATE group as sld, flatten(r0.domain) as domain; > >>>> r3 = GROUP r2 BY domain; > >>>> r4 = FOREACH r3 GENERATE r2.sld, COUNT(group) as domains; > >>>> > >>>> store r4 into 'sld-domains-count'; > >>>> > >>> > >>> > > > > >
