Julien, I did tryed with 2048M / Task child, no luck I still have two reduce that doesn't go through,
Is it somewhat related to the number of reduce, on this cluster I have 4 servers : - dual xeon dual core (8 core) - 8Gb ram - 4 disks I did set mapred.reduce.tasks and mapred.map.tasks to 16. because : 4 server of 4 disks. (what do you think) Maybe if this job is too big for my cluster, does adding reduce task could subdivise the problem into smaller reduces. indeed I think no, cause I guess the input key is for the same domain ? so my two last reduce task are the biggest domains of my DB ? L. On Sun, Aug 16, 2009 at 6:39 PM, Julien Nioche<[email protected]> wrote: > Hi, > > The reducing step of the updatedb requires quite a lot of memory indeed. See > https://issues.apache.org/jira/browse/NUTCH-702 for a discussion on this > subject. > BTW you'll have to specify the parameter mapred.child.java.opts in your > conf/hadoop-site.xml so that the value is sent to the hadoop slaves. Another > way to do that is to specify it on the command line with : -D > mapred.child.java.opts=-Xmx2000m > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/8/16 MoD <[email protected]> > >> Hi, >> >> During CrawlDb Map reduce job, >> The reduce worker fail 1 by 1 with : >> >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205) >> at >> java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291) >> at >> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613) >> at >> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652) >> at >> org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49) >> at org.apache.hadoop.io.MapWritable.(MapWritable.java:42) >> at org.apache.hadoop.io.MapWritable.(MapWritable.java:52) >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) >> at >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) >> at >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> at org.apache.hadoop.mapred.Child.main(Child.java:158) >> >> >> I have default 1Gb per JVM. >> >> /opt/java/jre/bin/java -Xmx1000m >> >> >> Being out of memory for a Java process is somewhat surprising, >> Does this job something that needs over 1Gb ram per node ? >> >> Oh by the way I don't have swap files, system have 8Gb and don't seems >> to be missing any ram. >> >> My command line : >> >> nu...@titaniumpelican search $ ./bin/nutch updatedb >> hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir >> hdfs://titaniumpelican:9000/user/nutch/crawl/segments >> CrawlDb update: starting >> CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb >> CrawlDb update: segments: >> [hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219] >> CrawlDb update: additions allowed: false >> CrawlDb update: URL normalizing: false >> CrawlDb update: URL filtering: false >> CrawlDb update: Merging segment data into db. >> java.lang.OutOfMemoryError: Java heap space >> >> >> Question : Why this job cut work into 140 map tasks ? >> >> Regards, >> Louis >> > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com >
