Hi Edwin, I should have specified that this is a custom change I've made to the CrawlDBReducer which looks like this :
switch (datum.getStatus()) { // collect other info case CrawlDatum.STATUS_LINKED: *if (maxLinks!=-1 && linked.size()>= maxLinks) break;* where maxLinks is a variable which I initialize from the configure() method * maxLinks = job.getInt("db.fetch.links.max", -1);* I suppose that this could approach could be improved by considering the top N links by weight instead of just taking them as they come. Julien 2009/3/19 Edwin Chu <edwinche...@gmail.com> > Thanks Julien > I looked into the nutch-default.xml and I can't find a directive that can > control the number of incoming links to be taken into account for scoring a > document. I can find db.max.inlinks, but it look like controling the > invertlinks process only. Could you tell me how to do it? > > Regards > Edwin > > On Fri, Mar 20, 2009 at 6:45 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi Edwin, > > > > I had a similar issue which I solved by capping the number of incoming > > links > > to be taken into account for scoring a document. Another option is to use > > the patch I submitted (NUTCH-702) on JIRA and does the lazy instanciation > > of > > metadata; that should save a lot of RAM (and CPU). > > > > HTH > > > > Julien > > > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > 2009/3/19 Edwin Chu <edwinche...@gmail.com> > > > > > Hi, > > > I am using the trunk version of Nutch in a cluster of 5 EC2 nodes to > > crawl > > > the Internet. Each nodes has 7GB of memory and I have > > > configured mapred.child.java.opts to -Xmx3000m in hadoop-site.xml. When > I > > > tried to update the crawldb of about 20M of urls with a crawl segment > > with > > > 5M of fetched content, I got the following error: > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > at java.util.concurrent.locks.ReentrantLock.(Unknown Source) > > > > > > at java.util.concurrent.ConcurrentHashMap$Segment.(Unknown Source) > > > > > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source) > > > > > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source) > > > > > > at > org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:46) > > > > > > at org.apache.hadoop.io.MapWritable.(MapWritable.java:42) > > > > > > at org.apache.hadoop.io.MapWritable.(MapWritable.java:52) > > > > > > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:311) > > > > > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) > > > > > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:1) > > > > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) > > > > > > at org.apache.hadoop.mapred.Child.main(Child.java:155) > > > > > > > > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > > > > > at java.util.concurrent.locks.ReentrantLock.(Unknown Source) > > > > > > at java.util.concurrent.ConcurrentHashMap$Segment.(Unknown Source) > > > > > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source) > > > > > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source) > > > > > > at > org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:46) > > > > > > at org.apache.hadoop.io.MapWritable.(MapWritable.java:42) > > > > > > at org.apache.nutch.crawl.CrawlDatum.(CrawlDatum.java:135) > > > > > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:95) > > > > > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:1) > > > > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430) > > > > > > at org.apache.hadoop.mapred.Child.main(Child.java:155) > > > > > > > > > Anyone has an idea on this problem? I supposed that output of reduce > > > function is written to filesystem immediately instead of being hold in > > > memory longer than necessary, otherwise the system would not be able to > > > scale. I think 3GB limit is maximum because there is no swap space in > EC2 > > > and each node can run maximum of 2 map/reduce tasks. > > > > > > Thank you very much. > > > > > > Regards > > > > > > Edwin > > > > > > -- DigitalPebble Ltd http://www.digitalpebble.com