Hi Edwin,

I should have specified that this is a custom change I've made to the
CrawlDBReducer which looks like this :

  switch (datum.getStatus()) {                // collect other info
      case CrawlDatum.STATUS_LINKED:
        *if (maxLinks!=-1 && linked.size()>= maxLinks) break;*

where maxLinks is a variable which I initialize from the configure() method
*    maxLinks = job.getInt("db.fetch.links.max", -1);*

I suppose that this could approach could be improved by considering the top
N links by weight instead of just taking them as they come.

Julien


2009/3/19 Edwin Chu <edwinche...@gmail.com>

> Thanks Julien
> I looked into the nutch-default.xml and I can't find a directive that can
> control the number of incoming links to be taken into account for scoring a
> document. I can find db.max.inlinks, but it look like controling the
> invertlinks process only. Could you tell me how to do it?
>
> Regards
> Edwin
>
> On Fri, Mar 20, 2009 at 6:45 AM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
> > Hi Edwin,
> >
> > I had a similar issue which I solved by capping the number of incoming
> > links
> > to be taken into account for scoring a document. Another option is to use
> > the patch I submitted (NUTCH-702) on JIRA and does the lazy instanciation
> > of
> > metadata; that should save a lot of RAM (and CPU).
> >
> > HTH
> >
> > Julien
> >
> >
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> > 2009/3/19 Edwin Chu <edwinche...@gmail.com>
> >
> > > Hi,
> > > I am using the trunk version of Nutch in a cluster of 5 EC2 nodes to
> > crawl
> > > the Internet. Each nodes has 7GB of memory and I have
> > > configured mapred.child.java.opts to -Xmx3000m in hadoop-site.xml. When
> I
> > > tried to update the crawldb of about 20M of urls with a crawl segment
> > with
> > > 5M of fetched content, I got the following error:
> > >
> > > java.lang.OutOfMemoryError: Java heap space
> > >
> > > at java.util.concurrent.locks.ReentrantLock.(Unknown Source)
> > >
> > > at java.util.concurrent.ConcurrentHashMap$Segment.(Unknown Source)
> > >
> > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source)
> > >
> > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source)
> > >
> > > at
> org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:46)
> > >
> > > at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
> > >
> > > at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
> > >
> > > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:311)
> > >
> > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
> > >
> > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:1)
> > >
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
> > >
> > > at org.apache.hadoop.mapred.Child.main(Child.java:155)
> > >
> > >
> > > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > >
> > > at java.util.concurrent.locks.ReentrantLock.(Unknown Source)
> > >
> > > at java.util.concurrent.ConcurrentHashMap$Segment.(Unknown Source)
> > >
> > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source)
> > >
> > > at java.util.concurrent.ConcurrentHashMap.(Unknown Source)
> > >
> > > at
> org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:46)
> > >
> > > at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
> > >
> > > at org.apache.nutch.crawl.CrawlDatum.(CrawlDatum.java:135)
> > >
> > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:95)
> > >
> > > at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:1)
> > >
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
> > >
> > > at org.apache.hadoop.mapred.Child.main(Child.java:155)
> > >
> > >
> > > Anyone has an idea on this problem? I supposed that output of reduce
> > > function is written to filesystem immediately instead of being hold in
> > > memory longer than necessary, otherwise the system would not be able to
> > > scale. I think 3GB limit is maximum because there is no swap space in
> EC2
> > > and each node can run maximum of 2 map/reduce tasks.
> > >
> > > Thank you very much.
> > >
> > > Regards
> > >
> > > Edwin
> > >
> >
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to