Hi,

During CrawlDb Map reduce job,
The reduce worker fail 1 by 1 with :

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at 
java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205)
        at 
java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291)
        at java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613)
        at java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652)
        at 
org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49)
        at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
        at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
        at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
        at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)


I have default 1Gb per JVM.

/opt/java/jre/bin/java -Xmx1000m


Being out of memory for a Java process is somewhat surprising,
Does this job something that needs over 1Gb ram per node ?

Oh by the way I don't have swap files, system have 8Gb and don't seems
to be missing any ram.

My command line :

nu...@titaniumpelican search $ ./bin/nutch  updatedb
hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir
hdfs://titaniumpelican:9000/user/nutch/crawl/segments
CrawlDb update: starting
CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb
CrawlDb update: segments:
[hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
java.lang.OutOfMemoryError: Java heap space


Question : Why this job cut work into 140 map tasks ?

Regards,
Louis

Reply via email to