Hi,
During CrawlDb Map reduce job,
The reduce worker fail 1 by 1 with :
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205)
at
java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291)
at java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613)
at java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652)
at
org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49)
at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
I have default 1Gb per JVM.
/opt/java/jre/bin/java -Xmx1000m
Being out of memory for a Java process is somewhat surprising,
Does this job something that needs over 1Gb ram per node ?
Oh by the way I don't have swap files, system have 8Gb and don't seems
to be missing any ram.
My command line :
nu...@titaniumpelican search $ ./bin/nutch updatedb
hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir
hdfs://titaniumpelican:9000/user/nutch/crawl/segments
CrawlDb update: starting
CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb
CrawlDb update: segments:
[hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
java.lang.OutOfMemoryError: Java heap space
Question : Why this job cut work into 140 map tasks ?
Regards,
Louis