See https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce the memory consumption. You are not setting the heapsize in the right way. This should be done in the hadoop-site.xml using the parameter mapred.child.java.opts. Changing hadoop-env modifies the amount of memory to be used for the services (JobTracker, DataNodes, etc...) but not for the hadoop tasks
HTH Julien 2009/7/2 lei wang <[email protected]> > Hi,everyone, these days a nutch problem occur to me when I test nutch to > index 2 millions pages. > > When then program steps into the reduce stage of crawldb update, the error > messeges gives as below: > Before this test, I try to crawl and index 1 millions pages, nutch goes > well. > I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even to > my memory size 6GB, but make no different. And change the value of > properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time > and again , but it seems no difference. > > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8 > slaves(tasktrack). > I attach the hadoop-env.sh and hadoop-site.xml files after this > messege, appreciate your help very much. > ============================================================ > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.<init>(HashMap.java:209) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) > at > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > at > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) > at > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) > at > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > > java.lang.OutOfMemoryError: Java heap space > at > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) > at > > java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) > at > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) > at > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) > at > > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > ===================================================================== > hadoop-env.sh > > # Set Hadoop-specific environment variables here. > # The only required environment variable is JAVA_HOME. All others are > # optional. When running a distributed configuration it is best to > # set JAVA_HOME in this file, so that it is correctly defined on > # remote nodes. > # The java implementation to use. Required. > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun > export JAVA_HOME=/usr/lib/jvm/java-6-sun > export HADOOP_HOME=/home/had/nutch-1.0 > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs > export NUTCH_HOME=/home/had/nutch-1.0 > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf > # Extra Java CLASSPATH elements. Optional. > # export HADOOP_CLASSPATH= > # The maximum amount of heap to use, in MB. Default is 1000. > export HADOOP_HEAPSIZE=4000 > export NUTCH_HEAPSIZE=4000 > > ========================================================================== > > hadoop-site.xml > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>fs.default.name</name> > <value>hdfs://distributed1:9000/</value> > <description>The name of the default file system. Either the literal > string "local" or a host:port for DFS.</description> > </property> > <property> > <name>mapred.job.tracker</name> > <value>distributed1:9001</value> > <description>The host and port that the MapReduce job tracker runs at. If > "local", then jobs are run in-process as a single map and reduce > task.</description> > </property> > <property> > <name>mapred.tasktracker.tasks.maximum</name> > <value>2</value> > <description> > The maximum number of tasks that will be run simultaneously by > a task tracker. This should be adjusted according to the heap size > per task, the amount of RAM available, and CPU consumption of each task. > </description> > </property> > <property> > <name>mapred.map.tasks</name> > <value>799</value> > <description> > This should be a prime number larger than multiple number of slave > hosts, > e.g. for 3 nodes set this to 17 > </description> > </property> > <property> > <name>io.file.buffer.size</name> > <value>131072</value>(test set this value to 4096, but make no > difference) > <description>The size of buffer for use in sequence files. > The size of this buffer should probably be a multiple of hardware > page size (4096 on Intel x86), and it determines how much data is > buffered during read and write operations.</description> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>29</value> > <description> > This should be a prime number close to a low multiple of slave hosts, > e.g. for 3 nodes set this to 7 > </description> > </property> > > <property> > <name>hadoop.tmp.dir</name> > <value>/home/had/nutch-1.0/tmp</value> > <description>A base for other temporary directories.</description> > </property> > <property> > <name>dfs.name.dir</name> > <value>/home/had/nutch-1.0/filesystem/name</value> > <description>Determines where on the local filesystem the DFS name node > should store the name table. If this is a comma-delimited list of > directories then the name table is replicated in all of the directories, > for > redundancy. </description> > </property> > <property> > <name>dfs.data.dir</name> > <value>/home/had/nutch-1.0/filesystem/data</value> > <description>Determines where on the local filesystem an DFS data node > should store its blocks. If this is a comma-delimited list of directories, > then data will be stored in all named directories, typically on different > devices. Directories that do not exist are ignored.</description> > </property> > <property> > <name>dfs.replication</name> > <value>1</value> > <description>Default block replication. The actual number of replications > can be specified when the file is created. The default is used if > replication is not specified in create time.</description> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx1000m</value> > <description> > You can specify other Java options for each map or reduce task here, > but most likely you will want to adjust the heap size. > </description> > </property> > <property> > <name>mapred.system.dir</name> > <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value> > </property> > <property> > <name>mapred.local.dir</name> > <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value> > </property> > > </configuration> > ==================================================================== > -- DigitalPebble Ltd http://www.digitalpebble.com
