HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and does NOT affect the memory used for the map/ reduce jobs. Maybe you should invest a bit of time reading about Hadoop first?
As for your memory problem it could be due to the parsing and not the fetching. If you don't already do so I suggest that you separate the fetching from the parsing. First that will tell you which part fails + if it does fail in the parsing then you would not need to refetch the content J. 2009/12/5 MilleBii <[email protected]> > My fetch cycle failed on the following initial error : > > java.io.IOException: Task process exit with nonzero status of 65. > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425) > > Than it makes a second attempt and after 3 hours I bump on that error > (altough I had double HADOOP_HEAPSIZE): > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > > Any idea what the initial error is or could be ? > For the second one, I'm going to reduce number of threads... but I'm > wondering if there could be a memory leak ? And I don't how to trace that. > > -- > -MilleBii- > -- DigitalPebble Ltd http://www.digitalpebble.com
