Thx again Julien, Yes I'm going to buy myself the Hadoop book, because I thought I could do without but I realize that I need to make good use of hadooop.
Didn't know you could split fetching & parsing: so I suppose you just issue nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I will try on my next run. 2009/12/5 Julien Nioche <[email protected]> > HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and > does NOT affect the memory used for the map/ reduce jobs. Maybe you should > invest a bit of time reading about Hadoop first? > > As for your memory problem it could be due to the parsing and not the > fetching. If you don't already do so I suggest that you separate the > fetching from the parsing. First that will tell you which part fails + if > it > does fail in the parsing then you would not need to refetch the content > > J. > > 2009/12/5 MilleBii <[email protected]> > > > My fetch cycle failed on the following initial error : > > > > java.io.IOException: Task process exit with nonzero status of 65. > > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425) > > > > Than it makes a second attempt and after 3 hours I bump on that error > > (altough I had double HADOOP_HEAPSIZE): > > > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > > > > > Any idea what the initial error is or could be ? > > For the second one, I'm going to reduce number of threads... but I'm > > wondering if there could be a memory leak ? And I don't how to trace > that. > > > > -- > > -MilleBii- > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > -- -MilleBii-
