Still failing on a 300k run of fetching (about 4 hours) I first get a long series OutOfMemory but it keeps fetching somehow and then it ends with : attempt_200912070739_0011_m_000000_0: Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
But the job never ends not even on error... so I have to shut it down (kill and restart hadoop) I increased NUTCH_HEAPSIZE, no luck Any idea what to do further, and I'd like not to reduce the run size 2009/12/6 MilleBii <[email protected]> > New and longer run ... I get plenty of : failed with: > java.lang.OutOfMemoryError: Java heap space > Fetching still goes on, not sure if this the expected behavior. > > > 2009/12/6 MilleBii <[email protected]> > > Works fine and my memory problem had to do with the fact that I had too >> many threads... >> >> 2009/12/5 MilleBii <[email protected]> >> >>> Thx again Julien, >>> >>> Yes I'm going to buy myself the Hadoop book, because I thought I could do >>> without but I realize that I need to make good use of hadooop. >>> >>> Didn't know you could split fetching & parsing: so I suppose you just >>> issue nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I >>> will try on my next run. >>> >>> >>> >>> 2009/12/5 Julien Nioche <[email protected]> >>> >>> HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and >>>> does NOT affect the memory used for the map/ reduce jobs. Maybe you >>>> should >>>> invest a bit of time reading about Hadoop first? >>>> >>>> As for your memory problem it could be due to the parsing and not the >>>> fetching. If you don't already do so I suggest that you separate the >>>> fetching from the parsing. First that will tell you which part fails + >>>> if it >>>> does fail in the parsing then you would not need to refetch the content >>>> >>>> J. >>>> >>>> 2009/12/5 MilleBii <[email protected]> >>>> >>>> > My fetch cycle failed on the following initial error : >>>> > >>>> > java.io.IOException: Task process exit with nonzero status of 65. >>>> > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425) >>>> > >>>> > Than it makes a second attempt and after 3 hours I bump on that error >>>> > (altough I had double HADOOP_HEAPSIZE): >>>> > >>>> > java.lang.OutOfMemoryError: GC overhead limit exceeded >>>> > >>>> > >>>> > Any idea what the initial error is or could be ? >>>> > For the second one, I'm going to reduce number of threads... but I'm >>>> > wondering if there could be a memory leak ? And I don't how to trace >>>> that. >>>> > >>>> > -- >>>> > -MilleBii- >>>> > >>>> >>>> >>>> >>>> -- >>>> DigitalPebble Ltd >>>> http://www.digitalpebble.com >>>> >>> >>> >>> >>> -- >>> -MilleBii- >>> >> >> >> >> -- >> -MilleBii- >> > > > > -- > -MilleBii- > -- -MilleBii-
