Still failing on a 300k run of fetching (about 4 hours)

I first get a long series OutOfMemory but it keeps fetching somehow and then
it ends with :
attempt_200912070739_0011_m_000000_0: Exception in thread "Thread for
syncLogs" java.lang.OutOfMemoryError: Java heap space

But the job never ends not even on error... so I have to shut it down (kill
and restart hadoop)
I increased NUTCH_HEAPSIZE, no luck

Any idea what to do further, and I'd like not to reduce the run size

2009/12/6 MilleBii <[email protected]>

> New and longer run ... I get plenty of  :  failed with:
> java.lang.OutOfMemoryError: Java heap space
> Fetching still goes on, not sure if this the expected behavior.
>
>
> 2009/12/6 MilleBii <[email protected]>
>
> Works fine and my memory problem had to do with the fact that I had too
>> many threads...
>>
>> 2009/12/5 MilleBii <[email protected]>
>>
>>> Thx again Julien,
>>>
>>> Yes I'm going to buy myself the Hadoop book, because I thought I could do
>>> without but I realize that I need to make good use of hadooop.
>>>
>>> Didn't know you could split fetching & parsing:  so I suppose you just
>>> issue nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I
>>> will try on my next run.
>>>
>>>
>>>
>>> 2009/12/5 Julien Nioche <[email protected]>
>>>
>>> HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
>>>> does NOT affect the memory used for the map/ reduce jobs. Maybe you
>>>> should
>>>> invest a bit of time reading about Hadoop first?
>>>>
>>>> As for your memory problem it could be due to the parsing and not the
>>>> fetching. If you don't already do so I suggest that you separate the
>>>> fetching from the parsing. First that will tell you which part fails +
>>>> if it
>>>> does fail in the parsing then you would not need to refetch the content
>>>>
>>>> J.
>>>>
>>>> 2009/12/5 MilleBii <[email protected]>
>>>>
>>>> > My fetch cycle failed on the following initial error :
>>>> >
>>>> > java.io.IOException: Task process exit with nonzero status of 65.
>>>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
>>>> >
>>>> > Than it makes a second attempt and after 3 hours I bump on that error
>>>> > (altough I had double HADOOP_HEAPSIZE):
>>>> >
>>>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> >
>>>> >
>>>> > Any idea what the initial error is or could be ?
>>>> > For the second one, I'm going to reduce number of threads... but I'm
>>>> > wondering if there could be a memory leak ? And I don't how to trace
>>>> that.
>>>> >
>>>> > --
>>>> > -MilleBii-
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> DigitalPebble Ltd
>>>> http://www.digitalpebble.com
>>>>
>>>
>>>
>>>
>>> --
>>> -MilleBii-
>>>
>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Reply via email to