Hi,
I'm running nutch 1.4 on a 5 node cluster.
I try to crawl big xlsx files (~60mb).
Every time I run nutch, I get an "Error: Java heap space", while parsing,
though eventually the parsing succeeds.
I imagine that the task fails and hadoop retries it and than its succeeds.
I can't figure out why does this behavior keep repeating itself.
If the parser can't handle the file because of insufficient memory, how does
it ALWAYS succeeds in the second try?
Also, the first couple of times I run it with 4 gb memory. Than I thought
that maybe the first try it fails and the second one it succeeds because it
is always "on the threshold" of the memory it needs to succeed.
So I gave it 8gb memory, but the same behavior persists - first try it fails
on heap space error, second try it succeeds.
Anyway, I'm not sure if this report line is related but it says after it
finished the job:
Total committed heap usage (bytes): 26456621056.
So in fact it uses much less memory than it can.


Any idea?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to