Hi, I'm running nutch 1.4 on a 5 node cluster. I try to crawl big xlsx files (~60mb). Every time I run nutch, I get an "Error: Java heap space", while parsing, though eventually the parsing succeeds. I imagine that the task fails and hadoop retries it and than its succeeds. I can't figure out why does this behavior keep repeating itself. If the parser can't handle the file because of insufficient memory, how does it ALWAYS succeeds in the second try? Also, the first couple of times I run it with 4 gb memory. Than I thought that maybe the first try it fails and the second one it succeeds because it is always "on the threshold" of the memory it needs to succeed. So I gave it 8gb memory, but the same behavior persists - first try it fails on heap space error, second try it succeeds. Anyway, I'm not sure if this report line is related but it says after it finished the job: Total committed heap usage (bytes): 26456621056. So in fact it uses much less memory than it can.
Any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html Sent from the Nutch - User mailing list archive at Nabble.com.