Hello all,

I ran a crawl with parsing disabled, and then later tried to parse the segment using the parse command. However, there seems to be a bug as I keep getting StackOverflowErrors on certain pages. First, I got one on an XML document that the text parser was trying to parse. I disabled the text parse plugin and then it got past that page no problem. Then later on, with the text parse plugin still disabled I got the same problem on a html document. This is what happened:

051130 102144 Parsing [http://www.ibcr.org/PAGE_EN/P1_01E.htm] with [EMAIL PROTECTED]
java.lang.StackOverflowError
java.io.IOException: Job failed!
       at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
       at org.apache.nutch.crawl.ParseSegment.parse(ParseSegment.java:91)
       at org.apache.nutch.crawl.ParseSegment.main(ParseSegment.java:110)

This is with the 0.8-dev version of nutch, with the default plugins enabled.

Is this a known issue? Is there some way around this? Is the parser somehow getting in an infinite loop?

Thanks for any help you can give.

-Matt Zytaruk

Reply via email to