Hello all,
I ran a crawl with parsing disabled, and then later tried to parse the
segment using the parse command. However, there seems to be a bug as I
keep getting StackOverflowErrors on certain pages. First, I got one on
an XML document that the text parser was trying to parse. I disabled the
text parse plugin and then it got past that page no problem. Then later
on, with the text parse plugin still disabled I got the same problem on
a html document. This is what happened:
051130 102144 Parsing [http://www.ibcr.org/PAGE_EN/P1_01E.htm] with
[EMAIL PROTECTED]
java.lang.StackOverflowError
java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at org.apache.nutch.crawl.ParseSegment.parse(ParseSegment.java:91)
at org.apache.nutch.crawl.ParseSegment.main(ParseSegment.java:110)
This is with the 0.8-dev version of nutch, with the default plugins enabled.
Is this a known issue? Is there some way around this? Is the parser
somehow getting in an infinite loop?
Thanks for any help you can give.
-Matt Zytaruk