You could try looking at these two discussions:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

--Kai

----- Original Message ----
From: Tsengtan A Shuy <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Monday, July 16, 2007 3:45:59 AM
Subject: RE: OOM error during parsing with nekohtml

I successfully run the whole-web crawl with the my new ubuntu OS, and I am
ready to fix the bug.  I need someone to guide me to get the most updated
source code and the bug assignment.

Thank you in advance!! 

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 16, 2007 3:05 AM
To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
Subject: OOM error during parsing with nekohtml

Hi All,

We are getting an OOM Exception during the processing of
http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
Nutch-497 patch to our source code. But actually the error is coming during
the parse method.
Does anybody has any idea regarding this.  Here is the complete stacktrace :

java.lang.OutOfMemoryError: Java heap space
    at java.lang.String.toUpperCase(String.java:2637)
    at java.lang.String.toUpperCase(String.java:2660)
    at
org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja
va:443)
    at
org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java
:252)
    at
org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100
9)
    at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
    at
org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
    at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j
ava:2343)
    at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
    at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
    at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
    at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
    at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
    at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
    at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
    at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


Regards,
Shailendra








       
____________________________________________________________________________________
Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for 
today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow  

Reply via email to