You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
--Kai ----- Original Message ---- From: Tsengtan A Shuy <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:45:59 AM Subject: RE: OOM error during parsing with nekohtml I successfully run the whole-web crawl with the my new ubuntu OS, and I am ready to fix the bug. I need someone to guide me to get the most updated source code and the bug assignment. Thank you in advance!! Adam Shuy, President ePacific Web Design & Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -----Original Message----- From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:05 AM To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Subject: OOM error during parsing with nekohtml Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(NamespaceBinder.ja va:443) at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java :252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:100 9) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.j ava:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16 4) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra ____________________________________________________________________________________ Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow