I think this may be a bug. -----Original Message----- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 8:28 PM To: [email protected] Subject: OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled
I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best machine, but I am only doing a limited crawl of about 52 urls. When I do the crawl with depth = 3 or even 6, it completes, when I do it at 10, it has been running out of memory. 2 questions 1. how do i restart the crawl? I have seen the tuturial, whch says " Recover the pages already fetched and than restart the fetcher. You'll need to create a file fetcher.done in the segment directory an than: updatedb, generate and fetch . Assuming your index is at /index % touch /index/segments/2005somesegment/fetcher.done % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/ % bin/nutch generate /index/db/ /index/segments/2005somesegment/ % bin/nutch fetch /index/segments/2005somesegment All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. ", but I have more than one segment, do I only need to do this for the last one in time, or all of them? 2. how to I index what I have already crawled? I have seen the indexing section in the tutorial, when I run bin/nutch invertlinks it gives me a Exception in thread "main" java.lang.NoClassDefFoundError: invertlinks using cygwin The fetcher exited with a 060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space Exception in thread "main" java.lang.RuntimeException: SEVERE error logged. Exiting fetcher. at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140) Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free Open Source Tax Software ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
