I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best machine, but I am only doing a limited crawl of about 52 urls. When I do the crawl with depth = 3 or even 6, it completes, when I do it at 10, it has been running out of memory. 2 questions 1. how do i restart the crawl? I have seen the tuturial, whch says "
Recover the pages already fetched and than restart the fetcher. You'll need to create a file fetcher.done in the segment directory an than: updatedb, generate and fetch . Assuming your index is at /index % touch /index/segments/2005somesegment/fetcher.done % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/ % bin/nutch generate /index/db/ /index/segments/2005somesegment/ % bin/nutch fetch /index/segments/2005somesegment All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. ", but I have more than one segment, do I only need to do this for the last one in time, or all of them? 2. how to I index what I have already crawled? I have seen the indexing section in the tutorial, when I run bin/nutch invertlinks it gives me a Exception in thread "main" java.lang.NoClassDefFoundError: invertlinks using cygwin The fetcher exited with a 060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space Exception in thread "main" java.lang.RuntimeException: SEVERE error logged. Exiting fetcher. at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140) Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free Open Source Tax Software
