Indexing what has already been crawled

Richard Braman Thu, 02 Mar 2006 17:41:35 -0800

I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls.  When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.  
 
2 questions 
 
1. how do i restart the crawl?  
I have seen the tuturial, whch says
"


 Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index 

% touch /index/segments/2005somesegment/fetcher.done 

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment

All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

", 

but I have more than one segment, do I only need to do this for the last
one in time, or all of them?

2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks 
using cygwin
 
The fetcher exited with a
 
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

 

Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

Reply via email to