Re: [Nutch-general] Interrupting a nutch crawl -- or use topN?

Ian Holsman Sat, 30 Jun 2007 16:38:44 -0700

Kai_testing Middleton wrote:
> I am running a nutch crawl of 19 sites.  I wish to let this crawl go for 
> about two days then gracefully stop it (I don't expect it to complete by 
> then).  Is there a way to do this?  I want it to stop crawling then build the 
> lucene index.  Note that I used a simple nutch crawl command, rather than the 
> "whole web" crawling methodology:
>
> nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10
>   
I use a iterative approach using a script similar to what Sami blogs 
about here: 
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html


I then issue a crawl of 10,000 URLs at a time, and just repeat the 
process for as long as the window available. because I use solr to store 
the crawl results
It makes the index available during the crawl window.

but I'm a relative newbie as well, so look forward what the experts say.


regards
Ian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Interrupting a nutch crawl -- or use topN?

Reply via email to