hello all,

I am quite new to nutch, currently I am trying to crawl some thousand urls with static and dynamic content. I am using nutch 0.7.1.

I now have some sites in my webdb which I don't want to crawl/fetch any more. How can I achieve that they aren't included in newly generated fetchlists? I first thought I can exclude these sites by adding them to the regex-urlfilter file. Unfortunately I recognized that the urlfilter regular expressions are only applied to new crawled urls >>before<< they are inserted into the webdb (when running updatedb).

So is there a simple way to keep him from crawling some sites which are already included in the webdb ?

thanks for your help
  Thimo



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to