hello all,
I am quite new to nutch, currently I am trying to crawl some thousand
urls with static and dynamic content. I am using nutch 0.7.1.
I now have some sites in my webdb which I don't want to crawl/fetch
any more. How can I achieve that they aren't included in newly
generated fetchlists? I first thought I can exclude these sites by
adding them to the regex-urlfilter file. Unfortunately I recognized
that the urlfilter regular expressions are only applied to new
crawled urls >>before<< they are inserted into the webdb (when
running updatedb).
So is there a simple way to keep him from crawling some sites which
are already included in the webdb ?
thanks for your help
Thimo
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general