[Nutch-general] regex-urlfilter applied to urls at wrong time ?

Thimo Eichstädt Thu, 24 Nov 2005 14:35:59 -0800

hello all,

I am quite new to nutch, currently I am trying to crawl some thousandurls with static and dynamic content. I am using nutch 0.7.1.

I now have some sites in my webdb which I don't want to crawl/fetchany more. How can I achieve that they aren't included in newlygenerated fetchlists? I first thought I can exclude these sites byadding them to the regex-urlfilter file. Unfortunately I recognizedthat the urlfilter regular expressions are only applied to newcrawled urls >>before<< they are inserted into the webdb (whenrunning updatedb).

So is there a simple way to keep him from crawling some sites whichare already included in the webdb ?


thanks for your help
  Thimo



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] regex-urlfilter applied to urls at wrong time ?

Reply via email to