Hi, in the most simple case, you don't change the configuration including URL filters while a crawl is running. It would be waste of CPU time to filter and normalize every step - it's enough to do this when new URLs are found (during inject and when outlinks are extracted during parse).
If rules may change at any time you have to filter (and normalize) in other steps as well. Sebastian On 08/02/2013 12:20 AM, Os Tyler wrote: > Answering my own question here, please correct me if I'm wrong. > > In order for the entries in regex-urlfilter.txt to be relevant to your crawl > and indexing, you need to manually edit 'bin/crawl' and remove -noFilter from > the 'nutch generate' command. > > Additionally, you need to edit the portion that calls 'nutch solrindex' and > add '-filter' to the solrindex call. > > ________________________________ > From: Os Tyler > Sent: Tuesday, July 30, 2013 3:26 PM > To: [email protected] > Subject: regex-urlfilter test shows negative, but URL still crawled > > I have an entry in regex-urlfilter.txt designed to prevent crawling of urls > that are part of our UPS search app. > > # skip URLs from the UPS search app > -\?ups= > -index.php/ups\?aa > > When I test the urls, it appears that regex-urlfilter should exclude them, > for example: > echo "http://redacted.com/index.php/ups?aa" | > /usr/local/apache-nutch/bin/nutch org/apache/nutch/net/URLFilterChecker > -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter > > Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter > -http://redacted.com/index.php/ups?aa > > But when I run 'crawl', it does not skip these urls. > > Thanks for any help in showing me what I'm missing here. > >

