I have an entry in regex-urlfilter.txt designed to prevent crawling of urls 
that are part of our UPS search app.

# skip URLs from the UPS search app
-\?ups=
-index.php/ups\?aa

When I test the urls, it appears that regex-urlfilter should exclude them, for 
example:
echo "http://redacted.com/index.php/ups?aa"; | /usr/local/apache-nutch/bin/nutch 
org/apache/nutch/net/URLFilterChecker -filterName 
org.apache.nutch.urlfilter.regex.RegexURLFilter

Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
-http://redacted.com/index.php/ups?aa

But when I run 'crawl', it does not skip these urls.

Thanks for any help in showing me what I'm missing here.

Reply via email to