regex-urlfilter test shows negative, but URL still crawled

Os Tyler Tue, 30 Jul 2013 15:28:27 -0700

I have an entry in regex-urlfilter.txt designed to prevent crawling of urls 
that are part of our UPS search app.


# skip URLs from the UPS search app
-\?ups=
-index.php/ups\?aa

When I test the urls, it appears that regex-urlfilter should exclude them, for 
example:
echo "http://redacted.com/index.php/ups?aa"; | /usr/local/apache-nutch/bin/nutch 
org/apache/nutch/net/URLFilterChecker -filterName 
org.apache.nutch.urlfilter.regex.RegexURLFilter

Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
-http://redacted.com/index.php/ups?aa

But when I run 'crawl', it does not skip these urls.

Thanks for any help in showing me what I'm missing here.

regex-urlfilter test shows negative, but URL still crawled

Reply via email to