Answering my own question here, please correct me if I'm wrong.

In order for the entries in regex-urlfilter.txt to be relevant to your crawl 
and indexing, you need to manually edit 'bin/crawl' and remove -noFilter from 
the 'nutch generate' command.

Additionally, you need to edit the portion that calls 'nutch solrindex' and add 
'-filter' to the solrindex call.

________________________________
From: Os Tyler
Sent: Tuesday, July 30, 2013 3:26 PM
To: [email protected]
Subject: regex-urlfilter test shows negative, but URL still crawled

I have an entry in regex-urlfilter.txt designed to prevent crawling of urls 
that are part of our UPS search app.

# skip URLs from the UPS search app
-\?ups=
-index.php/ups\?aa

When I test the urls, it appears that regex-urlfilter should exclude them, for 
example:
echo "http://redacted.com/index.php/ups?aa"; | /usr/local/apache-nutch/bin/nutch 
org/apache/nutch/net/URLFilterChecker -filterName 
org.apache.nutch.urlfilter.regex.RegexURLFilter

Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
-http://redacted.com/index.php/ups?aa

But when I run 'crawl', it does not skip these urls.

Thanks for any help in showing me what I'm missing here.

Reply via email to