Re: regex-urlfilter test shows negative, but URL still crawled

Sebastian Nagel Thu, 01 Aug 2013 15:39:12 -0700

Hi,

in the most simple case, you don't change the configuration including URL 
filters
while a crawl is running. It would be waste of CPU time to filter and normalize
every step - it's enough to do this when new URLs are found (during inject
and when outlinks are extracted during parse).


If rules may change at any time you have to filter (and normalize) in other
steps as well.

Sebastian

On 08/02/2013 12:20 AM, Os Tyler wrote:
> Answering my own question here, please correct me if I'm wrong.
> 
> In order for the entries in regex-urlfilter.txt to be relevant to your crawl 
> and indexing, you need to manually edit 'bin/crawl' and remove -noFilter from 
> the 'nutch generate' command.
> 
> Additionally, you need to edit the portion that calls 'nutch solrindex' and 
> add '-filter' to the solrindex call.
> 
> ________________________________
> From: Os Tyler
> Sent: Tuesday, July 30, 2013 3:26 PM
> To: [email protected]
> Subject: regex-urlfilter test shows negative, but URL still crawled
> 
> I have an entry in regex-urlfilter.txt designed to prevent crawling of urls 
> that are part of our UPS search app.
> 
> # skip URLs from the UPS search app
> -\?ups=
> -index.php/ups\?aa
> 
> When I test the urls, it appears that regex-urlfilter should exclude them, 
> for example:
> echo "http://redacted.com/index.php/ups?aa"; | 
> /usr/local/apache-nutch/bin/nutch org/apache/nutch/net/URLFilterChecker 
> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter
> 
> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
> -http://redacted.com/index.php/ups?aa
> 
> But when I run 'crawl', it does not skip these urls.
> 
> Thanks for any help in showing me what I'm missing here.
> 
>

Re: regex-urlfilter test shows negative, but URL still crawled

Reply via email to