RE: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

Markus Jelsma Mon, 22 Jul 2013 10:44:39 -0700


 
 
-----Original message-----
> From:dogrdon <dgor...@planning.org>
> Sent: Monday 22nd July 2013 17:02
> To: user@nutch.apache.org
> Subject: RE: Why aren't my path exclusions getting excluded in the Nutch 
> index to Solr?
> 
> Thanks, I was wondering if some kind of reset like that needed to happen. 
> 
> I am fairly certain that putting the exclusion regexes before the inclusion
> regex has been the answer to my problem. But I have been testing this by
> just deleting the entire crawl directory to wipe out old crawls (still in
> testing here so it's fine).


Oh yes, order matters. The first match (include or exclude) is taken.

> 
> Though when this goes to production, how do i "refilter the database" with
> Nutch? and does this assume that I am indexing to Solr?

This doesn't assume any search engine back end, it only operates on Nutch' own 
data structures. Lucene understands regex queries for a while now so you need 
that to update the index. Lucene requires a regex match for the whole ID / URL 
so some regexes have to be rewritten.
> 
> thanks again
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Why-aren-t-my-path-exclusions-getting-excluded-in-the-Nutch-index-to-Solr-tp4079172p4079483.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

Reply via email to