You can use an HtmlParseFilter and then set a metadata attribute as to 
whether or not it contains the phrase.  Problem with this is that all of 
the content is still stored.  You could also change the 
ParseOutputFormat to only write out if the word is contained although 
that is a bit of a hack.

This may be an area that we need to add an extension point to if one 
doesn't already exist.  I am sure there are many more people out there 
that would like to selectively store content based on the content.

Dennis Kubes

Brian Whitman wrote:
> In doing whole-internet focused crawls we'd like a parse/injector filter.
> 
> Say we only want pages in our nutch db and index that have the word 
> "nutch" in them. I'd like to express the rule as a lucene boolean query, 
> contents:nutch, because in our real world scenario the match is more 
> fuzzy and involves many phrases and terms. It's not just a regular 
> expression.
> 
> If the query does not match or matches under a threshold score, I don't 
> want to add the fetched/parsed document to the index, nor (more 
> importantly) have the generator find outlinks from that page for future 
> crawls.
> 
> This is somewhat like a url filter, but instead of filtering by url 
> content I want to filter by parsed page content.
> 
> Where would I add this in nutch?
> 
> -Brian
> 
> 
> 
> 
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to