You can use an HtmlParseFilter and then set a metadata attribute as to whether or not it contains the phrase. Problem with this is that all of the content is still stored. You could also change the ParseOutputFormat to only write out if the word is contained although that is a bit of a hack.
This may be an area that we need to add an extension point to if one doesn't already exist. I am sure there are many more people out there that would like to selectively store content based on the content. Dennis Kubes Brian Whitman wrote: > In doing whole-internet focused crawls we'd like a parse/injector filter. > > Say we only want pages in our nutch db and index that have the word > "nutch" in them. I'd like to express the rule as a lucene boolean query, > contents:nutch, because in our real world scenario the match is more > fuzzy and involves many phrases and terms. It's not just a regular > expression. > > If the query does not match or matches under a threshold score, I don't > want to add the fetched/parsed document to the index, nor (more > importantly) have the generator find outlinks from that page for future > crawls. > > This is somewhat like a url filter, but instead of filtering by url > content I want to filter by parsed page content. > > Where would I add this in nutch? > > -Brian > > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
