In doing whole-internet focused crawls we'd like a parse/injector filter.
Say we only want pages in our nutch db and index that have the word "nutch" in them. I'd like to express the rule as a lucene boolean query, contents:nutch, because in our real world scenario the match is more fuzzy and involves many phrases and terms. It's not just a regular expression. If the query does not match or matches under a threshold score, I don't want to add the fetched/parsed document to the index, nor (more importantly) have the generator find outlinks from that page for future crawls. This is somewhat like a url filter, but instead of filtering by url content I want to filter by parsed page content. Where would I add this in nutch? -Brian ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
