Carl Cerecke wrote: > Hi, > > I'm wondering what the best approach is to restrict a crawl to a certain > topic. I know that I can restrict what is crawled by a regex on the URL, > but I also need to restrict pages based on their content (whether they > are on topic or not). > > For example, say I wanted to crawl pages about Antarctica. First I start > off with a handful of pages and inject them into the crawldb, and I > generate a fetchlist, and can start sucking the pages down. I update the > crawldb with links from what has just been sucked down, and then during > the next fetch (and subsequent fetches), I want to filter which pages > end up in the segment based on their content (using, perhaps some sort > of antarctica-related-keyword score). Somehow I also need to tell the > crawldb about the URLS which I've sucked down but aren't > antarctica-related pages (so we don't suck them down again). > > This seems like the sort of problem other people have solved. Any > pointers? Am I on the right track here? Using nutch 0.9
The easiest way to do this is to implement a ScoringFilter plugin, which promotes wanted pages and demotes unwanted ones. Please see Javadoc for the ScoringFilter for details. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
