Carl Cerecke wrote: > Andrzej Bialecki wrote: >> Carl Cerecke wrote: >>> Hi, >>> >>> I'm wondering what the best approach is to restrict a crawl to a >>> certain topic. I know that I can restrict what is crawled by a regex >>> on the URL, but I also need to restrict pages based on their content >>> (whether they are on topic or not). >>> >>> For example, say I wanted to crawl pages about Antarctica. First I >>> start off with a handful of pages and inject them into the crawldb, >>> and I generate a fetchlist, and can start sucking the pages down. I >>> update the crawldb with links from what has just been sucked down, >>> and then during the next fetch (and subsequent fetches), I want to >>> filter which pages end up in the segment based on their content >>> (using, perhaps some sort of antarctica-related-keyword score). >>> Somehow I also need to tell the crawldb about the URLS which I've >>> sucked down but aren't antarctica-related pages (so we don't suck >>> them down again). >>> >>> This seems like the sort of problem other people have solved. Any >>> pointers? Am I on the right track here? Using nutch 0.9 >> >> The easiest way to do this is to implement a ScoringFilter plugin, >> which promotes wanted pages and demotes unwanted ones. Please see >> Javadoc for the ScoringFilter for details. > > I've given this a crack and it mostly seems to work, except I'm not sure > how to get the score back into the crawldb. After reading the Javadoc, I > figured that passScoreAfterParsing() was the method I need to implement. > All others are just simple one-liners for this case. Unfortunately, > passScoreAfterParsing() is alone in not having a CrawlDatum argument, so > I can't call datum.setScore(); I did notice that OPICScoringFilter does > this in passScoreAfterParsing: > parse.getData().getContentMeta().set(Nutch.SCORE_KEY, ...); and I tried > that in my own scoring filter, but just get the zero from > datum.setScore(0.0f) in initalScore(). > > Couple of questions then: > 1. Does it make sense to put the relevancy scoring code into > passScoreAfterParsing() > 2. If so, how do I get the score into the crawldb? > > I'm a bit vague on how all these bits connect together under the hood at > the moment.....
Spent all day on this, but no luck. I'm sure I'm missing something obvious. Glad for any pointers in the right direction. Cheers, Carl. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
