Carl Cerecke wrote: > Carl Cerecke wrote: >> Andrzej Bialecki wrote: >>> Carl Cerecke wrote:
>> I've given this a crack and it mostly seems to work, except I'm not >> sure how to get the score back into the crawldb. After reading the >> Javadoc, I figured that passScoreAfterParsing() was the method I need >> to implement. All others are just simple one-liners for this case. >> Unfortunately, passScoreAfterParsing() is alone in not having a >> CrawlDatum argument, so I can't call datum.setScore(); I did notice >> that OPICScoringFilter does this in passScoreAfterParsing: >> parse.getData().getContentMeta().set(Nutch.SCORE_KEY, ...); and I >> tried that in my own scoring filter, but just get the zero from >> datum.setScore(0.0f) in initalScore(). Nutch.SCORE_KEY is only used to pass the score value to outlinks. >> >> Couple of questions then: >> 1. Does it make sense to put the relevancy scoring code into >> passScoreAfterParsing() >> 2. If so, how do I get the score into the crawldb? >> >> I'm a bit vague on how all these bits connect together under the hood >> at the moment..... > > Spent all day on this, but no luck. I'm sure I'm missing something > obvious. Glad for any pointers in the right direction. The somewhat awkward API for ScoringFilter comes from the fact that different data is available at different steps, and similarly different output data is updated at different steps. When passScoreAfterParsing executes we don't update the db. The only method to update the original db entry is a bit indirect - first, you need to create an "adjust" value (using CrawlDatum.STATUS_LINKED) in distributeScoreToOulinks, and then detect this "adjust" value in updateDbScore among other inlinks, and update the CrawlDatum datum with a new score value. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
