Carl Cerecke wrote:
> Carl Cerecke wrote:
>> Andrzej Bialecki wrote:
>>> Carl Cerecke wrote:

>> I've given this a crack and it mostly seems to work, except I'm not 
>> sure how to get the score back into the crawldb. After reading the 
>> Javadoc, I figured that passScoreAfterParsing() was the method I need 
>> to implement. All others are just simple one-liners for this case. 
>> Unfortunately, passScoreAfterParsing() is alone in not having a 
>> CrawlDatum argument, so I can't call datum.setScore(); I did notice 
>> that OPICScoringFilter does this in passScoreAfterParsing: 
>> parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I 
>> tried that in my own scoring filter, but just get the zero from 
>> datum.setScore(0.0f) in initalScore().


Nutch.SCORE_KEY is only used to pass the score value to outlinks.


>>
>> Couple of questions then:
>> 1. Does it make sense to put the relevancy scoring code into 
>> passScoreAfterParsing()
>> 2. If so, how do I get the score into the crawldb?
>>
>> I'm a bit vague on how all these bits connect together under the hood 
>> at the moment.....
> 
> Spent all day on this, but no luck. I'm sure I'm missing something 
> obvious. Glad for any pointers in the right direction.

The somewhat awkward API for ScoringFilter comes from the fact that 
different data is available at different steps, and similarly different 
output data is updated at different steps. When passScoreAfterParsing 
executes we don't update the db.

The only method to update the original db entry is a bit indirect - 
first, you need to create an "adjust" value (using 
CrawlDatum.STATUS_LINKED) in distributeScoreToOulinks, and then detect 
this "adjust" value in updateDbScore among other inlinks, and update the 
CrawlDatum datum with a new score value.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to