[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372580 ] 

Andrzej Bialecki  commented on NUTCH-240:
-----------------------------------------

> First, I hope my critical remarks were not taken personally. I am thankful 
> for this and all of your contributions. 

Not at all, we're not arguing but argumenting - we both want to find the best 
solution.

Re: generate. Yes, that's a nice way out, it would satisfy the requirement I 
described above, without this awkward step.

Re: passScore* : let me explain a bit the requirements that lead me to this. In 
some cases there will be multiple metadata (not just a single primitive value) 
that drive the score, i.e. the final "score" and its distribution may depend on 
many values in CrawlDatum metadata (e.g. URL classification, expert evaluation, 
users' feedback, white/black-lists, etc). The passScore* API allows you to copy 
this arbitrary metadata from CrawlDatum-s (coming from CrawlDb -> 
crawl_generate) down to the parsing process and the score distribution step to 
outlinks. The distributeScore API would pick up this (or these plural) values 
and based its score distribution decisions on them.

This API just mimicks what was already there (only now you can use arbitrary 
metadata for scoring), and now we can plainly see it's an ugly way to do this. 
:) But the proper solution should allow passing arbitrary metadata from CrawlDb 
to the page scoring steps after parsing, and to the outlink score distribution 
process.

Another issue: the reason for returning an "adjust" value from 
distributeScoreToOutlink is that in some algorithms (among others OPIC - but we 
don't implement this part now...) the fact that a certain score was distributed 
to an outlink should affect the score of the page that is the source of this 
link.

> Scoring API: extension point, scoring filters and an OPIC plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-240
>          URL: http://issues.apache.org/jira/browse/NUTCH-240
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a 
> plugin-based API. Using this API it's possible to implement different scoring 
> algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to 
> URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current 
> implementation of the scoring algorithm. Together with the scoring API it 
> provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to