[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372580 ]
Andrzej Bialecki commented on NUTCH-240: ----------------------------------------- > First, I hope my critical remarks were not taken personally. I am thankful > for this and all of your contributions. Not at all, we're not arguing but argumenting - we both want to find the best solution. Re: generate. Yes, that's a nice way out, it would satisfy the requirement I described above, without this awkward step. Re: passScore* : let me explain a bit the requirements that lead me to this. In some cases there will be multiple metadata (not just a single primitive value) that drive the score, i.e. the final "score" and its distribution may depend on many values in CrawlDatum metadata (e.g. URL classification, expert evaluation, users' feedback, white/black-lists, etc). The passScore* API allows you to copy this arbitrary metadata from CrawlDatum-s (coming from CrawlDb -> crawl_generate) down to the parsing process and the score distribution step to outlinks. The distributeScore API would pick up this (or these plural) values and based its score distribution decisions on them. This API just mimicks what was already there (only now you can use arbitrary metadata for scoring), and now we can plainly see it's an ugly way to do this. :) But the proper solution should allow passing arbitrary metadata from CrawlDb to the page scoring steps after parsing, and to the outlink score distribution process. Another issue: the reason for returning an "adjust" value from distributeScoreToOutlink is that in some algorithms (among others OPIC - but we don't implement this part now...) the fact that a certain score was distributed to an outlink should affect the score of the page that is the source of this link. > Scoring API: extension point, scoring filters and an OPIC plugin > ---------------------------------------------------------------- > > Key: NUTCH-240 > URL: http://issues.apache.org/jira/browse/NUTCH-240 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Attachments: patch.txt > > This patch refactors all places where Nutch manipulates page scores, into a > plugin-based API. Using this API it's possible to implement different scoring > algorithms. It is also much easier to understand how scoring works. > Multiple scoring plugins can be run in sequence, in a manner similar to > URLFilters. > Included is also an OPICScoringFilter plugin, which contains the current > implementation of the scoring algorithm. Together with the scoring API it > provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
