Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseOutputFormat to not generate
STATUS_LINKED crawldata when a page has been refetched. That way
scores would only be adjusted for links in the original version of a
page. This is not perfect, but considerably better
But then we would miss any new links from that page. I think it's not
acceptable. Think e.g. of news sites, where links from the same page are
changing on a daily or even hourly basis.
Good point. Then maybe then we should add a new status just for this,
STATUS_REFRESH_LINK. If this is the only datum for a page, then the
page could be added with its inherited score, but otherwise, if it is an
already known page, the score increment is ignored. That way the scores
for existing pages would not change due to recrawling, but new pages
would still be added with a score influenced by the page that linked to
them. Still not perfect, but better.
If you remember, some time ago I proposed a different solution: to
involve linkDB in score calculations, and to store these partial OPIC
score values in Inlink. This would allow us to track score contributions
per source/target pair. Newly discovered links would get the initial
partial score value from the originating page, and we could track these
values if the original page's score changes (e.g. the number of links
increases, or the page's score is updated).
Involving the linkdb in score calculations means that the linkdb is
involved in crawldb updates, which makes crawldb updates much slower,
since the linkdb generally has many times more entries than the crawldb.
The linkdb is not required for batch crawling and OPIC scoring, a
common case. So if we wish to implement things this way we should make
it optional. For example, an initial crawl could be done using the
current algorithm while subsequent crawls could use a slower,
incrementally updating algorithm.
BTW: I've been toying with some patches to implement pluggable scoring
mechanisms, it would be easy to provide hooks for custom scoring
implementations. Scores are just float values, so they would be
sufficient for a wide range of scoring mechanisms, for others the newly
added CrawlDatum.metadata could be used.
+1
Doug
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general