[Nutch-general] Re: Adaptive Refetching

Doug Cutting Wed, 08 Mar 2006 15:25:00 -0800

Andrzej Bialecki wrote:

Doug Cutting wrote:
are refetched, their links are processed again. I think the easiestway to fix this would be to change ParseOutputFormat to not generateSTATUS_LINKED crawldata when a page has been refetched. That wayscores would only be adjusted for links in the original version of apage. This is not perfect, but considerably better
But then we would miss any new links from that page. I think it's notacceptable. Think e.g. of news sites, where links from the same page arechanging on a daily or even hourly basis.

Good point. Then maybe then we should add a new status just for this,STATUS_REFRESH_LINK. If this is the only datum for a page, then thepage could be added with its inherited score, but otherwise, if it is analready known page, the score increment is ignored. That way the scoresfor existing pages would not change due to recrawling, but new pageswould still be added with a score influenced by the page that linked tothem. Still not perfect, but better.

If you remember, some time ago I proposed a different solution: toinvolve linkDB in score calculations, and to store these partial OPICscore values in Inlink. This would allow us to track score contributionsper source/target pair. Newly discovered links would get the initialpartial score value from the originating page, and we could track thesevalues if the original page's score changes (e.g. the number of linksincreases, or the page's score is updated).

Involving the linkdb in score calculations means that the linkdb isinvolved in crawldb updates, which makes crawldb updates much slower,since the linkdb generally has many times more entries than the crawldb.The linkdb is not required for batch crawling and OPIC scoring, acommon case. So if we wish to implement things this way we should makeit optional. For example, an initial crawl could be done using thecurrent algorithm while subsequent crawls could use a slower,incrementally updating algorithm.

BTW: I've been toying with some patches to implement pluggable scoringmechanisms, it would be easy to provide hooks for custom scoringimplementations. Scores are just float values, so they would besufficient for a wide range of scoring mechanisms, for others the newlyadded CrawlDatum.metadata could be used.


+1

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Adaptive Refetching

Reply via email to