[Nutch-dev] Re: OPIC score calculation issues

Doug Cutting Tue, 28 Feb 2006 11:15:52 -0800

Andrzej Bialecki wrote:

* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-sfrom crawl_parse with the same URL, which means that we get:
   * the original CrawlDatum
   * (optionally a CrawlDatum that contains just a Signature)
* all CrawlDatum.LINKED entries pointing to our URL, generated byoutlinks from other
     pages.
Based on this information, a new score is calculated by adding theoriginal score and all
 scores from incoming links.
HOWEVER... and here's where I suspect the current code is wrong: sincewe are processing just one segment the incoming link information is veryincomplete because it comes only from the outlinks discovered byfetching this segment's fetchlist, and not the complete LinkDB.

I think the code is correct. OPIC is an incremental algorithm, designedto be calculated while crawling. As each new link is seen, itincrements the score of the page it links to. OPIC is thus much simplerand faster to calculate than PageRank. (It also provides a goodapproximation of PageRank, but prioritizes better when crawling thanPageRank. Crawling using an incrementally calculated PageRank is not asgood as OPIC at crawling higher PageRank pages sooner.)

One mitigating factor could be that we already accounted for incominglinks from other segments when processing those other segments - so ourinitial score already includes the inlink information from othersegments. But this assumes that we never generate and process more than1 segment in parallel, i.e. that we finish updating from all previoussegments before we update from the current segment (otherwise wewouldn't know the updated initial score).

I think all that you're saying is that we should not run two CrawlDBupdates at once, right? But there are lots of reasons we cannot do thatbesides the OPIC calculation.

Also, the "cash value" of those outlinks that point to URLs not in thecurrent fetchlist will be dropped, because they won't be collectedanywhere.


No, every cash value is used.  The input to a crawl db update includes a

CrawlDatum for every known url, including those just linked to. If theonly CrawlDatum for a url is a new outlink from a page crawled, then thescore for the page is 1.0 + the score of that outlink.

I think a better option would be to add the LinkDB as an input dir toCrawlDB.update(), so that we have access to all previously collectedinlinks.


That would be a lot slower, and it would not compute OPIC.

And a final note: CrawlDB.update() uses the initial score value recordedin the segment, and NOT the value that is actually found in CrawlDB atthe time of the update. This means that if there was another update inthe meantime, your new score in CrawlDB will be overwritten with thescore based on an older initial value. This is counter-intuitive - Ithink CrawlDB.update() should always use the latest score value found inthe current CrawlDB. I.e. in CrawlDBReducer instead of doing:
     result.setScore(result.getScore() + scoreIncrement);

we should do:

     result.setScore(old.getScore() + scoreIncrement);

The change is not quite that simple, since 'old' is sometimes null. Soperhaps we need to add an 'score' variable that is set to old.score whenold!=null and to 1.0 otherwise (for newly linked pages).

The reason I didn't do it that way was to permit the Fetcher to modifyscores, since I was thinking of the Fetcher as the actor whose actionsare being processed here, and of the CrawlDb as the passive thing actedon. But indeed, if you have another process that's updating a CrawlDbwhile a Fetcher is running, this may not be the case. So if we want toswitch things so that the Fetcher is not permitted to adjust scores,then this seems like a reasonable change.


Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: OPIC score calculation issues

Reply via email to