[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

Andrzej Bialecki (JIRA) Tue, 09 May 2006 14:20:52 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378755 ]


Andrzej Bialecki  commented on NUTCH-267:
-----------------------------------------

I would argue that what Nutch implements now shouldn't be called OPIC, because 
it has little to do with the algorithm described in the OPIC paper. Either we 
fix it, or we should rename it. Let me explain:

* the paper uses a "cash flow" concept, where nodes not only receive score 
contributions, but also give them away thus _reducing_ their available score. 
This is not implemented in Nutch, which leads to scores growing into infinity. 
This also makes the score dependent on the number of fetch cycles, i.e. the 
scores of two pages with exactly the same inlinks will be different if one of 
them underwent more refresh cycles than the other. So, the fundamental premise 
of the algorithm - that scores would converge to certain values as a result of 
cash flow balance - is not retained.

* the paper uses a concept of "virtual nodes" that give away cash to 
disconnected nodes in the current graph. In reality, these nodes are probably 
connected, but the current graph is not complete enough to track it. The Nutch 
implementation doesn't use this, but only because it doesn't give away "cash".

* finally, the paper argues that OPIC score and other different scores should 
be combined as a sum of logarithms, i.e. "log(opic) + log(docSimilarity)". 
Nutch uses a formula "sqrt(opic) * docSimilarity" (through document boosting).

I'm going to commit the scoring API soon, this should make it easier to 
experiment with different scoring models.

> Indexer doesn't consider linkdb when calculating boost value
> ------------------------------------------------------------
>
>          Key: NUTCH-267
>          URL: http://issues.apache.org/jira/browse/NUTCH-267
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Chris Schneider
>     Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
> indexer.boost.by.link.count was true, the indexer boost value was scaled 
> based on the log of the # of inbound links:
>     if (boostByLinkCount)
>       res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). 
> Instead, the boost value is just the square root (or some other scorePower) 
> of the page score. Shouldn't the invertlinks command, which creates the 
> linkdb, have some affect on the boost value calculated during indexing 
> (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

Reply via email to