[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] 

Doug Cutting commented on NUTCH-267:
------------------------------------

Andrzej: your analysis is correct, but it mostly only applies when re-crawling. 
 In an initial crawl, where each url is fetched only once, I think we implement 
 the OPIC "Greedy" strategy.  The question of what to do when re-crawling has 
not been adequately answered, but, glancing at the paper, it seems that 
resetting a urls score to zero each time it is fetched might be the best thing 
to do, so that it can start accumulating more cash.

When ranking, summing logs is the same as multiplying, no?

> Indexer doesn't consider linkdb when calculating boost value
> ------------------------------------------------------------
>
>          Key: NUTCH-267
>          URL: http://issues.apache.org/jira/browse/NUTCH-267
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Chris Schneider
>     Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
> indexer.boost.by.link.count was true, the indexer boost value was scaled 
> based on the log of the # of inbound links:
>     if (boostByLinkCount)
>       res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). 
> Instead, the boost value is just the square root (or some other scorePower) 
> of the page score. Shouldn't the invertlinks command, which creates the 
> linkdb, have some affect on the boost value calculated during indexing 
> (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to