[ 
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ] 

Ken Krugler commented on NUTCH-230:
-----------------------------------

So Doug beat me to this comment :)

I was going to describe the two cases we'd run into...

1. There's a great page, but most of the links are queries, and we currently 
skip them. So they aren't "bad" links, just links that we don't yet handle. And 
thus the value of the page gets diluted, because the few non-query links get 
very low OPIC scores "given" to the pages they reference.

2. There's a great blog post, but spam software added bogus links to adult 
sites. We blacklist them, but as with #1, the pages referenced by good links on 
the page suffer the consequences.

The way I think about the OPIC score is that the set of pages we've fetched so 
far has an energy level (sum of each page score), and OPIC redistributes this 
energy to better account for link info when determining page fetch order. So 
the current code effectively loses some of this energy via bad links.

Anyway, I was also going to propose a config setting if Andrzej or others felt 
strongly that pages should be penalized for filtered links. Otherwise always 
using the count of "approved" (maybe that's a better term than good/bad) links 
to divide up the page score makes sense to me.

> OPIC score for outlinks should be based on # of valid links, not total # of 
> links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page 
> score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet 
> get added to the crawl output.
> But this means that any filtered links result in some amount of the page's 
> OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used 
> that to determine the per-link OPIC score, after which I iterated over the 
> list, adding entries to the crawl output.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to