[ 
https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066982#comment-13066982
 ] 

Julien Nioche commented on NUTCH-1044:
--------------------------------------

I can confirming the issue. The solution is not straightforward and needs a bit 
of thinking.

{quote}
The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f 
as can be seen on the line 122 of CrawlDatum.java 
(http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
{quote}

The score is set in the method initialScore() in the ScoringFilters, see line 
81 of OPICScoringFilter which sets it to 0 by default as it expects it to be 
modified later when getting the contributions from the inlinks. 

There are several ways in which a URL can get a score : 
* specifying the param 'db.score.injected' when injecting (default value = 1.0)
* passing it in the seed list for each individual URL as a value of the 
metadata 'nutch.score'
* from inlinks (depends on the score of the source, number of links etc...)
* from redirection : which is currently broken

The default value of the score in CrawlDatum is 1.0 but this could be changed 
to 0.0. It also has a constructor 

{code}
CrawlDatum(int status, int fetchInterval, float score) 
{code}

which is allows to specify its score, this constructor is used by the Fetcher 
when the redirs are refetched immediately however the calls to initialScore() 
currently set it to 0 immediately.

We should probably change initialScore() in OPICScoringFilter so that by 
default it leaves the existing scores as they are and change the default value 
in CrawlDatum to 0.0. Using the CrawlDatum constructor above with the score of 
the source of the redir in the code of the Fetcher would fix the issue.

I will need to look into this and make sure that it has no negative effect + 
check the cases where the redirection is obtained from a meta refresh tag in 
the code.

Thanks for reporting it. 

> Redirected URLs and possibly all of their outlinked URLs have invalid scores.
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-1044
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1044
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, parser
>    Affects Versions: 1.3
>            Reporter: Nutch User - 1
>             Fix For: 1.4
>
>
> 1.: 
> http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
> 2.: 
> http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
> Please note that also URLs redirected by meta refresh redirection do have 
> invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of 
> ParseOutputFormat.java 
> (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup).
>  The new CrawlDatum's score isn't set anywhere after the creation so it's 
> 1.0f as can be seen on the line 122 of CrawlDatum.java 
> (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
> It's another question whether the redirected URL's score should be just 
> passed to the new URL or should the redirection be considered as a link in 
> which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' 
> + 1).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to