[ 
https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660394#action_12660394
 ] 

Dennis Kubes commented on NUTCH-572:
------------------------------------

I would like to close this issue.  Redirect handling has undergone significant 
changes since this issue was opened and we still need to take a hard look at 
redirects and possibly how scores are represented.  However, the newer scoring 
and indexing frameworks do work around this issue.

> Scoring and redirected Urls
> ---------------------------
>
>                 Key: NUTCH-572
>                 URL: https://issues.apache.org/jira/browse/NUTCH-572
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>
> When a redirect is found for a given url, the new or end url is stored as the 
> content page and the old CrawlDatum get one of a few redirect codes.  The 
> page that gets indexed in Nutch is the end page and it gets indexed under the 
> end url.  Many times a site will have a significant number of links pointing 
> to start page and very few pointing to the redirected end page.  This is 
> especially true for external links.  Opic scores do not get transfered to the 
> end page but stay with the start page (the one doing the redirecting).  But 
> the start page doesn't get indexed.  Hence the end page will show up in the 
> index but under a usually much reduced score.  A good example of this is 
> cnn.com:
> URL: http://www.cnn.com/
> Version: 6
> Status: 5 (db_redir_perm)
> Fetch time: Tue Dec 04 11:02:09 CST 2007
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 51.19438
> Signature: b5baaf80e9e10aa6205fc39051c362ff
> Metadata: _pst_:success(1), lastModified=0
> which redirects to http://www.cnn.com/?refresh=1
> URL: http://www.cnn.com/?refresh=1
> Version: 6
> Status: 2 (db_fetched)
> Fetch time: Tue Dec 04 11:02:11 CST 2007
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: b5baaf80e9e10aa6205fc39051c362ff
> Metadata: _pst_:success(1), lastModified=0
> Now, cnn which should be one of the highest, if not the highest ranking site 
> in the index for keywords such as news in fact doesn't show up in the index 
> and it's redirected end page appears much farther down in search results.  My 
> proposal is we somehow make OPIC scores follow redirects.  To do this we 
> would most likely need to store a start and end url for redirected urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to