[ https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660394#action_12660394 ]
Dennis Kubes commented on NUTCH-572: ------------------------------------ I would like to close this issue. Redirect handling has undergone significant changes since this issue was opened and we still need to take a hard look at redirects and possibly how scores are represented. However, the newer scoring and indexing frameworks do work around this issue. > Scoring and redirected Urls > --------------------------- > > Key: NUTCH-572 > URL: https://issues.apache.org/jira/browse/NUTCH-572 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8, 0.8.1, 0.9.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > > When a redirect is found for a given url, the new or end url is stored as the > content page and the old CrawlDatum get one of a few redirect codes. The > page that gets indexed in Nutch is the end page and it gets indexed under the > end url. Many times a site will have a significant number of links pointing > to start page and very few pointing to the redirected end page. This is > especially true for external links. Opic scores do not get transfered to the > end page but stay with the start page (the one doing the redirecting). But > the start page doesn't get indexed. Hence the end page will show up in the > index but under a usually much reduced score. A good example of this is > cnn.com: > URL: http://www.cnn.com/ > Version: 6 > Status: 5 (db_redir_perm) > Fetch time: Tue Dec 04 11:02:09 CST 2007 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 51.19438 > Signature: b5baaf80e9e10aa6205fc39051c362ff > Metadata: _pst_:success(1), lastModified=0 > which redirects to http://www.cnn.com/?refresh=1 > URL: http://www.cnn.com/?refresh=1 > Version: 6 > Status: 2 (db_fetched) > Fetch time: Tue Dec 04 11:02:11 CST 2007 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: b5baaf80e9e10aa6205fc39051c362ff > Metadata: _pst_:success(1), lastModified=0 > Now, cnn which should be one of the highest, if not the highest ranking site > in the index for keywords such as news in fact doesn't show up in the index > and it's redirected end page appears much farther down in search results. My > proposal is we somehow make OPIC scores follow redirects. To do this we > would most likely need to store a start and end url for redirected urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.