[ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540869 ]
Doğacan Güney commented on NUTCH-547: ------------------------------------- Dennis, thanks for the update and testing. I have been a bit unresponsive lately, I hope that this will change (very) soon. I actually was going to commit it a while back, but Andrzej said that he is working on adding alias support/etc. (as outlined in his earlier email to nutch-dev) which included this patch as part of his effort but it probably fell off his radar during his move and everything. Anyway, I feel that lately we have been too reluctant to commit new stuff. So I am also +1 for committing it. If we break something or miss something, we can always fix it later :) So I am going to give this one a day or two (say, until sunday) and commit it if there are no objections. Is this OK with you, Dennis? > Redirection handling: YahooSlurp's algorithm > -------------------------------------------- > > Key: NUTCH-547 > URL: https://issues.apache.org/jira/browse/NUTCH-547 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Reporter: Doğacan Güney > Fix For: 1.0.0 > > Attachments: NUTCH-547-3.patch, redirect_draft.patch, > redirect_draft_v2.patch > > > After reading Yahoo's algorithm (then one Andrzej linked to: > http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html ) > in the redirect/alias handling discussion, I had a bit of a spare > time, so I implemented it. > Note that the patch I am attaching is for the 'choosing' algorithm described > in > Yahoo's help page. It makes no attempt to handle aliases in any way. (See > http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 > for the discussion about alias handling). > E.g, > generate "http://www.milliyet.com.tr/" > fetch "http:/www.milliyet.com.tr/" which redirects to > "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39". > Update second page's datum's metadata to indicate that > "http://www.milliyet.com.tr/" is the representative form. > Updatedb, invertlinks, etc... > While indexing second page, change its "url" field to > "http://www.milliyet.com.tr/". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.