[ 
https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526258
 ] 

Andrzej Bialecki  commented on NUTCH-547:
-----------------------------------------

> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from 
> > fetchDatum with the value from dbDatum [..]

I'm still not sure about this issue - could you please clarify?

> Perhaps we can add reprUrl to a "repr" field instead?

Shouldn't this be the other way around - the idea of your patch is to put the 
data under the reprUrl, so in order to minimize code changes you replace the 
original url with reprUrl. This way we lose the value of the original url, so 
it seems to me that if we want to preserve it we should add it to an "orig" 
field ..

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described 
> in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See 
> http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362
>  for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/";
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39";.
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/"; is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/";.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to