[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

JIRA Tue, 04 Sep 2007 04:43:17 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524693
 ]


Doğacan Güney commented on NUTCH-547:
-------------------------------------

Thanks a lot for the quick review, Andrzej.

> * the patch uses a strange diff format ... the first lines of context diffs 
> appear on the same lines as chunk coordinates. 

Sorry about that. I am using git-svn (which, by the way, is an awesome tool) to 
develop nutch so I may have forgotten to use "svn diff" for the patch.

> * in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is 
> the same as the urlString? We should skip the 
> redirect then. 

We don't follow reprUrl,  we follow newUrl which is tested for equality with 
urlString. However, we should probably avoid writing reprUrl in crawldatum 
metadata if it is the same as the urlString.

> * the repeating parsing of refreshTime should be hidden in a utility method 
> in ParseStatus - although the proper way to 
> support this would be to extend ParseStatus to store this int value if 
> necessary, i.e. if ParseStatus is SUCCESS_REDIRECT (we
> would have to bump the version number, too).

Good point. Will look into that.

> * minimum refreshTime should be at least a constant, or configurable, and not 
> a literal. Similarly the redirType should be a 
> constant. 

This patch is only a rough draft. I will fix all such issues in a later patch.

> * if we change the "url" field in BasicIndexingFilter, shouldn't we also 
> change the "site"and "host" fields? [...]

Wow, can't believe I missed that. 

> [..] We could also consider adding reprUrl as an additional value for the 
> same "url" field - this way we would get hits both on
>  the original url and the reprUrl. 

This may cause problems with dedup which assumes that "url" field has a single 
value. Also, it may be difficult to decide which value of "url" to show in web 
UI. I also like that fact that "url" is like a UNIQUE KEY for the document. If 
we allow "url" to have multiple values we lose that. 

Perhaps we can add reprUrl to a "repr" field instead?

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described 
> in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See 
> http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362
>  for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/";
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39";.
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/"; is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/";.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm

Reply via email to