[ 
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129850#comment-13129850
 ] 

Lewis John McGibbney commented on NUTCH-1098:
---------------------------------------------

I think this makes sense, but yes number two above is important as the problem 
scales in parallel with respect to crawl job size => big problems w.r.t index 
maintenance issues. 

Radim, can you please update your patch as it does not apply cleanly to most 
recent trunk. Once this has been done we can make a decision. Thank you 
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do 
> not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding 
> duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to