[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129850#comment-13129850 ]
Lewis John McGibbney commented on NUTCH-1098: --------------------------------------------- I think this makes sense, but yes number two above is important as the problem scales in parallel with respect to crawl job size => big problems w.r.t index maintenance issues. Radim, can you please update your patch as it does not apply cleanly to most recent trunk. Once this has been done we can make a decision. Thank you > better url-normalizer basic > --------------------------- > > Key: NUTCH-1098 > URL: https://issues.apache.org/jira/browse/NUTCH-1098 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 1.3 > Environment: Any > Reporter: Radim Kolar > Assignee: Markus Jelsma > Labels: encoding, url > Fix For: 1.4 > > Attachments: nutch.diff > > Original Estimate: 4h > Remaining Estimate: 4h > > Basic URL normalizer lacks 2 important features > Encode space in URL into %20 to unbreak httpclient and possibly others who do > not expect space inside URL > Ability to decode %33 encoding in URL. This is important for avoiding > duplicates -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira