[ https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-18. ------------------------------ Resolution: Won't Fix Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira > Windows servers include illegal characters in URLs > -------------------------------------------------- > > Key: NUTCH-18 > URL: https://issues.apache.org/jira/browse/NUTCH-18 > Project: Nutch > Issue Type: Bug > Components: fetcher > Reporter: Stefan Groschupf > Priority: Minor > > Transfered from: > http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356 > submitted by: > Ken Meltsner > While spidering our intranet, I found that IIS may include > illegal characters in URLs -- specifically, characters with > the high bit set to produce non-English letters. In > addition, both Firefox and IE will accept URLs with high- > bit characters, but Java won't. > While this may not be Nutch's (or Java's) fault, it would > help if high-bit characters (and other illegal characters) > in URLs could be escaped (using percent-hex notation) > as part of the URL fix-up process, probably right after > the hostname lower-case conversion. > Example document name in Portuguese(with high-bit > characters) taken from a longer URL: > Nota%20tecnica%20-%20Alteração%20de% > 20escopo.doc > and with percent-escaped characters: > Nota%20tecnica%20-%20Altera%e7%e3o%20de% > 20escopo.doc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira