[ https://issues.apache.org/jira/browse/NUTCH-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel closed NUTCH-1685. ---------------------------------- Resolution: Duplicate You are right, [~markus17]. > URLUtil.toUNICODE fails on IDNs > ------------------------------- > > Key: NUTCH-1685 > URL: https://issues.apache.org/jira/browse/NUTCH-1685 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.7, 2.2.1 > Environment: Java 7, OpenJDK 64-Bit, 1.7.0_25 > Reporter: Sebastian Nagel > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1685-2x-test.patch > > > URLUtil.toUNICODE() fails on IDNs and returns null instead of the Unicode > URL. The constructor of URI obviously does not accept IDN host names. For > {{http://www.xn--evir-zoa.com/}} the constructor IDN() throws the exception: > {code} > java.net.URISyntaxException: Illegal character in hostname at index 11: > http://www.çevir.com/ > {code} > Principally, IDN.toUnicode() can convert URLs (not only domain or host > names). However, it does not convert URLs with host part consisting of only > two parts: {{http://xn--uni-tbingen-xhb.de/}}. Is that the reason why we need > URLUtil.toUNICODE() ? -- This message was sent by Atlassian JIRA (v6.1.5#6160)