[ https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407981#comment-16407981 ]
Markus Jelsma commented on NUTCH-2541: -------------------------------------- This is probably not a 1.14 problem, we fixed it some versions ago in the BasicURLNormalizer. > Arabic characters in the URL path are not properly escaped by the > protocol-httpclient plugin > -------------------------------------------------------------------------------------------- > > Key: NUTCH-2541 > URL: https://issues.apache.org/jira/browse/NUTCH-2541 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol > Affects Versions: 2.3.1, 1.14 > Reporter: Jorge Luis Betancourt Gonzalez > Priority: Major > > As reported on [1] > When trying to crawl some URLs with Arabic characters Nutch will complain due > to an {{InvalidArgumentException}}. This happens because the HTTP client > library is using internally the {{java.net.URI}} which does not support this > characters unless they're properly escaped. > [1] > https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225 -- This message was sent by Atlassian JIRA (v7.6.3#76005)