[ https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408075#comment-16408075 ]
Jorge Luis Betancourt Gonzalez commented on NUTCH-2541: ------------------------------------------------------- I've tested against master, I still see the same issue: {code} ➜ local (master) ✔ bin/nutch parsechecker "http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html" fetching: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html robots.txt whitelist not configured. Fetch failed with protocol status: exception(16), lastModified=0: java.lang.IllegalArgumentException: Invalid uri 'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html': escaped absolute path not valid {code} > Arabic characters in the URL path are not properly escaped by the > protocol-httpclient plugin > -------------------------------------------------------------------------------------------- > > Key: NUTCH-2541 > URL: https://issues.apache.org/jira/browse/NUTCH-2541 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol > Affects Versions: 2.3.1, 1.14 > Reporter: Jorge Luis Betancourt Gonzalez > Priority: Major > > As reported on [1] > When trying to crawl some URLs with Arabic characters Nutch will complain due > to an {{InvalidArgumentException}}. This happens because the HTTP client > library is using internally the {{java.net.URI}} which does not support this > characters unless they're properly escaped. > [1] > https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225 -- This message was sent by Atlassian JIRA (v7.6.3#76005)