[ https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219602#comment-16219602 ]
Hiran Chaudhuri commented on NUTCH-2452: ---------------------------------------- It seems I am able to fix the problem with this line in method org.apache.nutch.protocol.ftp.FtpResponse(URL, CrawlDatum, Ftp, Configuration): {{path = java.net.URLDecoder.decode(path, "UTF-8");}} > Problem retrieving encoded URLs via FTP? > ---------------------------------------- > > Key: NUTCH-2452 > URL: https://issues.apache.org/jira/browse/NUTCH-2452 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 > Reporter: Hiran Chaudhuri > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > 2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404 > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even more, using > Firefox and the same authentication data on the same URL displays the > directory successfully. Therefore I suspect the FTP client is unable to > decode the URL such that the FTP server would understand it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)