[ 
https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219602#comment-16219602
 ] 

Hiran Chaudhuri commented on NUTCH-2452:
----------------------------------------

It seems I am able to fix the problem with this line in method 
org.apache.nutch.protocol.ftp.FtpResponse(URL, CrawlDatum, Ftp, Configuration):

{{path = java.net.URLDecoder.decode(path, "UTF-8");}}



> Problem retrieving encoded URLs via FTP?
> ----------------------------------------
>
>                 Key: NUTCH-2452
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2452
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.13
>         Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>            Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> 2017-10-25 14:14:37,512 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even more, using 
> Firefox and the same authentication data on the same URL displays the 
> directory successfully. Therefore I suspect the FTP client is unable to 
> decode the URL such that the FTP server would understand it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to