[ https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241866#comment-16241866 ]
Hiran Chaudhuri edited comment on NUTCH-2451 at 11/7/17 11:44 AM: ------------------------------------------------------------------ Let's assume no suitable URLStreamHandler is registered. The PluginRepository - as it carries my proposed changes from NUTCH-2429 - is registered as URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL is constructed. Here either it finds a suitable URLStreamHandler that was provided from a plugin, or otherwise it falls back to the JVM default methods, which definitely can handle ftp:// URLs. The fact that a suitable URLStreamHandler is either found by the URLStreamHandlerFactory or by the JVM is evident as I just provided the ftp://nas URL, and nutch crawled successfully to find the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so. It would not have worked if FTP support were missing completely. *Therefore I believe the assumption is wong. A suitable URLStreamHandler is available at runtime.* Upon further analysis I find that the stack trace is pointing to source code org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) which boils down to {{u = new URL(response.getHeader("Location"));}} means the URL that gets constructed is not the FTP url we see in the log output but the value of a header, which may have not been set by the protocol-ftp plugin. *Therefore I do not agree that NUTCH-2429 could be related or even the cause for this problem.* was (Author: hiranchaudhuri): Let's assume no suitable URLStreamHandler is registered. The PluginRepository - as it carries my proposed changes from NUTCH-2429 - is registered as URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL is constructed. Here either it finds a suitable URLStreamHandler that was provided from a plugin, or otherwise it falls back to the JVM default methods, which definitely can handle ftp:// URLs. The fact that a suitable URLStreamHandler is either found by the URLStreamHandlerFactory or by the JVM is evident as I just provided the ftp://nas URL, and nutch crawled successfully to find the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so. It would not have worked if FTP support were missing completely. *Therefore I believe the assumption is wong. A suitable URLStreamHandler is available at runtime.* Upon further analysis I find that the stack trace, pointing to source code org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) which boils down to {{u = new URL(response.getHeader("Location"));}} means the URL that gets constructed is not the FTP url we see in the output but the value of a header, which may have not been set by the protocol-ftp plugin. *Therefore I do not agree that NUTCH-2429 could be related or even the cause for this problem.* > MalformedURLExceptions on perfectly looking URLs? > ------------------------------------------------- > > Key: NUTCH-2451 > URL: https://issues.apache.org/jira/browse/NUTCH-2451 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 > Reporter: Hiran Chaudhuri > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > 2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > java.net.MalformedURLException > at java.net.URL.<init>(URL.java:627) > at java.net.URL.<init>(URL.java:490) > at java.net.URL.<init>(URL.java:439) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > Caused by: java.lang.NullPointerException > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even if the file > did not exist I would not expect a MalformedURLException to occur. Even more, > using Firefox and the same authentication data on the same URL retrieves the > file successfully. > How come Nutch cannot get the file? -- This message was sent by Atlassian JIRA (v6.4.14#64029)