[ 
https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241866#comment-16241866
 ] 

Hiran Chaudhuri edited comment on NUTCH-2451 at 11/7/17 11:44 AM:
------------------------------------------------------------------

Let's assume no suitable URLStreamHandler is registered. The PluginRepository - 
as it carries my proposed changes from NUTCH-2429 - is registered as 
URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL 
is constructed. Here either it finds a suitable URLStreamHandler that was 
provided from a plugin, or otherwise it falls back to the JVM default methods, 
which definitely can handle ftp:// URLs. The fact that a suitable 
URLStreamHandler is either found by the URLStreamHandlerFactory or by the JVM 
is evident as I just provided the ftp://nas URL, and nutch crawled successfully 
to find the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so. It 
would not have worked if FTP support were missing completely.
*Therefore I believe the assumption is wong. A suitable URLStreamHandler is 
available at runtime.*

Upon further analysis I find that the stack trace is pointing to source code 
org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) which boils 
down to
{{u = new URL(response.getHeader("Location"));}}
means the URL that gets constructed is not the FTP url we see in the log output 
but the value of a header, which may have not been set by the protocol-ftp 
plugin.
*Therefore I do not agree that NUTCH-2429 could be related or even the cause 
for this problem.*



was (Author: hiranchaudhuri):
Let's assume no suitable URLStreamHandler is registered. The PluginRepository - 
as it carries my proposed changes from NUTCH-2429 - is registered as 
URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL 
is constructed. Here either it finds a suitable URLStreamHandler that was 
provided from a plugin, or otherwise it falls back to the JVM default methods, 
which definitely can handle ftp:// URLs. The fact that a suitable 
URLStreamHandler is either found by the URLStreamHandlerFactory or by the JVM 
is evident as I just provided the ftp://nas URL, and nutch crawled successfully 
to find the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so. It 
would not have worked if FTP support were missing completely.
*Therefore I believe the assumption is wong. A suitable URLStreamHandler is 
available at runtime.*

Upon further analysis I find that the stack trace, pointing to source code 
org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) which boils 
down to
{{u = new URL(response.getHeader("Location"));}}
means the URL that gets constructed is not the FTP url we see in the output but 
the value of a header, which may have not been set by the protocol-ftp plugin.
*Therefore I do not agree that NUTCH-2429 could be related or even the cause 
for this problem.*


> MalformedURLExceptions on perfectly looking URLs?
> -------------------------------------------------
>
>                 Key: NUTCH-2451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2451
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.13
>         Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>            Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{    } catch (Exception e) {
>       LOG.warn("Could not get {}", url, e);
>       return new ProtocolOutput(null, new ProtocolStatus(e));
>     }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> 2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> java.net.MalformedURLException
>       at java.net.URL.<init>(URL.java:627)
>       at java.net.URL.<init>(URL.java:490)
>       at java.net.URL.<init>(URL.java:439)
>       at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
>       at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even if the file 
> did not exist I would not expect a MalformedURLException to occur. Even more, 
> using Firefox and the same authentication data on the same URL retrieves the 
> file successfully.
> How come Nutch cannot get the file?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to