[ https://issues.apache.org/jira/browse/NUTCH-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2459: ----------------------------------- Fix Version/s: 1.15 > Nutch cannot download/parse some files via FTP > ---------------------------------------------- > > Key: NUTCH-2459 > URL: https://issues.apache.org/jira/browse/NUTCH-2459 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 > Reporter: Hiran Chaudhuri > Fix For: 1.15 > > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-11-09 23:44:56,135 WARN org.apache.nutch.protocol.ftp.Ftp - Error: > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.LinkedList.checkElementIndex(LinkedList.java:555) > at java.util.LinkedList.get(LinkedList.java:476) > at > org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327) > at > org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:267) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > 2017-11-09 23:44:56,135 ERROR org.apache.nutch.protocol.ftp.Ftp - Could not > get protocol output for ftp://nas/MediaPC/boot/memtest86+.elf > org.apache.nutch.protocol.ftp.FtpException: > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at > org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:309) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.LinkedList.checkElementIndex(LinkedList.java:555) > at java.util.LinkedList.get(LinkedList.java:476) > at > org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327) > }} > I cannot tell what the URLs showing this problems have in common. They seem > to be regular files, however a lot of other regular files can be fetched and > parsed successfully. As far as I understand the source code, at least one > outgoing link is expected: > {{ > FTPFile ftpFile = (FTPFile) list.get(0); > }} > Can this be safely assumed for all files? Or should there rather be a check > if outgoing links were found? -- This message was sent by Atlassian JIRA (v6.4.14#64029)