[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb
[ https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239943#comment-16239943 ] ASF GitHub Bot commented on NUTCH-2442: --- Omkar20895 commented on a change in pull request #239: NUTCH-2442 Injector to stop if job fails to avoid loss of CrawlDb URL: https://github.com/apache/nutch/pull/239#discussion_r148997590 ## File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java ## @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception { job.setNumReduceTasks(numOfReducers); try { - job.waitForCompletion(true); -} catch (Exception e) { + boolean success = job.waitForCompletion(true); + if(!success){ Review comment: @sebastian-nagel I did not understand, the formatting looks good to me. Can you please elaborate on what I am missing here? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Injector to stop if job fails to avoid loss of CrawlDb > -- > > Key: NUTCH-2442 > URL: https://issues.apache.org/jira/browse/NUTCH-2442 > Project: Nutch > Issue Type: Bug > Components: injector >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.14 > > > Injector does not check whether the MapReduce job is successful. Even if the > job fails > - installs the CrawlDb > -- move current/ to old/ > -- replace current/ with an empty or potentially incomplete version > - exits with code 0 so that scripts running the crawl workflow cannot detect > the failure -- if Injector is run a second time the CrawlDb is lost (both > current/ and old/ are empty or corrupted) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239740#comment-16239740 ] Hudson commented on NUTCH-2443: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See [https://builds.apache.org/job/Nutch-trunk/3465/]) NUTCH-2443 add source tag to the parse-html and parse-tika outlink (jorge-luis.betancourt: [https://github.com/apache/nutch/commit/d34a002b25a770369ad6a5a20475c7072d8fa02b]) * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java * (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java * (edit) src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java > Extract links from the video tag with the parse-html plugin > --- > > Key: NUTCH-2443 > URL: https://issues.apache.org/jira/browse/NUTCH-2443 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin >Affects Versions: 1.13 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Fix For: 1.14 > > > At the moment the {{parse-html}} extracts links from the tags {{a, area, > form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow > extracting links to binary files (images) extracting links also from the > {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2452) Problem retrieving encoded URLs via FTP?
[ https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239741#comment-16239741 ] Hudson commented on NUTCH-2452: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See [https://builds.apache.org/job/Nutch-trunk/3465/]) NUTCH-2452 Allow nutch to retrieve Ftp URLs that contain UrlEncoded (snagel: [https://github.com/apache/nutch/commit/517dbdf3261d42e90883d07320b7991ff8e2bcf8]) * (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java > Problem retrieving encoded URLs via FTP? > > > Key: NUTCH-2452 > URL: https://issues.apache.org/jira/browse/NUTCH-2452 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri > Fix For: 1.14 > > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > 2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404 > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even more, using > Firefox and the same authentication data on the same URL displays the > directory successfully. Therefore I suspect the FTP client is unable to > decode the URL such that the FTP server would understand it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2452) Problem retrieving encoded URLs via FTP?
[ https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2452. Resolution: Fixed Fix Version/s: 1.14 Picked [61e0ae7|https://github.com/apache/nutch/pull/237/commits/61e0ae700c32ce1c2fb3deadcf41bb655d5a6e6c] from pull-request [#237|https://github.com/apache/nutch/pull/237]. Thanks, [~hiranchaudhuri]! > Problem retrieving encoded URLs via FTP? > > > Key: NUTCH-2452 > URL: https://issues.apache.org/jira/browse/NUTCH-2452 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri > Fix For: 1.14 > > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > 2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404 > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even more, using > Firefox and the same authentication data on the same URL displays the > directory successfully. Therefore I suspect the FTP client is unable to > decode the URL such that the FTP server would understand it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2443. Resolution: Fixed Merged. Thanks, [~jorgelbg]! > Extract links from the video tag with the parse-html plugin > --- > > Key: NUTCH-2443 > URL: https://issues.apache.org/jira/browse/NUTCH-2443 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin >Affects Versions: 1.13 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Fix For: 1.14 > > > At the moment the {{parse-html}} extracts links from the tags {{a, area, > form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow > extracting links to binary files (images) extracting links also from the > {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239723#comment-16239723 ] ASF GitHub Bot commented on NUTCH-2443: --- sebastian-nagel closed pull request #230: NUTCH-2443 add source tag to the parse-html/tika outlink extractor URL: https://github.com/apache/nutch/pull/230 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java index 909da7ef4..4527dd7b4 100644 --- a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java +++ b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java @@ -86,6 +86,7 @@ public void setConf(Configuration conf) { linkParams.put("script", new LinkParams("script", "src", 0)); linkParams.put("link", new LinkParams("link", "href", 0)); linkParams.put("img", new LinkParams("img", "src", 0)); +linkParams.put("source", new LinkParams("source", "src", 0)); // remove unwanted link tags from the linkParams map String[] ignoreTags = conf.getStrings("parser.html.outlinks.ignore_tags"); diff --git a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java index 15725aee6..0faa013e9 100644 --- a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java +++ b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java @@ -127,7 +127,11 @@ + "" + " " + " " - + ""), }; + + ""), + new String(" " + " " + + " " + + "" + + "" + ""), }; private static int SKIP = 9; @@ -137,7 +141,8 @@ "http://www.nutch.org/maps/";, "http://www.nutch.org/whitespace/";, "http://www.nutch.org//";, "http://www.nutch.org/";, "http://www.nutch.org/";, "http://www.nutch.org/";, - "http://www.nutch.org/;something";, "http://www.nutch.org/"; }; + "http://www.nutch.org/;something";, "http://www.nutch.org/";, + "http://www.nutch.org/"; }; private static final DocumentFragment testDOMs[] = new DocumentFragment[testPages.length]; @@ -157,11 +162,11 @@ + "one two two three three four put some text here and there. " + "End this madness ! . . . .", "ignore ignore", "test1 test2", "test1 test2", "title anchor1 anchor2 anchor3", - "title anchor1 anchor2 anchor3 anchor4 anchor5", "title" }; + "title anchor1 anchor2 anchor3 anchor4 anchor5", "title", "" }; private static final String[] answerTitle = { "title", "title", "", "my title", "my title", "my title", "my title", "", "", "", "title", - "title", "title" }; + "title", "title", "" }; // note: should be in page-order private static Outlink[][] answerOutlinks; @@ -231,7 +236,8 @@ public void setup() { { new Outlink("http://www.nutch.org/g";, ""), new Outlink("http://www.nutch.org/g1";, ""), new Outlink("http://www.nutch.org/g2";, "bla bla"), - new Outlink("http://www.nutch.org/test.gif";, "bla bla"), } }; + new Outlink("http://www.nutch.org/test.gif";, "bla bla"), }, + { new Outlink("http://www.nutch.org/movie.mp4";, "") } }; } catch (MalformedURLException e) { diff --git a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java index e5dbd16a9..af85480bc 100644 --- a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java +++ b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java @@ -90,6 +90,7 @@ public void setConf(Configuration conf) { linkParams.put("script", new LinkParams("script", "src", 0)); linkParams.put("link", new LinkParams("link", "href", 0)); linkParams.put("img", new LinkParams("img", "src", 0)); +linkParams.put("source", new LinkParams("source", "src", 0)); // remove unwanted link tags from the linkParams map String[] ignoreTags = conf.getStrings("parser.html.outlinks.ignore_tags"); diff --git a/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java b/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java index 96029a6b4..2159b9d5a 100644 --- a/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java +++ b/src/plugin/parse-tika/src/test/org/apache/nu
[jira] [Commented] (NUTCH-2033) parse-tika skips valid documents.
[ https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239722#comment-16239722 ] Sebastian Nagel commented on NUTCH-2033: Should this be fixed inside Nutch? Which composite types are supported is known only in Tika - would be painful to update this list every time the Tika dependency is upgraded. But could implement this as a fall-back: if no parser is found, retry as "application/xml". > parse-tika skips valid documents. > - > > Key: NUTCH-2033 > URL: https://issues.apache.org/jira/browse/NUTCH-2033 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.10 >Reporter: Luis Lopez >Assignee: Lewis John McGibbney > Labels: mime-type, parse-tika, parser, tika > Fix For: 1.14 > > > If we run: > {code} > bin/nutch parsechecker -dumpText > http://ngdc.noaa.gov/geoportal/openSearchDescription > {code} > we’ll get: > {code} > Status: failed(2,0): Can't retrieve Tika parser for mime-type > application/opensearchdescription+xml > {code} > the same occurs for: > {code} > bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json > {code} > Both perfectly valid documents if they were returned as "application/xml" and > "text/plain" respectively. > This happens because parse-tika uses the mime type to retrieve a suitable > parser, some composite mime types are not included in this list even though > they are perfectly valid and parsable documents. This not taking into account > that servers often return incorrect mime types for the documents requested. > We created a helper class as a workaround for this issue. The class uses > regex expressions to define synonyms. In the first case any mime type that > matches "application/(.*)\+xml" will be replaced by "application/xml". This > way parse-tika will parse the document just fine. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2453) FTP protocol seems to have issues running multithreaded
[ https://issues.apache.org/jira/browse/NUTCH-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239719#comment-16239719 ] Sebastian Nagel commented on NUTCH-2453: Hi [~hiran], protocol-ftp is one of the oldest plugins (and now one of those used rarely. It may be the case that it's not thread-safe. Thanks, for reporting this issue! > FTP protocol seems to have issues running multithreaded > --- > > Key: NUTCH-2453 > URL: https://issues.apache.org/jira/browse/NUTCH-2453 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. Also I wanted to increase crawl speed and thus configured > fetcher.threads.per.queue=10 in nutch-site.xml. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this setup I saw such messages in the logs: > {{2017-10-25 22:52:54,699 WARN org.apache.nutch.protocol.ftp.Ftp - > ftp.client.login() failed: nas/192.168.178.43 > 2017-10-25 22:52:54,718 WARN org.apache.nutch.protocol.ftp.Ftp - Error: > java.net.SocketException: Socket closed > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read(BufferedReader.java:182) > at > org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58) > at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310) > at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290) > at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479) > at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552) > at org.apache.commons.net.ftp.FTP.user(FTP.java:698) > at org.apache.nutch.protocol.ftp.Client.login(Client.java:294) > at > org.apache.nutch.protocol.ftp.FtpResponse.(FtpResponse.java:190) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > 2017-10-25 22:52:54,721 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/silver-sda2/home/hiran/Desktop/Segelclub.txt~ > org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket > closed > at > org.apache.nutch.protocol.ftp.FtpResponse.(FtpResponse.java:308) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > Caused by: java.net.SocketException: Socket closed > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read(BufferedReader.java:182) > at > org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58) > at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310) > at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290) > at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479) > at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552) > at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
[jira] [Commented] (NUTCH-2452) Problem retrieving encoded URLs via FTP?
[ https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239718#comment-16239718 ] Sebastian Nagel commented on NUTCH-2452: Thanks, this should be fixed. > Problem retrieving encoded URLs via FTP? > > > Key: NUTCH-2452 > URL: https://issues.apache.org/jira/browse/NUTCH-2452 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{ } catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > 2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/ > org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404 > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even more, using > Firefox and the same authentication data on the same URL displays the > directory successfully. Therefore I suspect the FTP client is unable to > decode the URL such that the FTP server would understand it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2451) MalformedURLExceptions on perfectly looking URLs?
[ https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239717#comment-16239717 ] Sebastian Nagel commented on NUTCH-2451: This problem resembles those discussed in NUTCH-2429: for some reason (maybe a race condition or a class path issue) there is no ftp URLStreamHandler registered at this point. There must have been one if crawling over ftp succeeded so far (pages fetched, new ftp:// URLs found). > MalformedURLExceptions on perfectly looking URLs? > - > > Key: NUTCH-2451 > URL: https://issues.apache.org/jira/browse/NUTCH-2451 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.13 > Environment: Ubuntu 16.04.3 LTS > OpenJDK 1.8.0_131 > nutch 1.14-SNAPSHOT > Synology RS816 >Reporter: Hiran Chaudhuri > > I tried running Nutch on my Synology NAS. As SMB protocol is not contained in > Nutch, I turned on FTP service on the NAS and configured Nutch to crawl > ftp://nas. > The experience gives me varying results which seem to point to problems > within Nutch. However this may need further evaluation. > As some files could not be downloaded and I could not see a good error > message I changed the method > org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not > only return protocol status but send the full exception and stack trace to > the logs: > {{} catch (Exception e) { > LOG.warn("Could not get {}", url, e); > return new ProtocolOutput(null, new ProtocolStatus(e)); > } > }} > With this modification I suddenly see such messages in the logfile: > {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching > ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > 2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not > get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so > java.net.MalformedURLException > at java.net.URL.(URL.java:627) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340) > Caused by: java.lang.NullPointerException > }} > Please mind the URL was not configured from me. Instead it was obtained by > crawling my NAS. Also the URL looks perfectly fine to me. Even if the file > did not exist I would not expect a MalformedURLException to occur. Even more, > using Firefox and the same authentication data on the same URL retrieves the > file successfully. > How come Nutch cannot get the file? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2450) Remove FixMe in ParseOutputFormat
[ https://issues.apache.org/jira/browse/NUTCH-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239702#comment-16239702 ] ASF GitHub Bot commented on NUTCH-2450: --- sebastian-nagel commented on a change in pull request #235: Fix for NUTCH-2450 by Kenneth McFarland URL: https://github.com/apache/nutch/pull/235#discussion_r146774541 ## File path: src/java/org/apache/nutch/parse/ParseOutputFormat.java ## @@ -379,15 +378,16 @@ public static String filterNormalize(String fromUrl, String toUrl, if (ignoreInternalLinks) { if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) { String toDomain = URLUtil.getDomainName(targetURL).toLowerCase(); - //FIXME: toDomain will never be null, correct? if (toDomain == null || toDomain.equals(origin)) { return null; // skip it } } else { String toHost = targetURL.getHost().toLowerCase(); - //FIXME: toDomain will never be null, correct? if (toHost == null || toHost.equals(origin)) { -return null; // skip it +if (exemptionFilters == null // check if it is exempted? Review comment: The [exemption filter](https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/urlfilter/ignoreexempt/package-summary.html) is to define exemption for otherwise skipped external links. Should not be in the branch which handles internal links, otherwise we would have to redefine the exemption filter interface. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove FixMe in ParseOutputFormat > - > > Key: NUTCH-2450 > URL: https://issues.apache.org/jira/browse/NUTCH-2450 > Project: Nutch > Issue Type: Bug > Environment: master branch >Reporter: Kenneth McFarland >Assignee: Kenneth McFarland >Priority: Minor > > ParseOutputFormat contains a few FixMe's that I've looked at. If a valid url > is created, it will always return valid results. There is a spot in the code > where the try catch is already done, so the predicate is satisfied and there > is no need to keep checking it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)