[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239943#comment-16239943
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

Omkar20895 commented on a change in pull request #239: NUTCH-2442 Injector to 
stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r148997590
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   @sebastian-nagel I did not understand, the formatting looks good to me. Can 
you please elaborate on what I am missing here? Thanks. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239740#comment-16239740
 ] 

Hudson commented on NUTCH-2443:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See 
[https://builds.apache.org/job/Nutch-trunk/3465/])
NUTCH-2443 add source tag to the parse-html and parse-tika outlink 
(jorge-luis.betancourt: 
[https://github.com/apache/nutch/commit/d34a002b25a770369ad6a5a20475c7072d8fa02b])
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java


> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2452) Problem retrieving encoded URLs via FTP?

2017-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239741#comment-16239741
 ] 

Hudson commented on NUTCH-2452:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See 
[https://builds.apache.org/job/Nutch-trunk/3465/])
NUTCH-2452 Allow nutch to retrieve Ftp URLs that contain UrlEncoded (snagel: 
[https://github.com/apache/nutch/commit/517dbdf3261d42e90883d07320b7991ff8e2bcf8])
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java


> Problem retrieving encoded URLs via FTP?
> 
>
> Key: NUTCH-2452
> URL: https://issues.apache.org/jira/browse/NUTCH-2452
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> 2017-10-25 14:14:37,512 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even more, using 
> Firefox and the same authentication data on the same URL displays the 
> directory successfully. Therefore I suspect the FTP client is unable to 
> decode the URL such that the FTP server would understand it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2452) Problem retrieving encoded URLs via FTP?

2017-11-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2452.

   Resolution: Fixed
Fix Version/s: 1.14

Picked 
[61e0ae7|https://github.com/apache/nutch/pull/237/commits/61e0ae700c32ce1c2fb3deadcf41bb655d5a6e6c]
 from pull-request [#237|https://github.com/apache/nutch/pull/237]. Thanks, 
[~hiranchaudhuri]!

> Problem retrieving encoded URLs via FTP?
> 
>
> Key: NUTCH-2452
> URL: https://issues.apache.org/jira/browse/NUTCH-2452
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> 2017-10-25 14:14:37,512 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even more, using 
> Firefox and the same authentication data on the same URL displays the 
> directory successfully. Therefore I suspect the FTP client is unable to 
> decode the URL such that the FTP server would understand it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-11-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2443.

Resolution: Fixed

Merged. Thanks, [~jorgelbg]!

> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-11-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239723#comment-16239723
 ] 

ASF GitHub Bot commented on NUTCH-2443:
---

sebastian-nagel closed pull request #230: NUTCH-2443 add source tag to the 
parse-html/tika outlink extractor
URL: https://github.com/apache/nutch/pull/230
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
index 909da7ef4..4527dd7b4 100644
--- 
a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
+++ 
b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
@@ -86,6 +86,7 @@ public void setConf(Configuration conf) {
 linkParams.put("script", new LinkParams("script", "src", 0));
 linkParams.put("link", new LinkParams("link", "href", 0));
 linkParams.put("img", new LinkParams("img", "src", 0));
+linkParams.put("source", new LinkParams("source", "src", 0));
 
 // remove unwanted link tags from the linkParams map
 String[] ignoreTags = conf.getStrings("parser.html.outlinks.ignore_tags");
diff --git 
a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
 
b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
index 15725aee6..0faa013e9 100644
--- 
a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
+++ 
b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
@@ -127,7 +127,11 @@
   + ""
   + "   "
   + "   "
-  + ""), };
+  + ""),
+  new String(" " + " "
+  + " "
+  + ""
+  + "" + ""), };
 
   private static int SKIP = 9;
 
@@ -137,7 +141,8 @@
   "http://www.nutch.org/maps/";, "http://www.nutch.org/whitespace/";,
   "http://www.nutch.org//";, "http://www.nutch.org/";,
   "http://www.nutch.org/";, "http://www.nutch.org/";,
-  "http://www.nutch.org/;something";, "http://www.nutch.org/"; };
+  "http://www.nutch.org/;something";, "http://www.nutch.org/";,
+  "http://www.nutch.org/"; };
 
   private static final DocumentFragment testDOMs[] = new 
DocumentFragment[testPages.length];
 
@@ -157,11 +162,11 @@
   + "one two two three three four put some text here and there. "
   + "End this madness ! . . . .", "ignore ignore", "test1 test2",
   "test1 test2", "title anchor1 anchor2 anchor3",
-  "title anchor1 anchor2 anchor3 anchor4 anchor5", "title" };
+  "title anchor1 anchor2 anchor3 anchor4 anchor5", "title", "" };
 
   private static final String[] answerTitle = { "title", "title", "",
   "my title", "my title", "my title", "my title", "", "", "", "title",
-  "title", "title" };
+  "title", "title", "" };
 
   // note: should be in page-order
   private static Outlink[][] answerOutlinks;
@@ -231,7 +236,8 @@ public void setup() {
   { new Outlink("http://www.nutch.org/g";, ""),
   new Outlink("http://www.nutch.org/g1";, ""),
   new Outlink("http://www.nutch.org/g2";, "bla bla"),
-  new Outlink("http://www.nutch.org/test.gif";, "bla bla"), } };
+  new Outlink("http://www.nutch.org/test.gif";, "bla bla"), },
+  { new Outlink("http://www.nutch.org/movie.mp4";, "") } };
 
 } catch (MalformedURLException e) {
 
diff --git 
a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 
b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
index e5dbd16a9..af85480bc 100644
--- 
a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
+++ 
b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
@@ -90,6 +90,7 @@ public void setConf(Configuration conf) {
 linkParams.put("script", new LinkParams("script", "src", 0));
 linkParams.put("link", new LinkParams("link", "href", 0));
 linkParams.put("img", new LinkParams("img", "src", 0));
+linkParams.put("source", new LinkParams("source", "src", 0));
 
 // remove unwanted link tags from the linkParams map
 String[] ignoreTags = conf.getStrings("parser.html.outlinks.ignore_tags");
diff --git 
a/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java 
b/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java
index 96029a6b4..2159b9d5a 100644
--- 
a/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestDOMContentUtils.java
+++ 
b/src/plugin/parse-tika/src/test/org/apache/nu

[jira] [Commented] (NUTCH-2033) parse-tika skips valid documents.

2017-11-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239722#comment-16239722
 ] 

Sebastian Nagel commented on NUTCH-2033:


Should this be fixed inside Nutch? Which composite types are supported is known 
only in Tika - would be painful to update this list every time the Tika 
dependency is upgraded. But could implement this as a fall-back: if no parser 
is found, retry as "application/xml".

> parse-tika skips valid documents.
> -
>
> Key: NUTCH-2033
> URL: https://issues.apache.org/jira/browse/NUTCH-2033
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: mime-type, parse-tika, parser, tika
> Fix For: 1.14
>
>
> If we run:
> {code}
> bin/nutch parsechecker -dumpText 
> http://ngdc.noaa.gov/geoportal/openSearchDescription
> {code}
> we’ll get:
> {code}
> Status: failed(2,0): Can't retrieve Tika parser for mime-type 
> application/opensearchdescription+xml
> {code}
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and 
> "text/plain" respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable 
> parser, some composite mime types are not included in this list even though 
> they are perfectly valid and parsable documents. This not taking into account 
> that servers often return incorrect mime types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses 
> regex expressions to define synonyms. In the first case any mime type that 
> matches "application/(.*)\+xml" will be replaced by "application/xml". This 
> way parse-tika will parse the document just fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2453) FTP protocol seems to have issues running multithreaded

2017-11-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239719#comment-16239719
 ] 

Sebastian Nagel commented on NUTCH-2453:


Hi [~hiran], protocol-ftp is one of the oldest plugins (and now one of those 
used rarely. It may be the case that it's not thread-safe. Thanks, for 
reporting this issue!

> FTP protocol seems to have issues running multithreaded
> ---
>
> Key: NUTCH-2453
> URL: https://issues.apache.org/jira/browse/NUTCH-2453
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas. Also I wanted to increase crawl speed and thus configured 
> fetcher.threads.per.queue=10 in nutch-site.xml.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this setup I saw such messages in the logs:
> {{2017-10-25 22:52:54,699 WARN  org.apache.nutch.protocol.ftp.Ftp - 
> ftp.client.login() failed: nas/192.168.178.43
> 2017-10-25 22:52:54,718 WARN  org.apache.nutch.protocol.ftp.Ftp - Error:
> java.net.SocketException: Socket closed
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:161)
> at java.io.BufferedReader.read(BufferedReader.java:182)
> at 
> org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
> at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
> at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
> at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
> at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
> at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
> at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
> at 
> org.apache.nutch.protocol.ftp.FtpResponse.(FtpResponse.java:190)
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> 2017-10-25 22:52:54,721 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/silver-sda2/home/hiran/Desktop/Segelclub.txt~
> org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket 
> closed
> at 
> org.apache.nutch.protocol.ftp.FtpResponse.(FtpResponse.java:308)
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.net.SocketException: Socket closed
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:161)
> at java.io.BufferedReader.read(BufferedReader.java:182)
> at 
> org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
> at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
> at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
> at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
> at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
> at org.apache.commons.net.ftp.FTP.user(FTP.java:698)

[jira] [Commented] (NUTCH-2452) Problem retrieving encoded URLs via FTP?

2017-11-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239718#comment-16239718
 ] 

Sebastian Nagel commented on NUTCH-2452:


Thanks, this should be fixed.

> Problem retrieving encoded URLs via FTP?
> 
>
> Key: NUTCH-2452
> URL: https://issues.apache.org/jira/browse/NUTCH-2452
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> 2017-10-25 14:14:37,512 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even more, using 
> Firefox and the same authentication data on the same URL displays the 
> directory successfully. Therefore I suspect the FTP client is unable to 
> decode the URL such that the FTP server would understand it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2451) MalformedURLExceptions on perfectly looking URLs?

2017-11-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239717#comment-16239717
 ] 

Sebastian Nagel commented on NUTCH-2451:


This problem resembles those discussed in NUTCH-2429: for some reason (maybe a 
race condition or a class path issue) there is no ftp URLStreamHandler 
registered at this point. There must have been one if crawling over ftp 
succeeded so far (pages fetched, new ftp:// URLs found).

> MalformedURLExceptions on perfectly looking URLs?
> -
>
> Key: NUTCH-2451
> URL: https://issues.apache.org/jira/browse/NUTCH-2451
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{} catch (Exception e) {
>   LOG.warn("Could not get {}", url, e);
>   return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> 2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> java.net.MalformedURLException
>   at java.net.URL.(URL.java:627)
>   at java.net.URL.(URL.java:490)
>   at java.net.URL.(URL.java:439)
>   at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
>   at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even if the file 
> did not exist I would not expect a MalformedURLException to occur. Even more, 
> using Firefox and the same authentication data on the same URL retrieves the 
> file successfully.
> How come Nutch cannot get the file?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2450) Remove FixMe in ParseOutputFormat

2017-11-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239702#comment-16239702
 ] 

ASF GitHub Bot commented on NUTCH-2450:
---

sebastian-nagel commented on a change in pull request #235: Fix for NUTCH-2450 
by Kenneth McFarland
URL: https://github.com/apache/nutch/pull/235#discussion_r146774541
 
 

 ##
 File path: src/java/org/apache/nutch/parse/ParseOutputFormat.java
 ##
 @@ -379,15 +378,16 @@ public static String filterNormalize(String fromUrl, 
String toUrl,
   if (ignoreInternalLinks) {
 if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {
   String toDomain = URLUtil.getDomainName(targetURL).toLowerCase();
-  //FIXME: toDomain will never be null, correct?
   if (toDomain == null || toDomain.equals(origin)) {
 return null; // skip it
   }
 } else {
   String toHost = targetURL.getHost().toLowerCase();
-  //FIXME: toDomain will never be null, correct?
   if (toHost == null || toHost.equals(origin)) {
-return null; // skip it
+if (exemptionFilters == null // check if it is exempted?
 
 Review comment:
   The [exemption 
filter](https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/urlfilter/ignoreexempt/package-summary.html)
 is to define exemption for otherwise skipped external links. Should not be in 
the branch which handles internal links, otherwise we would have to redefine 
the exemption filter interface.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove FixMe in ParseOutputFormat
> -
>
> Key: NUTCH-2450
> URL: https://issues.apache.org/jira/browse/NUTCH-2450
> Project: Nutch
>  Issue Type: Bug
> Environment: master branch
>Reporter: Kenneth McFarland
>Assignee: Kenneth McFarland
>Priority: Minor
>
> ParseOutputFormat contains a few FixMe's that I've looked at. If a valid url 
> is created, it will always return valid results. There is a spot in the code 
> where the try catch is already done, so the predicate is satisfied and there 
> is no need to keep checking it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)