[jira] [Commented] (NUTCH-2192) Get rid of oro

2018-10-09 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643927#comment-16643927
 ] 

Markus Jelsma commented on NUTCH-2192:
--

Nice! I completely forgot these ancient issues. Thanks!

> Get rid of oro
> --
>
> Key: NUTCH-2192
> URL: https://issues.apache.org/jira/browse/NUTCH-2192
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: NUTCH-2192.patch
>
>
> Couple of classes still rely on oro, we should get rid of it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2192) Get rid of oro

2018-10-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643772#comment-16643772
 ] 

ASF GitHub Bot commented on NUTCH-2192:
---

lewismc commented on issue #389: NUTCH-2192 Migrate from Apache ORO to 
java.util.regex
URL: https://github.com/apache/nutch/pull/389#issuecomment-428278133
 
 
   Hi @sebastian-nagel any performance metrics between ORO and j.u.regex?
   I remember Markus working on this a long time ago and actually thought it 
was merged into master... thank you for reviving it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Get rid of oro
> --
>
> Key: NUTCH-2192
> URL: https://issues.apache.org/jira/browse/NUTCH-2192
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: NUTCH-2192.patch
>
>
> Couple of classes still rely on oro, we should get rid of it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2192) Get rid of oro

2018-10-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643739#comment-16643739
 ] 

ASF GitHub Bot commented on NUTCH-2192:
---

sebastian-nagel opened a new pull request #389: NUTCH-2192 Migrate from Apache 
ORO to java.util.regex
URL: https://github.com/apache/nutch/pull/389
 
 
   (also fixes NUTCH-1678, NUTCH-1014 and NUTCH-1021)
   - apply Markus' patch of NUTCH-2192
   - finish migration of parse-js
   - remove oro dependency
   - correct pointer to Java regex syntax (instead of "Perl5")
   - NUTCH-1063 "OutlinkExtractor test generates an exception but does not 
fail" is fixed by adding null-check (required anyway by java.util.regex classes)
   - adds a JUnit test for parse-js (NUTCH-1121) ported from 2.x and extended


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Get rid of oro
> --
>
> Key: NUTCH-2192
> URL: https://issues.apache.org/jira/browse/NUTCH-2192
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 2.4, 1.16
>
> Attachments: NUTCH-2192.patch
>
>
> Couple of classes still rely on oro, we should get rid of it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643468#comment-16643468
 ] 

Hudson commented on NUTCH-2648:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3559 (See 
[https://builds.apache.org/job/Nutch-trunk/3559/])
NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by 
(snagel: 
[https://github.com/apache/nutch/commit/3f64083ed38c500e06b88ae406798b205cffeeb5])
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by 
(snagel: 
[https://github.com/apache/nutch/commit/58ea01f179b2cdd5965fd8b12d6e9fd608509e3b])
* (edit) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttp.java


> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> --
>
> Key: NUTCH-2648
> URL: https://issues.apache.org/jira/browse/NUTCH-2648
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation 
> centrally for all http/https protocol plugins by a single configuration 
> property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that 
> TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may 
> skip sites with invalid certificates as this is can be an indicator for the 
> quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2648.

Resolution: Fixed

Merged. Thanks, [~markus17] for the review!

And, thanks [~jnioche]! If time I'll have a look at sc's httpclient protocol.

> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> --
>
> Key: NUTCH-2648
> URL: https://issues.apache.org/jira/browse/NUTCH-2648
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation 
> centrally for all http/https protocol plugins by a single configuration 
> property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that 
> TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may 
> skip sites with invalid certificates as this is can be an indicator for the 
> quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643378#comment-16643378
 ] 

ASF GitHub Bot commented on NUTCH-2648:
---

sebastian-nagel closed pull request #388:  NUTCH-2648 Make configurable whether 
TLS/SSL certificates are checked by protocol plugins
URL: https://github.com/apache/nutch/pull/388
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 9f57af26e..065ed86fe 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -249,6 +249,18 @@
   
 
 
+
+  http.tls.certificates.check
+  false
+  
+Whether to check the TLS/SSL server certificates for validity.
+If true invalid (e.g., self-signed or expired) certificates are
+rejected and the https connection is failed.  If false insecure
+TLS/SSL connections are allowed.  Note that this property is
+currently not supported by all http/https protocol plugins.
+  
+
+
 
   http.proxy.host
   
diff --git 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
index 42f479312..a5c0a90f1 100644
--- 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
+++ 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
@@ -157,6 +157,9 @@
   /** Skip page if Crawl-Delay longer than this value. */
   protected long maxCrawlDelay = -1L;
 
+  /** Whether to check TLS/SSL certificates */
+  protected boolean tlsCheckCertificate = false;
+
   /** Which TLS/SSL protocols to support */
   protected Set tlsPreferredProtocols;
 
@@ -206,6 +209,8 @@ public void setConf(Configuration conf) {
 // backward-compatible default setting
 this.useHttp11 = conf.getBoolean("http.useHttp11", true);
 this.useHttp2 = conf.getBoolean("http.useHttp2", false);
+this.tlsCheckCertificate = conf.getBoolean("http.tls.certificates.check",
+false);
 this.responseTime = conf.getBoolean("http.store.responsetime", true);
 this.storeIPAddress = conf.getBoolean("store.ip.address", false);
 this.storeHttpRequest = conf.getBoolean("store.http.request", false);
@@ -496,6 +501,10 @@ public boolean getUseHttp11() {
 return useHttp11;
   }
 
+  public boolean isTlsCheckCertificates() {
+return tlsCheckCertificate;
+  }
+
   public Set getTlsPreferredCipherSuites() {
 return tlsPreferredCipherSuites;
   }
diff --git 
a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 
b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
index 95ae35248..b4d3fbcb9 100644
--- 
a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
+++ 
b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
@@ -355,10 +355,17 @@ public Metadata getHeaders() {
* -
*/
 
-  private SSLSocket getSSLSocket(Socket socket, String sockHost, int sockPort) 
throws Exception {
-SSLContext sslContext = SSLContext.getInstance("TLS");
-sslContext.init(null, new TrustManager[]{new DummyX509TrustManager(null)}, 
null);
-SSLSocketFactory factory = sslContext.getSocketFactory();
+  private SSLSocket getSSLSocket(Socket socket, String sockHost, int sockPort)
+  throws Exception {
+SSLSocketFactory factory;
+if (http.isTlsCheckCertificates()) {
+  factory = (SSLSocketFactory) SSLSocketFactory.getDefault();
+} else {
+  SSLContext sslContext = SSLContext.getInstance("TLS");
+  sslContext.init(null,
+  new TrustManager[] { new DummyX509TrustManager(null) }, null);
+  factory = sslContext.getSocketFactory();
+}
 
 SSLSocket sslsocket = (SSLSocket) factory
   .createSocket(socket, sockHost, sockPort, true);
diff --git 
a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
 
b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
index c185f9bdc..2cd29d3e9 100644
--- 
a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
+++ 
b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
@@ -47,7 +47,7 @@
 import org.apache.commons.httpclient.params.HttpConnectionManagerParams;
 import org.apache.commons.httpclient.protocol.Protocol;
 import org.apache.commons.httpclient.protocol.ProtocolSocketFactory;
-
+import org.apache.commons.httpclient.protocol.SSLProtocolSocketFactory;
 import org.apache.commons.lang.StringUtils;
 import org.apache.nutch.crawl.CrawlDatum;
 import or

[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643035#comment-16643035
 ] 

Julien Nioche commented on NUTCH-2648:
--

[~wastl-nagel]

?? (code borrowed 
[storm-crawler#615|https://github.com/DigitalPebble/storm-crawler/issues/615], 
thanks [~jnioche]!)??

You are welcome. Would be fab if you could find the time to add the same 
behaviour to the httpclient protocol in SC :)

> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> --
>
> Key: NUTCH-2648
> URL: https://issues.apache.org/jira/browse/NUTCH-2648
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation 
> centrally for all http/https protocol plugins by a single configuration 
> property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that 
> TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may 
> skip sites with invalid certificates as this is can be an indicator for the 
> quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642971#comment-16642971
 ] 

Markus Jelsma commented on NUTCH-2648:
--

I misread the patch regarding the other plugins. So +1, no further comments.

> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> --
>
> Key: NUTCH-2648
> URL: https://issues.apache.org/jira/browse/NUTCH-2648
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation 
> centrally for all http/https protocol plugins by a single configuration 
> property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that 
> TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may 
> skip sites with invalid certificates as this is can be an indicator for the 
> quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642968#comment-16642968
 ] 

Sebastian Nagel commented on NUTCH-2648:


Hi [~markus17], it should work for protocol-http, protocol-httpclient and 
protocol-okhttp. I've tested it using parsechecker for all three plugins, here 
for httpclient:
{noformat}
% bin/nutch parsechecker -Dhttp.tls.certificates.check=true 
-Dplugin.includes='protocol-httpclient|parse-tika' 
https://ingevd.waarbenjij.nu/kaart/5000179/dag-4
...
Fetch failed with protocol status: exception(16), lastModified=0: 
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: 
PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target

% bin/nutch parsechecker -Dhttp.tls.certificates.check=false 
-Dplugin.includes='protocol-httpclient|parse-tika' 
...
Status: success(1,0)
...{noformat}
Regarding the protocol-htmlunit and the two selenium-based protocol plugins: 
it's now tracked in NUTCH-2649.

?? Maybe its also an idea to add a dummy trust manager in Nutch' base code??

Yes, or in lib-http. While implementing this for protocol-okhttp, I've thought 
about trying to bundle the DummyTrustManager functionalities. But the code 
overlaps only partially, so I was lazy here.

> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> --
>
> Key: NUTCH-2648
> URL: https://issues.apache.org/jira/browse/NUTCH-2648
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation 
> centrally for all http/https protocol plugins by a single configuration 
> property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that 
> TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may 
> skip sites with invalid certificates as this is can be an indicator for the 
> quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2649) Optionally skip TLS/SSL certificate validation for protocol-selenium and protocol-htmlunit

2018-10-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2649:
--

 Summary: Optionally skip TLS/SSL certificate validation for 
protocol-selenium and protocol-htmlunit
 Key: NUTCH-2649
 URL: https://issues.apache.org/jira/browse/NUTCH-2649
 Project: Nutch
  Issue Type: Improvement
  Components: protocol
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


NUTCH-2648 adds a property to enable/disable the TLS/SSL certificate validation 
for protocol-http, protocol-httpclient and protocol-okhttp. It should be also 
supported by remaining protocol plugins:
* protocol-selenium,
* protocol-interactiveselenium and
* protocol-htmlunit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)