[jira] [Commented] (NUTCH-2192) Get rid of oro
[ https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643927#comment-16643927 ] Markus Jelsma commented on NUTCH-2192: -- Nice! I completely forgot these ancient issues. Thanks! > Get rid of oro > -- > > Key: NUTCH-2192 > URL: https://issues.apache.org/jira/browse/NUTCH-2192 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 2.4, 1.16 > > Attachments: NUTCH-2192.patch > > > Couple of classes still rely on oro, we should get rid of it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2192) Get rid of oro
[ https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643772#comment-16643772 ] ASF GitHub Bot commented on NUTCH-2192: --- lewismc commented on issue #389: NUTCH-2192 Migrate from Apache ORO to java.util.regex URL: https://github.com/apache/nutch/pull/389#issuecomment-428278133 Hi @sebastian-nagel any performance metrics between ORO and j.u.regex? I remember Markus working on this a long time ago and actually thought it was merged into master... thank you for reviving it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Get rid of oro > -- > > Key: NUTCH-2192 > URL: https://issues.apache.org/jira/browse/NUTCH-2192 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 2.4, 1.16 > > Attachments: NUTCH-2192.patch > > > Couple of classes still rely on oro, we should get rid of it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2192) Get rid of oro
[ https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643739#comment-16643739 ] ASF GitHub Bot commented on NUTCH-2192: --- sebastian-nagel opened a new pull request #389: NUTCH-2192 Migrate from Apache ORO to java.util.regex URL: https://github.com/apache/nutch/pull/389 (also fixes NUTCH-1678, NUTCH-1014 and NUTCH-1021) - apply Markus' patch of NUTCH-2192 - finish migration of parse-js - remove oro dependency - correct pointer to Java regex syntax (instead of "Perl5") - NUTCH-1063 "OutlinkExtractor test generates an exception but does not fail" is fixed by adding null-check (required anyway by java.util.regex classes) - adds a JUnit test for parse-js (NUTCH-1121) ported from 2.x and extended This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Get rid of oro > -- > > Key: NUTCH-2192 > URL: https://issues.apache.org/jira/browse/NUTCH-2192 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 2.4, 1.16 > > Attachments: NUTCH-2192.patch > > > Couple of classes still rely on oro, we should get rid of it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643468#comment-16643468 ] Hudson commented on NUTCH-2648: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3559 (See [https://builds.apache.org/job/Nutch-trunk/3559/]) NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by (snagel: [https://github.com/apache/nutch/commit/3f64083ed38c500e06b88ae406798b205cffeeb5]) * (edit) src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java * (edit) conf/nutch-default.xml * (edit) src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by (snagel: [https://github.com/apache/nutch/commit/58ea01f179b2cdd5965fd8b12d6e9fd608509e3b]) * (edit) src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttp.java > Make configurable whether TLS/SSL certificates are checked by protocol plugins > -- > > Key: NUTCH-2648 > URL: https://issues.apache.org/jira/browse/NUTCH-2648 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > (see discussion in NUTCH-2647) > It should be possible to enable/disable TLS/SSL certificate validation > centrally for all http/https protocol plugins by a single configuration > property. > Some use cases (eg. crawl a site to detect insecure pages) may require that > TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may > skip sites with invalid certificates as this is can be an indicator for the > quality of a site. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2648. Resolution: Fixed Merged. Thanks, [~markus17] for the review! And, thanks [~jnioche]! If time I'll have a look at sc's httpclient protocol. > Make configurable whether TLS/SSL certificates are checked by protocol plugins > -- > > Key: NUTCH-2648 > URL: https://issues.apache.org/jira/browse/NUTCH-2648 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > (see discussion in NUTCH-2647) > It should be possible to enable/disable TLS/SSL certificate validation > centrally for all http/https protocol plugins by a single configuration > property. > Some use cases (eg. crawl a site to detect insecure pages) may require that > TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may > skip sites with invalid certificates as this is can be an indicator for the > quality of a site. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643378#comment-16643378 ] ASF GitHub Bot commented on NUTCH-2648: --- sebastian-nagel closed pull request #388: NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by protocol plugins URL: https://github.com/apache/nutch/pull/388 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 9f57af26e..065ed86fe 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -249,6 +249,18 @@ + + http.tls.certificates.check + false + +Whether to check the TLS/SSL server certificates for validity. +If true invalid (e.g., self-signed or expired) certificates are +rejected and the https connection is failed. If false insecure +TLS/SSL connections are allowed. Note that this property is +currently not supported by all http/https protocol plugins. + + + http.proxy.host diff --git a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java index 42f479312..a5c0a90f1 100644 --- a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java +++ b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java @@ -157,6 +157,9 @@ /** Skip page if Crawl-Delay longer than this value. */ protected long maxCrawlDelay = -1L; + /** Whether to check TLS/SSL certificates */ + protected boolean tlsCheckCertificate = false; + /** Which TLS/SSL protocols to support */ protected Set tlsPreferredProtocols; @@ -206,6 +209,8 @@ public void setConf(Configuration conf) { // backward-compatible default setting this.useHttp11 = conf.getBoolean("http.useHttp11", true); this.useHttp2 = conf.getBoolean("http.useHttp2", false); +this.tlsCheckCertificate = conf.getBoolean("http.tls.certificates.check", +false); this.responseTime = conf.getBoolean("http.store.responsetime", true); this.storeIPAddress = conf.getBoolean("store.ip.address", false); this.storeHttpRequest = conf.getBoolean("store.http.request", false); @@ -496,6 +501,10 @@ public boolean getUseHttp11() { return useHttp11; } + public boolean isTlsCheckCertificates() { +return tlsCheckCertificate; + } + public Set getTlsPreferredCipherSuites() { return tlsPreferredCipherSuites; } diff --git a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java index 95ae35248..b4d3fbcb9 100644 --- a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java +++ b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java @@ -355,10 +355,17 @@ public Metadata getHeaders() { * - */ - private SSLSocket getSSLSocket(Socket socket, String sockHost, int sockPort) throws Exception { -SSLContext sslContext = SSLContext.getInstance("TLS"); -sslContext.init(null, new TrustManager[]{new DummyX509TrustManager(null)}, null); -SSLSocketFactory factory = sslContext.getSocketFactory(); + private SSLSocket getSSLSocket(Socket socket, String sockHost, int sockPort) + throws Exception { +SSLSocketFactory factory; +if (http.isTlsCheckCertificates()) { + factory = (SSLSocketFactory) SSLSocketFactory.getDefault(); +} else { + SSLContext sslContext = SSLContext.getInstance("TLS"); + sslContext.init(null, + new TrustManager[] { new DummyX509TrustManager(null) }, null); + factory = sslContext.getSocketFactory(); +} SSLSocket sslsocket = (SSLSocket) factory .createSocket(socket, sockHost, sockPort, true); diff --git a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java index c185f9bdc..2cd29d3e9 100644 --- a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java +++ b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java @@ -47,7 +47,7 @@ import org.apache.commons.httpclient.params.HttpConnectionManagerParams; import org.apache.commons.httpclient.protocol.Protocol; import org.apache.commons.httpclient.protocol.ProtocolSocketFactory; - +import org.apache.commons.httpclient.protocol.SSLProtocolSocketFactory; import org.apache.commons.lang.StringUtils; import org.apache.nutch.crawl.CrawlDatum; import or
[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643035#comment-16643035 ] Julien Nioche commented on NUTCH-2648: -- [~wastl-nagel] ?? (code borrowed [storm-crawler#615|https://github.com/DigitalPebble/storm-crawler/issues/615], thanks [~jnioche]!)?? You are welcome. Would be fab if you could find the time to add the same behaviour to the httpclient protocol in SC :) > Make configurable whether TLS/SSL certificates are checked by protocol plugins > -- > > Key: NUTCH-2648 > URL: https://issues.apache.org/jira/browse/NUTCH-2648 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > (see discussion in NUTCH-2647) > It should be possible to enable/disable TLS/SSL certificate validation > centrally for all http/https protocol plugins by a single configuration > property. > Some use cases (eg. crawl a site to detect insecure pages) may require that > TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may > skip sites with invalid certificates as this is can be an indicator for the > quality of a site. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642971#comment-16642971 ] Markus Jelsma commented on NUTCH-2648: -- I misread the patch regarding the other plugins. So +1, no further comments. > Make configurable whether TLS/SSL certificates are checked by protocol plugins > -- > > Key: NUTCH-2648 > URL: https://issues.apache.org/jira/browse/NUTCH-2648 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > (see discussion in NUTCH-2647) > It should be possible to enable/disable TLS/SSL certificate validation > centrally for all http/https protocol plugins by a single configuration > property. > Some use cases (eg. crawl a site to detect insecure pages) may require that > TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may > skip sites with invalid certificates as this is can be an indicator for the > quality of a site. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642968#comment-16642968 ] Sebastian Nagel commented on NUTCH-2648: Hi [~markus17], it should work for protocol-http, protocol-httpclient and protocol-okhttp. I've tested it using parsechecker for all three plugins, here for httpclient: {noformat} % bin/nutch parsechecker -Dhttp.tls.certificates.check=true -Dplugin.includes='protocol-httpclient|parse-tika' https://ingevd.waarbenjij.nu/kaart/5000179/dag-4 ... Fetch failed with protocol status: exception(16), lastModified=0: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target % bin/nutch parsechecker -Dhttp.tls.certificates.check=false -Dplugin.includes='protocol-httpclient|parse-tika' ... Status: success(1,0) ...{noformat} Regarding the protocol-htmlunit and the two selenium-based protocol plugins: it's now tracked in NUTCH-2649. ?? Maybe its also an idea to add a dummy trust manager in Nutch' base code?? Yes, or in lib-http. While implementing this for protocol-okhttp, I've thought about trying to bundle the DummyTrustManager functionalities. But the code overlaps only partially, so I was lazy here. > Make configurable whether TLS/SSL certificates are checked by protocol plugins > -- > > Key: NUTCH-2648 > URL: https://issues.apache.org/jira/browse/NUTCH-2648 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > (see discussion in NUTCH-2647) > It should be possible to enable/disable TLS/SSL certificate validation > centrally for all http/https protocol plugins by a single configuration > property. > Some use cases (eg. crawl a site to detect insecure pages) may require that > TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may > skip sites with invalid certificates as this is can be an indicator for the > quality of a site. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2649) Optionally skip TLS/SSL certificate validation for protocol-selenium and protocol-htmlunit
Sebastian Nagel created NUTCH-2649: -- Summary: Optionally skip TLS/SSL certificate validation for protocol-selenium and protocol-htmlunit Key: NUTCH-2649 URL: https://issues.apache.org/jira/browse/NUTCH-2649 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 NUTCH-2648 adds a property to enable/disable the TLS/SSL certificate validation for protocol-http, protocol-httpclient and protocol-okhttp. It should be also supported by remaining protocol plugins: * protocol-selenium, * protocol-interactiveselenium and * protocol-htmlunit -- This message was sent by Atlassian JIRA (v7.6.3#76005)