The issue is now tracked in https://issues.apache.org/jira/browse/NUTCH-2907
On 10/28/21 15:31, Sebastian Nagel wrote: > Hi Shi Wei, > > sorry, but it looks like the Selenium protocol plugin has never been > used with a proxy over https. There are two points which need (at a > first glance) a rework: > > 1. the protocol tries to establish a TLS/SSL connection to the proxy if > the URL to be crawled is a https:// URL. There might be some proxies > which can do this, but the proxies I'm aware of expect a HTTP CONNECT > [1] for HTTPS proxying. > > 2. probably also the browser / driver needs to be configured to > use the same proxy. Afaics, this isn't done but is a requirement > if the proxy is required for accessing web content. However, it > might be possible by setting environment variables. > > Sorry again. Feel free to open a Jira issue to get this fixed. > > Best, > Sebastian > > [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method > > > On 10/28/21 11:45, sw.l...@quandatics.com wrote: >> Hi there, >> >> >> >> Good day! >> >> >> >> We would like to crawl the web data by executing the Nutch with Selenium >> plugin with the following command: >> >> >> >> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http >> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial >> >> >> >> However, it failed with the following error message: >> >> >> >> 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = xxxx >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list = >> true >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 10000 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch >> Test/Nutch-1.18 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept = >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header = >> true >> >> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output >> >> javax.net.ssl.SSLHandshakeException: Remote host closed connection during >> handshake >> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994) >> >> at sun.security.ssl.SSL >> >> >> >> FYI, we have tried the following approaches but the issues persisted. >> >> >> >> 1. Set the http.tls.certificates.check to false >> >> 2. Import the website's certificates to our java truststores >> >> 3. Our Nutch is configured with proxy >> >> >> >> Kindly advise. Thanks in advance! >> >> >> >> >> >> Best Regards, >> >> Shi Wei >> >> >> >>