Hi Shi Wei, sorry, but it looks like the Selenium protocol plugin has never been used with a proxy over https. There are two points which need (at a first glance) a rework:
1. the protocol tries to establish a TLS/SSL connection to the proxy if the URL to be crawled is a https:// URL. There might be some proxies which can do this, but the proxies I'm aware of expect a HTTP CONNECT [1] for HTTPS proxying. 2. probably also the browser / driver needs to be configured to use the same proxy. Afaics, this isn't done but is a requirement if the proxy is required for accessing web content. However, it might be possible by setting environment variables. Sorry again. Feel free to open a Jira issue to get this fixed. Best, Sebastian [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method On 10/28/21 11:45, sw.l...@quandatics.com wrote: > Hi there, > > > > Good day! > > > > We would like to crawl the web data by executing the Nutch with Selenium > plugin with the following command: > > > > $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http > https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial > > > > However, it failed with the following error message: > > > > 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = xxxx > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list = > true > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 10000 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch > Test/Nutch-1.18 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header = > true > > 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output > > javax.net.ssl.SSLHandshakeException: Remote host closed connection during > handshake > > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994) > > at sun.security.ssl.SSL > > > > FYI, we have tried the following approaches but the issues persisted. > > > > 1. Set the http.tls.certificates.check to false > > 2. Import the website's certificates to our java truststores > > 3. Our Nutch is configured with proxy > > > > Kindly advise. Thanks in advance! > > > > > > Best Regards, > > Shi Wei > > > >