Hi Shi Wei,

sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:

1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There might be some proxies
which can do this, but the proxies I'm aware of expect a HTTP CONNECT
[1] for HTTPS proxying.

2. probably also the browser / driver needs to be configured to
use the same proxy. Afaics, this isn't done but is a requirement
if the proxy is required for accessing web content. However, it
might be possible by setting environment variables.

Sorry again. Feel free to open a Jira issue to get this fixed.

Best,
Sebastian

[1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method


On 10/28/21 11:45, sw.l...@quandatics.com wrote:
> Hi there,
> 
>  
> 
> Good day!
> 
>  
> 
> We would like to crawl the web data by executing the Nutch with Selenium
> plugin with the following command:
> 
>  
> 
> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
> 
>  
> 
> However, it failed with the following error message:
> 
>  
> 
> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = xxxx
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
> true
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 10000
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
> Test/Nutch-1.18
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
> true
> 
> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
> 
> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
> handshake
> 
>         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
> 
>         at sun.security.ssl.SSL
> 
>  
> 
> FYI, we have tried the following approaches but the issues persisted.
> 
>  
> 
> 1. Set the http.tls.certificates.check to false
> 
> 2. Import the website's certificates to our java truststores
> 
> 3. Our Nutch is configured with proxy
> 
>  
> 
> Kindly advise. Thanks in advance!
> 
>  
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 

Reply via email to