I've been unable to crawl the https://www.phoenix.gov site using
protocol-httpclient. For some reason that site has limited TLS to the older
TLSv1 and this causes the apache httpclient to respond with error:

"fetch of https://www.phoenix.gov/ failed with: javax.net.ssl.SSLException:
Received fatal alert: protocol_version"

I've even tried many variations of -D options like

"bin/nutch fetch ... -Dhttps.protocols=SSLv3,TLSv1,TLSv1.1,TLSv1.2 ..."

only to receive the same error.

per Markus' comment maybe I should be using protocol-http even with SSL/TLS
sites?

Scott

On Tue, Mar 8, 2016 at 8:31 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hmm, this was true before we had decent URL normalization. It should run
> fine although you can encounter SSL issues. But those SSL issues might also
> be in protocol-http, which now also supports SSL. You should be fine with
> either plugin.
> Markus
>
> -----Original message-----
> > From:Joseph Naegele <jnaeg...@grierforensics.com>
> > Sent: Tuesday 8th March 2016 16:27
> > To: user@nutch.apache.org
> > Subject: protocol-http or protocol-httpclient?
> >
> > I'm using Nutch 1.11. The "plugin.includes" section of nutch-default.xml
> > still states that the protocol-httpclient plugin may present intermittent
> > problems. Is this still the case? What are the problems?
> >
> > There doesn't appear to be any problem crawling HTTPS using the
> > protocol-http plugin. Why do I need to use protocol-httpclient for
> crawling
> > via HTTPS?
> >
> > In short, I want to use the "correct" plugin because I am extending it to
> > perform a bit of extra work. "Correct" in this case means:
> > - The "recommended" of the two
> > - Whichever can crawl both HTTP and HTTPS connections
> > - Whichever performs better
> >
> > Thanks,
> > Joe
> >
> >
>

Reply via email to