I've been unable to crawl the https://www.phoenix.gov site using protocol-httpclient. For some reason that site has limited TLS to the older TLSv1 and this causes the apache httpclient to respond with error:
"fetch of https://www.phoenix.gov/ failed with: javax.net.ssl.SSLException: Received fatal alert: protocol_version" I've even tried many variations of -D options like "bin/nutch fetch ... -Dhttps.protocols=SSLv3,TLSv1,TLSv1.1,TLSv1.2 ..." only to receive the same error. per Markus' comment maybe I should be using protocol-http even with SSL/TLS sites? Scott On Tue, Mar 8, 2016 at 8:31 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hmm, this was true before we had decent URL normalization. It should run > fine although you can encounter SSL issues. But those SSL issues might also > be in protocol-http, which now also supports SSL. You should be fine with > either plugin. > Markus > > -----Original message----- > > From:Joseph Naegele <jnaeg...@grierforensics.com> > > Sent: Tuesday 8th March 2016 16:27 > > To: user@nutch.apache.org > > Subject: protocol-http or protocol-httpclient? > > > > I'm using Nutch 1.11. The "plugin.includes" section of nutch-default.xml > > still states that the protocol-httpclient plugin may present intermittent > > problems. Is this still the case? What are the problems? > > > > There doesn't appear to be any problem crawling HTTPS using the > > protocol-http plugin. Why do I need to use protocol-httpclient for > crawling > > via HTTPS? > > > > In short, I want to use the "correct" plugin because I am extending it to > > perform a bit of extra work. "Correct" in this case means: > > - The "recommended" of the two > > - Whichever can crawl both HTTP and HTTPS connections > > - Whichever performs better > > > > Thanks, > > Joe > > > > >