Markus so to crawl https and http urls successfully we just need to switch to a newer version of Nutch I.e. Higher than Nutch 1.10?
On 8/5/16, 12:47 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote: >Hello - see inline. >Markus > >-----Original message----- >> From:Arora, Madhvi <mar...@automationdirect.com> >> Sent: Friday 5th August 2016 18:03 >> To: user@nutch.apache.org >> Subject: Protocol change to https >> >> Hi, >> >> We are using Nutch 1.10 and Solr 5. We have around 10 different web sites >> that are crawled regularly. We are changing protocol of a few websites from >> http to https. So we will have a mix bag of http and https protocols. >> I checked in nutch user-mail archive and get that we need to change >> protocol-http to protocol-httpclient. >> 1: I wanted to find out the best way to handle this > >You can still use protocol-http, in some recent version we added TLS support >to it. > >> 2: What are the issues with using protocol-httpclient i.e. there were >> previous references to issues with use of protocol-httpclient. > >It does not allow unencoded URL's, but in recent Nutch' we improved basic >normalizer to fix it for you. > >> 3: Steps that need to be taken to update the SOLR index. I think that I will >> need to delete the old http urls from solr index, re-crawl and index the >> urls that need to be switched to https. > >Yes, just delete and recrawl and reindex everything. And consider upgrading to >1.12. > >> >> I will be grateful for any guidance or suggestions. >> >> Thanks, >> Madhvi >> >>