Markus so to crawl https and http urls successfully we just need to switch to a 
newer version of Nutch I.e. Higher than Nutch 1.10? 



On 8/5/16, 12:47 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:

>Hello - see inline.
>Markus 
> 
>-----Original message-----
>> From:Arora, Madhvi <mar...@automationdirect.com>
>> Sent: Friday 5th August 2016 18:03
>> To: user@nutch.apache.org
>> Subject: Protocol change to https
>> 
>> Hi,
>> 
>> We are using Nutch 1.10 and Solr 5. We have around 10 different web sites 
>> that are crawled regularly. We are changing  protocol of a few websites from 
>> http to https. So we will have a mix bag of http and https protocols.
>> I checked in nutch user-mail archive and get that we need to change 
>> protocol-http to protocol-httpclient.
>> 1: I wanted to find out the best way to handle this
>
>You can still use protocol-http, in some recent version we added TLS support 
>to it.
>
>> 2: What are the issues with using protocol-httpclient i.e. there were 
>> previous references to issues with use of protocol-httpclient.
>
>It does not allow unencoded URL's, but in recent Nutch' we improved basic 
>normalizer to fix it for you.
>
>> 3: Steps that need to be taken to update the SOLR index. I think that I will 
>> need to delete the old http urls from solr index, re-crawl and index  the 
>> urls that need to be switched to https.
>
>Yes, just delete and recrawl and reindex everything. And consider upgrading to 
>1.12.
>
>> 
>> I will be grateful for any guidance or suggestions.
>> 
>> Thanks,
>> Madhvi
>> 
>> 

Reply via email to