Re: Protocol change to https

Arora, Madhvi Tue, 16 Aug 2016 13:19:49 -0700

As per suggestion below, I am trying to upgrade to Nutch 1.12. I am using solr 
5.3.1. Crawling went very well with respect to:


1: https crawling
2: Boilerplate canola extraction through tika

The only problem so far I am having is an IOException. Please see below. I 
searched and there is an existing jira issue 
NUTCH-2269 <https://issues.apache.org/jira/browse/NUTCH-2269>

I get the same error if I try to clean via the old command:
bin/nutch solrclean crawl-adc/crawldb http://localhost:8983/solr/nutch

But cleaning through linkdb worked as said in the jira issue i.e. 
bin/nutch solrclean crawl-adc/linkdb http://localhost:8983/solr/nutch


Just want to know if there is a fix or an alternate way of cleaning and if 
cleaning via linkdb might be okay or what are the repercussions of cleaning via 
linkdb.


Exception from logs:
java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34)
        at 
org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
        at 
org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
        at 
org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
        at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
        at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
        at 
org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2016-08-16 15:27:47,794 ERROR indexer.CleaningJob - CleaningJob: 
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)

By the way, I am using the same message but maybe I should have started a new 
thread but this was kind of related to what I need.




On 8/5/16, 2:18 PM, "Arora, Madhvi" <mar...@automationdirect.com> wrote:

>Thank you very much!
>
>
>
>
>On 8/5/16, 2:13 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:
>
>>I am not sure which version is was added, you'd have to check CHANGES.txt, 
>>but upgrading is usually a good idea and very simple.
>>Markus
>>
>> 
>> 
>>-----Original message-----
>>> From:Arora, Madhvi <mar...@automationdirect.com>
>>> Sent: Friday 5th August 2016 19:53
>>> To: user@nutch.apache.org
>>> Subject: Re: Protocol change to https
>>> 
>>> Markus so to crawl https and http urls successfully we just need to switch 
>>> to a newer version of Nutch I.e. Higher than Nutch 1.10? 
>>> 
>>> 
>>> 
>>> On 8/5/16, 12:47 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:
>>> 
>>> >Hello - see inline.
>>> >Markus 
>>> > 
>>> >-----Original message-----
>>> >> From:Arora, Madhvi <mar...@automationdirect.com>
>>> >> Sent: Friday 5th August 2016 18:03
>>> >> To: user@nutch.apache.org
>>> >> Subject: Protocol change to https
>>> >> 
>>> >> Hi,
>>> >> 
>>> >> We are using Nutch 1.10 and Solr 5. We have around 10 different web 
>>> >> sites that are crawled regularly. We are changing  protocol of a few 
>>> >> websites from http to https. So we will have a mix bag of http and https 
>>> >> protocols.
>>> >> I checked in nutch user-mail archive and get that we need to change 
>>> >> protocol-http to protocol-httpclient.
>>> >> 1: I wanted to find out the best way to handle this
>>> >
>>> >You can still use protocol-http, in some recent version we added TLS 
>>> >support to it.
>>> >
>>> >> 2: What are the issues with using protocol-httpclient i.e. there were 
>>> >> previous references to issues with use of protocol-httpclient.
>>> >
>>> >It does not allow unencoded URL's, but in recent Nutch' we improved basic 
>>> >normalizer to fix it for you.
>>> >
>>> >> 3: Steps that need to be taken to update the SOLR index. I think that I 
>>> >> will need to delete the old http urls from solr index, re-crawl and 
>>> >> index  the urls that need to be switched to https.
>>> >
>>> >Yes, just delete and recrawl and reindex everything. And consider 
>>> >upgrading to 1.12.
>>> >
>>> >> 
>>> >> I will be grateful for any guidance or suggestions.
>>> >> 
>>> >> Thanks,
>>> >> Madhvi
>>> >> 
>>> >> 
>>>

Re: Protocol change to https

Reply via email to