[ https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423492#comment-15423492 ]
Jose-Marcio Martins commented on NUTCH-2269: -------------------------------------------- Hello, from a message I've posted on nutch-users discussion list... on Jun, 07 2016. Nobody answered. I tried with older solr releases but the problem remains.So I've tried to rebuild the crawl data (and solr data too) from scratch, incrementally to see at what point the problem arrives. I copy here the content of my message to nutch-list... Well. to find which "thing" could trigger the problem on "clean", I worked incrementally, and I found that the problem is triggered when nutch tries to clean the following URLs from solr : ******************************************************************************************** [nutch@crawler crawldb]$ ../../../../devel/show-urls part-00000 | grep gone db_gone http://www.armines.net/0.85 db_gone http://www.armines.net/1.8 db_gone http://www.armines.net/agenda/3%C3%A8me-a%C3%A9rogels db_gone http://www.armines.net/agenda/chercheurs-3d db_gone http://www.armines.net/agenda/rencontres-2016 db_gone http://www.armines.net/association-armines/chiffres-dactivit%C3%A9 db_gone http://www.armines.net/associations-reseaux db_gone http://www.armines.net/carnot-mines-tv/sciences-mat%C3%A9riaux/extinguo db_gone http://www.armines.net/centres-thematiques/%C3%A9conomie-management-soci%C3%A9t%C3%A9 db_gone http://www.armines.net/centres-thematiques/%C3%A9nerg%C3%A9tique-proc%C3%A9d%C3%A9s db_gone http://www.armines.net/centres-thematiques/math%C3%A9matiques-9 db_gone http://www.armines.net/centres-thematiques/sciences-lenvironnement db_gone http://www.armines.net/centres-thematiques/sciences-mat%C3%A9riaux db_gone http://www.armines.net/domaines-dapplication/energie-durable db_gone http://www.armines.net/domaines-dapplication/transformation-mati%C3%A8re db_gone http://www.armines.net/fr/grid4eu-solutions db_gone http://www.armines.net/text/javascript [nutch@crawler crawldb]$ Is it possible that the problem come from the encoded URLs (with %XY) ? > Clean not working after crawl > ----------------------------- > > Key: NUTCH-2269 > URL: https://issues.apache.org/jira/browse/NUTCH-2269 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.12 > Environment: Vagrant, Ubuntu, Java 8, Solr 4.10 > Reporter: Francesco Capponi > Fix For: 1.13 > > > I'm have been having this problem for a while and I had to rollback using the > old solr clean instead of the newer version. > Once it inserts/update correctly every document in Nutch, when it tries to > clean, it returns error 255: > {quote} > 2016-05-30 10:13:04,992 WARN output.FileOutputCommitter - Output Path is > null in setupJob() > 2016-05-30 10:13:07,284 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: content dest: > content > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: title dest: > title > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: host dest: host > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: segment dest: > segment > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: boost dest: > boost > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: digest dest: > digest > 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2016-05-30 10:13:08,133 INFO solr.SolrIndexWriter - SolrIndexer: deleting > 15/15 documents > 2016-05-30 10:13:08,919 WARN output.FileOutputCommitter - Output Path is > null in cleanupJob() > 2016-05-30 10:13:08,937 WARN mapred.LocalJobRunner - job_local662730477_0001 > java.lang.Exception: java.lang.IllegalStateException: Connection pool shut > down > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.IllegalStateException: Connection pool shut down > at org.apache.http.util.Asserts.check(Asserts.java:34) > at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) > at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) > at > org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483) > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) > at > org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) > at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2016-05-30 10:13:09,299 ERROR indexer.CleaningJob - CleaningJob: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:172) > at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:195) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:206) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)