SolrDeleteDuplications too slow when using hadoop -------------------------------------------------
Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_000003_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed .... {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending "<optimize/>" message directly to solr server -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.