[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

Otis Gospodnetic (JIRA) Thu, 28 May 2009 10:52:08 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714086#action_12714086
 ]


Otis Gospodnetic commented on NUTCH-739:
----------------------------------------

I think there are a few issues here.
# multiple tasks trying to optimize the same index (I'm assuming you are 
correct about this) -- yes, this should not be happening
# tasks timing out -- not sure how to handle that, since one never knows how 
long the optimize call will take
# your patch simply removed the optimize call -- but now where/how is the index 
going to get optimized after dups are deleted?


> SolrDeleteDuplications too slow when using hadoop
> -------------------------------------------------
>
>                 Key: NUTCH-739
>                 URL: https://issues.apache.org/jira/browse/NUTCH-739
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>         Environment: hadoop cluster with 3 nodes
> Map Task Capacity: 6
> Reduce Task Capacity: 6
> Indexer: one instance of solr server (on the one of slave nodes)
>            Reporter: Dmitry Lihachev
>             Fix For: 1.1
>
>         Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch
>
>
> in my environment i always have many warnings like this on the dedup step
> {noformat}
> Task attempt_200905270022_0212_r_000003_0 failed to report status for 600 
> seconds. Killing!
> {noformat}
> solr logs:
> {noformat}
> INFO: [] webapp=/solr path=/update 
> params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2}
>  status=0 QTime=173741
> May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
> finish
> INFO: {optimize=} 0 173599
> May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update 
> params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2}
>  status=0 QTime=173599
> May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
> INFO: Closing searc...@2ad9ac58 main
> May 27, 2009 10:29:27 AM 
> org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
> WARNING: Could not getStatistics on info bean 
> org.apache.solr.search.SolrIndexSearcher
> org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
> ....
> {noformat}
> So I think the problem in the piece of code on line 301 of 
> SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
> each of ones tries to optimize solr indexes before closing.
> The simplest way to avoid this bug - removing this line and sending 
> "<optimize/>" message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

Reply via email to