[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

Eric Bus (JIRA) Thu, 09 Jan 2014 07:17:20 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866696#comment-13866696
 ]


Eric Bus commented on SOLR-5579:
--------------------------------

Just a quick update: the leader again stopped working. I had to restart the 
cluster to get everything working again. The script that is running to check 
the status did not work, so unfortunately I don't have additional information 
from the logs. When I do, I'll report back here.

> Leader stops processing collection-work-queue after failed collection reload
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-5579
>                 URL: https://issues.apache.org/jira/browse/SOLR-5579
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.5.1
>         Environment: Debian Linux 6.0 running on VMWare
> Using embedded SOLR Jetty.
>            Reporter: Eric Bus
>            Assignee: Mark Miller
>              Labels: collections, queue
>
> I've been experiencing the same problem a few times now. My leader in 
> /overseer_elect/leader stops processing the collection queue at 
> /overseer/collection-queue-work. The queue will build up and it will trigger 
> an alert in my monitoring tool.
> I haven't been able to pinpoint the reason that the leader stops, but usually 
> I kill the leader node to trigger a leader election. The new node will pick 
> up the queue. And this is where the problems start.
> When the new leader is processing the queue and picks up a reload for a shard 
> without an active leader, the queue stops. It keeps repeating the message 
> that there is no active leader for the shard. But a new leader is never 
> elected:
> {quote}
> ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
> while trying to recover. 
> core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
> registered leader was found, collection:magento_349 slice:shard1
>         at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
>         at 
> org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
> Recovery failed - trying again... (7) core=magento_349_shard1_replica1
> INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
> 256.0 seconds before trying to recover again (8)
> {quote}
> Is the leader election in some way connected to the collection queue? If so, 
> can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

Reply via email to