[jira] [Created] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

Eric Bus (JIRA) Tue, 24 Dec 2013 06:28:03 -0800

Eric Bus created SOLR-5579:
------------------------------

             Summary: Leader stops processing collection-work-queue after 
failed collection reload
                 Key: SOLR-5579
                 URL: https://issues.apache.org/jira/browse/SOLR-5579
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.5.1
         Environment: Debian Linux 6.0 running on VMWare
Using embedded SOLR Jetty.
            Reporter: Eric Bus



I've been experiencing the same problem a few times now. My leader in 
/overseer_elect/leader stops processing the collection queue at 
/overseer/collection-queue-work. The queue will build up and it will trigger an 
alert in my monitoring tool.

I haven't been able to pinpoint the reason that the leader stops, but usually I 
kill the leader node to trigger a leader election. The new node will pick up 
the queue. And this is where the problems start.

When the new leader is processing the queue and picks up a reload for a shard 
without an active leader, the queue stops. It keeps repeating the message that 
there is no active leader for the shard. But a new leader is never elected:

{quote}
ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
while trying to recover. 
core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
registered leader was found, collection:magento_349 slice:shar
d1
        at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
        at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)

ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
Recovery failed - trying again... (7) core=magento_349_shard1_replica1
INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
256.0 seconds before trying to recover again (8)
{quote}

Is the leader election in some way connected to the collection queue? If so, 
can this be a deadlock, because it won't elect until the reload is complete?




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

Reply via email to