[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

2014-10-29 Thread Ryan Cooke (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189269#comment-14189269
 ] 

Ryan Cooke commented on SOLR-5579:
--

Pretty sure we are also encountering this issue, the collection reload http 
requests issued through the core admin are timing out and a corresponding 
message is sitting in the collection-work-queue. Reloading cores using the 
reload button in the admin gui will successfully reload the local collection 
however. Issuing the reload http request with the parameter async=true seems to 
behave in the same way (request time out)

 Leader stops processing collection-work-queue after failed collection reload
 

 Key: SOLR-5579
 URL: https://issues.apache.org/jira/browse/SOLR-5579
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.5.1
 Environment: Debian Linux 6.0 running on VMWare
 Using embedded SOLR Jetty.
Reporter: Eric Bus
Assignee: Mark Miller
  Labels: collections, queue

 I've been experiencing the same problem a few times now. My leader in 
 /overseer_elect/leader stops processing the collection queue at 
 /overseer/collection-queue-work. The queue will build up and it will trigger 
 an alert in my monitoring tool.
 I haven't been able to pinpoint the reason that the leader stops, but usually 
 I kill the leader node to trigger a leader election. The new node will pick 
 up the queue. And this is where the problems start.
 When the new leader is processing the queue and picks up a reload for a shard 
 without an active leader, the queue stops. It keeps repeating the message 
 that there is no active leader for the shard. But a new leader is never 
 elected:
 {quote}
 ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
 while trying to recover. 
 core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
 registered leader was found, collection:magento_349 slice:shard1
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
 at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
 at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
 ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
 Recovery failed - trying again... (7) core=magento_349_shard1_replica1
 INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
 256.0 seconds before trying to recover again (8)
 {quote}
 Is the leader election in some way connected to the collection queue? If so, 
 can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

2014-02-21 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908116#comment-13908116
 ] 

Markus Jelsma commented on SOLR-5579:
-

Not sure im struck by this too but a cluster started to fail after a successful 
collection reload.

 Leader stops processing collection-work-queue after failed collection reload
 

 Key: SOLR-5579
 URL: https://issues.apache.org/jira/browse/SOLR-5579
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.5.1
 Environment: Debian Linux 6.0 running on VMWare
 Using embedded SOLR Jetty.
Reporter: Eric Bus
Assignee: Mark Miller
  Labels: collections, queue

 I've been experiencing the same problem a few times now. My leader in 
 /overseer_elect/leader stops processing the collection queue at 
 /overseer/collection-queue-work. The queue will build up and it will trigger 
 an alert in my monitoring tool.
 I haven't been able to pinpoint the reason that the leader stops, but usually 
 I kill the leader node to trigger a leader election. The new node will pick 
 up the queue. And this is where the problems start.
 When the new leader is processing the queue and picks up a reload for a shard 
 without an active leader, the queue stops. It keeps repeating the message 
 that there is no active leader for the shard. But a new leader is never 
 elected:
 {quote}
 ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
 while trying to recover. 
 core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
 registered leader was found, collection:magento_349 slice:shard1
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
 at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
 at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
 ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
 Recovery failed - trying again... (7) core=magento_349_shard1_replica1
 INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
 256.0 seconds before trying to recover again (8)
 {quote}
 Is the leader election in some way connected to the collection queue? If so, 
 can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

2014-02-21 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908161#comment-13908161
 ] 

Markus Jelsma commented on SOLR-5579:
-

Also, i am now sure SOLR-4260 is a problem again, after the the collection got 
unusable, i now have a shards out of sync. The strange thing was that while the 
cluster was unusable (could not query) we did continue sending updates, without 
errors.

 Leader stops processing collection-work-queue after failed collection reload
 

 Key: SOLR-5579
 URL: https://issues.apache.org/jira/browse/SOLR-5579
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.5.1
 Environment: Debian Linux 6.0 running on VMWare
 Using embedded SOLR Jetty.
Reporter: Eric Bus
Assignee: Mark Miller
  Labels: collections, queue

 I've been experiencing the same problem a few times now. My leader in 
 /overseer_elect/leader stops processing the collection queue at 
 /overseer/collection-queue-work. The queue will build up and it will trigger 
 an alert in my monitoring tool.
 I haven't been able to pinpoint the reason that the leader stops, but usually 
 I kill the leader node to trigger a leader election. The new node will pick 
 up the queue. And this is where the problems start.
 When the new leader is processing the queue and picks up a reload for a shard 
 without an active leader, the queue stops. It keeps repeating the message 
 that there is no active leader for the shard. But a new leader is never 
 elected:
 {quote}
 ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
 while trying to recover. 
 core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
 registered leader was found, collection:magento_349 slice:shard1
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
 at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
 at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
 ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
 Recovery failed - trying again... (7) core=magento_349_shard1_replica1
 INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
 256.0 seconds before trying to recover again (8)
 {quote}
 Is the leader election in some way connected to the collection queue? If so, 
 can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

2014-01-09 Thread Eric Bus (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866696#comment-13866696
 ] 

Eric Bus commented on SOLR-5579:


Just a quick update: the leader again stopped working. I had to restart the 
cluster to get everything working again. The script that is running to check 
the status did not work, so unfortunately I don't have additional information 
from the logs. When I do, I'll report back here.

 Leader stops processing collection-work-queue after failed collection reload
 

 Key: SOLR-5579
 URL: https://issues.apache.org/jira/browse/SOLR-5579
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.5.1
 Environment: Debian Linux 6.0 running on VMWare
 Using embedded SOLR Jetty.
Reporter: Eric Bus
Assignee: Mark Miller
  Labels: collections, queue

 I've been experiencing the same problem a few times now. My leader in 
 /overseer_elect/leader stops processing the collection queue at 
 /overseer/collection-queue-work. The queue will build up and it will trigger 
 an alert in my monitoring tool.
 I haven't been able to pinpoint the reason that the leader stops, but usually 
 I kill the leader node to trigger a leader election. The new node will pick 
 up the queue. And this is where the problems start.
 When the new leader is processing the queue and picks up a reload for a shard 
 without an active leader, the queue stops. It keeps repeating the message 
 that there is no active leader for the shard. But a new leader is never 
 elected:
 {quote}
 ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
 while trying to recover. 
 core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
 registered leader was found, collection:magento_349 slice:shard1
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
 at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
 at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
 ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
 Recovery failed - trying again... (7) core=magento_349_shard1_replica1
 INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
 256.0 seconds before trying to recover again (8)
 {quote}
 Is the leader election in some way connected to the collection queue? If so, 
 can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

2013-12-26 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856929#comment-13856929
 ] 

Mark Miller commented on SOLR-5579:
---

Have to look for the issue, but may have been fixed in 4.6. 

 Leader stops processing collection-work-queue after failed collection reload
 

 Key: SOLR-5579
 URL: https://issues.apache.org/jira/browse/SOLR-5579
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.5.1
 Environment: Debian Linux 6.0 running on VMWare
 Using embedded SOLR Jetty.
Reporter: Eric Bus
  Labels: collections, queue

 I've been experiencing the same problem a few times now. My leader in 
 /overseer_elect/leader stops processing the collection queue at 
 /overseer/collection-queue-work. The queue will build up and it will trigger 
 an alert in my monitoring tool.
 I haven't been able to pinpoint the reason that the leader stops, but usually 
 I kill the leader node to trigger a leader election. The new node will pick 
 up the queue. And this is where the problems start.
 When the new leader is processing the queue and picks up a reload for a shard 
 without an active leader, the queue stops. It keeps repeating the message 
 that there is no active leader for the shard. But a new leader is never 
 elected:
 {quote}
 ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
 while trying to recover. 
 core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
 registered leader was found, collection:magento_349 slice:shard1
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
 at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
 at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
 ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
 Recovery failed - trying again... (7) core=magento_349_shard1_replica1
 INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
 256.0 seconds before trying to recover again (8)
 {quote}
 Is the leader election in some way connected to the collection queue? If so, 
 can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload

2013-12-26 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857195#comment-13857195
 ] 

Mark Miller commented on SOLR-5579:
---

Nope, scratch the above comment. Will need to look into this.

 Leader stops processing collection-work-queue after failed collection reload
 

 Key: SOLR-5579
 URL: https://issues.apache.org/jira/browse/SOLR-5579
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.5.1
 Environment: Debian Linux 6.0 running on VMWare
 Using embedded SOLR Jetty.
Reporter: Eric Bus
Assignee: Mark Miller
  Labels: collections, queue

 I've been experiencing the same problem a few times now. My leader in 
 /overseer_elect/leader stops processing the collection queue at 
 /overseer/collection-queue-work. The queue will build up and it will trigger 
 an alert in my monitoring tool.
 I haven't been able to pinpoint the reason that the leader stops, but usually 
 I kill the leader node to trigger a leader election. The new node will pick 
 up the queue. And this is where the problems start.
 When the new leader is processing the queue and picks up a reload for a shard 
 without an active leader, the queue stops. It keeps repeating the message 
 that there is no active leader for the shard. But a new leader is never 
 elected:
 {quote}
 ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error 
 while trying to recover. 
 core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No 
 registered leader was found, collection:magento_349 slice:shard1
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
 at 
 org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
 at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
 at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
 ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; 
 Recovery failed - trying again... (7) core=magento_349_shard1_replica1
 INFO  - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 
 256.0 seconds before trying to recover again (8)
 {quote}
 Is the leader election in some way connected to the collection queue? If so, 
 can this be a deadlock, because it won't elect until the reload is complete?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org