[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189269#comment-14189269 ] Ryan Cooke commented on SOLR-5579: -- Pretty sure we are also encountering this issue, the collection reload http requests issued through the core admin are timing out and a corresponding message is sitting in the collection-work-queue. Reloading cores using the reload button in the admin gui will successfully reload the local collection however. Issuing the reload http request with the parameter async=true seems to behave in the same way (request time out) Leader stops processing collection-work-queue after failed collection reload Key: SOLR-5579 URL: https://issues.apache.org/jira/browse/SOLR-5579 Project: Solr Issue Type: Bug Affects Versions: 4.5.1 Environment: Debian Linux 6.0 running on VMWare Using embedded SOLR Jetty. Reporter: Eric Bus Assignee: Mark Miller Labels: collections, queue I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool. I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start. When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected: {quote} ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1 INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8) {quote} Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908116#comment-13908116 ] Markus Jelsma commented on SOLR-5579: - Not sure im struck by this too but a cluster started to fail after a successful collection reload. Leader stops processing collection-work-queue after failed collection reload Key: SOLR-5579 URL: https://issues.apache.org/jira/browse/SOLR-5579 Project: Solr Issue Type: Bug Affects Versions: 4.5.1 Environment: Debian Linux 6.0 running on VMWare Using embedded SOLR Jetty. Reporter: Eric Bus Assignee: Mark Miller Labels: collections, queue I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool. I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start. When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected: {quote} ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1 INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8) {quote} Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908161#comment-13908161 ] Markus Jelsma commented on SOLR-5579: - Also, i am now sure SOLR-4260 is a problem again, after the the collection got unusable, i now have a shards out of sync. The strange thing was that while the cluster was unusable (could not query) we did continue sending updates, without errors. Leader stops processing collection-work-queue after failed collection reload Key: SOLR-5579 URL: https://issues.apache.org/jira/browse/SOLR-5579 Project: Solr Issue Type: Bug Affects Versions: 4.5.1 Environment: Debian Linux 6.0 running on VMWare Using embedded SOLR Jetty. Reporter: Eric Bus Assignee: Mark Miller Labels: collections, queue I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool. I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start. When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected: {quote} ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1 INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8) {quote} Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866696#comment-13866696 ] Eric Bus commented on SOLR-5579: Just a quick update: the leader again stopped working. I had to restart the cluster to get everything working again. The script that is running to check the status did not work, so unfortunately I don't have additional information from the logs. When I do, I'll report back here. Leader stops processing collection-work-queue after failed collection reload Key: SOLR-5579 URL: https://issues.apache.org/jira/browse/SOLR-5579 Project: Solr Issue Type: Bug Affects Versions: 4.5.1 Environment: Debian Linux 6.0 running on VMWare Using embedded SOLR Jetty. Reporter: Eric Bus Assignee: Mark Miller Labels: collections, queue I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool. I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start. When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected: {quote} ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1 INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8) {quote} Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856929#comment-13856929 ] Mark Miller commented on SOLR-5579: --- Have to look for the issue, but may have been fixed in 4.6. Leader stops processing collection-work-queue after failed collection reload Key: SOLR-5579 URL: https://issues.apache.org/jira/browse/SOLR-5579 Project: Solr Issue Type: Bug Affects Versions: 4.5.1 Environment: Debian Linux 6.0 running on VMWare Using embedded SOLR Jetty. Reporter: Eric Bus Labels: collections, queue I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool. I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start. When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected: {quote} ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1 INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8) {quote} Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5579) Leader stops processing collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857195#comment-13857195 ] Mark Miller commented on SOLR-5579: --- Nope, scratch the above comment. Will need to look into this. Leader stops processing collection-work-queue after failed collection reload Key: SOLR-5579 URL: https://issues.apache.org/jira/browse/SOLR-5579 Project: Solr Issue Type: Bug Affects Versions: 4.5.1 Environment: Debian Linux 6.0 running on VMWare Using embedded SOLR Jetty. Reporter: Eric Bus Assignee: Mark Miller Labels: collections, queue I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool. I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start. When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected: {quote} ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1 INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8) {quote} Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org