[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202844#comment-14202844 ]
James Hardwick commented on SOLR-6707: -------------------------------------- This is an excerpt from the the logs immediately following the original exception (shown above) {noformat} 2014-11-03 11:13:58,488 [zkCallback-2-thread-86] INFO cloud.ElectionContext - I am going to be the leader xxx.xxx.xxx.109:8081_app-search 2014-11-03 11:13:58,489 [zkCallback-2-thread-86] INFO cloud.SolrZkClient - makePath: /overseer_elect/leader 2014-11-03 11:13:58,489 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,489 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,496 [zkCallback-2-thread-86] INFO cloud.Overseer - Overseer (id=92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188) starting 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - Checking if I (core=appindex,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was down, I won't be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746105 2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - My last published State was Active, it's okay to be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2014-11-03 11:13:58,504 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,506 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=true 2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,507 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,508 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,521 [RecoveryThread] WARN cloud.RecoveryStrategy - Stopping recovery for core=app-analytics coreNodeName=core_node1 2014-11-03 11:13:58,560 [zkCallback-2-thread-86] INFO cloud.OverseerAutoReplicaFailoverThread - Starting OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 autoReplicaFailoverWaitAfterExpiration=30000 autoReplicaFailoverBadNodeExpiration=60000 2014-11-03 11:13:58,575 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,580 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was recovery_failed, I won't be the leader. 2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746107 2014-11-03 11:13:58,583 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,584 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=false 2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,584 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,585 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,588 [RecoveryThread] WARN cloud.RecoveryStrategy - Stopping recovery for core=app-analytics coreNodeName=core_node1 2014-11-03 11:13:58,599 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was recovery_failed, I won't be the leader. 2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746109 2014-11-03 11:13:58,609 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,609 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=false 2014-11-03 11:13:58,610 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,610 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,610 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,610 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,615 [RecoveryThread] WARN cloud.RecoveryStrategy - Stopping recovery for core=app-analytics coreNodeName=core_node1 2014-11-03 11:13:58,621 [OverseerCollectionProcessor-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.OverseerCollectionProcessor - Process current queue of collection creations 2014-11-03 11:13:58,623 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,626 [OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.Overseer - Starting to work on the main queue 2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was recovery_failed, I won't be the leader. 2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746111 2014-11-03 11:13:58,639 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,640 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=false 2014-11-03 11:13:58,641 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,641 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,641 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,641 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,645 [OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.Overseer - Update state numShards=null message={ "operation":"state", "state":"down", "base_url":"http://xxx.xxx.xxx.160:8081/app-search", "core":"appindex", "node_name":"xxx.xxx.xxx.160:8081_app-search", "shard":"shard1", "collection":"appindex"} 2014-11-03 11:13:58,645 [OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.Overseer - node=core_node3 is already registered 2014-11-03 11:13:58,646 [RecoveryThread] WARN cloud.RecoveryStrategy - Stopping recovery for core=app-analytics coreNodeName=core_node1 2014-11-03 11:13:58,646 [zkCallback-2-thread-86] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-11-03 11:13:58,658 [zkCallback-2-thread-88] INFO cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2) 2014-11-03 11:13:58,660 [zkCallback-2-thread-86] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,666 [zkCallback-2-thread-86] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,666 [zkCallback-2-thread-86] INFO cloud.ShardLeaderElectionContext - My last published State was recovery_failed, I won't be the leader. 2014-11-03 11:13:58,666 [zkCallback-2-thread-86] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,667 [zkCallback-2-thread-86] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746113 2014-11-03 11:13:58,669 [zkCallback-2-thread-86] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,670 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=false 2014-11-03 11:13:58,671 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,671 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,671 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,671 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,676 [RecoveryThread] WARN cloud.RecoveryStrategy - Stopping recovery for core=app-analytics coreNodeName=core_node1 2014-11-03 11:13:58,676 [zkCallback-2-thread-88] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-11-03 11:13:58,687 [zkCallback-2-thread-88] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-11-03 11:13:58,689 [OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.Overseer - Update state numShards=0 message={ "operation":"state", "core_node_name":"core_node1", "numShards":"0", "shard":"shard1", "roles":null, "state":"recovery_failed", "core":"app-analytics", "collection":"app-analytics", "node_name":"xxx.xxx.xxx.109:8081_app-search", "base_url":"http://xxx.xxx.xxx.109:8081/app-search"} 2014-11-03 11:13:58,704 [zkCallback-2-thread-88] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-11-03 11:13:58,705 [OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.Overseer - Update state numShards=0 message={ "operation":"state", "core_node_name":"core_node2", "numShards":"0", "shard":"shard1", "roles":null, "state":"recovery_failed", "core":"app-analytics", "collection":"app-analytics", "node_name":"xxx.xxx.xxx.154:8081_app-search", "base_url":"http://xxx.xxx.xxx.154:8081/app-search"} 2014-11-03 11:13:58,719 [zkCallback-2-thread-88] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-11-03 11:13:58,720 [OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188] INFO cloud.Overseer - Update state numShards=0 message={ "operation":"state", "core_node_name":"core_node1", "numShards":"0", "shard":"shard1", "roles":null, "state":"recovery_failed", "core":"app-analytics", "collection":"app-analytics", "node_name":"xxx.xxx.xxx.109:8081_app-search", "base_url":"http://xxx.xxx.xxx.109:8081/app-search"} 2014-11-03 11:13:58,724 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,732 [zkCallback-2-thread-86] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was recovery_failed, I won't be the leader. 2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746115 2014-11-03 11:13:58,734 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,735 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=false 2014-11-03 11:13:58,736 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,736 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,736 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,736 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,741 [zkCallback-2-thread-86] INFO cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged {noformat} > Recovery/election for invalid core results in rapid-fire re-attempts until > /overseer/queue is clogged > ----------------------------------------------------------------------------------------------------- > > Key: SOLR-6707 > URL: https://issues.apache.org/jira/browse/SOLR-6707 > Project: Solr > Issue Type: Bug > Affects Versions: 4.10 > Reporter: James Hardwick > > We experienced an issue the other day that brought a production solr server > down, and this is what we found after investigating: > - Running solr instance with two separate cores, one of which is perpetually > down because it's configs are not yet completely updated for Solr-cloud. This > was thought to be harmless since it's not currently in use. > - Solr experienced an "internal server error" supposedly because of "No space > left on device" even though we appeared to have ~10GB free. > - Solr immediately went into recovery, and subsequent leader election for > each shard of each core. > - Our primary core recovered immediately. Our additional core which was never > active in the first place, attempted to recover but of course couldn't due to > the improper configs. > - Solr then began rapid-fire reattempting recovery of said node, trying maybe > 20-30 times per second. > - This in turn bombarded zookeepers /overseer/queue into oblivion > - At some point /overseer/queue becomes so backed up that normal cluster > coordination can no longer play out, and Solr topples over. > I know this is a bit of an unusual circumstance due to us keeping the dead > core around, and our quick solution has been to remove said core. However I > can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org