[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202844#comment-14202844
 ] 

James Hardwick commented on SOLR-6707:
--------------------------------------

This is an excerpt from the the logs immediately following the original 
exception (shown above)

{noformat}
2014-11-03 11:13:58,488 [zkCallback-2-thread-86] INFO  cloud.ElectionContext  - 
I am going to be the leader xxx.xxx.xxx.109:8081_app-search
2014-11-03 11:13:58,489 [zkCallback-2-thread-86] INFO  cloud.SolrZkClient  - 
makePath: /overseer_elect/leader
2014-11-03 11:13:58,489 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,489 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,496 [zkCallback-2-thread-86] INFO  cloud.Overseer  - 
Overseer (id=92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188) 
starting
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=appindex,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was down, I won't 
be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746105
2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - My last published State was Active, it's 
okay to be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - I may be the new leader - try and sync
2014-11-03 11:13:58,504 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,506 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=true
2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,507 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,508 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,521 [RecoveryThread] WARN  cloud.RecoveryStrategy  - 
Stopping recovery for core=app-analytics coreNodeName=core_node1
2014-11-03 11:13:58,560 [zkCallback-2-thread-86] INFO  
cloud.OverseerAutoReplicaFailoverThread  - Starting 
OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 
autoReplicaFailoverWaitAfterExpiration=30000 
autoReplicaFailoverBadNodeExpiration=60000
2014-11-03 11:13:58,575 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,580 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was 
recovery_failed, I won't be the leader.
2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746107
2014-11-03 11:13:58,583 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,584 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=false
2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,584 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,585 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,588 [RecoveryThread] WARN  cloud.RecoveryStrategy  - 
Stopping recovery for core=app-analytics coreNodeName=core_node1
2014-11-03 11:13:58,599 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was 
recovery_failed, I won't be the leader.
2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,606 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746109
2014-11-03 11:13:58,609 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,609 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=false
2014-11-03 11:13:58,610 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,610 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,610 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,610 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,615 [RecoveryThread] WARN  cloud.RecoveryStrategy  - 
Stopping recovery for core=app-analytics coreNodeName=core_node1
2014-11-03 11:13:58,621 
[OverseerCollectionProcessor-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.OverseerCollectionProcessor  - Process current queue of collection 
creations
2014-11-03 11:13:58,623 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,626 
[OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.Overseer  - Starting to work on the main queue
2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was 
recovery_failed, I won't be the leader.
2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,630 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746111
2014-11-03 11:13:58,639 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,640 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=false
2014-11-03 11:13:58,641 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,641 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,641 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,641 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,645 
[OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.Overseer  - Update state numShards=null message={
  "operation":"state",
  "state":"down",
  "base_url":"http://xxx.xxx.xxx.160:8081/app-search";,
  "core":"appindex",
  "node_name":"xxx.xxx.xxx.160:8081_app-search",
  "shard":"shard1",
  "collection":"appindex"}
2014-11-03 11:13:58,645 
[OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.Overseer  - node=core_node3 is already registered
2014-11-03 11:13:58,646 [RecoveryThread] WARN  cloud.RecoveryStrategy  - 
Stopping recovery for core=app-analytics coreNodeName=core_node1
2014-11-03 11:13:58,646 [zkCallback-2-thread-86] INFO  cloud.DistributedQueue  
- LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2014-11-03 11:13:58,658 [zkCallback-2-thread-88] INFO  cloud.ZkStateReader  - A 
cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 2)
2014-11-03 11:13:58,660 [zkCallback-2-thread-86] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,666 [zkCallback-2-thread-86] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,666 [zkCallback-2-thread-86] INFO  
cloud.ShardLeaderElectionContext  - My last published State was 
recovery_failed, I won't be the leader.
2014-11-03 11:13:58,666 [zkCallback-2-thread-86] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,667 [zkCallback-2-thread-86] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746113
2014-11-03 11:13:58,669 [zkCallback-2-thread-86] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,670 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=false
2014-11-03 11:13:58,671 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,671 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,671 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,671 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,676 [RecoveryThread] WARN  cloud.RecoveryStrategy  - 
Stopping recovery for core=app-analytics coreNodeName=core_node1
2014-11-03 11:13:58,676 [zkCallback-2-thread-88] INFO  cloud.DistributedQueue  
- LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2014-11-03 11:13:58,687 [zkCallback-2-thread-88] INFO  cloud.DistributedQueue  
- LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2014-11-03 11:13:58,689 
[OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.Overseer  - Update state numShards=0 message={
  "operation":"state",
  "core_node_name":"core_node1",
  "numShards":"0",
  "shard":"shard1",
  "roles":null,
  "state":"recovery_failed",
  "core":"app-analytics",
  "collection":"app-analytics",
  "node_name":"xxx.xxx.xxx.109:8081_app-search",
  "base_url":"http://xxx.xxx.xxx.109:8081/app-search"}
  2014-11-03 11:13:58,704 [zkCallback-2-thread-88] INFO  cloud.DistributedQueue 
 - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2014-11-03 11:13:58,705 
[OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.Overseer  - Update state numShards=0 message={
  "operation":"state",
  "core_node_name":"core_node2",
  "numShards":"0",
  "shard":"shard1",
  "roles":null,
  "state":"recovery_failed",
  "core":"app-analytics",
  "collection":"app-analytics",
  "node_name":"xxx.xxx.xxx.154:8081_app-search",
  "base_url":"http://xxx.xxx.xxx.154:8081/app-search"}
2014-11-03 11:13:58,719 [zkCallback-2-thread-88] INFO  cloud.DistributedQueue  
- LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2014-11-03 11:13:58,720 
[OverseerStateUpdate-92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_0000000188]
 INFO  cloud.Overseer  - Update state numShards=0 message={
  "operation":"state",
  "core_node_name":"core_node1",
  "numShards":"0",
  "shard":"shard1",
  "roles":null,
  "state":"recovery_failed",
  "core":"app-analytics",
  "collection":"app-analytics",
  "node_name":"xxx.xxx.xxx.109:8081_app-search",
  "base_url":"http://xxx.xxx.xxx.109:8081/app-search"}
2014-11-03 11:13:58,724 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,732 [zkCallback-2-thread-86] INFO  cloud.DistributedQueue  
- LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was 
recovery_failed, I won't be the leader.
2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,732 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746115
2014-11-03 11:13:58,734 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,735 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=false
2014-11-03 11:13:58,736 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,736 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,736 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,736 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,741 [zkCallback-2-thread-86] INFO  cloud.DistributedQueue  
- LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
{noformat}

> Recovery/election for invalid core results in rapid-fire re-attempts until 
> /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6707
>                 URL: https://issues.apache.org/jira/browse/SOLR-6707
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server 
> down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually 
> down because it's configs are not yet completely updated for Solr-cloud. This 
> was thought to be harmless since it's not currently in use. 
> - Solr experienced an "internal server error" supposedly because of "No space 
> left on device" even though we appeared to have ~10GB free. 
> - Solr immediately went into recovery, and subsequent leader election for 
> each shard of each core. 
> - Our primary core recovered immediately. Our additional core which was never 
> active in the first place, attempted to recover but of course couldn't due to 
> the improper configs. 
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
> 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster 
> coordination can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead 
> core around, and our quick solution has been to remove said core. However I 
> can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to