[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035520#comment-14035520 ]
ASF subversion and git services commented on SOLR-6095: ------------------------------------------------------- Commit 1603383 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1603383 ] SOLR-6095 SolrCloud cluster can end up without an overseer with overseer roles > SolrCloud cluster can end up without an overseer with overseer roles > --------------------------------------------------------------------- > > Key: SOLR-6095 > URL: https://issues.apache.org/jira/browse/SOLR-6095 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.8 > Reporter: Shalin Shekhar Mangar > Assignee: Noble Paul > Fix For: 5.0, 4.10 > > Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, > SOLR-6095.patch > > > We have a large cluster running on ec2 which occasionally ends up without an > overseer after a rolling restart. We always restart our overseer nodes at the > very last otherwise we end up with a large number of shards that can't > recover properly. > This cluster is running a custom branch forked from 4.8 and has SOLR-5473, > SOLR-5495 and SOLR-5468 applied. We have a large number of small collections > (120 collections each with approx 5M docs) on 16 Solr nodes. We are also > using the overseer roles feature to designate two specified nodes as > overseers. However, I think the problem that we're seeing is not specific to > the overseer roles feature. > As soon as the overseer was shutdown, we saw the following on the node which > was next in line to become the overseer: > {code} > 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext > - I am going to be the leader ec2-xxxxxxxxxx.compute-1.amazonaws.com:8987_solr > 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /overseer_elect/leader > at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at > org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > at > org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) > at > org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) > at > org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) > at > org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) > at > org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) > at > org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) > at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) > at > org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > {code} > When the overseer leader node is gracefully shutdown, we get the following in > the logs: > {code} > 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in > Overseer main queue loop > org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 > at > org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) > at > org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) > at > org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:503) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) > at > org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) > at > org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) > at > org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) > ... 4 more > 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer > Loop exiting : ec2-xxxxxxxxxx.compute-1.amazonaws.com:8986_solr > 2014-05-20 09:55:39,256 [main-EventThread] WARN common.cloud.ZkStateReader > - ZooKeeper watch triggered, but Solr cannot talk to ZK > 2014-05-20 09:55:39,259 [ShutdownMonitor] INFO server.handler.ContextHandler > - stopped > o.e.j.w.WebAppContext{/solr,file:/vol0/cloud86/solr-webapp/webapp/},/vol0/cloud86/webapps/solr.war > {code} > Notice how the overseer kept on running almost till the last point i.e. until > the jetty context stopped. On some runs, we got the following on the overseer > leader node on graceful shutdown: > {code} > 2014-05-19 21:33:43,657 [Thread-75] ERROR solr.cloud.Overseer - Exception in > Overseer main queue loop > org.apache.solr.common.SolrException: Could not load collection from ZK:sm71 > at > org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) > at > org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) > at > org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:503) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274) > at > org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:769) > ... 4 more > 2014-05-19 21:33:43,662 [main-EventThread] WARN common.cloud.ZkStateReader > - ZooKeeper watch triggered, but Solr cannot talk to ZK > 2014-05-19 21:33:43,663 [Thread-75] INFO solr.cloud.Overseer - Overseer > Loop exiting : ec2-xxxxxxxxxxxx.compute-1.amazonaws.com:8987_solr > 2014-05-19 21:33:43,664 [OverseerExitThread] ERROR solr.cloud.Overseer - > could not read the data > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /overseer_elect/leader > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:329) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:85) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater$1.run(Overseer.java:293) > 2014-05-19 21:33:43,665 [ShutdownMonitor] INFO server.handler.ContextHandler > - stopped > o.e.j.w.WebAppContext{/solr,file:/vol0/cloud87/solr-webapp/webapp/},/vol0/cloud87/webapps/solr.war > {code} > Again the overseer was clinging on till the last moment and by the time it > exited the ZK session expired and it couldn't delete the > /overseer_elect/leader node. The exception on the next-in-line node was the > same i.e. NodeExists for /overseer_elect/leader. > In both cases, we are left with no overseers after restart. I can easily > reproduce this problem by just restarting overseer leader nodes repeatedly. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org