[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394640#comment-14394640 ]
Timothy Potter commented on SOLR-7338: -------------------------------------- Hi [~markrmil...@gmail.com], do you think anything else needs to be done on this one? I'd actually like to get this into the 5.1 release - patch looks good to me. If you're comfortable with the unit test I posted, I can combine them and commit. Thanks. > A reloaded core will never register itself as active after a ZK session > expiration > ---------------------------------------------------------------------------------- > > Key: SOLR-7338 > URL: https://issues.apache.org/jira/browse/SOLR-7338 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Timothy Potter > Assignee: Mark Miller > Attachments: SOLR-7338.patch, SOLR-7338_test.patch > > > If a collection gets reloaded, then a core's isReloaded flag is always true. > If a core experiences a ZK session expiration after a reload, then it won't > ever be able to set itself to active because of the check in > {{ZkController#register}}: > {code} > UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); > if (!core.isReloaded() && ulog != null) { > // disable recovery in case shard is in construction state (for > shard splits) > Slice slice = getClusterState().getSlice(collection, shardId); > if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { > Future<UpdateLog.RecoveryInfo> recoveryFuture = > core.getUpdateHandler().getUpdateLog().recoverFromLog(); > if (recoveryFuture != null) { > log.info("Replaying tlog for " + ourUrl + " during startup... > NOTE: This can take a while."); > recoveryFuture.get(); // NOTE: this could potentially block for > // minutes or more! > // TODO: public as recovering in the mean time? > // TODO: in the future we could do peersync in parallel with > recoverFromLog > } else { > log.info("No LogReplay needed for core=" + core.getName() + " > baseURL=" + baseUrl); > } > } > boolean didRecovery = checkRecovery(coreName, desc, > recoverReloadedCores, isLeader, cloudDesc, > collection, coreZkNodeName, shardId, leaderProps, core, cc); > if (!didRecovery) { > publish(desc, ZkStateReader.ACTIVE); > } > } > {code} > I can easily simulate this on trunk by doing: > {code} > bin/solr -c -z localhost:2181 > bin/solr create -c foo > bin/post -c foo example/exampledocs/*.xml > curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo" > kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID> > {code} > Where <PID> is the process ID of the Solr node. Here are the logs after the > CONT command. As you can see below, the core never gets to setting itself as > active again. I think the bug is that the isReloaded flag needs to get set > back to false once the reload is successful, but I don't understand what this > flag is needed for anyway??? > {code} > INFO - 2015-04-01 17:28:50.962; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@5519dba0 > name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > INFO - 2015-04-01 17:28:50.963; > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected > INFO - 2015-04-01 17:28:51.107; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@5519dba0 > name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent > state:Expired type:None path:null path:null type:None > INFO - 2015-04-01 17:28:51.107; > org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper > session was expired. Attempting to reconnect to recover relationship with > ZooKeeper... > INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer > (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing > INFO - 2015-04-01 17:28:51.108; > org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for > /configs/foo > INFO - 2015-04-01 17:28:51.108; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : > 192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.109; > org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I > (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a > leader. > INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; > Running listeners for /configs/foo > INFO - 2015-04-01 17:28:51.109; > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - > starting a new one... > INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config > update listener called for core foo_shard1_replica1 > INFO - 2015-04-01 17:28:51.110; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect > to ZooKeeper > ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; > OverseerAutoReplicaFailoverThread had an error in its thread work > loop.:org.apache.solr.common.SolrException: Error reading cluster properties > at > org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772) > at > org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150) > at > org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76) > at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308) > at > org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765) > ... 3 more > WARN - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; > listener throws error > org.apache.solr.common.SolrException: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /configs/foo/params.json > at > org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160) > at > org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907) > at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503) > at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351) > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /configs/foo/params.json > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at > org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294) > at > org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) > at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291) > at > org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150) > ... 3 more > ERROR - 2015-04-01 17:28:51.110; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /overseer_elect/leader > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259) > INFO - 2015-04-01 17:28:51.114; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@5519dba0 > name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > INFO - 2015-04-01 17:28:51.115; > org.apache.solr.common.cloud.ConnectionManager; Client is connected to > ZooKeeper > INFO - 2015-04-01 17:28:51.115; > org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper > reestablished. > INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; > ZooKeeper session re-connected ... refreshing core states after session > expiration. > INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; > publishing core=foo_shard1_replica1 state=down collection=foo > INFO - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; > canceling election > /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000 > WARN - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; > cancelElection did not find election node to remove > /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000 > INFO - 2015-04-01 17:28:51.121; > org.apache.solr.cloud.OverseerElectionContext; I am going to be the leader > 192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; > makePath: /overseer_elect/leader > INFO - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer > (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting > INFO - 2015-04-01 17:28:51.128; > org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting > OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 > autoReplicaFailoverWaitAfterExpiration=30000 > autoReplicaFailoverBadNodeExpiration=60000 > INFO - 2015-04-01 17:28:51.128; > org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of > collection creations > INFO - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; > Updating cluster state from ZooKeeper... > INFO - 2015-04-01 17:28:51.129; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the > main queue > INFO - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; > addZkWatch foo > INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; > Updating collection state at /collections/foo/state.json from ZooKeeper... > INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 4 > INFO - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register > node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; > makePath: /live_nodes/192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register > replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr > collection:foo shard:shard1 > INFO - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 4 > INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; > canceling election > /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000 > WARN - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; > cancelElection did not find election node to remove > /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000 > INFO - 2015-04-01 17:28:51.134; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: > queueSize: 1, message = { > "core":"foo_shard1_replica1", > "core_node_name":"core_node1", > "roles":null, > "base_url":"http://192.168.1.2:8983/solr", > "node_name":"192.168.1.2:8983_solr", > "numShards":"1", > "state":"down", > "shard":"shard1", > "collection":"foo", > "operation":"state"} current state version: 1 > INFO - 2015-04-01 17:28:51.135; > org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=1 > message={ > "core":"foo_shard1_replica1", > "core_node_name":"core_node1", > "roles":null, > "base_url":"http://192.168.1.2:8983/solr", > "node_name":"192.168.1.2:8983_solr", > "numShards":"1", > "state":"down", > "shard":"shard1", > "collection":"foo", > "operation":"state"} > INFO - 2015-04-01 17:28:51.136; > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process > for shard shard1 > INFO - 2015-04-01 17:28:51.136; > org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection > /collections/foo/state.json version: 4 > INFO - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The > last leader attempt started 198060ms ago. > INFO - 2015-04-01 17:28:51.137; > org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: > WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/foo/state.json for collection foo has occurred - > updating... (live nodes size: 1) > INFO - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 5 > INFO - 2015-04-01 17:28:51.138; > org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged > fired on path /overseer/queue state SyncConnected > INFO - 2015-04-01 17:28:51.138; > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to > continue. > INFO - 2015-04-01 17:28:51.138; > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - > try and sync > INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync > replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/ > INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync > Success - now sync replicas to me > INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; > http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas > INFO - 2015-04-01 17:28:51.139; > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: > http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1 > INFO - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; > makePath: /collections/foo/leaders/shard1 > INFO - 2015-04-01 17:28:51.139; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: > queueSize: 1, message = { > "operation":"leader", > "shard":"shard1", > "collection":"foo"} current state version: 1 > INFO - 2015-04-01 17:28:51.140; > org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection > /collections/foo/state.json version: 5 > INFO - 2015-04-01 17:28:51.141; > org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: > WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/foo/state.json for collection foo has occurred - > updating... (live nodes size: 1) > INFO - 2015-04-01 17:28:51.142; > org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged > fired on path /overseer/queue state SyncConnected > INFO - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 6 > INFO - 2015-04-01 17:28:51.144; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: > queueSize: 1, message = { > "operation":"leader", > "shard":"shard1", > "collection":"foo", > "base_url":"http://192.168.1.2:8983/solr", > "core":"foo_shard1_replica1", > "state":"active"} current state version: 1 > INFO - 2015-04-01 17:28:51.145; > org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection > /collections/foo/state.json version: 6 > INFO - 2015-04-01 17:28:51.145; > org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: > WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/foo/state.json for collection foo has occurred - > updating... (live nodes size: 1) > INFO - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 7 > INFO - 2015-04-01 17:28:51.147; > org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged > fired on path /overseer/queue state SyncConnected > INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are > http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is > http://192.168.1.2:8983/solr/foo_shard1_replica1/ > INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; > In register, core.isReloaded? true > INFO - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 7 > INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch > zkdir /configs/foo > INFO - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; > Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema > INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; > Running listeners for /configs/foo > INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config > update listener called for core foo_shard1_replica1 > INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; > Current schema version 5 is already the latest > INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; > Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema > INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current > version of requestparams : 0 > INFO - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; > Current schema version 5 is already the latest > INFO - 2015-04-01 17:28:51.201; > org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to > ZooKeeper > INFO - 2015-04-01 17:28:51.202; > org.apache.solr.common.cloud.ConnectionManager; Connected:true > INFO - 2015-04-01 17:28:51.211; > org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for > /configs/foo > INFO - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; > Running listeners for /configs/foo > INFO - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config > update listener called for core foo_shard1_replica1 > INFO - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current > version of requestparams : 0 > INFO - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; > [admin] webapp=null path=/admin/cores > params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0 > INFO - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; > [admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099} > status=0 QTime=11 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org