[
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394640#comment-14394640
]
Timothy Potter commented on SOLR-7338:
--------------------------------------
Hi [[email protected]], do you think anything else needs to be done on
this one? I'd actually like to get this into the 5.1 release - patch looks good
to me. If you're comfortable with the unit test I posted, I can combine them
and commit. Thanks.
> A reloaded core will never register itself as active after a ZK session
> expiration
> ----------------------------------------------------------------------------------
>
> Key: SOLR-7338
> URL: https://issues.apache.org/jira/browse/SOLR-7338
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Timothy Potter
> Assignee: Mark Miller
> Attachments: SOLR-7338.patch, SOLR-7338_test.patch
>
>
> If a collection gets reloaded, then a core's isReloaded flag is always true.
> If a core experiences a ZK session expiration after a reload, then it won't
> ever be able to set itself to active because of the check in
> {{ZkController#register}}:
> {code}
> UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
> if (!core.isReloaded() && ulog != null) {
> // disable recovery in case shard is in construction state (for
> shard splits)
> Slice slice = getClusterState().getSlice(collection, shardId);
> if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
> Future<UpdateLog.RecoveryInfo> recoveryFuture =
> core.getUpdateHandler().getUpdateLog().recoverFromLog();
> if (recoveryFuture != null) {
> log.info("Replaying tlog for " + ourUrl + " during startup...
> NOTE: This can take a while.");
> recoveryFuture.get(); // NOTE: this could potentially block for
> // minutes or more!
> // TODO: public as recovering in the mean time?
> // TODO: in the future we could do peersync in parallel with
> recoverFromLog
> } else {
> log.info("No LogReplay needed for core=" + core.getName() + "
> baseURL=" + baseUrl);
> }
> }
> boolean didRecovery = checkRecovery(coreName, desc,
> recoverReloadedCores, isLeader, cloudDesc,
> collection, coreZkNodeName, shardId, leaderProps, core, cc);
> if (!didRecovery) {
> publish(desc, ZkStateReader.ACTIVE);
> }
> }
> {code}
> I can easily simulate this on trunk by doing:
> {code}
> bin/solr -c -z localhost:2181
> bin/solr create -c foo
> bin/post -c foo example/exampledocs/*.xml
> curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo"
> kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
> {code}
> Where <PID> is the process ID of the Solr node. Here are the logs after the
> CONT command. As you can see below, the core never gets to setting itself as
> active again. I think the bug is that the isReloaded flag needs to get set
> back to false once the reload is successful, but I don't understand what this
> flag is needed for anyway???
> {code}
> INFO - 2015-04-01 17:28:50.962;
> org.apache.solr.common.cloud.ConnectionManager; Watcher
> org.apache.solr.common.cloud.ConnectionManager@5519dba0
> name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent
> state:Disconnected type:None path:null path:null type:None
> INFO - 2015-04-01 17:28:50.963;
> org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
> INFO - 2015-04-01 17:28:51.107;
> org.apache.solr.common.cloud.ConnectionManager; Watcher
> org.apache.solr.common.cloud.ConnectionManager@5519dba0
> name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent
> state:Expired type:None path:null path:null type:None
> INFO - 2015-04-01 17:28:51.107;
> org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper
> session was expired. Attempting to reconnect to recover relationship with
> ZooKeeper...
> INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer
> (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing
> INFO - 2015-04-01 17:28:51.108;
> org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for
> /configs/foo
> INFO - 2015-04-01 17:28:51.108;
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting :
> 192.168.1.2:8983_solr
> INFO - 2015-04-01 17:28:51.109;
> org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I
> (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a
> leader.
> INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4;
> Running listeners for /configs/foo
> INFO - 2015-04-01 17:28:51.109;
> org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired -
> starting a new one...
> INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config
> update listener called for core foo_shard1_replica1
> INFO - 2015-04-01 17:28:51.110;
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect
> to ZooKeeper
> ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException;
> OverseerAutoReplicaFailoverThread had an error in its thread work
> loop.:org.apache.solr.common.SolrException: Error reading cluster properties
> at
> org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772)
> at
> org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150)
> at
> org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
> at
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
> at
> org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765)
> ... 3 more
> WARN - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4;
> listener throws error
> org.apache.solr.common.SolrException:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /configs/foo/params.json
> at
> org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160)
> at
> org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907)
> at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503)
> at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /configs/foo/params.json
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> at
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
> at
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
> at
> org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150)
> ... 3 more
> ERROR - 2015-04-01 17:28:51.110;
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /overseer_elect/leader
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259)
> INFO - 2015-04-01 17:28:51.114;
> org.apache.solr.common.cloud.ConnectionManager; Watcher
> org.apache.solr.common.cloud.ConnectionManager@5519dba0
> name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent
> state:SyncConnected type:None path:null path:null type:None
> INFO - 2015-04-01 17:28:51.115;
> org.apache.solr.common.cloud.ConnectionManager; Client is connected to
> ZooKeeper
> INFO - 2015-04-01 17:28:51.115;
> org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper
> reestablished.
> INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1;
> ZooKeeper session re-connected ... refreshing core states after session
> expiration.
> INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController;
> publishing core=foo_shard1_replica1 state=down collection=foo
> INFO - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext;
> canceling election
> /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> WARN - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext;
> cancelElection did not find election node to remove
> /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> INFO - 2015-04-01 17:28:51.121;
> org.apache.solr.cloud.OverseerElectionContext; I am going to be the leader
> 192.168.1.2:8983_solr
> INFO - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient;
> makePath: /overseer_elect/leader
> INFO - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer
> (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting
> INFO - 2015-04-01 17:28:51.128;
> org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting
> OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000
> autoReplicaFailoverWaitAfterExpiration=30000
> autoReplicaFailoverBadNodeExpiration=60000
> INFO - 2015-04-01 17:28:51.128;
> org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of
> collection creations
> INFO - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader;
> Updating cluster state from ZooKeeper...
> INFO - 2015-04-01 17:28:51.129;
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the
> main queue
> INFO - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader;
> addZkWatch foo
> INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader;
> Updating collection state at /collections/foo/state.json from ZooKeeper...
> INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader;
> Updating data for foo to ver 4
> INFO - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register
> node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr
> INFO - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient;
> makePath: /live_nodes/192.168.1.2:8983_solr
> INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register
> replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr
> collection:foo shard:shard1
> INFO - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader;
> Updating data for foo to ver 4
> INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext;
> canceling election
> /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> WARN - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext;
> cancelElection did not find election node to remove
> /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> INFO - 2015-04-01 17:28:51.134;
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage:
> queueSize: 1, message = {
> "core":"foo_shard1_replica1",
> "core_node_name":"core_node1",
> "roles":null,
> "base_url":"http://192.168.1.2:8983/solr",
> "node_name":"192.168.1.2:8983_solr",
> "numShards":"1",
> "state":"down",
> "shard":"shard1",
> "collection":"foo",
> "operation":"state"} current state version: 1
> INFO - 2015-04-01 17:28:51.135;
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=1
> message={
> "core":"foo_shard1_replica1",
> "core_node_name":"core_node1",
> "roles":null,
> "base_url":"http://192.168.1.2:8983/solr",
> "node_name":"192.168.1.2:8983_solr",
> "numShards":"1",
> "state":"down",
> "shard":"shard1",
> "collection":"foo",
> "operation":"state"}
> INFO - 2015-04-01 17:28:51.136;
> org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process
> for shard shard1
> INFO - 2015-04-01 17:28:51.136;
> org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection
> /collections/foo/state.json version: 4
> INFO - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The
> last leader attempt started 198060ms ago.
> INFO - 2015-04-01 17:28:51.137;
> org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change:
> WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/collections/foo/state.json for collection foo has occurred -
> updating... (live nodes size: 1)
> INFO - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader;
> Updating data for foo to ver 5
> INFO - 2015-04-01 17:28:51.138;
> org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged
> fired on path /overseer/queue state SyncConnected
> INFO - 2015-04-01 17:28:51.138;
> org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
> continue.
> INFO - 2015-04-01 17:28:51.138;
> org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
> try and sync
> INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync
> replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync
> Success - now sync replicas to me
> INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy;
> http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas
> INFO - 2015-04-01 17:28:51.139;
> org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1
> INFO - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient;
> makePath: /collections/foo/leaders/shard1
> INFO - 2015-04-01 17:28:51.139;
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage:
> queueSize: 1, message = {
> "operation":"leader",
> "shard":"shard1",
> "collection":"foo"} current state version: 1
> INFO - 2015-04-01 17:28:51.140;
> org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection
> /collections/foo/state.json version: 5
> INFO - 2015-04-01 17:28:51.141;
> org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change:
> WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/collections/foo/state.json for collection foo has occurred -
> updating... (live nodes size: 1)
> INFO - 2015-04-01 17:28:51.142;
> org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged
> fired on path /overseer/queue state SyncConnected
> INFO - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader;
> Updating data for foo to ver 6
> INFO - 2015-04-01 17:28:51.144;
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage:
> queueSize: 1, message = {
> "operation":"leader",
> "shard":"shard1",
> "collection":"foo",
> "base_url":"http://192.168.1.2:8983/solr",
> "core":"foo_shard1_replica1",
> "state":"active"} current state version: 1
> INFO - 2015-04-01 17:28:51.145;
> org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection
> /collections/foo/state.json version: 6
> INFO - 2015-04-01 17:28:51.145;
> org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change:
> WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/collections/foo/state.json for collection foo has occurred -
> updating... (live nodes size: 1)
> INFO - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader;
> Updating data for foo to ver 7
> INFO - 2015-04-01 17:28:51.147;
> org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged
> fired on path /overseer/queue state SyncConnected
> INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are
> http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is
> http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController;
> In register, core.isReloaded? true
> INFO - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader;
> Updating data for foo to ver 7
> INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch
> zkdir /configs/foo
> INFO - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader;
> Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4;
> Running listeners for /configs/foo
> INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config
> update listener called for core foo_shard1_replica1
> INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader;
> Current schema version 5 is already the latest
> INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader;
> Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current
> version of requestparams : 0
> INFO - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader;
> Current schema version 5 is already the latest
> INFO - 2015-04-01 17:28:51.201;
> org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to
> ZooKeeper
> INFO - 2015-04-01 17:28:51.202;
> org.apache.solr.common.cloud.ConnectionManager; Connected:true
> INFO - 2015-04-01 17:28:51.211;
> org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for
> /configs/foo
> INFO - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4;
> Running listeners for /configs/foo
> INFO - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config
> update listener called for core foo_shard1_replica1
> INFO - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current
> version of requestparams : 0
> INFO - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter;
> [admin] webapp=null path=/admin/cores
> params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0
> INFO - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter;
> [admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099}
> status=0 QTime=11
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]