[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391503#comment-14391503 ]
Timothy Potter commented on SOLR-7338: -------------------------------------- bq. The flag is "has this core been reloaded before" - how is that a temporary state? I'm not saying it's temporary, I'm saying I can't see what possible value knowing that provides long term? I can see how that would be useful for reporting if it were a timestamp of some sort, such as last reloaded on, but a simple boolean makes no sense to me, nor do I see it being used anywhere in the code other than determining if things like tlog replay should be skipped while the core is initializing. Once it is fully active, the flag seems useless from either a reporting perspective or a state management perspective. But I think we've wasted enough time on this one ... > A reloaded core will never register itself as active after a ZK session > expiration > ---------------------------------------------------------------------------------- > > Key: SOLR-7338 > URL: https://issues.apache.org/jira/browse/SOLR-7338 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Timothy Potter > Assignee: Timothy Potter > > If a collection gets reloaded, then a core's isReloaded flag is always true. > If a core experiences a ZK session expiration after a reload, then it won't > ever be able to set itself to active because of the check in > {{ZkController#register}}: > {code} > UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); > if (!core.isReloaded() && ulog != null) { > // disable recovery in case shard is in construction state (for > shard splits) > Slice slice = getClusterState().getSlice(collection, shardId); > if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { > Future<UpdateLog.RecoveryInfo> recoveryFuture = > core.getUpdateHandler().getUpdateLog().recoverFromLog(); > if (recoveryFuture != null) { > log.info("Replaying tlog for " + ourUrl + " during startup... > NOTE: This can take a while."); > recoveryFuture.get(); // NOTE: this could potentially block for > // minutes or more! > // TODO: public as recovering in the mean time? > // TODO: in the future we could do peersync in parallel with > recoverFromLog > } else { > log.info("No LogReplay needed for core=" + core.getName() + " > baseURL=" + baseUrl); > } > } > boolean didRecovery = checkRecovery(coreName, desc, > recoverReloadedCores, isLeader, cloudDesc, > collection, coreZkNodeName, shardId, leaderProps, core, cc); > if (!didRecovery) { > publish(desc, ZkStateReader.ACTIVE); > } > } > {code} > I can easily simulate this on trunk by doing: > {code} > bin/solr -c -z localhost:2181 > bin/solr create -c foo > bin/post -c foo example/exampledocs/*.xml > curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo" > kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID> > {code} > Where <PID> is the process ID of the Solr node. Here are the logs after the > CONT command. As you can see below, the core never gets to setting itself as > active again. I think the bug is that the isReloaded flag needs to get set > back to false once the reload is successful, but I don't understand what this > flag is needed for anyway??? > {code} > INFO - 2015-04-01 17:28:50.962; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@5519dba0 > name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > INFO - 2015-04-01 17:28:50.963; > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected > INFO - 2015-04-01 17:28:51.107; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@5519dba0 > name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent > state:Expired type:None path:null path:null type:None > INFO - 2015-04-01 17:28:51.107; > org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper > session was expired. Attempting to reconnect to recover relationship with > ZooKeeper... > INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer > (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing > INFO - 2015-04-01 17:28:51.108; > org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for > /configs/foo > INFO - 2015-04-01 17:28:51.108; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : > 192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.109; > org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I > (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a > leader. > INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; > Running listeners for /configs/foo > INFO - 2015-04-01 17:28:51.109; > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - > starting a new one... > INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config > update listener called for core foo_shard1_replica1 > INFO - 2015-04-01 17:28:51.110; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect > to ZooKeeper > ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; > OverseerAutoReplicaFailoverThread had an error in its thread work > loop.:org.apache.solr.common.SolrException: Error reading cluster properties > at > org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772) > at > org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150) > at > org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76) > at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308) > at > org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765) > ... 3 more > WARN - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; > listener throws error > org.apache.solr.common.SolrException: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /configs/foo/params.json > at > org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160) > at > org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907) > at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503) > at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351) > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /configs/foo/params.json > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at > org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294) > at > org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) > at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291) > at > org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150) > ... 3 more > ERROR - 2015-04-01 17:28:51.110; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /overseer_elect/leader > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259) > INFO - 2015-04-01 17:28:51.114; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@5519dba0 > name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent > state:SyncConnected type:None path:null path:null type:None > INFO - 2015-04-01 17:28:51.115; > org.apache.solr.common.cloud.ConnectionManager; Client is connected to > ZooKeeper > INFO - 2015-04-01 17:28:51.115; > org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper > reestablished. > INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; > ZooKeeper session re-connected ... refreshing core states after session > expiration. > INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; > publishing core=foo_shard1_replica1 state=down collection=foo > INFO - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; > canceling election > /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000 > WARN - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; > cancelElection did not find election node to remove > /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000 > INFO - 2015-04-01 17:28:51.121; > org.apache.solr.cloud.OverseerElectionContext; I am going to be the leader > 192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; > makePath: /overseer_elect/leader > INFO - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer > (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting > INFO - 2015-04-01 17:28:51.128; > org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting > OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 > autoReplicaFailoverWaitAfterExpiration=30000 > autoReplicaFailoverBadNodeExpiration=60000 > INFO - 2015-04-01 17:28:51.128; > org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of > collection creations > INFO - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; > Updating cluster state from ZooKeeper... > INFO - 2015-04-01 17:28:51.129; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the > main queue > INFO - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; > addZkWatch foo > INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; > Updating collection state at /collections/foo/state.json from ZooKeeper... > INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 4 > INFO - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register > node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; > makePath: /live_nodes/192.168.1.2:8983_solr > INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register > replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr > collection:foo shard:shard1 > INFO - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 4 > INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; > canceling election > /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000 > WARN - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; > cancelElection did not find election node to remove > /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000 > INFO - 2015-04-01 17:28:51.134; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: > queueSize: 1, message = { > "core":"foo_shard1_replica1", > "core_node_name":"core_node1", > "roles":null, > "base_url":"http://192.168.1.2:8983/solr", > "node_name":"192.168.1.2:8983_solr", > "numShards":"1", > "state":"down", > "shard":"shard1", > "collection":"foo", > "operation":"state"} current state version: 1 > INFO - 2015-04-01 17:28:51.135; > org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=1 > message={ > "core":"foo_shard1_replica1", > "core_node_name":"core_node1", > "roles":null, > "base_url":"http://192.168.1.2:8983/solr", > "node_name":"192.168.1.2:8983_solr", > "numShards":"1", > "state":"down", > "shard":"shard1", > "collection":"foo", > "operation":"state"} > INFO - 2015-04-01 17:28:51.136; > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process > for shard shard1 > INFO - 2015-04-01 17:28:51.136; > org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection > /collections/foo/state.json version: 4 > INFO - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The > last leader attempt started 198060ms ago. > INFO - 2015-04-01 17:28:51.137; > org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: > WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/foo/state.json for collection foo has occurred - > updating... (live nodes size: 1) > INFO - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 5 > INFO - 2015-04-01 17:28:51.138; > org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged > fired on path /overseer/queue state SyncConnected > INFO - 2015-04-01 17:28:51.138; > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to > continue. > INFO - 2015-04-01 17:28:51.138; > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - > try and sync > INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync > replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/ > INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync > Success - now sync replicas to me > INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; > http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas > INFO - 2015-04-01 17:28:51.139; > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: > http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1 > INFO - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; > makePath: /collections/foo/leaders/shard1 > INFO - 2015-04-01 17:28:51.139; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: > queueSize: 1, message = { > "operation":"leader", > "shard":"shard1", > "collection":"foo"} current state version: 1 > INFO - 2015-04-01 17:28:51.140; > org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection > /collections/foo/state.json version: 5 > INFO - 2015-04-01 17:28:51.141; > org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: > WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/foo/state.json for collection foo has occurred - > updating... (live nodes size: 1) > INFO - 2015-04-01 17:28:51.142; > org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged > fired on path /overseer/queue state SyncConnected > INFO - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 6 > INFO - 2015-04-01 17:28:51.144; > org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: > queueSize: 1, message = { > "operation":"leader", > "shard":"shard1", > "collection":"foo", > "base_url":"http://192.168.1.2:8983/solr", > "core":"foo_shard1_replica1", > "state":"active"} current state version: 1 > INFO - 2015-04-01 17:28:51.145; > org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection > /collections/foo/state.json version: 6 > INFO - 2015-04-01 17:28:51.145; > org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: > WatchedEvent state:SyncConnected type:NodeDataChanged > path:/collections/foo/state.json for collection foo has occurred - > updating... (live nodes size: 1) > INFO - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 7 > INFO - 2015-04-01 17:28:51.147; > org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged > fired on path /overseer/queue state SyncConnected > INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are > http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is > http://192.168.1.2:8983/solr/foo_shard1_replica1/ > INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; > In register, core.isReloaded? true > INFO - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; > Updating data for foo to ver 7 > INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch > zkdir /configs/foo > INFO - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; > Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema > INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; > Running listeners for /configs/foo > INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config > update listener called for core foo_shard1_replica1 > INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; > Current schema version 5 is already the latest > INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; > Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema > INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current > version of requestparams : 0 > INFO - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; > Current schema version 5 is already the latest > INFO - 2015-04-01 17:28:51.201; > org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to > ZooKeeper > INFO - 2015-04-01 17:28:51.202; > org.apache.solr.common.cloud.ConnectionManager; Connected:true > INFO - 2015-04-01 17:28:51.211; > org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for > /configs/foo > INFO - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; > Running listeners for /configs/foo > INFO - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config > update listener called for core foo_shard1_replica1 > INFO - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current > version of requestparams : 0 > INFO - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; > [admin] webapp=null path=/admin/cores > params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0 > INFO - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; > [admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099} > status=0 QTime=11 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org