[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484013#comment-14484013 ] Mark Miller commented on SOLR-7338: --- Same status as yesterday - I'll look into this today. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.110; org.apache.solr.common.cloud.ConnectionManager; Waiting for
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484005#comment-14484005 ] Timothy Potter commented on SOLR-7338: -- [~markrmil...@gmail.com] I don't think the RecoveryZkTest failure is due to this ticket as it failed in a similar fashion prior to that commit: https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/2888/ I've actually beast'd that test with 5.1 100 times w/o failure so I'd like to move forward with the RC candidate I've already built and staged unless you advise otherwise ;-) A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481921#comment-14481921 ] Timothy Potter commented on SOLR-7338: -- good catch ... going to try to get the failure to reproduce locally now A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.110;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482127#comment-14482127 ] Mark Miller commented on SOLR-7338: --- A lot of my ChaosMonkey test runs on my local jenkins machine starting failing after that commit. I have not had a chance to dig into the logs yet though. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482068#comment-14482068 ] Timothy Potter commented on SOLR-7338: -- hmmm ... no failures with -Dbeast.iters=20 either ... not much jumping out at me in the logs either. The replica with missing data definitely received the update that is being reported at the end of the test as missing, for instance: {code} ## Only in http://127.0.0.1:10050/p_/z/collection1: [{_version_=1497723893329166337, id=2-472} {code} but earlier in the logs, we see: {code} [junit4] 2 949616 T6637 C:control_collection S:shard1 R:core_node1 c:collection1 C1136 P10043 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2CONTROL=TRUEwt=javabin} {add=[2-472 (1497723893326020609)]} 0 0 [junit4] 2 949621 T6686 C:collection1 S:shard1 R:core_node2 c:collection1 C1134 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2update.distrib=FROMLEADERdistrib.from=http://127.0.0.1:10050/p_/z/collection1/wt=javabin} {add=[2-472 (1497723893329166337)]} 0 0 [junit4] 2 949622 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 C1135 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2wt=javabin} {add=[2-472 (1497723893329166337)]} 0 3 {code} I used the same seed as the failed build as well - FBFBECE5D5AABA29 You see anything on your side [~markrmil...@gmail.com]? A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482033#comment-14482033 ] Timothy Potter commented on SOLR-7338: -- running beast with 20 iters ... haven't been able to reproduce with 10 ... the logs from the failed test make it seem like the recovery process succeeded OK: {code} [junit4] 2 added docs:1022 with 0 fails deletes:510 [junit4] 2 956835 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut Wait for recoveries to finish - wait 120 for each attempt [junit4] 2 956836 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Wait for recoveries to finish - collection: collection1 failOnTimeout:true timeout (sec):120 [junit4] 2 959842 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http://127.0.0.1:10050/p_/z/collection1/ core=collection1 - recoveringAfterStartup=true [junit4] 2 959843 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100 [junit4] 2 959845 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync WARN no frame of reference to tell if we've missed updates [junit4] 2 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was not successful - trying replication. core=collection1 [junit4] 2 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Recovery was cancelled [junit4] 2 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Finished recovery process. core=collection1 [junit4] 2 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions The last recovery attempt started 7015ms ago. [junit4] 2 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions Throttling recovery attempts - waiting for 2984ms [junit4] 2 959847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get params={version=2getVersions=100distrib=falseqt=/getwt=javabin} status=0 QTime=1 [junit4] 2 962833 T6739 C1144 P10059 oasc.RecoveryStrategy.run Starting recovery process. core=collection1 recoveringAfterStartup=false [junit4] 2 962834 T6739 C1144 P10059 oasc.RecoveryStrategy.doRecovery Publishing state of core collection1 as recovering, leader is http://127.0.0.1:10050/p_/z/collection1/ and I am http://127.0.0.1:10059/p_/z/collection1/ [junit4] 2 962834 T6739 C:collection1 c:collection1 C1144 P10059 oasc.ZkController.publish publishing core=collection1 state=recovering collection=collection1 [junit4] 2 962834 T6739 C:collection1 c:collection1 C1144 P10059 oasc.ZkController.publish numShards not found on descriptor - reading it from system property [junit4] 2 962836 T6649 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2 962837 T6650 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2 collection:collection1, [junit4] 2 core_node_name:core_node2, [junit4] 2 state:recovering, [junit4] 2 shard:shard1, [junit4] 2 base_url:http://127.0.0.1:10059/p_/z;, [junit4] 2 roles:null, [junit4] 2 node_name:127.0.0.1:10059_p_%2Fz, [junit4] 2 core:collection1, [junit4] 2 operation:state, [junit4] 2 numShards:1} current state version: 5 [junit4] 2 962837 T6650 oasco.ReplicaMutator.updateState Update state numShards=1 message={ [junit4] 2 collection:collection1, [junit4] 2 core_node_name:core_node2, [junit4] 2 state:recovering, [junit4] 2 shard:shard1, [junit4] 2 base_url:http://127.0.0.1:10059/p_/z;, [junit4] 2 roles:null, [junit4] 2 node_name:127.0.0.1:10059_p_%2Fz, [junit4] 2 core:collection1, [junit4] 2 operation:state, [junit4] 2 numShards:1} [junit4] 2 962837 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.sendPrepRecoveryCmd Sending prep recovery command to http://127.0.0.1:10050/p_/z; WaitForState: action=PREPRECOVERYcore=collection1nodeName=127.0.0.1%3A10059_p_%252FzcoreNodeName=core_node2state=recoveringcheckLive=trueonlyIfLeader=trueonlyIfLeaderActive=true [junit4] 2 962838 T6650 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/collection1/state.json version: 10 [junit4]
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482399#comment-14482399 ] Mark Miller commented on SOLR-7338: --- May have been coincidental fails (still, bad new(ish) replicas out of sync fails) - I'll spend some time tomorrow looking closer and post what I find or close this issue again. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482493#comment-14482493 ] Timothy Potter commented on SOLR-7338: -- Ok cool - I have RC built for 5.1 with this fix in, but will do it if needed ... can't get it to reproduce with 50 beasts! {code} ant beast -Dbeast.iters=50 -Dtestcase=RecoveryZkTest -Dtests.method=test -Dtests.seed=FBFBECE5D5AABA29 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ar_SD -Dtests.timezone=America/Marigot -Dtests.asserts=true -Dtests.file.encoding=US-ASCII {code} A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Fix For: Trunk, 5.1 Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481319#comment-14481319 ] ASF subversion and git services commented on SOLR-7338: --- Commit 1671554 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1671554 ] SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583 A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481345#comment-14481345 ] ASF subversion and git services commented on SOLR-7338: --- Commit 1671562 from [~thelabdude] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671562 ] SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583 A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481364#comment-14481364 ] ASF subversion and git services commented on SOLR-7338: --- Commit 1671570 from [~thelabdude] in branch 'dev/branches/lucene_solr_5_1' [ https://svn.apache.org/r1671570 ] SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583 A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394840#comment-14394840 ] David Smiley commented on SOLR-7338: Here's a question for you [~markrmil...@gmail.com]: If every core were to be reloaded, would that change anything? What if I go and do that too all my cores. Can we just assume that all cores may have been reloaded at some point in the past? If we do assume that, we do we lose anything? -- other than complexity :-) A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394640#comment-14394640 ] Timothy Potter commented on SOLR-7338: -- Hi [~markrmil...@gmail.com], do you think anything else needs to be done on this one? I'd actually like to get this into the 5.1 release - patch looks good to me. If you're comfortable with the unit test I posted, I can combine them and commit. Thanks. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395200#comment-14395200 ] Mark Miller commented on SOLR-7338: --- bq. I can combine them and commit. Go ahead. I think that's the right current fix and it also addresses SOLR-6583. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.110;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392770#comment-14392770 ] Mark Miller commented on SOLR-7338: --- Anyway, we can spin that off into a new issue if someone wants to try and refactor it. I just don't yet get the impetus for it. Onlookers feel free to chime in here until/unless a new issue is made. bq. but here's the unit test I started working on Cool. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392763#comment-14392763 ] Mark Miller commented on SOLR-7338: --- bq. isReloaded feels more like isReloading What's the gain, what's the point, what's the alternative? I don't get that at all. It tells you if the core has been reloaded. This is often useful in things that happen on creating a new SolrCore. Who cares about isReloading? I'm lost. Is it just too difficult to understand what isReloaded means? I'd be more confused with this temporary isReloading call - seems so easy for that to be tricky. isReloaded is so permanent and easy to understand. The core has been reloaded or it hasn't. How the heck does trying to track exactly when the core is actually in the process of reload or not more useful? Anyone else? A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch, SOLR-7338_test.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392603#comment-14392603 ] Yonik Seeley commented on SOLR-7338: Without looking at the code/patches ;-) I understand what Tim is saying, and agree. isReloaded feels more like isReloading (i.e. it's state that is only temporarily used to make other decisions during initialization.) I don't know how hard it would be to factor out though... prob not worth it. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Mark Miller Attachments: SOLR-7338.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391284#comment-14391284 ] Mark Miller commented on SOLR-7338: --- bq. I can't see a reason why after a successful reload is complete, that flag should stay == true. Why not? A reloaded core should never do various things like replay it's transaction log in register. I don't see how it makes any sense if it doesn't stay true. The core has either been reloaded before or it has not. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391292#comment-14391292 ] Mark Miller commented on SOLR-7338: --- bq. looks like some broken refactoring or something. Or an incomplete attempt at a bug fix. We also do not want to recover on a reload, so this is probably an incomplete attempt at fixing that. It just didn't take into account that we still need that ACTIVE publish. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391301#comment-14391301 ] Mark Miller commented on SOLR-7338: --- Okay, so, total story looks like: The isReloaded call was originally just used to prevent tlog replay other than on first startup. Recovery was always done, even on a reload. Later, it was noticed we should not be recovering on reload and the recovery check was also brought under isReloaded. This was not okay - the recovery check, unlike the tlog replay check, needs to know if this register attempt is the result of a reload - not if the core has ever been reloaded. We need a test that can catch this. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391258#comment-14391258 ] Timothy Potter commented on SOLR-7338: -- bq. but I don't understand what this flag is needed for anyway??? Sorry, I wasn't clear ... I get the idea of having a flag to tell that a core is in the process of reloading, so that we can make various decisions about what to do with the tlogs, etc, however, I think that flag should be set back to false after a core is fully reloaded and active again, no? I can't see a reason why after a successful reload is complete, that flag should stay == true. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391279#comment-14391279 ] Mark Miller commented on SOLR-7338: --- looks like some broken refactoring or something. Look at what it used to be: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler() .getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core=+core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } {code} A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391543#comment-14391543 ] Mark Miller commented on SOLR-7338: --- bq. I'm saying I can't see what possible value knowing that provides long term? For example, the value it provides in the specific code we are talking about? If the core has been reloaded and its not afterExpiration, the is a core reload register call. I don't understand how it doesn't provide value... bq. I'm not saying it's temporary, But you said that a couple times... A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter Attachments: SOLR-7338.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391552#comment-14391552 ] Timothy Potter commented on SOLR-7338: -- whatever ... I also said several times I'm talking about after register runs, not during ... of course I see value up until register happens, but you win ;-) A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter Attachments: SOLR-7338.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.110;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391557#comment-14391557 ] Mark Miller commented on SOLR-7338: --- Also keep in mind, a reloaded SolrCore has differences from a non reloaded core as well - for example, a reloaded core did not create it's index writer or SolrCoreState like a new core does - it inherits them. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter Attachments: SOLR-7338.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391559#comment-14391559 ] Mark Miller commented on SOLR-7338: --- bq. I also said several times I'm talking about after register runs, not during I guess I just don't understand that all. Being able to tell if a core is a reloaded core has absolutely nothing to do with register. Register just happens to use that call because it allows register to tell if the register is coming from a new core or a reloaded one. I am unable to spot the issue. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter Attachments: SOLR-7338.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391567#comment-14391567 ] Mark Miller commented on SOLR-7338: --- If you do a call hierarchy on the SolrCore#isReloaded call, you can find a couple other uses for it as well. It's just kind of info we have to know. I guess we could use a different method of figuring out what we want in register, but this call works, is available, and is generally necessary. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter Attachments: SOLR-7338.patch If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109;
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391475#comment-14391475 ] Mark Miller commented on SOLR-7338: --- A related bug is SOLR-6583. We should be skipping tlog replay on 'afterExpiration==true' as well. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.110; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper ERROR - 2015-04-01
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391503#comment-14391503 ] Timothy Potter commented on SOLR-7338: -- bq. The flag is has this core been reloaded before - how is that a temporary state? I'm not saying it's temporary, I'm saying I can't see what possible value knowing that provides long term? I can see how that would be useful for reporting if it were a timestamp of some sort, such as last reloaded on, but a simple boolean makes no sense to me, nor do I see it being used anywhere in the code other than determining if things like tlog replay should be skipped while the core is initializing. Once it is fully active, the flag seems useless from either a reporting perspective or a state management perspective. But I think we've wasted enough time on this one ... A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391470#comment-14391470 ] Mark Miller commented on SOLR-7338: --- The flag is has this core been reloaded before - how is that a temporary state? That whole change I show above was indeed a bug. You can see that in checkRecovery, that is where we deal with skipping the recovery on a loaded core. The calling code just got mangled. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for
[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
[ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391329#comment-14391329 ] Timothy Potter commented on SOLR-7338: -- Thanks for digging that up. I still don't understand why we have to keep track of whether a core has ever been reloaded as part of the long term state of a core. Using a flag to decide that we don't need to replay the tlog after a reload, while the reloaded core is initializing is one thing, but what's the point of remembering core A has been reloaded at some point in the past after core A has fully initialized and become active? I guess my point is it seems like a temporary state and not part of the long term state of a core. But that's orthogonal to this issue anyway, so I'll fix the register code and add a test for this. A reloaded core will never register itself as active after a ZK session expiration -- Key: SOLR-7338 URL: https://issues.apache.org/jira/browse/SOLR-7338 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Timothy Potter Assignee: Timothy Potter If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in {{ZkController#register}}: {code} UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info(Replaying tlog for + ourUrl + during startup... NOTE: This can take a while.); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info(No LogReplay needed for core= + core.getName() + baseURL= + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } } {code} I can easily simulate this on trunk by doing: {code} bin/solr -c -z localhost:2181 bin/solr create -c foo bin/post -c foo example/exampledocs/*.xml curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo; kill -STOP PID sleep PAUSE_SECONDS kill -CONT PID {code} Where PID is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway??? {code} INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_00)