[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484013#comment-14484013
 ] 

Mark Miller commented on SOLR-7338:
---

Same status as yesterday - I'll look into this today.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core foo_shard1_replica1
 INFO  - 2015-04-01 17:28:51.110; 
 org.apache.solr.common.cloud.ConnectionManager; Waiting for 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-07 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484005#comment-14484005
 ] 

Timothy Potter commented on SOLR-7338:
--

[~markrmil...@gmail.com] I don't think the RecoveryZkTest failure is due to 
this ticket as it failed in a similar fashion prior to that commit:
https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/2888/

I've actually beast'd that test with 5.1 100 times w/o failure so I'd like to 
move forward with the RC candidate I've already built and staged unless you 
advise otherwise ;-)

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481921#comment-14481921
 ] 

Timothy Potter commented on SOLR-7338:
--

good catch ... going to try to get the failure to reproduce locally now


 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core foo_shard1_replica1
 INFO  - 2015-04-01 17:28:51.110; 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482127#comment-14482127
 ] 

Mark Miller commented on SOLR-7338:
---

A lot of my ChaosMonkey test runs on my local jenkins machine starting failing 
after that commit. I have not had a chance to dig into the logs yet though.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core foo_shard1_replica1

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482068#comment-14482068
 ] 

Timothy Potter commented on SOLR-7338:
--

hmmm ... no failures with -Dbeast.iters=20 either ... not much jumping out at 
me in the logs either. The replica with missing data definitely received the 
update that is being reported at the end of the test as missing, for instance:

{code}
## Only in http://127.0.0.1:10050/p_/z/collection1: 
[{_version_=1497723893329166337, id=2-472}
{code}

but earlier in the logs, we see:

{code}
[junit4]   2 949616 T6637 C:control_collection S:shard1 R:core_node1 
c:collection1 C1136 P10043 oasup.LogUpdateProcessor.finish [collection1] 
webapp=/p_/z path=/update params={version=2CONTROL=TRUEwt=javabin} 
{add=[2-472 (1497723893326020609)]} 0 0
   [junit4]   2 949621 T6686 C:collection1 S:shard1 R:core_node2 c:collection1 
C1134 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z 
path=/update 
params={version=2update.distrib=FROMLEADERdistrib.from=http://127.0.0.1:10050/p_/z/collection1/wt=javabin}
 {add=[2-472 (1497723893329166337)]} 0 0
   [junit4]   2 949622 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 
C1135 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z 
path=/update params={version=2wt=javabin} {add=[2-472 (1497723893329166337)]} 
0 3
{code}

I used the same seed as the failed build as well - FBFBECE5D5AABA29

You see anything on your side [~markrmil...@gmail.com]?

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482033#comment-14482033
 ] 

Timothy Potter commented on SOLR-7338:
--

running beast with 20 iters ... haven't been able to reproduce with 10 ... the 
logs from the failed test make it seem like the recovery process succeeded OK:

{code}
[junit4]   2 added docs:1022 with 0 fails deletes:510
   [junit4]   2 956835 T6623 C:collection1 S:shard1 c:collection1 
oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut Wait for recoveries 
to finish - wait 120 for each attempt
   [junit4]   2 956836 T6623 C:collection1 S:shard1 c:collection1 
oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Wait for recoveries to 
finish - collection: collection1 failOnTimeout:true timeout (sec):120
   [junit4]   2 959842 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from 
http://127.0.0.1:10050/p_/z/collection1/ core=collection1 - 
recoveringAfterStartup=true
   [junit4]   2 959843 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z 
START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100
   [junit4]   2 959845 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.sync WARN no frame of reference to tell if we've missed updates
   [junit4]   2 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery PeerSync Recovery was not successful - trying 
replication. core=collection1
   [junit4]   2 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Recovery was cancelled
   [junit4]   2 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Finished recovery process. core=collection1
   [junit4]   2 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions The 
last recovery attempt started 7015ms ago.
   [junit4]   2 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions 
Throttling recovery attempts - waiting for 2984ms
   [junit4]   2 959847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get 
params={version=2getVersions=100distrib=falseqt=/getwt=javabin} status=0 
QTime=1 
   [junit4]   2 962833 T6739 C1144 P10059 oasc.RecoveryStrategy.run Starting 
recovery process.  core=collection1 recoveringAfterStartup=false
   [junit4]   2 962834 T6739 C1144 P10059 oasc.RecoveryStrategy.doRecovery 
Publishing state of core collection1 as recovering, leader is 
http://127.0.0.1:10050/p_/z/collection1/ and I am 
http://127.0.0.1:10059/p_/z/collection1/
   [junit4]   2 962834 T6739 C:collection1 c:collection1 C1144 P10059 
oasc.ZkController.publish publishing core=collection1 state=recovering 
collection=collection1
   [junit4]   2 962834 T6739 C:collection1 c:collection1 C1144 P10059 
oasc.ZkController.publish numShards not found on descriptor - reading it from 
system property
   [junit4]   2 962836 T6649 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
   [junit4]   2 962837 T6650 oasc.Overseer$ClusterStateUpdater.run 
processMessage: queueSize: 1, message = {
   [junit4]   2  collection:collection1,
   [junit4]   2  core_node_name:core_node2,
   [junit4]   2  state:recovering,
   [junit4]   2  shard:shard1,
   [junit4]   2  base_url:http://127.0.0.1:10059/p_/z;,
   [junit4]   2  roles:null,
   [junit4]   2  node_name:127.0.0.1:10059_p_%2Fz,
   [junit4]   2  core:collection1,
   [junit4]   2  operation:state,
   [junit4]   2  numShards:1} current state version: 5
   [junit4]   2 962837 T6650 oasco.ReplicaMutator.updateState Update state 
numShards=1 message={
   [junit4]   2  collection:collection1,
   [junit4]   2  core_node_name:core_node2,
   [junit4]   2  state:recovering,
   [junit4]   2  shard:shard1,
   [junit4]   2  base_url:http://127.0.0.1:10059/p_/z;,
   [junit4]   2  roles:null,
   [junit4]   2  node_name:127.0.0.1:10059_p_%2Fz,
   [junit4]   2  core:collection1,
   [junit4]   2  operation:state,
   [junit4]   2  numShards:1}
   [junit4]   2 962837 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.sendPrepRecoveryCmd Sending prep recovery command to 
http://127.0.0.1:10050/p_/z; WaitForState: 
action=PREPRECOVERYcore=collection1nodeName=127.0.0.1%3A10059_p_%252FzcoreNodeName=core_node2state=recoveringcheckLive=trueonlyIfLeader=trueonlyIfLeaderActive=true
   [junit4]   2 962838 T6650 oasco.ZkStateWriter.writePendingUpdates going to 
update_collection /collections/collection1/state.json version: 10
   [junit4]   

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482399#comment-14482399
 ] 

Mark Miller commented on SOLR-7338:
---

May have been coincidental fails (still, bad new(ish) replicas out of sync 
fails) - I'll spend some time tomorrow looking closer and post what I find or 
close this issue again. 

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482493#comment-14482493
 ] 

Timothy Potter commented on SOLR-7338:
--

Ok cool - I have RC built for 5.1 with this fix in, but will do it if needed 
... can't get it to reproduce with 50 beasts!

{code}
ant beast -Dbeast.iters=50  -Dtestcase=RecoveryZkTest -Dtests.method=test 
-Dtests.seed=FBFBECE5D5AABA29 -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=ar_SD -Dtests.timezone=America/Marigot -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
{code}

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Fix For: Trunk, 5.1

 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481319#comment-14481319
 ] 

ASF subversion and git services commented on SOLR-7338:
---

Commit 1671554 from [~thelabdude] in branch 'dev/trunk'
[ https://svn.apache.org/r1671554 ]

SOLR-7338: A reloaded core will never register itself as active after a ZK 
session expiration, also fixes SOLR-6583

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481345#comment-14481345
 ] 

ASF subversion and git services commented on SOLR-7338:
---

Commit 1671562 from [~thelabdude] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1671562 ]

SOLR-7338: A reloaded core will never register itself as active after a ZK 
session expiration, also fixes SOLR-6583

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481364#comment-14481364
 ] 

ASF subversion and git services commented on SOLR-7338:
---

Commit 1671570 from [~thelabdude] in branch 'dev/branches/lucene_solr_5_1'
[ https://svn.apache.org/r1671570 ]

SOLR-7338: A reloaded core will never register itself as active after a ZK 
session expiration, also fixes SOLR-6583

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-03 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394840#comment-14394840
 ] 

David Smiley commented on SOLR-7338:


Here's a question for you [~markrmil...@gmail.com]: If every core were to be 
reloaded, would that change anything?  What if I go and do that too all my 
cores.  Can we just assume that all cores may have been reloaded at some point 
in the past?  If we do assume that, we do we lose anything?  -- other than 
complexity :-)

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-03 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394640#comment-14394640
 ] 

Timothy Potter commented on SOLR-7338:
--

Hi [~markrmil...@gmail.com], do you think anything else needs to be done on 
this one? I'd actually like to get this into the 5.1 release - patch looks good 
to me. If you're comfortable with the unit test I posted, I can combine them 
and commit. Thanks.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395200#comment-14395200
 ] 

Mark Miller commented on SOLR-7338:
---

bq. I can combine them and commit.

Go ahead. I think that's the right current fix and it also addresses SOLR-6583.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core foo_shard1_replica1
 INFO  - 2015-04-01 17:28:51.110; 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392770#comment-14392770
 ] 

Mark Miller commented on SOLR-7338:
---

Anyway, we can spin that off into a new issue if someone wants to try and 
refactor it. I just don't yet get the impetus for it. Onlookers feel free to 
chime in here until/unless a new issue is made.

bq. but here's the unit test I started working on

Cool.



 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392763#comment-14392763
 ] 

Mark Miller commented on SOLR-7338:
---

bq. isReloaded feels more like isReloading

What's the gain, what's the point, what's the alternative?

I don't get that at all. It tells you if the core has been reloaded. This is 
often useful in things that happen on creating a new SolrCore.

Who cares about isReloading? I'm lost.

Is it just too difficult to understand what isReloaded means?

I'd be more confused with this temporary isReloading call - seems so easy for 
that to be tricky. isReloaded is so permanent and easy to understand. The core 
has been reloaded or it hasn't. How the heck does trying to track exactly when 
the core is actually in the process of reload or not more useful?

Anyone else?

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch, SOLR-7338_test.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392603#comment-14392603
 ] 

Yonik Seeley commented on SOLR-7338:


Without looking at the code/patches ;-) I understand what Tim is saying, and 
agree.  isReloaded feels more like isReloading (i.e. it's state that is 
only temporarily used to make other decisions during initialization.)  I don't 
know how hard it would be to factor out though... prob not worth it.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Mark Miller
 Attachments: SOLR-7338.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391284#comment-14391284
 ] 

Mark Miller commented on SOLR-7338:
---

bq.  I can't see a reason why after a successful reload is complete, that flag 
should stay == true.

Why not? A reloaded core should never do various things like replay it's 
transaction log in register. I don't see how it makes any sense if it doesn't 
stay true. The core has either been reloaded before or it has not.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391292#comment-14391292
 ] 

Mark Miller commented on SOLR-7338:
---

bq. looks like some broken refactoring or something.

Or an incomplete attempt at a bug fix. We also do not want to recover on a 
reload, so this is probably an incomplete attempt at fixing that. It just 
didn't take into account that we still need that ACTIVE publish.



 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391301#comment-14391301
 ] 

Mark Miller commented on SOLR-7338:
---

Okay, so, total story looks like:

The isReloaded call was originally just used to prevent tlog replay other than 
on first startup.

Recovery was always done, even on a reload.

Later, it was noticed we should not be recovering on reload and the recovery 
check was also brought under isReloaded. 

This was not okay - the recovery check, unlike the tlog replay check, needs to 
know if this register attempt is the result of a reload - not if the core has 
ever been reloaded.

We need a test that can catch this.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391258#comment-14391258
 ] 

Timothy Potter commented on SOLR-7338:
--

bq. but I don't understand what this flag is needed for anyway???

Sorry, I wasn't clear ... I get the idea of having a flag to tell that a core 
is in the process of reloading, so that we can make various decisions about 
what to do with the tlogs, etc, however, I think that flag should be set back 
to false after a core is fully reloaded and active again, no? I can't see a 
reason why after a successful reload is complete, that flag should stay == true.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391279#comment-14391279
 ] 

Mark Miller commented on SOLR-7338:
---

looks like some broken refactoring or something.

Look at what it used to be:

{code}
  UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
  if (!core.isReloaded()  ulog != null) {
FutureUpdateLog.RecoveryInfo recoveryFuture = core.getUpdateHandler()
.getUpdateLog().recoverFromLog();
if (recoveryFuture != null) {
  recoveryFuture.get(); // NOTE: this could potentially block for
  // minutes or more!
  // TODO: public as recovering in the mean time?
  // TODO: in the future we could do peersync in parallel with 
recoverFromLog
} else {
  log.info(No LogReplay needed for core=+core.getName() +  baseURL= 
+ baseUrl);
}
  }  
  boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, 
isLeader, cloudDesc,
  collection, coreZkNodeName, shardId, leaderProps, core, cc);
  if (!didRecovery) {
publish(desc, ZkStateReader.ACTIVE);
  }
{code}

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391543#comment-14391543
 ] 

Mark Miller commented on SOLR-7338:
---

bq.  I'm saying I can't see what possible value knowing that provides long term?

For example, the value it provides in the specific code we are talking about? 
If the core has been reloaded and its not afterExpiration, the is a core reload 
register call. I don't understand how it doesn't provide value...

bq. I'm not saying it's temporary,

But you said that a couple times...

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter
 Attachments: SOLR-7338.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391552#comment-14391552
 ] 

Timothy Potter commented on SOLR-7338:
--

whatever ... I also said several times I'm talking about after register runs, 
not during ... of course I see value up until register happens, but you win ;-)

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter
 Attachments: SOLR-7338.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core foo_shard1_replica1
 INFO  - 2015-04-01 17:28:51.110; 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391557#comment-14391557
 ] 

Mark Miller commented on SOLR-7338:
---

Also keep in mind, a reloaded SolrCore has differences from a non reloaded core 
as well - for example, a reloaded core did not create it's index writer or 
SolrCoreState like a new core does - it inherits them. 

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter
 Attachments: SOLR-7338.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391559#comment-14391559
 ] 

Mark Miller commented on SOLR-7338:
---

bq. I also said several times I'm talking about after register runs, not during

I guess I just don't understand that all. Being able to tell if a core is a 
reloaded core has absolutely nothing to do with register. Register just happens 
to use that call because it allows register to tell if the register is coming 
from a new core or a reloaded one.

I am unable to spot the issue.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter
 Attachments: SOLR-7338.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391567#comment-14391567
 ] 

Mark Miller commented on SOLR-7338:
---

If you do a call hierarchy on the SolrCore#isReloaded call, you can find a 
couple other uses for it as well. It's just kind of info we have to know. I 
guess we could use a different method of figuring out what we want in register, 
but this call works, is available, and is generally necessary.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter
 Attachments: SOLR-7338.patch


 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391475#comment-14391475
 ] 

Mark Miller commented on SOLR-7338:
---

A related bug is SOLR-6583.

We should be skipping tlog replay on 'afterExpiration==true' as well.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for core foo_shard1_replica1
 INFO  - 2015-04-01 17:28:51.110; 
 org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
 to ZooKeeper
 ERROR - 2015-04-01 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391503#comment-14391503
 ] 

Timothy Potter commented on SOLR-7338:
--

bq. The flag is has this core been reloaded before - how is that a temporary 
state?

I'm not saying it's temporary, I'm saying I can't see what possible value 
knowing that provides long term? I can see how that would be useful for 
reporting if it were a timestamp of some sort, such as last reloaded on, but a 
simple boolean makes no sense to me, nor do I see it being used anywhere in the 
code other than determining if things like tlog replay should be skipped while 
the core is initializing. Once it is fully active, the flag seems useless from 
either a reporting perspective or a state management perspective. But I think 
we've wasted enough time on this one ...

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391470#comment-14391470
 ] 

Mark Miller commented on SOLR-7338:
---

The flag is has this core been reloaded before - how is that a temporary 
state?

That whole change I show above was indeed a bug. You can see that in 
checkRecovery, that is where we deal with skipping the recovery on a loaded 
core. The calling code just got mangled.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) am no longer a 
 leader.
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
 Running listeners for /configs/foo
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
 starting a new one...
 INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
 update listener called for 

[jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration

2015-04-01 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391329#comment-14391329
 ] 

Timothy Potter commented on SOLR-7338:
--

Thanks for digging that up. I still don't understand why we have to keep track 
of whether a core has ever been reloaded as part of the long term state of a 
core. Using a flag to decide that we don't need to replay the tlog after a 
reload, while the reloaded core is initializing is one thing, but what's the 
point of remembering core A has been reloaded at some point in the past after 
core A has fully initialized and become active? I guess my point is it seems 
like a temporary state and not part of the long term state of a core. But 
that's orthogonal to this issue anyway, so I'll fix the register code and add a 
test for this.

 A reloaded core will never register itself as active after a ZK session 
 expiration
 --

 Key: SOLR-7338
 URL: https://issues.apache.org/jira/browse/SOLR-7338
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Reporter: Timothy Potter
Assignee: Timothy Potter

 If a collection gets reloaded, then a core's isReloaded flag is always true. 
 If a core experiences a ZK session expiration after a reload, then it won't 
 ever be able to set itself to active because of the check in 
 {{ZkController#register}}:
 {code}
 UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
 if (!core.isReloaded()  ulog != null) {
   // disable recovery in case shard is in construction state (for 
 shard splits)
   Slice slice = getClusterState().getSlice(collection, shardId);
   if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
 FutureUpdateLog.RecoveryInfo recoveryFuture = 
 core.getUpdateHandler().getUpdateLog().recoverFromLog();
 if (recoveryFuture != null) {
   log.info(Replaying tlog for  + ourUrl +  during startup... 
 NOTE: This can take a while.);
   recoveryFuture.get(); // NOTE: this could potentially block for
   // minutes or more!
   // TODO: public as recovering in the mean time?
   // TODO: in the future we could do peersync in parallel with 
 recoverFromLog
 } else {
   log.info(No LogReplay needed for core= + core.getName() +  
 baseURL= + baseUrl);
 }
   }
   boolean didRecovery = checkRecovery(coreName, desc, 
 recoverReloadedCores, isLeader, cloudDesc,
   collection, coreZkNodeName, shardId, leaderProps, core, cc);
   if (!didRecovery) {
 publish(desc, ZkStateReader.ACTIVE);
   }
 }
 {code}
 I can easily simulate this on trunk by doing:
 {code}
 bin/solr -c -z localhost:2181
 bin/solr create -c foo
 bin/post -c foo example/exampledocs/*.xml
 curl http://localhost:8983/solr/admin/collections?action=RELOADname=foo;
 kill -STOP PID  sleep PAUSE_SECONDS  kill -CONT PID
 {code}
 Where PID is the process ID of the Solr node. Here are the logs after the 
 CONT command. As you can see below, the core never gets to setting itself as 
 active again. I think the bug is that the isReloaded flag needs to get set 
 back to false once the reload is successful, but I don't understand what this 
 flag is needed for anyway???
 {code}
 INFO  - 2015-04-01 17:28:50.962; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Disconnected type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:50.963; 
 org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Watcher 
 org.apache.solr.common.cloud.ConnectionManager@5519dba0 
 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
 state:Expired type:None path:null path:null type:None
 INFO  - 2015-04-01 17:28:51.107; 
 org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
 session was expired. Attempting to reconnect to recover relationship with 
 ZooKeeper...
 INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00) closing
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
 /configs/foo
 INFO  - 2015-04-01 17:28:51.108; 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
 192.168.1.2:8983_solr
 INFO  - 2015-04-01 17:28:51.109; 
 org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
 (id=93579450724974592-192.168.1.2:8983_solr-n_00)