[ 
https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482033#comment-14482033
 ] 

Timothy Potter commented on SOLR-7338:
--------------------------------------

running beast with 20 iters ... haven't been able to reproduce with 10 ... the 
logs from the failed test make it seem like the recovery process succeeded OK:

{code}
[junit4]   2> added docs:1022 with 0 fails deletes:510
   [junit4]   2> 956835 T6623 C:collection1 S:shard1 c:collection1 
oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut Wait for recoveries 
to finish - wait 120 for each attempt
   [junit4]   2> 956836 T6623 C:collection1 S:shard1 c:collection1 
oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Wait for recoveries to 
finish - collection: collection1 failOnTimeout:true timeout (sec):120
   [junit4]   2> 959842 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from 
http://127.0.0.1:10050/p_/z/collection1/ core=collection1 - 
recoveringAfterStartup=true
   [junit4]   2> 959843 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z 
START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100
   [junit4]   2> 959845 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.sync WARN no frame of reference to tell if we've missed updates
   [junit4]   2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery PeerSync Recovery was not successful - trying 
replication. core=collection1
   [junit4]   2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Recovery was cancelled
   [junit4]   2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Finished recovery process. core=collection1
   [junit4]   2> 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions The 
last recovery attempt started 7015ms ago.
   [junit4]   2> 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions 
Throttling recovery attempts - waiting for 2984ms
   [junit4]   2> 959847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get 
params={version=2&getVersions=100&distrib=false&qt=/get&wt=javabin} status=0 
QTime=1 
   [junit4]   2> 962833 T6739 C1144 P10059 oasc.RecoveryStrategy.run Starting 
recovery process.  core=collection1 recoveringAfterStartup=false
   [junit4]   2> 962834 T6739 C1144 P10059 oasc.RecoveryStrategy.doRecovery 
Publishing state of core collection1 as recovering, leader is 
http://127.0.0.1:10050/p_/z/collection1/ and I am 
http://127.0.0.1:10059/p_/z/collection1/
   [junit4]   2> 962834 T6739 C:collection1 c:collection1 C1144 P10059 
oasc.ZkController.publish publishing core=collection1 state=recovering 
collection=collection1
   [junit4]   2> 962834 T6739 C:collection1 c:collection1 C1144 P10059 
oasc.ZkController.publish numShards not found on descriptor - reading it from 
system property
   [junit4]   2> 962836 T6649 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
   [junit4]   2> 962837 T6650 oasc.Overseer$ClusterStateUpdater.run 
processMessage: queueSize: 1, message = {
   [junit4]   2>          "collection":"collection1",
   [junit4]   2>          "core_node_name":"core_node2",
   [junit4]   2>          "state":"recovering",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "base_url":"http://127.0.0.1:10059/p_/z";,
   [junit4]   2>          "roles":null,
   [junit4]   2>          "node_name":"127.0.0.1:10059_p_%2Fz",
   [junit4]   2>          "core":"collection1",
   [junit4]   2>          "operation":"state",
   [junit4]   2>          "numShards":"1"} current state version: 5
   [junit4]   2> 962837 T6650 oasco.ReplicaMutator.updateState Update state 
numShards=1 message={
   [junit4]   2>          "collection":"collection1",
   [junit4]   2>          "core_node_name":"core_node2",
   [junit4]   2>          "state":"recovering",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "base_url":"http://127.0.0.1:10059/p_/z";,
   [junit4]   2>          "roles":null,
   [junit4]   2>          "node_name":"127.0.0.1:10059_p_%2Fz",
   [junit4]   2>          "core":"collection1",
   [junit4]   2>          "operation":"state",
   [junit4]   2>          "numShards":"1"}
   [junit4]   2> 962837 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.sendPrepRecoveryCmd Sending prep recovery command to 
http://127.0.0.1:10050/p_/z; WaitForState: 
action=PREPRECOVERY&core=collection1&nodeName=127.0.0.1%3A10059_p_%252Fz&coreNodeName=core_node2&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
   [junit4]   2> 962838 T6650 oasco.ZkStateWriter.writePendingUpdates going to 
update_collection /collections/collection1/state.json version: 10
   [junit4]   2> 962839 T6671 oasha.CoreAdminHandler.handleWaitForStateAction 
Going to wait for coreNodeName: core_node2, state: recovering, checkLive: true, 
onlyIfLeader: true, onlyIfLeaderActive: true
   [junit4]   2> 962839 T6678 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/collection1/state.json for collection collection1 has 
occurred - updating... (live nodes size: 3)
   [junit4]   2> 962839 T6728 C:collection1 S:shard1 c:collection1 
oascc.ZkStateReader$7.process A cluster state change: WatchedEvent 
state:SyncConnected type:NodeDataChanged 
path:/collections/collection1/state.json for collection collection1 has 
occurred - updating... (live nodes size: 3)
   [junit4]   2> 962840 T6728 C:collection1 S:shard1 c:collection1 
oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to 
ver 11 
   [junit4]   2> 962841 T6671 oascc.ZkStateReader.updateWatchedCollection 
Updating data for collection1 to ver 11 
   [junit4]   2> 962841 T6671 oasha.CoreAdminHandler.handleWaitForStateAction 
Will wait a max of 183 seconds to see collection1 (shard1 of collection1) have 
state: recovering
   [junit4]   2> 962842 T6678 oascc.ZkStateReader.updateWatchedCollection 
Updating data for collection1 to ver 11 
   [junit4]   2> 962842 T6671 oasha.CoreAdminHandler.handleWaitForStateAction 
In WaitForState(recovering): collection=collection1, shard=shard1, 
thisCore=collection1, leaderDoesNotNeedRecovery=false, isLeader? true, 
live=true, checkLive=true, currentState=recovering, localState=active, 
nodeName=127.0.0.1:10059_p_%2Fz, coreNodeName=core_node2, 
onlyIfActiveCheckResult=false, nodeProps: 
core_node2:{"state":"recovering","base_url":"http://127.0.0.1:10059/p_/z","node_name":"127.0.0.1:10059_p_%2Fz","core":"collection1"}
   [junit4]   2> 962842 T6671 oasha.CoreAdminHandler.handleWaitForStateAction 
Waited coreNodeName: core_node2, state: recovering, checkLive: true, 
onlyIfLeader: true for: 0 seconds.
   [junit4]   2> 962843 T6671 oass.SolrDispatchFilter.handleAdminRequest 
[admin] webapp=null path=/admin/cores 
params={version=2&onlyIfLeader=true&checkLive=true&coreNodeName=core_node2&state=recovering&nodeName=127.0.0.1:10059_p_%252Fz&onlyIfLeaderActive=true&core=collection1&action=PREPRECOVERY&wt=javabin}
 status=0 QTime=5 
   [junit4]   2> 969843 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from 
http://127.0.0.1:10050/p_/z/collection1/ core=collection1 - 
recoveringAfterStartup=false
   [junit4]   2> 969844 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z 
START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100
   [junit4]   2> 969847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get 
params={version=2&getVersions=100&distrib=false&qt=/get&wt=javabin} status=0 
QTime=0 
   [junit4]   2> 969848 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.handleVersions PeerSync: core=collection1 
url=http://127.0.0.1:10059/p_/z  Received 100 versions from 
http://127.0.0.1:10050/p_/z/collection1/
   [junit4]   2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.handleVersions PeerSync: core=collection1 
url=http://127.0.0.1:10059/p_/z  Our versions are newer. 
ourLowThreshold=1497723899508424704 otherHigh=1497723899760082944
   [junit4]   2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z 
DONE. sync succeeded
   [junit4]   2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.DirectUpdateHandler2.commit start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
   [junit4]   2> 969865 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2
   [junit4]   2>                
commit{dir=MockDirectoryWrapper(RAMDirectory@e80270f 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7f845138),segFN=segments_1,generation=1}
   [junit4]   2>                
commit{dir=MockDirectoryWrapper(RAMDirectory@e80270f 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7f845138),segFN=segments_2,generation=2}
   [junit4]   2> 969866 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2
   [junit4]   2> 969869 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oass.SolrIndexSearcher.<init> Opening Searcher@46806cf3[collection1] main
   [junit4]   2> 969869 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasu.DirectUpdateHandler2.commit end_commit_flush
   [junit4]   2> 969870 T6730 C:collection1 S:shard1 c:collection1 
oasc.SolrCore.registerSearcher [collection1] Registered new searcher 
Searcher@46806cf3[collection1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C115/43:delGen=1)
 Uninverting(_1(5.2.0):C360/137:delGen=1)))}
   [junit4]   2> 969870 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.RecoveryStrategy.doRecovery PeerSync Recovery was successful - registering 
as Active. core=collection1
   [junit4]   2> 969871 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.ZkController.publish publishing core=collection1 state=active 
collection=collection1
   [junit4]   2> 969871 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 
oasc.ZkController.publish numShards not found on descriptor - reading it from 
system property
   [junit4]   2> 969872 T6649 oasc.DistributedQueue$LatchWatcher.process 
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
   [junit4]   2> 969874 T6650 oasc.Overseer$ClusterStateUpdater.run 
processMessage: queueSize: 1, message = {
   [junit4]   2>          "collection":"collection1",
   [junit4]   2>          "core_node_name":"core_node2",
   [junit4]   2>          "state":"active",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "base_url":"http://127.0.0.1:10059/p_/z";,
   [junit4]   2>          "roles":null,
   [junit4]   2>          "node_name":"127.0.0.1:10059_p_%2Fz",
   [junit4]   2>          "core":"collection1",
   [junit4]   2>          "operation":"state",
   [junit4]   2>          "numShards":"1"} current state version: 5
   [junit4]   2> 969875 T6650 oasco.ReplicaMutator.updateState Update state 
numShards=1 message={
   [junit4]   2>          "collection":"collection1",
   [junit4]   2>          "core_node_name":"core_node2",
   [junit4]   2>          "state":"active",
   [junit4]   2>          "shard":"shard1",
   [junit4]   2>          "base_url":"http://127.0.0.1:10059/p_/z";,
   [junit4]   2>          "roles":null,
   [junit4]   2>          "node_name":"127.0.0.1:10059_p_%2Fz",
   [junit4]   2>          "core":"collection1",
   [junit4]   2>          "operation":"state",
   [junit4]   2>          "numShards":"1"}
   [junit4]   2> 969876 T6650 oasco.ZkStateWriter.writePendingUpdates going to 
update_collection /collections/collection1/state.json version: 11
   [junit4]   2> 969876 T6728 C:collection1 S:shard1 c:collection1 
oascc.ZkStateReader$7.process A cluster state change: WatchedEvent 
state:SyncConnected type:NodeDataChanged 
path:/collections/collection1/state.json for collection collection1 has 
occurred - updating... (live nodes size: 3)
   [junit4]   2> 969876 T6678 oascc.ZkStateReader$7.process A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/collection1/state.json for collection collection1 has 
occurred - updating... (live nodes size: 3)
   [junit4]   2> 969877 T6728 C:collection1 S:shard1 c:collection1 
oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to 
ver 12 
   [junit4]   2> 969877 T6678 oascc.ZkStateReader.updateWatchedCollection 
Updating data for collection1 to ver 12 
   [junit4]   2> 970870 T6623 C:collection1 S:shard1 c:collection1 
oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - 
collection: collection1
   [junit4]   2> 970873 T6638 C:control_collection S:shard1 R:core_node1 
c:collection1 C1146 P10043 oasu.DirectUpdateHandler2.commit start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
   [junit4]   2> 970920 T6638 C:control_collection S:shard1 R:core_node1 
c:collection1 C1146 P10043 oasc.SolrDeletionPolicy.onCommit 
SolrDeletionPolicy.onCommit: commits: num=2
   [junit4]   2>                
commit{dir=MockDirectoryWrapper(RAMDirectory@25082424 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@397cca7a),segFN=segments_1,generation=1}
   [junit4]   2>                
commit{dir=MockDirectoryWrapper(RAMDirectory@25082424 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@397cca7a),segFN=segments_2,generation=2}
   [junit4]   2> 970921 T6638 C:control_collection S:shard1 R:core_node1 
c:collection1 C1146 P10043 oasc.SolrDeletionPolicy.updateCommits newest commit 
generation = 2
   [junit4]   2> 970924 T6638 C:control_collection S:shard1 R:core_node1 
c:collection1 C1146 P10043 oass.SolrIndexSearcher.<init> Opening 
Searcher@1508b423[collection1] main
   [junit4]   2> 970925 T6638 C:control_collection S:shard1 R:core_node1 
c:collection1 C1146 P10043 oasu.DirectUpdateHandler2.commit end_commit_flush
   [junit4]   2> 970926 T6654 C:control_collection c:collection1 
oasc.SolrCore.registerSearcher [collection1] Registered new searcher 
Searcher@1508b423[collection1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C1303/622:delGen=1)
 Uninverting(_1(5.2.0):C740/398:delGen=1)))}
   [junit4]   2> 970926 T6638 C:control_collection S:shard1 R:core_node1 
c:collection1 C1146 P10043 oasup.LogUpdateProcessor.finish [collection1] 
webapp=/p_/z path=/update 
params={version=2&waitSearcher=true&softCommit=false&wt=javabin&commit=true} 
{commit=} 0 53
   [junit4]   2> 970931 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasu.DirectUpdateHandler2.commit start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
   [junit4]   2>  C1144_STATE=coll:collection1 core:collection1 
props:{state=active, base_url=http://127.0.0.1:10059/p_/z, 
node_name=127.0.0.1:10059_p_%2Fz, core=collection1}
   [junit4]   2> 970932 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 
C1144 P10059 oasu.DirectUpdateHandler2.commit start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
   [junit4]   2> 970933 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 
C1144 P10059 oasu.DirectUpdateHandler2.commit No uncommitted changes. Skipping 
IW.commit.
   [junit4]   2> 970933 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 
C1144 P10059 oasc.SolrCore.openNewSearcher SolrIndexSearcher has not changed - 
not re-opening: org.apache.solr.search.SolrIndexSearcher
   [junit4]   2> 970934 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 
C1144 P10059 oasu.DirectUpdateHandler2.commit end_commit_flush
   [junit4]   2> 970934 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 
C1144 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z 
path=/update 
params={version=2&commit_end_point=true&waitSearcher=true&update.distrib=FROMLEADER&openSearcher=true&distrib.from=http://127.0.0.1:10050/p_/z/collection1/&expungeDeletes=false&softCommit=false&wt=javabin&commit=true}
 {commit=} 0 2
   [junit4]   2> 970978 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: 
commits: num=2
   [junit4]   2>                
commit{dir=MockDirectoryWrapper(RAMDirectory@5b80c596 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632c29c8),segFN=segments_1,generation=1}
   [junit4]   2>                
commit{dir=MockDirectoryWrapper(RAMDirectory@5b80c596 
lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632c29c8),segFN=segments_2,generation=2}
   [junit4]   2> 970979 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2
   [junit4]   2> 970982 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oass.SolrIndexSearcher.<init> Opening 
Searcher@4273e191[collection1] main
   [junit4]   2> 970983 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasu.DirectUpdateHandler2.commit end_commit_flush
   [junit4]   2> 970983 T6680 C:collection1 c:collection1 
oasc.SolrCore.registerSearcher [collection1] Registered new searcher 
Searcher@4273e191[collection1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C861/415:delGen=1)
 Uninverting(_1(5.2.0):C1182/605:delGen=1)))}
   [junit4]   2> 970984 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z 
path=/update 
params={version=2&commit_end_point=true&waitSearcher=true&update.distrib=FROMLEADER&openSearcher=true&distrib.from=http://127.0.0.1:10050/p_/z/collection1/&expungeDeletes=false&softCommit=false&wt=javabin&commit=true}
 {commit=} 0 53
   [junit4]   2> 970986 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z 
path=/update 
params={version=2&waitSearcher=true&_stateVer_=collection1:6&softCommit=false&wt=javabin&commit=true}
 {commit=} 0 57
   [junit4]   2> 970991 T6666 C:collection1 S:shard1 R:core_node1 c:collection1 
C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/select 
params={version=2&tests=checkShardConsistency&distrib=false&rows=0&q=*:*&wt=javabin}
 hits=1023 status=0 QTime=1 
   [junit4]   2> 970993 T6721 C:collection1 S:shard1 R:core_node2 c:collection1 
C1144 P10059 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/select 
params={version=2&tests=checkShardConsistency&distrib=false&rows=0&q=*:*&wt=javabin}
 hits=295 status=0 QTime=0 
   [junit4]   2> 970994 T6623 C:collection1 S:shard1 c:collection1 
oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut shard inconsistency 
- waiting ...
{code}

> A reloaded core will never register itself as active after a ZK session 
> expiration
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-7338
>                 URL: https://issues.apache.org/jira/browse/SOLR-7338
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Assignee: Mark Miller
>             Fix For: Trunk, 5.1
>
>         Attachments: SOLR-7338.patch, SOLR-7338_test.patch
>
>
> If a collection gets reloaded, then a core's isReloaded flag is always true. 
> If a core experiences a ZK session expiration after a reload, then it won't 
> ever be able to set itself to active because of the check in 
> {{ZkController#register}}:
> {code}
>         UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
>         if (!core.isReloaded() && ulog != null) {
>           // disable recovery in case shard is in construction state (for 
> shard splits)
>           Slice slice = getClusterState().getSlice(collection, shardId);
>           if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
>             Future<UpdateLog.RecoveryInfo> recoveryFuture = 
> core.getUpdateHandler().getUpdateLog().recoverFromLog();
>             if (recoveryFuture != null) {
>               log.info("Replaying tlog for " + ourUrl + " during startup... 
> NOTE: This can take a while.");
>               recoveryFuture.get(); // NOTE: this could potentially block for
>               // minutes or more!
>               // TODO: public as recovering in the mean time?
>               // TODO: in the future we could do peersync in parallel with 
> recoverFromLog
>             } else {
>               log.info("No LogReplay needed for core=" + core.getName() + " 
> baseURL=" + baseUrl);
>             }
>           }
>           boolean didRecovery = checkRecovery(coreName, desc, 
> recoverReloadedCores, isLeader, cloudDesc,
>               collection, coreZkNodeName, shardId, leaderProps, core, cc);
>           if (!didRecovery) {
>             publish(desc, ZkStateReader.ACTIVE);
>           }
>         }
> {code}
> I can easily simulate this on trunk by doing:
> {code}
> bin/solr -c -z localhost:2181
> bin/solr create -c foo
> bin/post -c foo example/exampledocs/*.xml
> curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo";
> kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
> {code}
> Where <PID> is the process ID of the Solr node. Here are the logs after the 
> CONT command. As you can see below, the core never gets to setting itself as 
> active again. I think the bug is that the isReloaded flag needs to get set 
> back to false once the reload is successful, but I don't understand what this 
> flag is needed for anyway???
> {code}
> INFO  - 2015-04-01 17:28:50.962; 
> org.apache.solr.common.cloud.ConnectionManager; Watcher 
> org.apache.solr.common.cloud.ConnectionManager@5519dba0 
> name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:50.963; 
> org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
> INFO  - 2015-04-01 17:28:51.107; 
> org.apache.solr.common.cloud.ConnectionManager; Watcher 
> org.apache.solr.common.cloud.ConnectionManager@5519dba0 
> name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
> state:Expired type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:51.107; 
> org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper 
> session was expired. Attempting to reconnect to recover relationship with 
> ZooKeeper...
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
> (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing
> INFO  - 2015-04-01 17:28:51.108; 
> org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
> /configs/foo
> INFO  - 2015-04-01 17:28:51.108; 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
> 192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.109; 
> org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
> (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a 
> leader.
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; 
> Running listeners for /configs/foo
> INFO  - 2015-04-01 17:28:51.109; 
> org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
> starting a new one...
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
> update listener called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.110; 
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
> to ZooKeeper
> ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; 
> OverseerAutoReplicaFailoverThread had an error in its thread work 
> loop.:org.apache.solr.common.SolrException: Error reading cluster properties
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772)
>       at 
> org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150)
>       at 
> org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765)
>       ... 3 more
> WARN  - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; 
> listener throws error
> org.apache.solr.common.SolrException: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /configs/foo/params.json
>       at 
> org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160)
>       at 
> org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907)
>       at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503)
>       at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
> KeeperErrorCode = Session expired for /configs/foo/params.json
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
>       at 
> org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150)
>       ... 3 more
> ERROR - 2015-04-01 17:28:51.110; 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /overseer_elect/leader
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259)
> INFO  - 2015-04-01 17:28:51.114; 
> org.apache.solr.common.cloud.ConnectionManager; Watcher 
> org.apache.solr.common.cloud.ConnectionManager@5519dba0 
> name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:51.115; 
> org.apache.solr.common.cloud.ConnectionManager; Client is connected to 
> ZooKeeper
> INFO  - 2015-04-01 17:28:51.115; 
> org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper 
> reestablished.
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; 
> ZooKeeper session re-connected ... refreshing core states after session 
> expiration.
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; 
> publishing core=foo_shard1_replica1 state=down collection=foo
> INFO  - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; 
> canceling election 
> /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> WARN  - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; 
> cancelElection did not find election node to remove 
> /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> INFO  - 2015-04-01 17:28:51.121; 
> org.apache.solr.cloud.OverseerElectionContext; I am going to be the leader 
> 192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; 
> makePath: /overseer_elect/leader
> INFO  - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer 
> (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting
> INFO  - 2015-04-01 17:28:51.128; 
> org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting 
> OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 
> autoReplicaFailoverWaitAfterExpiration=30000 
> autoReplicaFailoverBadNodeExpiration=60000
> INFO  - 2015-04-01 17:28:51.128; 
> org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of 
> collection creations
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; 
> Updating cluster state from ZooKeeper... 
> INFO  - 2015-04-01 17:28:51.129; 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the 
> main queue
> INFO  - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; 
> addZkWatch foo
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; 
> Updating collection state at /collections/foo/state.json from ZooKeeper... 
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; 
> Updating data for foo to ver 4 
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register 
> node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; 
> makePath: /live_nodes/192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register 
> replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr 
> collection:foo shard:shard1
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; 
> Updating data for foo to ver 4 
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; 
> canceling election 
> /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> WARN  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; 
> cancelElection did not find election node to remove 
> /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> INFO  - 2015-04-01 17:28:51.134; 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "core":"foo_shard1_replica1",
>   "core_node_name":"core_node1",
>   "roles":null,
>   "base_url":"http://192.168.1.2:8983/solr";,
>   "node_name":"192.168.1.2:8983_solr",
>   "numShards":"1",
>   "state":"down",
>   "shard":"shard1",
>   "collection":"foo",
>   "operation":"state"} current state version: 1
> INFO  - 2015-04-01 17:28:51.135; 
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=1 
> message={
>   "core":"foo_shard1_replica1",
>   "core_node_name":"core_node1",
>   "roles":null,
>   "base_url":"http://192.168.1.2:8983/solr";,
>   "node_name":"192.168.1.2:8983_solr",
>   "numShards":"1",
>   "state":"down",
>   "shard":"shard1",
>   "collection":"foo",
>   "operation":"state"}
> INFO  - 2015-04-01 17:28:51.136; 
> org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process 
> for shard shard1
> INFO  - 2015-04-01 17:28:51.136; 
> org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection 
> /collections/foo/state.json version: 4
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The 
> last leader attempt started 198060ms ago.
> INFO  - 2015-04-01 17:28:51.137; 
> org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: 
> WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/foo/state.json for collection foo has occurred - 
> updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; 
> Updating data for foo to ver 5 
> INFO  - 2015-04-01 17:28:51.138; 
> org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged 
> fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.138; 
> org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to 
> continue.
> INFO  - 2015-04-01 17:28:51.138; 
> org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - 
> try and sync
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync 
> replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync 
> Success - now sync replicas to me
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; 
> http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas
> INFO  - 2015-04-01 17:28:51.139; 
> org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: 
> http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; 
> makePath: /collections/foo/leaders/shard1
> INFO  - 2015-04-01 17:28:51.139; 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "operation":"leader",
>   "shard":"shard1",
>   "collection":"foo"} current state version: 1
> INFO  - 2015-04-01 17:28:51.140; 
> org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection 
> /collections/foo/state.json version: 5
> INFO  - 2015-04-01 17:28:51.141; 
> org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: 
> WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/foo/state.json for collection foo has occurred - 
> updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.142; 
> org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged 
> fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; 
> Updating data for foo to ver 6 
> INFO  - 2015-04-01 17:28:51.144; 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "operation":"leader",
>   "shard":"shard1",
>   "collection":"foo",
>   "base_url":"http://192.168.1.2:8983/solr";,
>   "core":"foo_shard1_replica1",
>   "state":"active"} current state version: 1
> INFO  - 2015-04-01 17:28:51.145; 
> org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection 
> /collections/foo/state.json version: 6
> INFO  - 2015-04-01 17:28:51.145; 
> org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: 
> WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/collections/foo/state.json for collection foo has occurred - 
> updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; 
> Updating data for foo to ver 7 
> INFO  - 2015-04-01 17:28:51.147; 
> org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged 
> fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are 
> http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is 
> http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; 
>  In register, core.isReloaded? true 
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; 
> Updating data for foo to ver 7 
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch 
> zkdir /configs/foo
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; 
> Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; 
> Running listeners for /configs/foo
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config 
> update listener called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; 
> Current schema version 5 is already the latest
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; 
> Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current 
> version of requestparams : 0
> INFO  - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; 
> Current schema version 5 is already the latest
> INFO  - 2015-04-01 17:28:51.201; 
> org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to 
> ZooKeeper
> INFO  - 2015-04-01 17:28:51.202; 
> org.apache.solr.common.cloud.ConnectionManager; Connected:true
> INFO  - 2015-04-01 17:28:51.211; 
> org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
> /configs/foo
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; 
> Running listeners for /configs/foo
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config 
> update listener called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current 
> version of requestparams : 0
> INFO  - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; 
> [admin] webapp=null path=/admin/cores 
> params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0 
> INFO  - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; 
> [admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099} 
> status=0 QTime=11 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to