[jira] [Commented] (SOLR-18273) Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration recovery and removes/re-registers live_nodes

Yifei Geng (Jira) Wed, 03 Jun 2026 00:45:09 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085725#comment-18085725
 ]


Yifei Geng commented on SOLR-18273:
-----------------------------------

I looked into the Solr 10.0.0 source code and the observed behavior appears to 
come from how ZooKeeper connection state events are mapped into ZkController 
recovery logic.

The relevant code paths are:

* org.apache.solr.common.cloud.OnDisconnect
* org.apache.solr.common.cloud.OnReconnect
* org.apache.solr.cloud.ZkController

In OnDisconnect, both Curator SUSPENDED and LOST events call 
ZkController.onDisconnect(boolean sessionExpired):

{noformat}
if (newState == ConnectionState.LOST || newState == ConnectionState.SUSPENDED) {
  onDisconnect(newState == ConnectionState.LOST);
}
{noformat}

So SUSPENDED is passed as sessionExpired=false, while LOST is passed as 
sessionExpired=true.

However, OnReconnect does not carry this distinction forward. Any Curator 
RECONNECTED event calls ZkController.onReconnect():

{noformat}
if (ConnectionState.RECONNECTED.equals(newState)) {
  onReconnect();
}
{noformat}

Then ZkController.onReconnect() unconditionally executes logic that is 
explicitly written as session-expiration recovery:

{noformat}
log.info("ZooKeeper session re-connected ... refreshing core states after 
session expiration.");

removeEphemeralLiveNode();
zkStateReader.createClusterStateWatchersAndUpdate();
...
registerAllCoresAsDown(false);
createEphemeralLiveNode();
...
executorService.submit(new RegisterCoreAsync(descriptor, true, true));
{noformat}

The third argument to RegisterCoreAsync is afterExpiration=true, so all local 
cores are re-registered as if the ZooKeeper session had expired.

This explains the observed log sequence:

{noformat}
State change: SUSPENDED
State change: RECONNECTED
ZooKeeper session re-connected ... refreshing core states after session 
expiration.
Remove node as live in ZooKeeper:/live_nodes/...
Register node as live in ZooKeeper:/live_nodes/...
Registering core ... afterExpiration? true
{noformat}

The important point is that this sequence can happen after a short ZooKeeper 
leader failover where the client session has not actually expired. In our logs, 
ZooKeeper leader election completed within a few seconds, while ZooKeeper 
server-side session expiration messages appeared later. So Solr appears to be 
entering the session-expiration recovery path based only on RECONNECTED, not 
based on confirmed session expiration.

There is also an important subtlety: simply returning early from 
ZkController.onReconnect() when the previous disconnect was only SUSPENDED may 
not be sufficient, because ZkController.onDisconnect(false) already performs 
non-trivial local cleanup:

{noformat}
overseer.close();

for (CoreDescriptor descriptor : descriptors) {
  closeExistingElectionContext(descriptor, sessionExpired);
}

for (CoreDescriptor descriptor : descriptors) {
  descriptor.getCloudDescriptor().setLeader(false);
  descriptor.getCloudDescriptor().setHasRegistered(false);
}
{noformat}

For sessionExpired=false, closeExistingElectionContext() closes the local 
election context but does not remove it from the map, because the ephemeral 
election znodes should still exist:

{noformat}
prevContext.close();

if (sessionExpired) {
  electionContexts.remove(contextKey);
}
{noformat}

So the fix likely needs to treat connection state as an explicit state machine, 
not as two independent callbacks.

A safer approach would be:

* Track the last known ZooKeeper session id.
* Track whether the last disconnect was LOST/session-expired or only SUSPENDED.
* On RECONNECTED:
  * if the previous state was LOST, or if the ZooKeeper session id changed, run 
the existing full session-expiration recovery path.
  * otherwise, run only a lightweight reconnect path:
    * refresh/recreate watchers if needed
    * refresh cluster state
    * restore local state/election handling without removing /live_nodes
    * do not call registerAllCoresAsDown(false)
    * do not re-register every core with afterExpiration=true

This would preserve the existing full recovery behavior for real session 
expiration, while avoiding unnecessary live_nodes removal and mass core 
re-registration during short ZooKeeper leader failovers.


15:43

 

 

默认权限

> Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration 
> recovery and removes/re-registers live_nodes
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18273
>                 URL: https://issues.apache.org/jira/browse/SOLR-18273
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: 10.0
>         Environment: Solr 10.0.0
>            Reporter: Yifei Geng
>            Priority: Critical
>
> ```markdown
> Title: Solr 10.0.0 treats short ZooKeeper leader failover as 
> session-expiration recovery and removes/re-registers live_nodes
> ### Description
> We are observing a SolrCloud stability issue after upgrading/testing with 
> Solr 10.0.0.
> When the ZooKeeper leader node is manually stopped, ZooKeeper elects a new 
> leader successfully within about 2.4 seconds. However, multiple Solr 10 nodes 
> enter a disruptive recovery path after ZooKeeper reconnect: they log 
> `ZooKeeper session re-connected ... refreshing core states after session 
> expiration`, remove their own `/live_nodes` entry, publish themselves as 
> `DOWN`, and then re-register hundreds of cores with `afterExpiration? true`.
> This causes SolrCloud-level instability: `live_nodes` briefly drops from `3 
> -> 2 -> 1 -> 0`, Overseer leadership changes, shard leader state becomes 
> inconsistent, and update requests start failing with 503 errors.
> The same ZooKeeper leader failover scenario did not cause this behavior in 
> our Solr 8 cluster.
> ### Environment
> - Solr version: `10.0.0`
> - SolrCloud cluster: 3 Solr nodes
> - ZooKeeper version: tested with `3.9.4`
> - ZooKeeper ensemble: 3 nodes
> - ZooKeeper settings:
>   - `tickTime=2000`
>   - `minSessionTimeout=4000`
>   - `maxSessionTimeout=40000`
>   - `initLimit=10`
>   - `syncLimit=5`
> - Solr JVM / ZK options:
>   - `-Dsolr.zookeeper.client.timeout=30000`
>   - `-DzkClientTimeout=30000`
>   - 
> `-DzkHost=zk-solr-test1:2181,zk-solr-test2:2181,zk-solr-test3:2181/solr10-test`
> ### Steps to reproduce
> 1. Start a 3-node Solr 10.0.0 SolrCloud cluster.
> 2. Start a 3-node ZooKeeper ensemble.
> 3. Run write/update traffic against collections in SolrCloud.
> 4. Manually stop the current ZooKeeper leader process.
> 5. Observe Solr logs and `/live_nodes`.
> ### Expected behavior
> ZooKeeper leader failover should cause a short `SUSPENDED` period and then 
> Solr should reconnect without treating the event as a full session-expiration 
> recovery, as long as the ZooKeeper session has not actually expired.
> Solr nodes should not remove their own live node entries and re-register all 
> cores after a short ZooKeeper leader election.
> ### Actual behavior
> ZooKeeper leader election completed quickly:
> ```text
> 2026-05-29 14:43:25,444 Disconnected from leader zk-solr-test2
> 2026-05-29 14:43:27,969 LEADING - LEADER ELECTION TOOK - 2417 MS
> ```
> But Solr nodes entered a disruptive recovery path.
> Example from Solr node 2:
> ```text
> 2026-05-29 14:43:25.467 WARN  Session 0x301ca123e240004 ... 
> EndOfStreamException
> 2026-05-29 14:43:25.581 INFO  State change: SUSPENDED
> 2026-05-29 14:43:26.180 WARN  zk-solr-test2:2181 ... Connection refused
> 2026-05-29 14:43:28.518 INFO  State change: RECONNECTED
> 2026-05-29 14:43:28.519 INFO  ZooKeeper session re-connected ... refreshing 
> core states after session expiration.
> 2026-05-29 14:43:28.521 INFO  Remove node as live in 
> ZooKeeper:/live_nodes/solr10-test-2:8983_solr
> 2026-05-29 14:43:29.495 INFO  Publish node=solr10-test-2:8983_solr as DOWN
> 2026-05-29 14:43:30.042 INFO  Register node as live in ZooKeeper: 
> /live_nodes/solr10-test-2:8983_solr
> 2026-05-29 14:43:30.066 INFO  Registering core antibody_shard1_replica_n2 
> afterExpiration? true
> ```
> During this period, live nodes changed as follows:
> ```text
> Updated live nodes from ZooKeeper... (3) -> (2)
> Updated live nodes from ZooKeeper... (2) -> (1)
> Updated live nodes from ZooKeeper... (1) -> (0)
> Updated live nodes from ZooKeeper... (0) -> (1)
> ```
> Update requests failed during and after this state transition:
> ```text
> Cannot talk to ZooKeeper - Updates are disabled.
> ```
> Later, Solr reported leader state inconsistency:
> ```text
> ClusterState says we are the leader, but locally we don't think so.
> ```
> A second Solr node showed the same pattern:
> ```text
> 2026-05-29 14:43:25.616 INFO  State change: SUSPENDED
> 2026-05-29 14:43:29.583 INFO  State change: RECONNECTED
> 2026-05-29 14:43:29.584 INFO  ZooKeeper session re-connected ... refreshing 
> core states after session expiration.
> 2026-05-29 14:43:29.589 INFO  Remove node as live in 
> ZooKeeper:/live_nodes/solr10-test-3:8983_solr
> 2026-05-29 14:43:31.040 INFO  Publish node=solr10-test-3:8983_solr as DOWN
> ```
> ### Important detail
> ZooKeeper server-side session expiration happened much later than the 
> Solr-side "after session expiration" log.
> ZooKeeper logs:
> ```text
> 2026-05-29 14:47:44,327 Expiring session 0x301ca123e240005, timeout of 
> 30000ms exceeded
> 2026-05-29 14:47:46,327 Expiring session 0x301ca123e240004, timeout of 
> 30000ms exceeded
> ```
> So at `14:43:28/29`, Solr appears to be executing session-expiration recovery 
> even though ZooKeeper had not yet logged actual server-side expiration for 
> those sessions.
> ### Impact
> A short ZooKeeper leader failover is amplified into a SolrCloud-wide 
> instability event:
> - Multiple Solr nodes publish themselves as `DOWN`
> - `/live_nodes` temporarily drops to zero or near-zero
> - Overseer leadership changes
> - Hundreds of cores are re-registered
> - Updates fail with 503
> - Shard leader state becomes inconsistent
> ### Question / suspected issue
> Is Solr 10.0.0 expected to run full session-expiration recovery on 
> `RECONNECTED` after a short ZooKeeper leader failover?
> This looks like Solr may be treating a transient ZooKeeper reconnect as 
> session expiration, or otherwise overreacting during Curator/ZooKeeper 
> reconnect handling, causing unnecessary live node removal and core 
> re-registration.
> Could this be a Solr 10.0.0 regression in ZooKeeper 
> reconnect/session-expiration handling compared with Solr 8?
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-18273) Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration recovery and removes/re-registers live_nodes

Reply via email to