Yifei Geng created SOLR-18273:
---------------------------------
Summary: Solr 10.0.0 treats short ZooKeeper leader failover as
session-expiration recovery and removes/re-registers live_nodes
Key: SOLR-18273
URL: https://issues.apache.org/jira/browse/SOLR-18273
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: AutoScaling
Affects Versions: 10.0
Environment: Solr 10.0.0
Reporter: Yifei Geng
```markdown
Title: Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration
recovery and removes/re-registers live_nodes
### Description
We are observing a SolrCloud stability issue after upgrading/testing with Solr
10.0.0.
When the ZooKeeper leader node is manually stopped, ZooKeeper elects a new
leader successfully within about 2.4 seconds. However, multiple Solr 10 nodes
enter a disruptive recovery path after ZooKeeper reconnect: they log `ZooKeeper
session re-connected ... refreshing core states after session expiration`,
remove their own `/live_nodes` entry, publish themselves as `DOWN`, and then
re-register hundreds of cores with `afterExpiration? true`.
This causes SolrCloud-level instability: `live_nodes` briefly drops from `3 ->
2 -> 1 -> 0`, Overseer leadership changes, shard leader state becomes
inconsistent, and update requests start failing with 503 errors.
The same ZooKeeper leader failover scenario did not cause this behavior in our
Solr 8 cluster.
### Environment
- Solr version: `10.0.0`
- SolrCloud cluster: 3 Solr nodes
- ZooKeeper version: tested with `3.9.4`
- ZooKeeper ensemble: 3 nodes
- ZooKeeper settings:
- `tickTime=2000`
- `minSessionTimeout=4000`
- `maxSessionTimeout=40000`
- `initLimit=10`
- `syncLimit=5`
- Solr JVM / ZK options:
- `-Dsolr.zookeeper.client.timeout=30000`
- `-DzkClientTimeout=30000`
-
`-DzkHost=zk-solr-test1:2181,zk-solr-test2:2181,zk-solr-test3:2181/solr10-test`
### Steps to reproduce
1. Start a 3-node Solr 10.0.0 SolrCloud cluster.
2. Start a 3-node ZooKeeper ensemble.
3. Run write/update traffic against collections in SolrCloud.
4. Manually stop the current ZooKeeper leader process.
5. Observe Solr logs and `/live_nodes`.
### Expected behavior
ZooKeeper leader failover should cause a short `SUSPENDED` period and then Solr
should reconnect without treating the event as a full session-expiration
recovery, as long as the ZooKeeper session has not actually expired.
Solr nodes should not remove their own live node entries and re-register all
cores after a short ZooKeeper leader election.
### Actual behavior
ZooKeeper leader election completed quickly:
```text
2026-05-29 14:43:25,444 Disconnected from leader zk-solr-test2
2026-05-29 14:43:27,969 LEADING - LEADER ELECTION TOOK - 2417 MS
```
But Solr nodes entered a disruptive recovery path.
Example from Solr node 2:
```text
2026-05-29 14:43:25.467 WARN Session 0x301ca123e240004 ... EndOfStreamException
2026-05-29 14:43:25.581 INFO State change: SUSPENDED
2026-05-29 14:43:26.180 WARN zk-solr-test2:2181 ... Connection refused
2026-05-29 14:43:28.518 INFO State change: RECONNECTED
2026-05-29 14:43:28.519 INFO ZooKeeper session re-connected ... refreshing
core states after session expiration.
2026-05-29 14:43:28.521 INFO Remove node as live in
ZooKeeper:/live_nodes/solr10-test-2:8983_solr
2026-05-29 14:43:29.495 INFO Publish node=solr10-test-2:8983_solr as DOWN
2026-05-29 14:43:30.042 INFO Register node as live in ZooKeeper:
/live_nodes/solr10-test-2:8983_solr
2026-05-29 14:43:30.066 INFO Registering core antibody_shard1_replica_n2
afterExpiration? true
```
During this period, live nodes changed as follows:
```text
Updated live nodes from ZooKeeper... (3) -> (2)
Updated live nodes from ZooKeeper... (2) -> (1)
Updated live nodes from ZooKeeper... (1) -> (0)
Updated live nodes from ZooKeeper... (0) -> (1)
```
Update requests failed during and after this state transition:
```text
Cannot talk to ZooKeeper - Updates are disabled.
```
Later, Solr reported leader state inconsistency:
```text
ClusterState says we are the leader, but locally we don't think so.
```
A second Solr node showed the same pattern:
```text
2026-05-29 14:43:25.616 INFO State change: SUSPENDED
2026-05-29 14:43:29.583 INFO State change: RECONNECTED
2026-05-29 14:43:29.584 INFO ZooKeeper session re-connected ... refreshing
core states after session expiration.
2026-05-29 14:43:29.589 INFO Remove node as live in
ZooKeeper:/live_nodes/solr10-test-3:8983_solr
2026-05-29 14:43:31.040 INFO Publish node=solr10-test-3:8983_solr as DOWN
```
### Important detail
ZooKeeper server-side session expiration happened much later than the Solr-side
"after session expiration" log.
ZooKeeper logs:
```text
2026-05-29 14:47:44,327 Expiring session 0x301ca123e240005, timeout of 30000ms
exceeded
2026-05-29 14:47:46,327 Expiring session 0x301ca123e240004, timeout of 30000ms
exceeded
```
So at `14:43:28/29`, Solr appears to be executing session-expiration recovery
even though ZooKeeper had not yet logged actual server-side expiration for
those sessions.
### Impact
A short ZooKeeper leader failover is amplified into a SolrCloud-wide
instability event:
- Multiple Solr nodes publish themselves as `DOWN`
- `/live_nodes` temporarily drops to zero or near-zero
- Overseer leadership changes
- Hundreds of cores are re-registered
- Updates fail with 503
- Shard leader state becomes inconsistent
### Question / suspected issue
Is Solr 10.0.0 expected to run full session-expiration recovery on
`RECONNECTED` after a short ZooKeeper leader failover?
This looks like Solr may be treating a transient ZooKeeper reconnect as session
expiration, or otherwise overreacting during Curator/ZooKeeper reconnect
handling, causing unnecessary live node removal and core re-registration.
Could this be a Solr 10.0.0 regression in ZooKeeper
reconnect/session-expiration handling compared with Solr 8?
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]