[jira] [Created] (SOLR-18273) Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration recovery and removes/re-registers live_nodes

Yifei Geng (Jira) Sun, 31 May 2026 19:06:46 -0700

Yifei Geng created SOLR-18273:
---------------------------------

             Summary: Solr 10.0.0 treats short ZooKeeper leader failover as 
session-expiration recovery and removes/re-registers live_nodes
                 Key: SOLR-18273
                 URL: https://issues.apache.org/jira/browse/SOLR-18273
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: AutoScaling
    Affects Versions: 10.0
         Environment: Solr 10.0.0
            Reporter: Yifei Geng



```markdown
Title: Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration 
recovery and removes/re-registers live_nodes

### Description

We are observing a SolrCloud stability issue after upgrading/testing with Solr 
10.0.0.

When the ZooKeeper leader node is manually stopped, ZooKeeper elects a new 
leader successfully within about 2.4 seconds. However, multiple Solr 10 nodes 
enter a disruptive recovery path after ZooKeeper reconnect: they log `ZooKeeper 
session re-connected ... refreshing core states after session expiration`, 
remove their own `/live_nodes` entry, publish themselves as `DOWN`, and then 
re-register hundreds of cores with `afterExpiration? true`.

This causes SolrCloud-level instability: `live_nodes` briefly drops from `3 -> 
2 -> 1 -> 0`, Overseer leadership changes, shard leader state becomes 
inconsistent, and update requests start failing with 503 errors.

The same ZooKeeper leader failover scenario did not cause this behavior in our 
Solr 8 cluster.

### Environment

- Solr version: `10.0.0`
- SolrCloud cluster: 3 Solr nodes
- ZooKeeper version: tested with `3.9.4`
- ZooKeeper ensemble: 3 nodes
- ZooKeeper settings:
  - `tickTime=2000`
  - `minSessionTimeout=4000`
  - `maxSessionTimeout=40000`
  - `initLimit=10`
  - `syncLimit=5`
- Solr JVM / ZK options:
  - `-Dsolr.zookeeper.client.timeout=30000`
  - `-DzkClientTimeout=30000`
  - 
`-DzkHost=zk-solr-test1:2181,zk-solr-test2:2181,zk-solr-test3:2181/solr10-test`

### Steps to reproduce

1. Start a 3-node Solr 10.0.0 SolrCloud cluster.
2. Start a 3-node ZooKeeper ensemble.
3. Run write/update traffic against collections in SolrCloud.
4. Manually stop the current ZooKeeper leader process.
5. Observe Solr logs and `/live_nodes`.

### Expected behavior

ZooKeeper leader failover should cause a short `SUSPENDED` period and then Solr 
should reconnect without treating the event as a full session-expiration 
recovery, as long as the ZooKeeper session has not actually expired.

Solr nodes should not remove their own live node entries and re-register all 
cores after a short ZooKeeper leader election.

### Actual behavior

ZooKeeper leader election completed quickly:

```text
2026-05-29 14:43:25,444 Disconnected from leader zk-solr-test2
2026-05-29 14:43:27,969 LEADING - LEADER ELECTION TOOK - 2417 MS
```

But Solr nodes entered a disruptive recovery path.

Example from Solr node 2:

```text
2026-05-29 14:43:25.467 WARN  Session 0x301ca123e240004 ... EndOfStreamException
2026-05-29 14:43:25.581 INFO  State change: SUSPENDED
2026-05-29 14:43:26.180 WARN  zk-solr-test2:2181 ... Connection refused
2026-05-29 14:43:28.518 INFO  State change: RECONNECTED
2026-05-29 14:43:28.519 INFO  ZooKeeper session re-connected ... refreshing 
core states after session expiration.
2026-05-29 14:43:28.521 INFO  Remove node as live in 
ZooKeeper:/live_nodes/solr10-test-2:8983_solr
2026-05-29 14:43:29.495 INFO  Publish node=solr10-test-2:8983_solr as DOWN
2026-05-29 14:43:30.042 INFO  Register node as live in ZooKeeper: 
/live_nodes/solr10-test-2:8983_solr
2026-05-29 14:43:30.066 INFO  Registering core antibody_shard1_replica_n2 
afterExpiration? true
```

During this period, live nodes changed as follows:

```text
Updated live nodes from ZooKeeper... (3) -> (2)
Updated live nodes from ZooKeeper... (2) -> (1)
Updated live nodes from ZooKeeper... (1) -> (0)
Updated live nodes from ZooKeeper... (0) -> (1)
```

Update requests failed during and after this state transition:

```text
Cannot talk to ZooKeeper - Updates are disabled.
```

Later, Solr reported leader state inconsistency:

```text
ClusterState says we are the leader, but locally we don't think so.
```

A second Solr node showed the same pattern:

```text
2026-05-29 14:43:25.616 INFO  State change: SUSPENDED
2026-05-29 14:43:29.583 INFO  State change: RECONNECTED
2026-05-29 14:43:29.584 INFO  ZooKeeper session re-connected ... refreshing 
core states after session expiration.
2026-05-29 14:43:29.589 INFO  Remove node as live in 
ZooKeeper:/live_nodes/solr10-test-3:8983_solr
2026-05-29 14:43:31.040 INFO  Publish node=solr10-test-3:8983_solr as DOWN
```

### Important detail

ZooKeeper server-side session expiration happened much later than the Solr-side 
"after session expiration" log.

ZooKeeper logs:

```text
2026-05-29 14:47:44,327 Expiring session 0x301ca123e240005, timeout of 30000ms 
exceeded
2026-05-29 14:47:46,327 Expiring session 0x301ca123e240004, timeout of 30000ms 
exceeded
```

So at `14:43:28/29`, Solr appears to be executing session-expiration recovery 
even though ZooKeeper had not yet logged actual server-side expiration for 
those sessions.

### Impact

A short ZooKeeper leader failover is amplified into a SolrCloud-wide 
instability event:

- Multiple Solr nodes publish themselves as `DOWN`
- `/live_nodes` temporarily drops to zero or near-zero
- Overseer leadership changes
- Hundreds of cores are re-registered
- Updates fail with 503
- Shard leader state becomes inconsistent

### Question / suspected issue

Is Solr 10.0.0 expected to run full session-expiration recovery on 
`RECONNECTED` after a short ZooKeeper leader failover?

This looks like Solr may be treating a transient ZooKeeper reconnect as session 
expiration, or otherwise overreacting during Curator/ZooKeeper reconnect 
handling, causing unnecessary live node removal and core re-registration.

Could this be a Solr 10.0.0 regression in ZooKeeper 
reconnect/session-expiration handling compared with Solr 8?
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-18273) Solr 10.0.0 treats short ZooKeeper leader failover as session-expiration recovery and removes/re-registers live_nodes

Reply via email to