GhoufranGhazaly opened a new issue, #12597:
URL: https://github.com/apache/ignite/issues/12597
I’m using Apache Ignite 3.1 in a production environment with a 2
server-nodes cluster.
Both nodes are configured as CMG and MetaStorage nodes.
Recently, the cluster stopped working due to Raft leader election timeouts.
While investigating the logs, I found the following messages indicating that
nodes became stale:
2025-12-19 23:39:37:970 +0530
[WARNING][Node1-network-worker-10][HandshakeManagerUtils]
Rejecting handshake: Node2:b1bbd239-630b-4da1-92f8-0fe86f6aa435 is stale,
node should be restarted so that other nodes can connect
2025-12-19 23:39:47:777 +0530
[WARNING][Node1-network-worker-11][RecoveryAcceptorHandshakeManager]
Handshake rejected by initiator: Node1:96c65e18-3a2c-47bb-b21a-3d3d2119e3eb
is stale, node should be restarted so that other nodes can connect
After this, the cluster started failing with Raft-related timeouts,
MetaStorage leader could not be elected, and the cluster became unavailable.
**Questions**
- What can cause a node to become “stale” in Ignite 3, even when:
No scale up/down was performed, No rebalancing was triggered, The cluster
topology was unchanged?
- How can this situation be avoided? Are there specific timeouts that should
be tuned? Are there best practices for preventing stale nodes in production?
- FailureHandler behavior
Currently, my node configuration uses:
failureHandler {
dumpThreadsOnFailure=true
dumpThreadsThrottlingTimeoutMillis=10000
handler {
ignoredFailureTypes=[
systemWorkerBlocked,
systemCriticalOperationTimeout
]
type=noop
}
oomBufferSizeBytes=16384
}
I am considering changing it to:
failureHandler {
dumpThreadsOnFailure=true
dumpThreadsThrottlingTimeoutMillis=10000
handler {
ignoredFailureTypes=[
systemWorkerBlocked,
systemCriticalOperationTimeout
]
type=stop
}
oomBufferSizeBytes=16384
}
Additionally, I configured the Ignite service to automatically restart the
node after 5 seconds.
Will this approach help in automatically restarting a stale node and
allowing it to rejoin the cluster cleanly?
Is this the recommended approach for production environments?
- Cluster stability
Finally, I would appreciate guidance on:
* Recommended production configuration
* Cluster sizing considerations
* Any known limitations or best practices to ensure cluster stability and
avoid full outages
My main concern is to keep the cluster stable in production and avoid
complete unavailability.
Thank you for your guidance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]