[I] Node becomes stale, Raft leader election fails, cluster stops (2-node CMG + MetaStorage) [ignite]

via GitHub Mon, 22 Dec 2025 00:35:10 -0800


GhoufranGhazaly opened a new issue, #12597:
URL: https://github.com/apache/ignite/issues/12597


   I’m using Apache Ignite 3.1 in a production environment with a 2 
server-nodes cluster.
   Both nodes are configured as CMG and MetaStorage nodes.
   Recently, the cluster stopped working due to Raft leader election timeouts.
   While investigating the logs, I found the following messages indicating that 
nodes became stale:
   
   2025-12-19 23:39:37:970 +0530 
[WARNING][Node1-network-worker-10][HandshakeManagerUtils]
   Rejecting handshake: Node2:b1bbd239-630b-4da1-92f8-0fe86f6aa435 is stale, 
node should be restarted so that other nodes can connect
   
   2025-12-19 23:39:47:777 +0530 
[WARNING][Node1-network-worker-11][RecoveryAcceptorHandshakeManager]
   Handshake rejected by initiator: Node1:96c65e18-3a2c-47bb-b21a-3d3d2119e3eb 
is stale, node should be restarted so that other nodes can connect
   
   After this, the cluster started failing with Raft-related timeouts, 
MetaStorage leader could not be elected, and the cluster became unavailable.
   **Questions**
   
   - What can cause a node to become “stale” in Ignite 3, even when:
   No scale up/down was performed, No rebalancing was triggered, The cluster 
topology was unchanged?
   
   - How can this situation be avoided? Are there specific timeouts that should 
be tuned? Are there best practices for preventing stale nodes in production?
   
   - FailureHandler behavior
   Currently, my node configuration uses:
   
   failureHandler {
       dumpThreadsOnFailure=true
       dumpThreadsThrottlingTimeoutMillis=10000
       handler {
           ignoredFailureTypes=[
               systemWorkerBlocked,
               systemCriticalOperationTimeout
           ]
           type=noop
       }
       oomBufferSizeBytes=16384
   }
   
   
   I am considering changing it to:
   
   failureHandler {
       dumpThreadsOnFailure=true
       dumpThreadsThrottlingTimeoutMillis=10000
       handler {
           ignoredFailureTypes=[
               systemWorkerBlocked,
               systemCriticalOperationTimeout
           ]
           type=stop
       }
       oomBufferSizeBytes=16384
   }
   
   Additionally, I configured the Ignite service to automatically restart the 
node after 5 seconds.
   Will this approach help in automatically restarting a stale node and 
allowing it to rejoin the cluster cleanly?
   Is this the recommended approach for production environments?
   
   - Cluster stability
   Finally, I would appreciate guidance on:
   * Recommended production configuration
   * Cluster sizing considerations
   * Any known limitations or best practices to ensure cluster stability and 
avoid full outages
   My main concern is to keep the cluster stable in production and avoid 
complete unavailability.
   
   Thank you for your guidance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Node becomes stale, Raft leader election fails, cluster stops (2-node CMG + MetaStorage) [ignite]

Reply via email to