OmShinde1513 opened a new issue, #12480:
URL: https://github.com/apache/ignite/issues/12480

   I have a setup with two Ignite(2.17.0) server nodes using 
ZookeeperDiscoverySpi for discovery. When a network partition occurs between 
the Ignite nodes, both nodes are still able to communicate with Zookeeper. As 
expected, ZookeeperDiscoverySpi triggers a NODE_SEGMENTED event.
   According to my configured segmentation policy (RESTART_JVM), the affected 
Ignite node stops and restarts.
   
   However, during restart the network partition still exists, so the 
restarting node is again unable to join the cluster. This causes another 
NODE_SEGMENTED event, and the restart procedure is triggered again.
   
   At this point, the restart thread becomes blocked because it cannot acquire 
the instance lock on IgnitionEx (inside the synchronized stop0() method). The 
lock is already held by the main thread, which is stuck inside IgnitionEx’s 
synchronized start0() method.
   
   Inside start0(), the main thread hangs indefinitely in
   GridCachePartitionExchangeManager#onKernalStart() while waiting for an 
exchange future. It repeatedly times out trying to get the exchange future 
because it cannot communicate with the peer Ignite node. The timeout exception 
is caught and retried indefinitely, causing the main thread to loop forever and 
never release the instance lock.
   
   This results in:
   
   Restart thread blocked on instance lock
   
   Main thread stuck inside an infinite retry loop during partition exchange
   
   Node never recovers after segmentation during network partitioning, Its like 
deadlock


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to