OmShinde1513 opened a new issue, #12480: URL: https://github.com/apache/ignite/issues/12480
I have a setup with two Ignite(2.17.0) server nodes using ZookeeperDiscoverySpi for discovery. When a network partition occurs between the Ignite nodes, both nodes are still able to communicate with Zookeeper. As expected, ZookeeperDiscoverySpi triggers a NODE_SEGMENTED event. According to my configured segmentation policy (RESTART_JVM), the affected Ignite node stops and restarts. However, during restart the network partition still exists, so the restarting node is again unable to join the cluster. This causes another NODE_SEGMENTED event, and the restart procedure is triggered again. At this point, the restart thread becomes blocked because it cannot acquire the instance lock on IgnitionEx (inside the synchronized stop0() method). The lock is already held by the main thread, which is stuck inside IgnitionEx’s synchronized start0() method. Inside start0(), the main thread hangs indefinitely in GridCachePartitionExchangeManager#onKernalStart() while waiting for an exchange future. It repeatedly times out trying to get the exchange future because it cannot communicate with the peer Ignite node. The timeout exception is caught and retried indefinitely, causing the main thread to loop forever and never release the instance lock. This results in: Restart thread blocked on instance lock Main thread stuck inside an infinite retry loop during partition exchange Node never recovers after segmentation during network partitioning, Its like deadlock -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
