Could you please elaborate your suspicion? addRoleDelegationToCache and addDocument calls were made after killing node3, these calls trying to push data into cache and we are not using any transaction API to start or commit transaction on cache explicitly while pushing data to cache. And these calls are made by node1 to access regular application. Application access was not made immediately after killing node3, I tried to access application after about 3-5 minutes. And killing node3 was done when system was idle.
Unfortunately I can not share application. I can try to reproduce it again and provide more logs if needed or try to write a test program to simulate it. Let me know if you need more logs, probably DEBUG level logs. Below are some pointers from threaddump and logs - 1. Below from thread dump made me assume topology change not completed and any cache operation later on waiting on it. /org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:523) / 2. Node3 address 10.107.186.137, 17:03:48,223 is the time when Server2 log first detected failed node. Below logs - /17:03:48,223 WARNING [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi] (tcp-disco-msg-worker-#2%TESTNODE%) Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' 17:03:48,237 WARNING [org.apache.ignite.internal.managers.discovery.GridDiscoveryManager] (disco-event-worker-#254%TESTNODE%) Node FAILED: TcpDiscoveryNode [id=e840a775-36b9-48d3-993c-25dea95d59d0, addrs=[10.107.186.137, 10.245.1.1, 127.0.0.1, 192.168.122.1], sockAddrs=[/192.168.122.1:48500, /10.107.186.137:48500, /10.245.1.1:48500, /127.0.0.1:48500], discPort=48500, order=55, intOrder=29, lastExchangeTime=1478911399482, loc=false, ver=1.7.0#20160801-sha1:383273e3, isClient=false] 17:03:48,240 INFO [stdout] (disco-event-worker-#254%TESTNODE%) [17:03:48] Topology snapshot [ver=56, servers=2, clients=0, CPUs=96, heap=4.0GB] 17:03:48,241 INFO [org.apache.ignite.internal.managers.discovery.GridDiscoveryManager] (disco-event-worker-#254%TESTNODE%) Topology snapshot [ver=56, servers=2, clients=0, CPUs=96, heap=4.0GB] 17:03:58,480 WARNING [org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager] (exchange-worker-#256%TESTNODE%) Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], node=8cc0ac24-24b9-4d69-8472-b6a567f4d907]. Dumping pending objects that might be the cause: 17:03:58,482 WARNING [org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager] (exchange-worker-#256%TESTNODE%) Ready affinity version: AffinityTopologyVersion [topVer=55, minorTopVer=1]/ -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9025.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.