Could you please elaborate your suspicion?
 
addRoleDelegationToCache and addDocument calls were made after killing
node3, these calls trying to push data into cache and we are not using any
transaction API to start or commit transaction on cache explicitly while
pushing data to cache. And these calls are made by node1 to access regular
application. Application access was not made immediately after killing
node3, I tried to access application after about 3-5 minutes. And killing
node3 was done when system was idle. 

Unfortunately I can not share application. I can try to reproduce it again
and provide more logs if needed or try to write a test program to simulate
it.

Let me know if you need more logs, probably DEBUG level logs.   

Below are some pointers from threaddump and logs -

1. Below from thread dump made me assume topology change not completed and
any cache operation later on waiting on it.
/org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:523)
/

2. Node3 address 10.107.186.137, 17:03:48,223 is the time when Server2 log
first detected failed node. Below logs -

/17:03:48,223 WARNING [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi]
(tcp-disco-msg-worker-#2%TESTNODE%) Local node has detected failed nodes and
started cluster-wide procedure. To speed up failure detection please see
'Failure Detection' section under javadoc for 'TcpDiscoverySpi'
17:03:48,237 WARNING
[org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
(disco-event-worker-#254%TESTNODE%) Node FAILED: TcpDiscoveryNode
[id=e840a775-36b9-48d3-993c-25dea95d59d0, addrs=[10.107.186.137, 10.245.1.1,
127.0.0.1, 192.168.122.1], sockAddrs=[/192.168.122.1:48500,
/10.107.186.137:48500, /10.245.1.1:48500, /127.0.0.1:48500], discPort=48500,
order=55, intOrder=29, lastExchangeTime=1478911399482, loc=false,
ver=1.7.0#20160801-sha1:383273e3, isClient=false]
17:03:48,240 INFO  [stdout] (disco-event-worker-#254%TESTNODE%) [17:03:48]
Topology snapshot [ver=56, servers=2, clients=0, CPUs=96, heap=4.0GB]
17:03:48,241 INFO 
[org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
(disco-event-worker-#254%TESTNODE%) Topology snapshot [ver=56, servers=2,
clients=0, CPUs=96, heap=4.0GB]
17:03:58,480 WARNING
[org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager]
(exchange-worker-#256%TESTNODE%) Failed to wait for partition map exchange
[topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0],
node=8cc0ac24-24b9-4d69-8472-b6a567f4d907]. Dumping pending objects that
might be the cause: 
17:03:58,482 WARNING
[org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager]
(exchange-worker-#256%TESTNODE%) Ready affinity version:
AffinityTopologyVersion [topVer=55, minorTopVer=1]/






--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9025.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Reply via email to