Re: Cluster hung after a node killed
Sam, There is no exact date as for now. I would recommend to monitor the dev list for activity around this. -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9883.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Thanks Val. Ticket has been fixed for 1.9, Any idea when 1.9 will be available? -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9880.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Sam, I reproduced the issue using your code and created a ticket: https://issues.apache.org/jira/browse/IGNITE-4450 -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9616.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Val, Did you had chance to look at attached sample program? Did it help figuring out what is going on? -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9465.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Attaching 4 classes - Node1.java - create empty cache node. Node2.java - seed cache with 100K <int,Str> dat. Node3.java - take explicit lock on a key and wait for 15 seconds to unlock. Node4.java - fetch cached data. Steps - 1. Run Node1, wait for empty node to boot up. 2. Run Node2, wait for completion message. 3. Run Node3, kill it when it prompts for "kill me...". 4. Run Node4 Topology snapshot on other node will show node joined, however Node 4 will not able to complete with any fetch operations. Fetch from cache hung. 5. Run another instance of Node2, it will not able to complete with any put operation. Put to cache hung. Node1.java <http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node1.java> Node2.java <http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node2.java> Node3.java <http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node3.java> Node4.java <http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node4.java> -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9375.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Sam, Can you create a small project on GitHub with instructions how to run? I just tried to repeat the same test, but everything works fine for me. -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9340.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Val, I have reproduced this with simple program - 1. Node 1 - run example ExampleNodeStartup 2. Node 2 - Run a program which create a transaction cache and add 100K simple entries. <String, String> cfg.setCacheMode(CacheMode.PARTITIONED); cfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL); cfg.setMemoryMode(CacheMemoryMode.OFFHEAP_TIERED); cfg.setSwapEnabled(false); cfg.setBackups(0); 3. Node 3 - Run a program which takes a lock (cache.lock(key)) 4. Kill Node 3 before it can unlock. 5. Node 4 - Run a program which tries to get cached data. Node4 not able to join cluster, it is hung. In-fact complete cluster is hung andy operation from Node2 also hung. -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9337.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Sam, It depends on the reason of this hang. If there is a bug, then most likely timeout will not help either :) Do you have a test that reproduces the issue? -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9317.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Val, Killing the node that acquired the lock did not release it automatically and leads whole cluster in hung state, any operation on any cache (not related to lock) are in wait state. Cluster is not able to recover seamlessly. Looks like a bug to me. I understand lock timeout can be error-prone, if configured correctly then lock timeout can provide a second way of auto recovery in such failover cases. Is there any way to configure timeouts on lock? Explicit lock is one of the usecase we have, but Cluster auto recovery during any change to cluster is the most important, so if these 2 not going together then its a show stopper. -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9309.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Val, I am curious as to how will ignite cluster behave with a killed node that is in an active transaction with implicit cache locks. Thank you. BTW, when will 1.8 be released? -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9297.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Sam, If there is an explicit lock that you acquired and not released properly, there is not much we can do. Explicit locks must be released explicitly, doing this by timeout even more error-prone in my view. BTW, only stopping the node that acquired the lock should've helped as well, because locks are released in this case automatically. -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9247.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Ideally cluster should recover seamlessly. Is there any locktimeout which I can configure? or any other configuration which will make sure locks taken by crashing node gets released and cluster still serves all requests? Is this a bug? -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9243.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Could you please elaborate your suspicion? addRoleDelegationToCache and addDocument calls were made after killing node3, these calls trying to push data into cache and we are not using any transaction API to start or commit transaction on cache explicitly while pushing data to cache. And these calls are made by node1 to access regular application. Application access was not made immediately after killing node3, I tried to access application after about 3-5 minutes. And killing node3 was done when system was idle. Unfortunately I can not share application. I can try to reproduce it again and provide more logs if needed or try to write a test program to simulate it. Let me know if you need more logs, probably DEBUG level logs. Below are some pointers from threaddump and logs - 1. Below from thread dump made me assume topology change not completed and any cache operation later on waiting on it. /org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:523) / 2. Node3 address 10.107.186.137, 17:03:48,223 is the time when Server2 log first detected failed node. Below logs - /17:03:48,223 WARNING [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi] (tcp-disco-msg-worker-#2%TESTNODE%) Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' 17:03:48,237 WARNING [org.apache.ignite.internal.managers.discovery.GridDiscoveryManager] (disco-event-worker-#254%TESTNODE%) Node FAILED: TcpDiscoveryNode [id=e840a775-36b9-48d3-993c-25dea95d59d0, addrs=[10.107.186.137, 10.245.1.1, 127.0.0.1, 192.168.122.1], sockAddrs=[/192.168.122.1:48500, /10.107.186.137:48500, /10.245.1.1:48500, /127.0.0.1:48500], discPort=48500, order=55, intOrder=29, lastExchangeTime=1478911399482, loc=false, ver=1.7.0#20160801-sha1:383273e3, isClient=false] 17:03:48,240 INFO [stdout] (disco-event-worker-#254%TESTNODE%) [17:03:48] Topology snapshot [ver=56, servers=2, clients=0, CPUs=96, heap=4.0GB] 17:03:48,241 INFO [org.apache.ignite.internal.managers.discovery.GridDiscoveryManager] (disco-event-worker-#254%TESTNODE%) Topology snapshot [ver=56, servers=2, clients=0, CPUs=96, heap=4.0GB] 17:03:58,480 WARNING [org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager] (exchange-worker-#256%TESTNODE%) Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], node=8cc0ac24-24b9-4d69-8472-b6a567f4d907]. Dumping pending objects that might be the cause: 17:03:58,482 WARNING [org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager] (exchange-worker-#256%TESTNODE%) Ready affinity version: AffinityTopologyVersion [topVer=55, minorTopVer=1]/ -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9025.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
logs.zip <http://apache-ignite-users.70518.x6.nabble.com/file/n9010/logs.zip> Attaching threaddump and log from 2 nodes. Lost logs from node 3 which was killed using "kill-9 " Let me know if you need more logs. In my understanding, after killing node 3 topology version update screwed-up and Node2 keeps on complaining about failed Node3. Node 1 tried to access application which hung on get or put ignite call because of topology mismatch or locked. -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9010.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi, I think, INFO log level would be enough (from all nodes). -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p8991.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Do you want Ignite to be running in DEBUG? or System.out should be enough from all 3 nodes? -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p8978.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Cluster hung after a node killed
Hi Sam, Please attach full logs and full thread dumps if you want someone to take a look. There is not enough information in your message to understand the reason of the issue. -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p8976.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Cluster hung after a node killed
Hi, I have configured cache as off-heap partitioned cache. Running 3 nodes on separate machine. Loaded some data into cache using my application's normal operations. Used "/kill -9 /" to kill node 3. Node 2 shows below Warning on console after every 10 seconds - /11:03:03,320 WARNING [org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager] (exchange-worker-#256%TESTNODE%) Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], node=8cc0ac24-24b9-4d69-8472-b6a567f4d907]. Dumping pending objects that might be the cause:/ Node 1 looks fine. However application does not work anymore and threaddump shows it is waiting on cache put - /java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007ecbd4a38> (a org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache$AffinityReadyFuture) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:159) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:117) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:523) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:434) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:387) at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.nodes(GridCacheAffinityManager.java:259) at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primary(GridCacheAffinityManager.java:295) at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primary(GridCacheAffinityManager.java:286) at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primary(GridCacheAffinityManager.java:310) at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.entryExx(GridDhtColocatedCache.java:176) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.entryEx(GridNearTxLocal.java:1251) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.enlistWriteEntry(IgniteTxLocalAdapter.java:2354) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.enlistWrite(IgniteTxLocalAdapter.java:1990) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.putAsync0(IgniteTxLocalAdapter.java:2902) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.putAsync(IgniteTxLocalAdapter.java:1859) at org.apache.ignite.internal.processors.cache.GridCacheAdapter$22.op(GridCacheAdapter.java:2240) at org.apache.ignite.internal.processors.cache.GridCacheAdapter$22.op(GridCacheAdapter.java:2238) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:4351) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2238) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2215) at org.apache.ignite.internal.processors.cache.IgniteCacheProxy.put(IgniteCacheProxy.java:1214)/ Is there any specific configuration I need to provide for self recovery of cluster? Losing cache data is fine, data is backedup in some persistent store Example - DATABASE. -Sam -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.