Is the node that expected a long GC pause and failed eventually is a client node? This is important to know.
From thread dumps I see that some of the nodes unable to rollback transactions "pub-#1%DataGridServer-Development%" Id=35 in WAITING on lock=org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxFinishFuture@3e4dcc0c at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:155) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:115) at org.apache.ignite.internal.processors.cache.transactions.TransactionProxyImpl.rollback(TransactionProxyImpl.java:296) at com.somecompany.grid.server.tradegen.BatchIdHelper.getListOfIds(BatchIdHelper.java:84) at com.somecompany.grid.server.tradegen.TradeGenerator.generateUniqueTradeId64(TradeGenerator.java:47) at com.somecompany.grid.server.tradegen.TradeGenerator.allocateTradesFromFills(TradeGenerator.java:158) while the others are waiting while an affinity topology changes which, in my understanding, prevents the first nodes from successful transactions rollback. ock=org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache$AffinityReadyFuture@14a5b2c7 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:157) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:115) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:477) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:435) at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.primaryPartitions(GridAffinityAssignmentCache.java:399) at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryPartitions(GridCacheAffinityManager.java:366) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.reservePartitions(GridMapQueryExecutor.java:316) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onQueryRequest(GridMapQueryExecutor.java:428) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onMessage(GridMapQueryExecutor.java:184) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor$2.onMessage(GridMapQueryExecutor.java:159) at org.apache.ignite.internal.managers.communication.GridIoManager$ArrayListener.onMessage(GridIoManager.java:1821) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821) at org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103) at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784) at — Denis > On Sep 16, 2016, at 2:44 AM, yfernando <[email protected]> wrote: > > Hi Denis, > > We have been able to reproduce this situation where a node failure freezes > the entire grid. > > Please find the full thread dumps of the 5 nodes that are locked up. > > The memoryMode of the caches are configured to be OFFHEAP_TIERED > The cacheMode is PARTITIONED > The atomicityMode is TRANSACTIONAL > > We have also seen ALL the clients freeze during a FULL GC occurring on ANY > single node. > > Please let us know if you require any more information. > > grid-tp1-dev-11220-201609141523318.txt > <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp1-dev-11220-201609141523318.txt> > > grid-tp1-dev-11223-201609141523318.txt > <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp1-dev-11223-201609141523318.txt> > > grid-tp3-dev-11220-201609141523318.txt > <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp3-dev-11220-201609141523318.txt> > > grid-tp3-dev-11221-201609141523318.txt > <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp3-dev-11221-201609141523318.txt> > > grid-tp4-dev-11220-201609141523318.txt > <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp4-dev-11220-201609141523318.txt> > > > > > > -- > View this message in context: > http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7791.html > Sent from the Apache Ignite Users mailing list archive at Nabble.com.
