[ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280843#comment-17280843 ]
Anton Kalashnikov commented on IGNITE-14055: -------------------------------------------- [~ibessonov] can you take a look and merge please? > Deadlock in timeoutObjectProcessor between 'send message' & 'handshake > timeout' > ------------------------------------------------------------------------------- > > Key: IGNITE-14055 > URL: https://issues.apache.org/jira/browse/IGNITE-14055 > Project: Ignite > Issue Type: Bug > Reporter: Anton Kalashnikov > Assignee: Anton Kalashnikov > Priority: Major > Attachments: StartServerWithTxPuts (1).java, freeze (1).sh > > Time Spent: 20m > Remaining Estimate: 0h > > Cluster hangs after jvm pauses on one of server nodes. > Scenario: > 1. Start three server nodes with put operations using StartServerWithTxPuts. > 2. Emulate jvm freezes on one server node by running the attached script: > {{*sh freeze.sh <jvm process pid>*}} > 3. Wait until the script has finished. > Result: > The cluster hangs on tx put operations. > The first server node continuously prints: > {noformat} > [2020-11-03 09:36:01,719][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,124][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,326][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,528][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} > {noformat} > The second node prints long running transactions in prepared state ignoring > the default tx timeout: > > {noformat} > [2020-11-03 09:36:46,199][WARN > ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future > [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture > [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal > [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, > filtersSet=true, entry=GridDhtDetachedCacheEntry > [super=GridDistributedCacheEntry [super=GridCacheMapEntry > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, > flags=0]]], prepared=0, locked=false, > nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, > transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, > xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, > nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, > last=false, nearEntries=0, clientFirst=false, > node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], nearLocallyMapped=false, > colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false, > trackTimeout=false, systemTime=39312675, systemStartTime=0, > prepareStartTime=340253826207680, prepareTime=39265601, > commitOrRollbackStartTime=340253865473498, commitOrRollbackTime=46297, > lb=null, mvccOp=false, qryId=-1, crdVer=0, thread=pool-2-thread-2, > mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, > filtersSet=true, entry=GridDhtDetachedCacheEntry > [super=GridDistributedCacheEntry [super=GridCacheMapEntry > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, > flags=0]]], prepared=0, locked=false, > nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, > transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, > xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, > nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, > last=false, nearEntries=0, clientFirst=false, > node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], super=GridDhtTxLocalAdapter > [nearOnOriginatingNode=false, nearNodes=KeySetView [], dhtNodes=KeySetView > [], explicitLock=false, super=IgniteTxLocalAdapter [completedBase=null, > sndTransformedVals=false, depEnabled=false, > txState=IgniteTxImplicitSingleStateImpl [init=true, recovery=false, > useMvccCaching=false], super=IgniteTxAdapter [xidVer=GridCacheVersion > [topVer=215865159, order=1604385188157, nodeOrder=2], writeVer=null, > implicit=true, loc=true, threadId=123, startTime=1604385188496, > nodeId=61733533-7c90-4383-ab98-93f41d7059f4, isolation=READ_COMMITTED, > concurrency=OPTIMISTIC, timeout=10000, sysInvalidate=false, sys=false, plc=2, > commitVer=GridCacheVersion [topVer=215865159, order=1604385188157, > nodeOrder=2], finalizing=NONE, invalidParts=null, state=COMMITTED, > timedOut=false, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > mvccSnapshot=null, skipCompletedVers=false, parentTx=null, duration=217695ms, > onePhaseCommit=false], size=1]]], commit=true, > mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, > filtersSet=true, entry=GridDhtDetachedCacheEntry > [super=GridDistributedCacheEntry [super=GridCacheMapEntry > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, > flags=0]]], prepared=0, locked=false, > nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, > transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, > xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, > nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, > last=false, nearEntries=0, clientFirst=false, > node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], trackable=true, > finishOnePhaseCalled=false, innerFuts=TransformCollectionView > [FinishFuture[node=07583a9d-36c8-4100-a69c-8cbd26ca82c9, loc=false, > done=false]], super=GridCompoundIdentityFuture [super=GridCompoundFuture > [rdc=AlwaysTrueReducer [], initFlag=1, lsnrCalls=0, done=false, > cancelled=false, err=null, futs=TransformCollectionView [false]]]]]{noformat} > The third node indicated starvation in the striped pool: > {noformat} > [2020-11-03 09:39:25,852][WARN > ][grid-timeout-worker-#22%6d42f793-e8d7-4fc1-812a-5336f850d9ff%][G] >>> > Possible starvation in striped pool.Thread name: > sys-stripe-4-#5%6d42f793-e8d7-4fc1-812a-5336f850d9ff%Queue: []Deadlock: > falseCompleted: 359Thread > [name="sys-stripe-4-#5%6d42f793-e8d7-4fc1-812a-5336f850d9ff%", id=16, > state=TIMED_WAITING, blockCnt=33, waitCnt=1229] at > sun.misc.Unsafe.park(Native Method) at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) > at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:227) > at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:168) > at > o.a.i.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:290) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1146) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1095) > at > o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1812) > at > o.a.i.i.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1927) > at > o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257) > at > o.a.i.i.processors.cache.distributed.dht.GridDhtTxFinishFuture.finish(GridDhtTxFinishFuture.java:528) > at > o.a.i.i.processors.cache.distributed.dht.GridDhtTxFinishFuture.finish(GridDhtTxFinishFuture.java:329) > at > o.a.i.i.processors.cache.distributed.dht.GridDhtTxLocal.finishTx(GridDhtTxLocal.java:504) > at > o.a.i.i.processors.cache.distributed.dht.GridDhtTxLocal.rollbackDhtLocalAsync(GridDhtTxLocal.java:586) > at > o.a.i.i.processors.cache.transactions.IgniteTxHandler.finishDhtLocal(IgniteTxHandler.java:1102) > at > o.a.i.i.processors.cache.transactions.IgniteTxHandler.finish(IgniteTxHandler.java:970) > at > o.a.i.i.processors.cache.transactions.IgniteTxHandler.processNearTxFinishRequest(IgniteTxHandler.java:925) > at > o.a.i.i.processors.cache.transactions.IgniteTxHandler.access$200(IgniteTxHandler.java:134) > at > o.a.i.i.processors.cache.transactions.IgniteTxHandler$3.apply(IgniteTxHandler.java:231) > at > o.a.i.i.processors.cache.transactions.IgniteTxHandler$3.apply(IgniteTxHandler.java:229) > at > o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) > at > o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) > at > o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) > at > o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) > at > o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) > at > o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) > at > o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721) > at > o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1328) > at > o.a.i.i.managers.communication.GridIoManager.access$4600(GridIoManager.java:157) > at > o.a.i.i.managers.communication.GridIoManager$8.execute(GridIoManager.java:1213) > at > o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:54) > at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:559) > at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:119) at > java.lang.Thread.run(Thread.java:748) > {noformat} > {{The request is sent from the timeout object processor. During this > request, the connection hangs. To resolve the hanging, the handshake timeout > object should be executed but it is impossible due to it awaits the finishing > of the initial request.}} -- This message was sent by Atlassian Jira (v8.3.4#803005)