[ 
https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280843#comment-17280843
 ] 

Anton Kalashnikov commented on IGNITE-14055:
--------------------------------------------

[~ibessonov] can you take a look and merge please?

> Deadlock in timeoutObjectProcessor between 'send message' & 'handshake 
> timeout'
> -------------------------------------------------------------------------------
>
>                 Key: IGNITE-14055
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14055
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Anton Kalashnikov
>            Assignee: Anton Kalashnikov
>            Priority: Major
>         Attachments: StartServerWithTxPuts (1).java, freeze (1).sh
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Cluster hangs after jvm pauses on one of server nodes.
>  Scenario:
>  1. Start three server nodes with put operations using StartServerWithTxPuts.
>  2. Emulate jvm freezes on one server node by running the attached script:
>  {{*sh freeze.sh <jvm process pid>*}}
>  3. Wait until the script has finished.
> Result:
>  The cluster hangs on tx put operations.
> The first server node continuously prints:
> {noformat}
> [2020-11-03 09:36:01,719][INFO 
> ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, 
> rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO 
> ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Received incoming connection from remote node while connecting to this node, 
> rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, 
> rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 
> 09:36:01,922][INFO 
> ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, 
> rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO 
> ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Received incoming connection from remote node while connecting to this node, 
> rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, 
> rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 
> 09:36:02,124][INFO 
> ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, 
> rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO 
> ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Received incoming connection from remote node while connecting to this node, 
> rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, 
> rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 
> 09:36:02,326][INFO 
> ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, 
> rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO 
> ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Received incoming connection from remote node while connecting to this node, 
> rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, 
> rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 
> 09:36:02,528][INFO 
> ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, 
> rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO 
> ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
> Received incoming connection from remote node while connecting to this node, 
> rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, 
> rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}}
> {noformat}
>  The second node prints long running transactions in prepared state ignoring 
> the default tx timeout:
>  
> {noformat}
> [2020-11-03 09:36:46,199][WARN 
> ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future 
> [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture 
> [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal 
> [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping 
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey 
> [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], 
> cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl 
> [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], 
> oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, 
> ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, 
> dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, 
> filtersSet=true, entry=GridDhtDetachedCacheEntry 
> [super=GridDistributedCacheEntry [super=GridCacheMapEntry 
> [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, 
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, 
> flags=0]]], prepared=0, locked=false, 
> nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, 
> transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, 
> xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, 
> nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, 
> last=false, nearEntries=0, clientFirst=false, 
> node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], nearLocallyMapped=false, 
> colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false, 
> trackTimeout=false, systemTime=39312675, systemStartTime=0, 
> prepareStartTime=340253826207680, prepareTime=39265601, 
> commitOrRollbackStartTime=340253865473498, commitOrRollbackTime=46297, 
> lb=null, mvccOp=false, qryId=-1, crdVer=0, thread=pool-2-thread-2, 
> mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping 
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey 
> [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], 
> cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl 
> [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], 
> oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, 
> ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, 
> dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, 
> filtersSet=true, entry=GridDhtDetachedCacheEntry 
> [super=GridDistributedCacheEntry [super=GridCacheMapEntry 
> [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, 
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, 
> flags=0]]], prepared=0, locked=false, 
> nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, 
> transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, 
> xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, 
> nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, 
> last=false, nearEntries=0, clientFirst=false, 
> node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], super=GridDhtTxLocalAdapter 
> [nearOnOriginatingNode=false, nearNodes=KeySetView [], dhtNodes=KeySetView 
> [], explicitLock=false, super=IgniteTxLocalAdapter [completedBase=null, 
> sndTransformedVals=false, depEnabled=false, 
> txState=IgniteTxImplicitSingleStateImpl [init=true, recovery=false, 
> useMvccCaching=false], super=IgniteTxAdapter [xidVer=GridCacheVersion 
> [topVer=215865159, order=1604385188157, nodeOrder=2], writeVer=null, 
> implicit=true, loc=true, threadId=123, startTime=1604385188496, 
> nodeId=61733533-7c90-4383-ab98-93f41d7059f4, isolation=READ_COMMITTED, 
> concurrency=OPTIMISTIC, timeout=10000, sysInvalidate=false, sys=false, plc=2, 
> commitVer=GridCacheVersion [topVer=215865159, order=1604385188157, 
> nodeOrder=2], finalizing=NONE, invalidParts=null, state=COMMITTED, 
> timedOut=false, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], 
> mvccSnapshot=null, skipCompletedVers=false, parentTx=null, duration=217695ms, 
> onePhaseCommit=false], size=1]]], commit=true, 
> mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping 
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey 
> [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], 
> cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl 
> [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], 
> oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, 
> ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, 
> dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, 
> filtersSet=true, entry=GridDhtDetachedCacheEntry 
> [super=GridDistributedCacheEntry [super=GridCacheMapEntry 
> [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, 
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, 
> flags=0]]], prepared=0, locked=false, 
> nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, 
> transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, 
> xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, 
> nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, 
> last=false, nearEntries=0, clientFirst=false, 
> node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], trackable=true, 
> finishOnePhaseCalled=false, innerFuts=TransformCollectionView 
> [FinishFuture[node=07583a9d-36c8-4100-a69c-8cbd26ca82c9, loc=false, 
> done=false]], super=GridCompoundIdentityFuture [super=GridCompoundFuture 
> [rdc=AlwaysTrueReducer [], initFlag=1, lsnrCalls=0, done=false, 
> cancelled=false, err=null, futs=TransformCollectionView [false]]]]]{noformat}
>  The third node indicated starvation in the striped pool:
> {noformat}
> [2020-11-03 09:39:25,852][WARN 
> ][grid-timeout-worker-#22%6d42f793-e8d7-4fc1-812a-5336f850d9ff%][G] >>> 
> Possible starvation in striped pool.Thread name: 
> sys-stripe-4-#5%6d42f793-e8d7-4fc1-812a-5336f850d9ff%Queue: []Deadlock: 
> falseCompleted: 359Thread 
> [name="sys-stripe-4-#5%6d42f793-e8d7-4fc1-812a-5336f850d9ff%", id=16, 
> state=TIMED_WAITING, blockCnt=33, waitCnt=1229]        at 
> sun.misc.Unsafe.park(Native Method)        at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)        
> at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:227)     
>    at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:168)   
>      at 
> o.a.i.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:290)
>         at 
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1146)
>         at 
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1095)
>         at 
> o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1812)    
>     at 
> o.a.i.i.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1927)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257)
>         at 
> o.a.i.i.processors.cache.distributed.dht.GridDhtTxFinishFuture.finish(GridDhtTxFinishFuture.java:528)
>         at 
> o.a.i.i.processors.cache.distributed.dht.GridDhtTxFinishFuture.finish(GridDhtTxFinishFuture.java:329)
>         at 
> o.a.i.i.processors.cache.distributed.dht.GridDhtTxLocal.finishTx(GridDhtTxLocal.java:504)
>         at 
> o.a.i.i.processors.cache.distributed.dht.GridDhtTxLocal.rollbackDhtLocalAsync(GridDhtTxLocal.java:586)
>         at 
> o.a.i.i.processors.cache.transactions.IgniteTxHandler.finishDhtLocal(IgniteTxHandler.java:1102)
>         at 
> o.a.i.i.processors.cache.transactions.IgniteTxHandler.finish(IgniteTxHandler.java:970)
>         at 
> o.a.i.i.processors.cache.transactions.IgniteTxHandler.processNearTxFinishRequest(IgniteTxHandler.java:925)
>         at 
> o.a.i.i.processors.cache.transactions.IgniteTxHandler.access$200(IgniteTxHandler.java:134)
>         at 
> o.a.i.i.processors.cache.transactions.IgniteTxHandler$3.apply(IgniteTxHandler.java:231)
>         at 
> o.a.i.i.processors.cache.transactions.IgniteTxHandler$3.apply(IgniteTxHandler.java:229)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
>         at 
> o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)
>         at 
> o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721)
>         at 
> o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1328)
>         at 
> o.a.i.i.managers.communication.GridIoManager.access$4600(GridIoManager.java:157)
>         at 
> o.a.i.i.managers.communication.GridIoManager$8.execute(GridIoManager.java:1213)
>         at 
> o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:54)       
>  at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:559)        
> at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:119)        at 
> java.lang.Thread.run(Thread.java:748)
> {noformat}
>  {{The request is sent from the timeout object processor. During this 
> request, the connection hangs. To resolve the hanging, the handshake timeout 
> object should be executed but it is impossible due to it awaits the finishing 
> of the initial request.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to