[ 
https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630081#comment-16630081
 ] 

Maxim Muzafarov commented on IGNITE-7196:
-----------------------------------------

Folks,

I've found the cause. The main problem with `flakiness` is that even 
`waitForCheckpoint` completes unsuccessfully (e.g. InterruptedException occurs) 
we only log this exception and delete cache group descriptor anyway. That's why 
at the restore binary state on node startup we've got 
{{testDestroyCachesAbruptly}} failure. I think we should restore only that 
caches we have at node start. PR fixed according to this assumption.

I've merged PR with the latest changes of master branch and resolve all 
conflicts. Suppose, the IGNITE-9418 issue affects my branch and I've got some 
tests to hang at the moment of exchange future done.

I need to investigate this case, sorry. I'll back to you soon.

> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-7196
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7196
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.3
>            Reporter: Mikhail Cherkasov
>            Assignee: Maxim Muzafarov
>            Priority: Critical
>             Fix For: 2.7
>
>
> Exchange can stuck and wait while new node restoring state from disk and 
> starting caches, there's a log snippet from a just joined new node that shows 
> the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager]
>  Resolved page store work directory: 
> /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log work directory: 
> /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resolved write ahead log archive directory: 
> /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Started write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 
> KiB, checkpointBuffer=100.0 MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] 
> Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 
> MiB, checkpointBuffer=896.0 MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checking memory state [lastValidPos=FileWALPointer [idx=3582, 
> fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer 
> [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], 
> lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Ignite node stopped in the middle of checkpoint. Will restore memory state 
> and finish checkpoint on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to 
> wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500], 
> discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, 
> loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], 
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57, 
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
> [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 
> 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, 
> /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false, 
> lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null, 
> initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null, 
> done=false, state=SRV, evtLatch=0, 
> remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68, 
> ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164, 
> 979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef], 
> super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
> hash=542833742]]]
> [21:37:12,866][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Checkpoint finished [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, 
> pages=157778, markPos=FileWALPointer [idx=3629, fileOffset=50829700, 
> len=9229, forceFlush=false], pagesWrite=5460ms, fsync=45781ms, total=51241ms]
> [21:37:12,878][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager]
>  Resuming logging to WAL segment 
> [file=/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/0000000000000009.wal,
>  offset=50838929, ver=1]
> [21:37:13,028][WARNING][main][GridCachePartitionExchangeManager] Still 
> waiting for initial partition map exchange 
> [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent 
> [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101], 
> sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, 
> ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500], 
> discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, 
> loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], 
> exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57, 
> minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode 
> [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 
> 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, 
> /127.0.0.1:48500, 
> ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], 
> discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324, 
> loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, 
> nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], 
> nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter 
> [ignoreInterrupts=false, state=INIT, res=null, hash=820635914], init=false, 
> lastVer=null, partReleaseFut=null, exchActions=null, affChangeMsg=null, 
> initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null, 
> done=false, state=SRV, evtLatch=0, 
> remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68, 
> ee15ea85-f5f6-4513-871e-7b4b080911cf, 56c97317-26cf-43d2-bf76-0cab59c6fa5f, 
> f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164, 
> 979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef], 
> super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, 
> hash=542833742]]]
> [21:37:13,116][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor] 
> Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc, 
> mode=REPLICATED, atomicity=TRANSACTIONAL]
> [21:37:13,125][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor] 
> Started cache [name=statementp, memoryPolicyName=Default_Region, 
> mode=PARTITIONED, atomicity=TRANSACTIONAL]
> [21:37:13,131][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finished waiting for partition release future 
> [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], waitTime=0ms, 
> futInfo=NA]
> [21:37:13,133][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Read checkpoint status 
> [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
>  
> endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-END.bin]
> [21:37:13,134][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Applying lost cache updates since last checkpoint record 
> [lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, 
> forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:37:13,738][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
>  Finished applying WAL changes [updatesApplied=0, time=607ms]
> [21:37:13,749][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> crd=false]
> [21:37:13,782][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Received full message, will finish exchange 
> [node=56c97317-26cf-43d2-bf76-0cab59c6fa5f, resVer=AffinityTopologyVersion 
> [topVer=57, minorTopVer=0]]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-4-#45%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.24.98:58928]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-3-#44%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.27.95:52532]
> [21:37:13,804][INFO][grid-nio-worker-tcp-comm-5-#46%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.17.174:54358]
> [21:37:13,805][INFO][grid-nio-worker-tcp-comm-6-#47%statement_grid%][TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/172.31.20.209:48100, 
> rmtAddr=/172.31.23.109:42528]
> [21:37:17,569][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=57, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], 
> err=null]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to