[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Muzafarov updated IGNITE-7196: Fix Version/s: (was: 2.7) 2.8 > Exchange can stuck and wait while new node restoring state from disk and > starting caches > > > Key: IGNITE-7196 > URL: https://issues.apache.org/jira/browse/IGNITE-7196 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.3 >Reporter: Mikhail Cherkasov >Assignee: Maxim Muzafarov >Priority: Critical > Fix For: 2.8 > > > Exchange can stuck and wait while new node restoring state from disk and > starting caches, there's a log snippet from a just joined new node that shows > the issue: > {noformat} > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, > customEvt=null, allowMerge=true] > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] > Resolved page store work directory: > /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log work directory: > /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log archive directory: > /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Started write-ahead log manager [mode=DEFAULT] > [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 > KiB, checkpointBuffer=100.0 MiB] > [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 > MiB, checkpointBuffer=896.0 MiB] > [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Read checkpoint status > [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, > > endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] > [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Checking memory state [lastValidPos=FileWALPointer [idx=3582, > fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer > [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], > lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] > [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Ignite node stopped in the middle of checkpoint. Will restore memory state > and finish checkpoint on node start. > [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.17.115:57148] > [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, > pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, > forceFlush=false]] > [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Finished applying memory changes [changesApplied=165103, time=8189ms] > [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.28.10:47964] > [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.27.101:46000] > [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to > wait for initial partition map exchange. Possible reasons are: > ^-- Transactions in deadlock. > ^-- Long running transactions (ignore if this is the case). > ^-- Unreleased explicit locks. > [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still > waiting for initial partition map exchange > [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, > addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], > sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, >
[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Pavlov updated IGNITE-7196: --- Description: Exchange can stuck and wait while new node restoring state from disk and starting caches, there's a log snippet from a just joined new node that shows the issue: {noformat} [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, customEvt=null, allowMerge=true] [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] Resolved page store work directory: /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Resolved write ahead log work directory: /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Resolved write ahead log archive directory: /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Started write-ahead log manager [mode=DEFAULT] [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 KiB, checkpointBuffer=100.0 MiB] [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 MiB, checkpointBuffer=896.0 MiB] [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] Read checkpoint status [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] Checking memory state [lastValidPos=FileWALPointer [idx=3582, fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] Ignite node stopped in the middle of checkpoint. Will restore memory state and finish checkpoint on node start. [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.17.115:57148] [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, forceFlush=false]] [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] Finished applying memory changes [changesApplied=165103, time=8189ms] [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.28.10:47964] [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.27.101:46000] [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to wait for initial partition map exchange. Possible reasons are: ^-- Transactions in deadlock. ^-- Long running transactions (ignore if this is the case). ^-- Unreleased explicit locks. [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still waiting for initial partition map exchange [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], topVer=57, nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992], crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.27.101], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500], discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Pavlov updated IGNITE-7196: --- Fix Version/s: (was: 2.8) 2.7 > Exchange can stuck and wait while new node restoring state from disk and > starting caches > > > Key: IGNITE-7196 > URL: https://issues.apache.org/jira/browse/IGNITE-7196 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.3 >Reporter: Mikhail Cherkasov >Assignee: Maxim Muzafarov >Priority: Critical > Fix For: 2.7 > > > Exchange can stuck and wait while new node restoring state from disk and > starting caches, there's a log snippet from a just joined new node that shows > the issue: > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, > customEvt=null, allowMerge=true] > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] > Resolved page store work directory: > /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log work directory: > /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log archive directory: > /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Started write-ahead log manager [mode=DEFAULT] > [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 > KiB, checkpointBuffer=100.0 MiB] > [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 > MiB, checkpointBuffer=896.0 MiB] > [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Read checkpoint status > [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, > > endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] > [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Checking memory state [lastValidPos=FileWALPointer [idx=3582, > fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer > [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], > lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] > [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Ignite node stopped in the middle of checkpoint. Will restore memory state > and finish checkpoint on node start. > [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.17.115:57148] > [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, > pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, > forceFlush=false]] > [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Finished applying memory changes [changesApplied=165103, time=8189ms] > [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.28.10:47964] > [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.27.101:46000] > [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to > wait for initial partition map exchange. Possible reasons are: > ^-- Transactions in deadlock. > ^-- Long running transactions (ignore if this is the case). > ^-- Unreleased explicit locks. > [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still > waiting for initial partition map exchange > [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, > addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], > sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, >
[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Cherkasov updated IGNITE-7196: -- Fix Version/s: (was: 2.7) 2.8 > Exchange can stuck and wait while new node restoring state from disk and > starting caches > > > Key: IGNITE-7196 > URL: https://issues.apache.org/jira/browse/IGNITE-7196 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.3 >Reporter: Mikhail Cherkasov >Assignee: Maxim Muzafarov >Priority: Critical > Fix For: 2.8 > > > Exchange can stuck and wait while new node restoring state from disk and > starting caches, there's a log snippet from a just joined new node that shows > the issue: > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, > customEvt=null, allowMerge=true] > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] > Resolved page store work directory: > /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log work directory: > /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log archive directory: > /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Started write-ahead log manager [mode=DEFAULT] > [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 > KiB, checkpointBuffer=100.0 MiB] > [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 > MiB, checkpointBuffer=896.0 MiB] > [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Read checkpoint status > [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, > > endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] > [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Checking memory state [lastValidPos=FileWALPointer [idx=3582, > fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer > [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], > lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] > [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Ignite node stopped in the middle of checkpoint. Will restore memory state > and finish checkpoint on node start. > [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.17.115:57148] > [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, > pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, > forceFlush=false]] > [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Finished applying memory changes [changesApplied=165103, time=8189ms] > [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.28.10:47964] > [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.27.101:46000] > [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to > wait for initial partition map exchange. Possible reasons are: > ^-- Transactions in deadlock. > ^-- Long running transactions (ignore if this is the case). > ^-- Unreleased explicit locks. > [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still > waiting for initial partition map exchange > [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, > addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], > sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, >
[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Pavlov updated IGNITE-7196: --- Fix Version/s: (was: 2.6) 2.7 > Exchange can stuck and wait while new node restoring state from disk and > starting caches > > > Key: IGNITE-7196 > URL: https://issues.apache.org/jira/browse/IGNITE-7196 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.3 >Reporter: Mikhail Cherkasov >Assignee: Alexey Goncharuk >Priority: Critical > Fix For: 2.7 > > > Exchange can stuck and wait while new node restoring state from disk and > starting caches, there's a log snippet from a just joined new node that shows > the issue: > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, > customEvt=null, allowMerge=true] > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] > Resolved page store work directory: > /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log work directory: > /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log archive directory: > /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Started write-ahead log manager [mode=DEFAULT] > [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 > KiB, checkpointBuffer=100.0 MiB] > [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 > MiB, checkpointBuffer=896.0 MiB] > [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Read checkpoint status > [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, > > endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] > [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Checking memory state [lastValidPos=FileWALPointer [idx=3582, > fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer > [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], > lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] > [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Ignite node stopped in the middle of checkpoint. Will restore memory state > and finish checkpoint on node start. > [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.17.115:57148] > [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, > pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, > forceFlush=false]] > [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Finished applying memory changes [changesApplied=165103, time=8189ms] > [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.28.10:47964] > [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.27.101:46000] > [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to > wait for initial partition map exchange. Possible reasons are: > ^-- Transactions in deadlock. > ^-- Long running transactions (ignore if this is the case). > ^-- Unreleased explicit locks. > [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still > waiting for initial partition map exchange > [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, > addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], > sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, >
[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Goncharuk updated IGNITE-7196: - Fix Version/s: (was: 2.5) 2.6 > Exchange can stuck and wait while new node restoring state from disk and > starting caches > > > Key: IGNITE-7196 > URL: https://issues.apache.org/jira/browse/IGNITE-7196 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.3 >Reporter: Mikhail Cherkasov >Assignee: Alexey Goncharuk >Priority: Critical > Fix For: 2.6 > > > Exchange can stuck and wait while new node restoring state from disk and > starting caches, there's a log snippet from a just joined new node that shows > the issue: > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, > customEvt=null, allowMerge=true] > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] > Resolved page store work directory: > /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log work directory: > /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log archive directory: > /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Started write-ahead log manager [mode=DEFAULT] > [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 > KiB, checkpointBuffer=100.0 MiB] > [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 > MiB, checkpointBuffer=896.0 MiB] > [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Read checkpoint status > [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, > > endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] > [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Checking memory state [lastValidPos=FileWALPointer [idx=3582, > fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer > [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], > lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] > [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Ignite node stopped in the middle of checkpoint. Will restore memory state > and finish checkpoint on node start. > [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.17.115:57148] > [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, > pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, > forceFlush=false]] > [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Finished applying memory changes [changesApplied=165103, time=8189ms] > [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.28.10:47964] > [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.27.101:46000] > [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to > wait for initial partition map exchange. Possible reasons are: > ^-- Transactions in deadlock. > ^-- Long running transactions (ignore if this is the case). > ^-- Unreleased explicit locks. > [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still > waiting for initial partition map exchange > [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, > addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], > sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, >
[jira] [Updated] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
[ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Ozerov updated IGNITE-7196: Fix Version/s: (was: 2.4) 2.5 > Exchange can stuck and wait while new node restoring state from disk and > starting caches > > > Key: IGNITE-7196 > URL: https://issues.apache.org/jira/browse/IGNITE-7196 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.3 >Reporter: Mikhail Cherkasov >Assignee: Alexey Goncharuk >Priority: Critical > Fix For: 2.5 > > > Exchange can stuck and wait while new node restoring state from disk and > starting caches, there's a log snippet from a just joined new node that shows > the issue: > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1, > customEvt=null, allowMerge=true] > [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] > Resolved page store work directory: > /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log work directory: > /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Resolved write ahead log archive directory: > /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463 > [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] > Started write-ahead log manager [mode=DEFAULT] > [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 > KiB, checkpointBuffer=100.0 MiB] > [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] > Started page memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 > MiB, checkpointBuffer=896.0 MiB] > [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Read checkpoint status > [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin, > > endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin] > [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Checking memory state [lastValidPos=FileWALPointer [idx=3582, > fileOffset=59186076, len=9229, forceFlush=false], lastMarked=FileWALPointer > [idx=3629, fileOffset=50829700, len=9229, forceFlush=false], > lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af] > [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Ignite node stopped in the middle of checkpoint. Will restore memory state > and finish checkpoint on node start. > [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.17.115:57148] > [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, > pos=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, > forceFlush=false]] > [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager] > Finished applying memory changes [changesApplied=165103, time=8189ms] > [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.28.10:47964] > [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.31.20.209:48100, > rmtAddr=/172.31.27.101:46000] > [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to > wait for initial partition map exchange. Possible reasons are: > ^-- Transactions in deadlock. > ^-- Long running transactions (ignore if this is the case). > ^-- Unreleased explicit locks. > [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still > waiting for initial partition map exchange > [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent > [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, > addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], > sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, >