Pavel Pereslegin created IGNITE-15429: -----------------------------------------
Summary: Taking a snapshot may increase the PME execution time by the checkpointFrequency interval. Key: IGNITE-15429 URL: https://issues.apache.org/jira/browse/IGNITE-15429 Project: Ignite Issue Type: Bug Reporter: Pavel Pereslegin When a snapshot is taken, a checkpoint is forced on all cluster nodes. In a rare case, when forcing a checkpoint, the start of the snapshot operation is set to the planned (instead of the current) checkpoint. In this case, the local partition exchange future does not finish until the next checkpoint starts (but timeout), which significantly increases the execution time of the exchange. Log output on a node with a checkpoint frequency of 60 seconds. {noformat} 2021-08-31 23:30:04.792 [INFO ][exchange-worker-#179][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask] Snapshot operation is scheduled on local node and will be handled by the checkpoint l istener [sctx=SnapshotFutureTask [pageStore=GridCacheSharedManagerAdapter [starting=true, stop=false], srcNodeId=a49f4c59-a4d1-4b02-b416-ceede4ffc0ba, snpName=20210831233001_snapshot, tmpSnpWorkDir=/opt/ignite/ssd/data/epk_r b_sylvanas5_ca_sbrf_ru/snp/20210831233001_snapshot, locBuff=java.lang.ThreadLocal$SuppliedThreadLocal@6fd06d5f, ioFactory=org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIOFactory@54067b77, cpEnd Fut=java.util.concurrent.CompletableFuture@382a989[Not completed], startedFut=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=378462544], tmpConsIdDir=/opt/ignite/ssd/data/something/snp/20210831233001_snapshot/db/something, closeFut=null, err=null, started=true], topVer=AffinityTopologyVersion [topVer=515, minorTopVer=2]] 2021-08-31 23:30:05.444 [INFO ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, startP tr=FileWALPointer [idx=13473, fileOff=1008077757, len=236399], checkpointBeforeLockTime=352ms, checkpointLockWait=0ms, checkpointListenersExecuteTime=325ms, checkpointLockHoldTime=331ms, walCpRecordFsyncDuration=3ms, writeCh eckpointEntryDuration=0ms, splitAndSortCpPagesDuration=8ms, pages=15001, reason='timeout'] 2021-08-31 23:30:05.671 [INFO ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager] Checkpoint finished [cpId=b0e9b43d-02f4-44fb-90c5-41dc6c294248, pages=15001, markPos=FileWALPointer [idx=13473, fileOff=1008077757, len=236399], walSegmentsCleared=0, walSegmentsCovered=[], markDuration=343ms, pagesWrite=128ms, fsync=99ms, total=922ms] ...60 seconds later... 2021-08-31 23:31:05.779 [INFO ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotSender] Resolved snapshot work directory: /opt/ignite/sas/snapshot/20210831233001_snapshot/db/something 2021-08-31 23:31:05.812 [INFO ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.snapshot.SnapshotFutureTask] Submit partition processing tasks with partition allocated lengths: ... 2021-08-31 23:31:05.837 [INFO ][db-checkpoint-thread-#236][org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager] Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=328ms, checkpointLockWait=0ms, checkpointListenersExecuteTime=298ms, checkpointLockHoldTime=304ms, reason='timeout'] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)