[ https://issues.apache.org/jira/browse/IGNITE-19239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ilya Shishkov updated IGNITE-19239: ----------------------------------- Description: There may be possible error messages about checkpoint read lock acquisition timeouts and critical threads blocking during snapshot restore process (just after caches start): {quote} [2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock] Checkpoint read lock acquisition has been timed out. {quote} {quote} [2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%, {color:red}blockedFor=100s{color}] {quote} Also there are active exchange process, which finishes with such timings (timing will be approximatelly equal to blocking time of threads): {quote} [2023-04-06T10:55:52,211][INFO ]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in exchange queue" (0 ms), ..., stage="Restore partition states" ({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 ms{color})] {quote} As I understand, such errors do not affect restoring, but can confuse, so it would be perfect if we fix them. How to reproduce: # Set checkpoint frequency less than failure detection timeout. # Ensure, that cache groups partitions states restoring lasts more than failure detection timeout, i.e. it is actual to sufficiently large caches. Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch] was: There may be possible error messages about checkpoint read lock acquisition timeouts and critical threads blocking during snapshot restore process (just after caches start): {quote} [2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock] Checkpoint read lock acquisition has been timed out. {quote} {quote} [2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%, {color:red}blockedFor=100s{color}] {quote} Also there are active exchange process, which finishes with such timings (timing will be approximatelly equal to blocking time of threads): {quote} [2023-04-06T10:55:52,211][INFO ]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in exchange queue" (0 ms), ..., stage="Restore partition states" ({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 ms{color})] {quote} Is I understand, such errors do not affect restoring, but can confuse. How to reproduce: # Set checkpoint frequency less than failure detection timeout. # Ensure, that cache groups partitions states restoring lasts more than failure detection timeout, i.e. it is actual to sufficiently large caches. Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch] > Checkpoint read lock acquisition timeouts during snapshot restore > ----------------------------------------------------------------- > > Key: IGNITE-19239 > URL: https://issues.apache.org/jira/browse/IGNITE-19239 > Project: Ignite > Issue Type: Bug > Reporter: Ilya Shishkov > Priority: Minor > Labels: iep-43, ise > Attachments: BlockingThreadsOnSnapshotRestoreReproducerTest.patch > > > There may be possible error messages about checkpoint read lock acquisition > timeouts and critical threads blocking during snapshot restore process (just > after caches start): > {quote} > [2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock] > Checkpoint read lock acquisition has been timed out. > {quote} > {quote} > [2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G] > Blocked system-critical thread has been detected. This can lead to > cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, > threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%, > {color:red}blockedFor=100s{color}] > {quote} > Also there are active exchange process, which finishes with such timings > (timing will be approximatelly equal to blocking time of threads): > {quote} > [2023-04-06T10:55:52,211][INFO > ]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange > timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], > resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in > exchange queue" (0 ms), ..., stage="Restore partition states" > ({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 > ms{color})] > {quote} > > As I understand, such errors do not affect restoring, but can confuse, so it > would be perfect if we fix them. > > How to reproduce: > # Set checkpoint frequency less than failure detection timeout. > # Ensure, that cache groups partitions states restoring lasts more than > failure detection timeout, i.e. it is actual to sufficiently large caches. > Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch] -- This message was sent by Atlassian Jira (v8.20.10#820010)