[ 
https://issues.apache.org/jira/browse/IGNITE-19239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Shishkov updated IGNITE-19239:
-----------------------------------
    Description: 
There may be possible error messages about checkpoint read lock acquisition 
timeouts and critical threads blocking during snapshot restore process (just 
after caches start):
{quote} 
[2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
 Checkpoint read lock acquisition has been timed out. 
{quote} 

{quote} 
[2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, 
threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
 {color:red}blockedFor=100s{color}] 
{quote} 

Also there are active exchange process, which finishes with such timings 
(timing will be approximatelly equal to blocking time of threads): 
{quote} 
[2023-04-06T10:55:52,211][INFO 
]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange 
timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], 
resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in 
exchange queue" (0 ms), ..., stage="Restore partition states" 
({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 
ms{color})] 
{quote} 
 

As I understand, such errors do not affect restoring, but can confuse, so it 
would be perfect if we fix them.

 

How to reproduce:
 # Set checkpoint frequency less than failure detection timeout.
 # Ensure, that cache groups partitions states restoring lasts more than 
failure detection timeout, i.e. it is actual to sufficiently large caches.

Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]

  was:
There may be possible error messages about checkpoint read lock acquisition 
timeouts and critical threads blocking during snapshot restore process (just 
after caches start):
{quote} 
[2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
 Checkpoint read lock acquisition has been timed out. 
{quote} 

{quote} 
[2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, 
threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
 {color:red}blockedFor=100s{color}] 
{quote} 

Also there are active exchange process, which finishes with such timings 
(timing will be approximatelly equal to blocking time of threads): 
{quote} 
[2023-04-06T10:55:52,211][INFO 
]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange 
timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], 
resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in 
exchange queue" (0 ms), ..., stage="Restore partition states" 
({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 
ms{color})] 
{quote} 
 

Is I understand, such errors do not affect restoring, but can confuse.

 

How to reproduce:
 # Set checkpoint frequency less than failure detection timeout.
 # Ensure, that cache groups partitions states restoring lasts more than 
failure detection timeout, i.e. it is actual to sufficiently large caches.

Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]


> Checkpoint read lock acquisition timeouts during snapshot restore
> -----------------------------------------------------------------
>
>                 Key: IGNITE-19239
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19239
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ilya Shishkov
>            Priority: Minor
>              Labels: iep-43, ise
>         Attachments: BlockingThreadsOnSnapshotRestoreReproducerTest.patch
>
>
> There may be possible error messages about checkpoint read lock acquisition 
> timeouts and critical threads blocking during snapshot restore process (just 
> after caches start):
> {quote} 
> [2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
>  Checkpoint read lock acquisition has been timed out. 
> {quote} 
> {quote} 
> [2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G]
>  Blocked system-critical thread has been detected. This can lead to 
> cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, 
> threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
>  {color:red}blockedFor=100s{color}] 
> {quote} 
> Also there are active exchange process, which finishes with such timings 
> (timing will be approximatelly equal to blocking time of threads): 
> {quote} 
> [2023-04-06T10:55:52,211][INFO 
> ]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange 
> timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], 
> resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in 
> exchange queue" (0 ms), ..., stage="Restore partition states" 
> ({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 
> ms{color})] 
> {quote} 
>  
> As I understand, such errors do not affect restoring, but can confuse, so it 
> would be perfect if we fix them.
>  
> How to reproduce:
>  # Set checkpoint frequency less than failure detection timeout.
>  # Ensure, that cache groups partitions states restoring lasts more than 
> failure detection timeout, i.e. it is actual to sufficiently large caches.
> Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to