[jira] [Commented] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout

2021-07-15 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381800#comment-17381800
 ] 

Ignite TC Bot commented on IGNITE-15099:


{panel:title=Branch: [pull/9259/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/9259/head] Base: [master] : New Tests 
(1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}
{color:#8b}Basic 3{color} [[tests 
1|https://ci.ignite.apache.org/viewLog.html?buildId=6087630]]
* {color:#013220}IgniteBasicWithPersistenceTestSuite: 
SystemWorkersBlockingTest.testBlockingSection - PASSED{color}

{panel}
[TeamCity *--> Run :: All* 
Results|https://ci.ignite.apache.org/viewLog.html?buildId=6087690&buildTypeId=IgniteTests24Java8_RunAll]

> Wrong heartbeat update while waiting for a checkpoint by timeout
> 
>
> Key: IGNITE-15099
> URL: https://issues.apache.org/jira/browse/IGNITE-15099
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.11, 2.12
>Reporter: Ilya Shishkov
>Assignee: Aleksey Plekhanov
>Priority: Minor
>  Labels: ise
> Attachments: Checkpointer_fake_block_master.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This problem occurs under these conditions:
>  * native persistence is turned on
>  * failureDetectionTimeout < checkpointFrequency
>  * checkpoints are sometimes skipped by timeout (more often, more probable 
> the problem occurrence)
> There is a race condition between a listener execution and finishing of a 
> pending future (see CheckpointContextImpl#executor body [1]). In some cases 
> future can finish before listener closure, therefore updating of a heartbeat 
> in listener can occur after call of _blockingSectionBegin_ in 
> Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for 
> next checkpoint (see [2]).
> {code:java|title=CheckpointContextImpl#executor}
> @Override public Executor executor() {
> return asyncRunner == null ? null : cmd -> {
> try {
> GridFutureAdapter res = new GridFutureAdapter<>();
> res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // 
> Listener is invoked concurrently with pending future finish
> asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> pendingTaskFuture.add(res);
> }
> catch (RejectedExecutionException e) {
> assert false : "A task should never be rejected by async 
> runner";
> }
> };
> }
> {code}
> {code:java|title=Checkpointer#waitCheckpointEvent}
> try {
> synchronized (this) {
> long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - 
> System.nanoTime());
> while (remaining > 0 && !isCancelled()) {
> blockingSectionBegin();
> try {
> wait(remaining); 
> // At this point and till blockingSectionEnd call heartbeat 
> should be equal to Long.MAX_VALUE
> remaining = U.nanosToMillis(scheduledCp.nextCpNanos - 
> System.nanoTime());
> }
> finally {
> blockingSectionEnd();
> }
> }
> }
> }
> {code}
>  
> If interval between checkpoints (_checkpointFrequency_) is greater than the 
> _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ 
> may cause an error message in log, because a checkpoint thread is treated as 
> blocked (but in fact it was not).
> *Reproducer of problem:* [3]. Even if test was not failed you can see log 
> message with incorrect heartbeat after waiting for checkpoint.
>  
> Links:
>  # 
> [CheckpointContextImpl#executor|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104]
>  # 
> [Checkpointer#waitCheckpointEvent|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816]
> # Reproducer: [^Checkpointer_fake_block_master.patch]. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout

2021-07-15 Thread Ilya Shishkov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381379#comment-17381379
 ] 

Ilya Shishkov commented on IGNITE-15099:


[~alex_pl], LGTM.

> Wrong heartbeat update while waiting for a checkpoint by timeout
> 
>
> Key: IGNITE-15099
> URL: https://issues.apache.org/jira/browse/IGNITE-15099
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.11, 2.12
>Reporter: Ilya Shishkov
>Assignee: Aleksey Plekhanov
>Priority: Minor
>  Labels: ise
> Attachments: Checkpointer_fake_block_master.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This problem occurs under these conditions:
>  * native persistence is turned on
>  * failureDetectionTimeout < checkpointFrequency
>  * checkpoints are sometimes skipped by timeout (more often, more probable 
> the problem occurrence)
> There is a race condition between a listener execution and finishing of a 
> pending future (see CheckpointContextImpl#executor body [1]). In some cases 
> future can finish before listener closure, therefore updating of a heartbeat 
> in listener can occur after call of _blockingSectionBegin_ in 
> Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for 
> next checkpoint (see [2]).
> {code:java|title=CheckpointContextImpl#executor}
> @Override public Executor executor() {
> return asyncRunner == null ? null : cmd -> {
> try {
> GridFutureAdapter res = new GridFutureAdapter<>();
> res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // 
> Listener is invoked concurrently with pending future finish
> asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> pendingTaskFuture.add(res);
> }
> catch (RejectedExecutionException e) {
> assert false : "A task should never be rejected by async 
> runner";
> }
> };
> }
> {code}
> {code:java|title=Checkpointer#waitCheckpointEvent}
> try {
> synchronized (this) {
> long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - 
> System.nanoTime());
> while (remaining > 0 && !isCancelled()) {
> blockingSectionBegin();
> try {
> wait(remaining); 
> // At this point and till blockingSectionEnd call heartbeat 
> should be equal to Long.MAX_VALUE
> remaining = U.nanosToMillis(scheduledCp.nextCpNanos - 
> System.nanoTime());
> }
> finally {
> blockingSectionEnd();
> }
> }
> }
> }
> {code}
>  
> If interval between checkpoints (_checkpointFrequency_) is greater than the 
> _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ 
> may cause an error message in log, because a checkpoint thread is treated as 
> blocked (but in fact it was not).
> *Reproducer of problem:* [3]. Even if test was not failed you can see log 
> message with incorrect heartbeat after waiting for checkpoint.
>  
> Links:
>  # 
> [CheckpointContextImpl#executor|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104]
>  # 
> [Checkpointer#waitCheckpointEvent|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816]
> # Reproducer: [^Checkpointer_fake_block_master.patch]. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout

2021-07-15 Thread Aleksey Plekhanov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381307#comment-17381307
 ] 

Aleksey Plekhanov commented on IGNITE-15099:


[~shishkovilja], can you please have a look at the patch?

> Wrong heartbeat update while waiting for a checkpoint by timeout
> 
>
> Key: IGNITE-15099
> URL: https://issues.apache.org/jira/browse/IGNITE-15099
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.11, 2.12
>Reporter: Ilya Shishkov
>Assignee: Aleksey Plekhanov
>Priority: Minor
>  Labels: ise
> Attachments: Checkpointer_fake_block_master.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This problem occurs under these conditions:
>  * native persistence is turned on
>  * failureDetectionTimeout < checkpointFrequency
>  * checkpoints are sometimes skipped by timeout (more often, more probable 
> the problem occurrence)
> There is a race condition between a listener execution and finishing of a 
> pending future (see CheckpointContextImpl#executor body [1]). In some cases 
> future can finish before listener closure, therefore updating of a heartbeat 
> in listener can occur after call of _blockingSectionBegin_ in 
> Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for 
> next checkpoint (see [2]).
> {code:java|title=CheckpointContextImpl#executor}
> @Override public Executor executor() {
> return asyncRunner == null ? null : cmd -> {
> try {
> GridFutureAdapter res = new GridFutureAdapter<>();
> res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // 
> Listener is invoked concurrently with pending future finish
> asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> pendingTaskFuture.add(res);
> }
> catch (RejectedExecutionException e) {
> assert false : "A task should never be rejected by async 
> runner";
> }
> };
> }
> {code}
> {code:java|title=Checkpointer#waitCheckpointEvent}
> try {
> synchronized (this) {
> long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - 
> System.nanoTime());
> while (remaining > 0 && !isCancelled()) {
> blockingSectionBegin();
> try {
> wait(remaining); 
> // At this point and till blockingSectionEnd call heartbeat 
> should be equal to Long.MAX_VALUE
> remaining = U.nanosToMillis(scheduledCp.nextCpNanos - 
> System.nanoTime());
> }
> finally {
> blockingSectionEnd();
> }
> }
> }
> }
> {code}
>  
> If interval between checkpoints (_checkpointFrequency_) is greater than the 
> _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ 
> may cause an error message in log, because a checkpoint thread is treated as 
> blocked (but in fact it was not).
> *Reproducer of problem:* [3]. Even if test was not failed you can see log 
> message with incorrect heartbeat after waiting for checkpoint.
>  
> Links:
>  # 
> [CheckpointContextImpl#executor|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104]
>  # 
> [Checkpointer#waitCheckpointEvent|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816]
> # Reproducer: [^Checkpointer_fake_block_master.patch]. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)