[jira] [Commented] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout
[ https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381800#comment-17381800 ] Ignite TC Bot commented on IGNITE-15099: {panel:title=Branch: [pull/9259/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel} {panel:title=Branch: [pull/9259/head] Base: [master] : New Tests (1)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1} {color:#8b}Basic 3{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=6087630]] * {color:#013220}IgniteBasicWithPersistenceTestSuite: SystemWorkersBlockingTest.testBlockingSection - PASSED{color} {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=6087690&buildTypeId=IgniteTests24Java8_RunAll] > Wrong heartbeat update while waiting for a checkpoint by timeout > > > Key: IGNITE-15099 > URL: https://issues.apache.org/jira/browse/IGNITE-15099 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.11, 2.12 >Reporter: Ilya Shishkov >Assignee: Aleksey Plekhanov >Priority: Minor > Labels: ise > Attachments: Checkpointer_fake_block_master.patch > > Time Spent: 10m > Remaining Estimate: 0h > > This problem occurs under these conditions: > * native persistence is turned on > * failureDetectionTimeout < checkpointFrequency > * checkpoints are sometimes skipped by timeout (more often, more probable > the problem occurrence) > There is a race condition between a listener execution and finishing of a > pending future (see CheckpointContextImpl#executor body [1]). In some cases > future can finish before listener closure, therefore updating of a heartbeat > in listener can occur after call of _blockingSectionBegin_ in > Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for > next checkpoint (see [2]). > {code:java|title=CheckpointContextImpl#executor} > @Override public Executor executor() { > return asyncRunner == null ? null : cmd -> { > try { > GridFutureAdapter res = new GridFutureAdapter<>(); > res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // > Listener is invoked concurrently with pending future finish > asyncRunner.execute(U.wrapIgniteFuture(cmd, res)); > pendingTaskFuture.add(res); > } > catch (RejectedExecutionException e) { > assert false : "A task should never be rejected by async > runner"; > } > }; > } > {code} > {code:java|title=Checkpointer#waitCheckpointEvent} > try { > synchronized (this) { > long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - > System.nanoTime()); > while (remaining > 0 && !isCancelled()) { > blockingSectionBegin(); > try { > wait(remaining); > // At this point and till blockingSectionEnd call heartbeat > should be equal to Long.MAX_VALUE > remaining = U.nanosToMillis(scheduledCp.nextCpNanos - > System.nanoTime()); > } > finally { > blockingSectionEnd(); > } > } > } > } > {code} > > If interval between checkpoints (_checkpointFrequency_) is greater than the > _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ > may cause an error message in log, because a checkpoint thread is treated as > blocked (but in fact it was not). > *Reproducer of problem:* [3]. Even if test was not failed you can see log > message with incorrect heartbeat after waiting for checkpoint. > > Links: > # > [CheckpointContextImpl#executor|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104] > # > [Checkpointer#waitCheckpointEvent|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816] > # Reproducer: [^Checkpointer_fake_block_master.patch]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout
[ https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381379#comment-17381379 ] Ilya Shishkov commented on IGNITE-15099: [~alex_pl], LGTM. > Wrong heartbeat update while waiting for a checkpoint by timeout > > > Key: IGNITE-15099 > URL: https://issues.apache.org/jira/browse/IGNITE-15099 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.11, 2.12 >Reporter: Ilya Shishkov >Assignee: Aleksey Plekhanov >Priority: Minor > Labels: ise > Attachments: Checkpointer_fake_block_master.patch > > Time Spent: 10m > Remaining Estimate: 0h > > This problem occurs under these conditions: > * native persistence is turned on > * failureDetectionTimeout < checkpointFrequency > * checkpoints are sometimes skipped by timeout (more often, more probable > the problem occurrence) > There is a race condition between a listener execution and finishing of a > pending future (see CheckpointContextImpl#executor body [1]). In some cases > future can finish before listener closure, therefore updating of a heartbeat > in listener can occur after call of _blockingSectionBegin_ in > Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for > next checkpoint (see [2]). > {code:java|title=CheckpointContextImpl#executor} > @Override public Executor executor() { > return asyncRunner == null ? null : cmd -> { > try { > GridFutureAdapter res = new GridFutureAdapter<>(); > res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // > Listener is invoked concurrently with pending future finish > asyncRunner.execute(U.wrapIgniteFuture(cmd, res)); > pendingTaskFuture.add(res); > } > catch (RejectedExecutionException e) { > assert false : "A task should never be rejected by async > runner"; > } > }; > } > {code} > {code:java|title=Checkpointer#waitCheckpointEvent} > try { > synchronized (this) { > long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - > System.nanoTime()); > while (remaining > 0 && !isCancelled()) { > blockingSectionBegin(); > try { > wait(remaining); > // At this point and till blockingSectionEnd call heartbeat > should be equal to Long.MAX_VALUE > remaining = U.nanosToMillis(scheduledCp.nextCpNanos - > System.nanoTime()); > } > finally { > blockingSectionEnd(); > } > } > } > } > {code} > > If interval between checkpoints (_checkpointFrequency_) is greater than the > _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ > may cause an error message in log, because a checkpoint thread is treated as > blocked (but in fact it was not). > *Reproducer of problem:* [3]. Even if test was not failed you can see log > message with incorrect heartbeat after waiting for checkpoint. > > Links: > # > [CheckpointContextImpl#executor|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104] > # > [Checkpointer#waitCheckpointEvent|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816] > # Reproducer: [^Checkpointer_fake_block_master.patch]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout
[ https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381307#comment-17381307 ] Aleksey Plekhanov commented on IGNITE-15099: [~shishkovilja], can you please have a look at the patch? > Wrong heartbeat update while waiting for a checkpoint by timeout > > > Key: IGNITE-15099 > URL: https://issues.apache.org/jira/browse/IGNITE-15099 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.11, 2.12 >Reporter: Ilya Shishkov >Assignee: Aleksey Plekhanov >Priority: Minor > Labels: ise > Attachments: Checkpointer_fake_block_master.patch > > Time Spent: 10m > Remaining Estimate: 0h > > This problem occurs under these conditions: > * native persistence is turned on > * failureDetectionTimeout < checkpointFrequency > * checkpoints are sometimes skipped by timeout (more often, more probable > the problem occurrence) > There is a race condition between a listener execution and finishing of a > pending future (see CheckpointContextImpl#executor body [1]). In some cases > future can finish before listener closure, therefore updating of a heartbeat > in listener can occur after call of _blockingSectionBegin_ in > Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for > next checkpoint (see [2]). > {code:java|title=CheckpointContextImpl#executor} > @Override public Executor executor() { > return asyncRunner == null ? null : cmd -> { > try { > GridFutureAdapter res = new GridFutureAdapter<>(); > res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // > Listener is invoked concurrently with pending future finish > asyncRunner.execute(U.wrapIgniteFuture(cmd, res)); > pendingTaskFuture.add(res); > } > catch (RejectedExecutionException e) { > assert false : "A task should never be rejected by async > runner"; > } > }; > } > {code} > {code:java|title=Checkpointer#waitCheckpointEvent} > try { > synchronized (this) { > long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - > System.nanoTime()); > while (remaining > 0 && !isCancelled()) { > blockingSectionBegin(); > try { > wait(remaining); > // At this point and till blockingSectionEnd call heartbeat > should be equal to Long.MAX_VALUE > remaining = U.nanosToMillis(scheduledCp.nextCpNanos - > System.nanoTime()); > } > finally { > blockingSectionEnd(); > } > } > } > } > {code} > > If interval between checkpoints (_checkpointFrequency_) is greater than the > _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ > may cause an error message in log, because a checkpoint thread is treated as > blocked (but in fact it was not). > *Reproducer of problem:* [3]. Even if test was not failed you can see log > message with incorrect heartbeat after waiting for checkpoint. > > Links: > # > [CheckpointContextImpl#executor|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104] > # > [Checkpointer#waitCheckpointEvent|https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816] > # Reproducer: [^Checkpointer_fake_block_master.patch]. -- This message was sent by Atlassian Jira (v8.3.4#803005)