Hello, Ilya

> But anyway, I propose to remove the update of the heartbeat from other
threads altogether and wrap the call to listeners in a blockingSection.
I don't quite understand your proposal. Which call to listeners do you
mean? If we wrap the listener into the blocking section the result will be
the same.
Alternatively, I think we can wrap awaitPendingTasksFinished into the
blocking section, this will also solve the problem, but in this case we can
never detect lack of progress by async executor threads and checkpointer
thread hangs due to this reason.
What's wrong with the current fix? It solves the current problem and I hope
will protect us from problems like this in the future.

вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov <kazakov.i...@gmail.com>:

> Hi Igniters, hi Alexey.
>
> I want to discuss this issue:
> https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it too.
>
> I was able to determine where there is a race.
>
> The update of the heartbeat happens asynchronously into the listener code.
> But we always wait in the checkpoint thread for all pending async
> tasks. And this is reasonable.
>
> for (CheckpointListener lsnr : dbLsnrs)
>   lsnr.beforeCheckpointBegin(ctx0);
>
> ctx0.awaitPendingTasksFinished();
>
> The race was because of inappropriate order of future registration. In
> CheckpointContextImpl.executor () (inside listeners execution)
>
> GridFutureAdapter<?> res = new GridFutureAdapter<>();
> res.listen(fut -> heartbeatUpdater.updateHeartbeat());
> asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> pendingTaskFuture.add(res);
>
> Here we create a task, submit a task to the executor, and only after this
> do we register the task. Thus we got a situation where checkpointer thread
> was moving on after ctx0.awaitPendingTasksFinished(); and still,
> the unregistered asyncRunner task was moving on in parallel.
>
> But anyway, I propose to remove the update of the heartbeat from other
> threads altogether and wrap the call to listeners in a blockingSection.
>
> As I understand heartbeat was designed just to indicate self-progress by a
> worker. If a worker can not indicate self-progress we should wrap such code
> into blockingSections. In case of listeners, worker can not indicate
> self-progress, thus let's wrap it into blockingSection.
>
> Guys, what do you think about this?
>
> -------------
> Ilya Kazakov
>

Reply via email to