Hi Igniters, hi Alexey.

I want to discuss this issue:
https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it too.

I was able to determine where there is a race.

The update of the heartbeat happens asynchronously into the listener code.
But we always wait in the checkpoint thread for all pending async
tasks. And this is reasonable.

for (CheckpointListener lsnr : dbLsnrs)
  lsnr.beforeCheckpointBegin(ctx0);

ctx0.awaitPendingTasksFinished();

The race was because of inappropriate order of future registration. In
CheckpointContextImpl.executor () (inside listeners execution)

GridFutureAdapter<?> res = new GridFutureAdapter<>();
res.listen(fut -> heartbeatUpdater.updateHeartbeat());
asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
pendingTaskFuture.add(res);

Here we create a task, submit a task to the executor, and only after this
do we register the task. Thus we got a situation where checkpointer thread
was moving on after ctx0.awaitPendingTasksFinished(); and still,
the unregistered asyncRunner task was moving on in parallel.

But anyway, I propose to remove the update of the heartbeat from other
threads altogether and wrap the call to listeners in a blockingSection.

As I understand heartbeat was designed just to indicate self-progress by a
worker. If a worker can not indicate self-progress we should wrap such code
into blockingSections. In case of listeners, worker can not indicate
self-progress, thus let's wrap it into blockingSection.

Guys, what do you think about this?

-------------
Ilya Kazakov

Reply via email to