1.
I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():

This
------------------------------------------
for (CheckpointListener lsnr : dbLsnrs)
    lsnr.beforeCheckpointBegin(ctx0);

ctx0.awaitPendingTasksFinished();
------------------------------------------

And this:

------------------------------------------
// Listeners must be invoked before we write checkpoint record to WAL.
for (CheckpointListener lsnr : dbLsnrs)
    lsnr.onMarkCheckpointBegin(ctx0);

ctx0.awaitPendingTasksFinished();
------------------------------------------

Inside lsnr.beforeCheckpointBegin(ctx0) and lsnr.onMarkCheckpointBegin(ctx0)
we call CheckpointContextImpl.executor() which submit heartbeat update
tasks in threadpool. But due to a bug in registration,
ctx0.awaitPendingTasksFinished() do not work correctly. Checpoint thread
does not wait for all tasks to complete and moves on.

This could lead to other bugs because as written in the comment "//
Listeners must be invoked before we write checkpoint record to WAL."

2.
About the fix.
Yes, the fix resolves the issue, but does not resolve the root cause - a
race between checkpoint thread and threads run in asyncRunner. Also, as I
understand, there should be no attempts to update heartbeat inside
blockingSection, but the fix does not exclude such attempts but blocks them.

3.
But my main point is that it looks strange to update the heartbeat of
thread A from thread B. It's like doing artificial respiration and chest
compressions. Thread A is waiting on async tasks completion, but these
tasks are updating progress of thread A. I suppose that blockingSection was
designed for such situations when the thread is waiting for something and
does not perform any progress.

вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky <ivanda...@gmail.com>:

> +1 For current fix. Code is clean and understandable. I suppose that the
> current fix is a correct variant to update heartbeatTs.
>
> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov <plehanov.a...@gmail.com>:
>
> > Hello, Ilya
> >
> > > But anyway, I propose to remove the update of the heartbeat from other
> > threads altogether and wrap the call to listeners in a blockingSection.
> > I don't quite understand your proposal. Which call to listeners do you
> > mean? If we wrap the listener into the blocking section the result will
> be
> > the same.
> > Alternatively, I think we can wrap awaitPendingTasksFinished into the
> > blocking section, this will also solve the problem, but in this case we
> can
> > never detect lack of progress by async executor threads and checkpointer
> > thread hangs due to this reason.
> > What's wrong with the current fix? It solves the current problem and I
> hope
> > will protect us from problems like this in the future.
> >
> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov <kazakov.i...@gmail.com>:
> >
> > > Hi Igniters, hi Alexey.
> > >
> > > I want to discuss this issue:
> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught it
> > too.
> > >
> > > I was able to determine where there is a race.
> > >
> > > The update of the heartbeat happens asynchronously into the listener
> > code.
> > > But we always wait in the checkpoint thread for all pending async
> > > tasks. And this is reasonable.
> > >
> > > for (CheckpointListener lsnr : dbLsnrs)
> > >   lsnr.beforeCheckpointBegin(ctx0);
> > >
> > > ctx0.awaitPendingTasksFinished();
> > >
> > > The race was because of inappropriate order of future registration. In
> > > CheckpointContextImpl.executor () (inside listeners execution)
> > >
> > > GridFutureAdapter<?> res = new GridFutureAdapter<>();
> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat());
> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> > > pendingTaskFuture.add(res);
> > >
> > > Here we create a task, submit a task to the executor, and only after
> this
> > > do we register the task. Thus we got a situation where checkpointer
> > thread
> > > was moving on after ctx0.awaitPendingTasksFinished(); and still,
> > > the unregistered asyncRunner task was moving on in parallel.
> > >
> > > But anyway, I propose to remove the update of the heartbeat from other
> > > threads altogether and wrap the call to listeners in a blockingSection.
> > >
> > > As I understand heartbeat was designed just to indicate self-progress
> by
> > a
> > > worker. If a worker can not indicate self-progress we should wrap such
> > code
> > > into blockingSections. In case of listeners, worker can not indicate
> > > self-progress, thus let's wrap it into blockingSection.
> > >
> > > Guys, what do you think about this?
> > >
> > > -------------
> > > Ilya Kazakov
> > >
> >
>
>
> --
> Sincerely yours, Ivan Daschinskiy
>

Reply via email to