Re: [Discussion] Race on heartbeat update in checkpoint thread

Andrey Gura Wed, 18 Aug 2021 08:03:51 -0700

Alexey, Ilya,

I took a look at the problem and corresponding fixes. It seems that
you are both right. But Alexey's fix looks like some kind of hack to
me.


We have two problems:

1. Heartbeat update from thread that will complete a future.

I agree with Ilya and Andrey. Only a critical worker itself can update
heartbeat. So, removal of `res.listen(fut ->
heartbeatUpdater.updateHeartbeat());` from
CheckpointContextImpl#executor() method is a good idea. It will avoid
a race and makes Alexey's fix redundant.

2. blockingSection in checkpointer thread really can never detect the
lack of pending tasks progress.

Alexey is right. It is a problem. But what I actually see:
checkpointWritePageThreads and checkpointCollectInfoThreads which
execute pending tasks, are actually critical themselves and should be
also monitored by WorkersRegistry. In such a case the checkpointer
thread can safely use blocking sections at places designated by Ilya
and forementioned threads should also use blocking sections on IO
operations.

So proposal:

- Remove Alexey's fix
- Remove asynchronous heartbeat update
- Add blocking sections (fix of Ilya)
- Monitor checkpointWritePageThreads and checkpointCollectInfoThreads
if they exist (it depends on configuration)

WDYT?

On Thu, Jul 29, 2021 at 5:48 AM Ilya Kazakov <[email protected]> wrote:
>
> Guys. I have discussed just these changes. Look at my PR, please. What do
> you think?
>
> https://issues.apache.org/jira/browse/IGNITE-15192
> https://github.com/apache/ignite/pull/9280/files
>
> чт, 22 июл. 2021 г. в 22:30, Andrey Mashenkov <[email protected]>:
>
> > Hi guys,
> >
> > I think updateHeartBeat() method was misused in the future listener and
> > this must be fixed.
> >
> > Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the
> > worker itself.
> > It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc,
> > which said "Notifying dispatcher that work is in progress and thread didn't
> > freeze."
> > GridWorker.heartbeatTs field is marked volatile just to provide consistent
> > visibility guarantee to a watcher.
> >
> > So, CheckpointPagesWriter violates this contract, when runs in another
> > thread(s).
> > But it works fine just because the main (Checkpointer) thread waits for
> > all the Writers to finish their work before it can fall into the blocking
> > section, therefore there is no race possible.
> >
> > As for the broken case, there are 2 possible solutions,
> > 1. Main thread must wait for all children's tasks to finish before going
> > into the blocking section.
> > 2. Or make updateHeartbeat consistent with the blockingSectionBegin/End.
> > Seems, there is no need to mark the method as synchronized, but use call
> > AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v :
> > U.currentTimeMillis()).
> >
> > I'd suggest
> > * Remove the "thread" mentioning from WorkProgressDispatcher interface to
> > avoid confusion with the current usage and make WorkProgressDispatcher more
> > general.
> > * Avoid unsafe heartbeatUpdater leak to the outside via wrapping
> > CheckpointContextImpl.heartbeatUpdater instance to perform consistent
> > heartbeat update from p.2 above.
> >
> > Does it make sense?
> >
> > On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov <[email protected]>
> > wrote:
> >
> >> Ilya,
> >>
> >> Race only affects the future listener (which only updates heartbeat), but
> >> not checkpoint listeners, so no other bugs possible caused by this race
> >> except wrong heartbeat update.
> >>
> >> As I've written before, I think it's not a good idea to wait for other
> >> threads by the critical thread in the blocking section. If these threads
> >> are not monitored we can never detect lack of progress and never trigger
> >> the failure handler. Perhaps a good solution will be to register async
> >> threads as critical threads (additionally to waiting for these threads in
> >> the blocing section) and update their own heartbeat. But the current fix
> >> is
> >> required too, to avoid such problems for other critical threads in the
> >> future.
> >>
> >> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov <[email protected]>:
> >>
> >> > 1.
> >> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():
> >> >
> >> > This
> >> > ------------------------------------------
> >> > for (CheckpointListener lsnr : dbLsnrs)
> >> >     lsnr.beforeCheckpointBegin(ctx0);
> >> >
> >> > ctx0.awaitPendingTasksFinished();
> >> > ------------------------------------------
> >> >
> >> > And this:
> >> >
> >> > ------------------------------------------
> >> > // Listeners must be invoked before we write checkpoint record to WAL.
> >> > for (CheckpointListener lsnr : dbLsnrs)
> >> >     lsnr.onMarkCheckpointBegin(ctx0);
> >> >
> >> > ctx0.awaitPendingTasksFinished();
> >> > ------------------------------------------
> >> >
> >> > Inside lsnr.beforeCheckpointBegin(ctx0) and
> >> > lsnr.onMarkCheckpointBegin(ctx0) we call
> >> CheckpointContextImpl.executor()
> >> > which submit heartbeat update tasks in threadpool. But due to a bug in
> >> > registration, ctx0.awaitPendingTasksFinished() do not work correctly.
> >> > Checpoint thread does not wait for all tasks to complete and moves on.
> >> >
> >> > This could lead to other bugs because as written in the comment "//
> >> > Listeners must be invoked before we write checkpoint record to WAL."
> >> >
> >> > 2.
> >> > About the fix.
> >> > Yes, the fix resolves the issue, but does not resolve the root cause - a
> >> > race between checkpoint thread and threads run in asyncRunner. Also, as
> >> I
> >> > understand, there should be no attempts to update heartbeat inside
> >> > blockingSection, but the fix does not exclude such attempts but blocks
> >> them.
> >> >
> >> > 3.
> >> > But my main point is that it looks strange to update the heartbeat of
> >> > thread A from thread B. It's like doing artificial respiration and chest
> >> > compressions. Thread A is waiting on async tasks completion, but these
> >> > tasks are updating progress of thread A. I suppose that blockingSection
> >> was
> >> > designed for such situations when the thread is waiting for something
> >> and
> >> > does not perform any progress.
> >> >
> >> > вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky <[email protected]>:
> >> >
> >> >> +1 For current fix. Code is clean and understandable. I suppose that
> >> the
> >> >> current fix is a correct variant to update heartbeatTs.
> >> >>
> >> >> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov <[email protected]>:
> >> >>
> >> >> > Hello, Ilya
> >> >> >
> >> >> > > But anyway, I propose to remove the update of the heartbeat from
> >> other
> >> >> > threads altogether and wrap the call to listeners in a
> >> blockingSection.
> >> >> > I don't quite understand your proposal. Which call to listeners do
> >> you
> >> >> > mean? If we wrap the listener into the blocking section the result
> >> will
> >> >> be
> >> >> > the same.
> >> >> > Alternatively, I think we can wrap awaitPendingTasksFinished into the
> >> >> > blocking section, this will also solve the problem, but in this case
> >> we
> >> >> can
> >> >> > never detect lack of progress by async executor threads and
> >> checkpointer
> >> >> > thread hangs due to this reason.
> >> >> > What's wrong with the current fix? It solves the current problem and
> >> I
> >> >> hope
> >> >> > will protect us from problems like this in the future.
> >> >> >
> >> >> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov <[email protected]>:
> >> >> >
> >> >> > > Hi Igniters, hi Alexey.
> >> >> > >
> >> >> > > I want to discuss this issue:
> >> >> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught
> >> it
> >> >> > too.
> >> >> > >
> >> >> > > I was able to determine where there is a race.
> >> >> > >
> >> >> > > The update of the heartbeat happens asynchronously into the
> >> listener
> >> >> > code.
> >> >> > > But we always wait in the checkpoint thread for all pending async
> >> >> > > tasks. And this is reasonable.
> >> >> > >
> >> >> > > for (CheckpointListener lsnr : dbLsnrs)
> >> >> > >   lsnr.beforeCheckpointBegin(ctx0);
> >> >> > >
> >> >> > > ctx0.awaitPendingTasksFinished();
> >> >> > >
> >> >> > > The race was because of inappropriate order of future
> >> registration. In
> >> >> > > CheckpointContextImpl.executor () (inside listeners execution)
> >> >> > >
> >> >> > > GridFutureAdapter<?> res = new GridFutureAdapter<>();
> >> >> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat());
> >> >> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
> >> >> > > pendingTaskFuture.add(res);
> >> >> > >
> >> >> > > Here we create a task, submit a task to the executor, and only
> >> after
> >> >> this
> >> >> > > do we register the task. Thus we got a situation where checkpointer
> >> >> > thread
> >> >> > > was moving on after ctx0.awaitPendingTasksFinished(); and still,
> >> >> > > the unregistered asyncRunner task was moving on in parallel.
> >> >> > >
> >> >> > > But anyway, I propose to remove the update of the heartbeat from
> >> other
> >> >> > > threads altogether and wrap the call to listeners in a
> >> >> blockingSection.
> >> >> > >
> >> >> > > As I understand heartbeat was designed just to indicate
> >> self-progress
> >> >> by
> >> >> > a
> >> >> > > worker. If a worker can not indicate self-progress we should wrap
> >> such
> >> >> > code
> >> >> > > into blockingSections. In case of listeners, worker can not
> >> indicate
> >> >> > > self-progress, thus let's wrap it into blockingSection.
> >> >> > >
> >> >> > > Guys, what do you think about this?
> >> >> > >
> >> >> > > -------------
> >> >> > > Ilya Kazakov
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> Sincerely yours, Ivan Daschinskiy
> >> >>
> >> >
> >>
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >

Re: [Discussion] Race on heartbeat update in checkpoint thread

Reply via email to