Alexey, Ilya, I took a look at the problem and corresponding fixes. It seems that you are both right. But Alexey's fix looks like some kind of hack to me.
We have two problems: 1. Heartbeat update from thread that will complete a future. I agree with Ilya and Andrey. Only a critical worker itself can update heartbeat. So, removal of `res.listen(fut -> heartbeatUpdater.updateHeartbeat());` from CheckpointContextImpl#executor() method is a good idea. It will avoid a race and makes Alexey's fix redundant. 2. blockingSection in checkpointer thread really can never detect the lack of pending tasks progress. Alexey is right. It is a problem. But what I actually see: checkpointWritePageThreads and checkpointCollectInfoThreads which execute pending tasks, are actually critical themselves and should be also monitored by WorkersRegistry. In such a case the checkpointer thread can safely use blocking sections at places designated by Ilya and forementioned threads should also use blocking sections on IO operations. So proposal: - Remove Alexey's fix - Remove asynchronous heartbeat update - Add blocking sections (fix of Ilya) - Monitor checkpointWritePageThreads and checkpointCollectInfoThreads if they exist (it depends on configuration) WDYT? On Thu, Jul 29, 2021 at 5:48 AM Ilya Kazakov <kazakov.i...@gmail.com> wrote: > > Guys. I have discussed just these changes. Look at my PR, please. What do > you think? > > https://issues.apache.org/jira/browse/IGNITE-15192 > https://github.com/apache/ignite/pull/9280/files > > чт, 22 июл. 2021 г. в 22:30, Andrey Mashenkov <andrey.mashen...@gmail.com>: > > > Hi guys, > > > > I think updateHeartBeat() method was misused in the future listener and > > this must be fixed. > > > > Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the > > worker itself. > > It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc, > > which said "Notifying dispatcher that work is in progress and thread didn't > > freeze." > > GridWorker.heartbeatTs field is marked volatile just to provide consistent > > visibility guarantee to a watcher. > > > > So, CheckpointPagesWriter violates this contract, when runs in another > > thread(s). > > But it works fine just because the main (Checkpointer) thread waits for > > all the Writers to finish their work before it can fall into the blocking > > section, therefore there is no race possible. > > > > As for the broken case, there are 2 possible solutions, > > 1. Main thread must wait for all children's tasks to finish before going > > into the blocking section. > > 2. Or make updateHeartbeat consistent with the blockingSectionBegin/End. > > Seems, there is no need to mark the method as synchronized, but use call > > AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v : > > U.currentTimeMillis()). > > > > I'd suggest > > * Remove the "thread" mentioning from WorkProgressDispatcher interface to > > avoid confusion with the current usage and make WorkProgressDispatcher more > > general. > > * Avoid unsafe heartbeatUpdater leak to the outside via wrapping > > CheckpointContextImpl.heartbeatUpdater instance to perform consistent > > heartbeat update from p.2 above. > > > > Does it make sense? > > > > On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov <plehanov.a...@gmail.com> > > wrote: > > > >> Ilya, > >> > >> Race only affects the future listener (which only updates heartbeat), but > >> not checkpoint listeners, so no other bugs possible caused by this race > >> except wrong heartbeat update. > >> > >> As I've written before, I think it's not a good idea to wait for other > >> threads by the critical thread in the blocking section. If these threads > >> are not monitored we can never detect lack of progress and never trigger > >> the failure handler. Perhaps a good solution will be to register async > >> threads as critical threads (additionally to waiting for these threads in > >> the blocing section) and update their own heartbeat. But the current fix > >> is > >> required too, to avoid such problems for other critical threads in the > >> future. > >> > >> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov <kazakov.i...@gmail.com>: > >> > >> > 1. > >> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin(): > >> > > >> > This > >> > ------------------------------------------ > >> > for (CheckpointListener lsnr : dbLsnrs) > >> > lsnr.beforeCheckpointBegin(ctx0); > >> > > >> > ctx0.awaitPendingTasksFinished(); > >> > ------------------------------------------ > >> > > >> > And this: > >> > > >> > ------------------------------------------ > >> > // Listeners must be invoked before we write checkpoint record to WAL. > >> > for (CheckpointListener lsnr : dbLsnrs) > >> > lsnr.onMarkCheckpointBegin(ctx0); > >> > > >> > ctx0.awaitPendingTasksFinished(); > >> > ------------------------------------------ > >> > > >> > Inside lsnr.beforeCheckpointBegin(ctx0) and > >> > lsnr.onMarkCheckpointBegin(ctx0) we call > >> CheckpointContextImpl.executor() > >> > which submit heartbeat update tasks in threadpool. But due to a bug in > >> > registration, ctx0.awaitPendingTasksFinished() do not work correctly. > >> > Checpoint thread does not wait for all tasks to complete and moves on. > >> > > >> > This could lead to other bugs because as written in the comment "// > >> > Listeners must be invoked before we write checkpoint record to WAL." > >> > > >> > 2. > >> > About the fix. > >> > Yes, the fix resolves the issue, but does not resolve the root cause - a > >> > race between checkpoint thread and threads run in asyncRunner. Also, as > >> I > >> > understand, there should be no attempts to update heartbeat inside > >> > blockingSection, but the fix does not exclude such attempts but blocks > >> them. > >> > > >> > 3. > >> > But my main point is that it looks strange to update the heartbeat of > >> > thread A from thread B. It's like doing artificial respiration and chest > >> > compressions. Thread A is waiting on async tasks completion, but these > >> > tasks are updating progress of thread A. I suppose that blockingSection > >> was > >> > designed for such situations when the thread is waiting for something > >> and > >> > does not perform any progress. > >> > > >> > вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky <ivanda...@gmail.com>: > >> > > >> >> +1 For current fix. Code is clean and understandable. I suppose that > >> the > >> >> current fix is a correct variant to update heartbeatTs. > >> >> > >> >> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov <plehanov.a...@gmail.com>: > >> >> > >> >> > Hello, Ilya > >> >> > > >> >> > > But anyway, I propose to remove the update of the heartbeat from > >> other > >> >> > threads altogether and wrap the call to listeners in a > >> blockingSection. > >> >> > I don't quite understand your proposal. Which call to listeners do > >> you > >> >> > mean? If we wrap the listener into the blocking section the result > >> will > >> >> be > >> >> > the same. > >> >> > Alternatively, I think we can wrap awaitPendingTasksFinished into the > >> >> > blocking section, this will also solve the problem, but in this case > >> we > >> >> can > >> >> > never detect lack of progress by async executor threads and > >> checkpointer > >> >> > thread hangs due to this reason. > >> >> > What's wrong with the current fix? It solves the current problem and > >> I > >> >> hope > >> >> > will protect us from problems like this in the future. > >> >> > > >> >> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov <kazakov.i...@gmail.com>: > >> >> > > >> >> > > Hi Igniters, hi Alexey. > >> >> > > > >> >> > > I want to discuss this issue: > >> >> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught > >> it > >> >> > too. > >> >> > > > >> >> > > I was able to determine where there is a race. > >> >> > > > >> >> > > The update of the heartbeat happens asynchronously into the > >> listener > >> >> > code. > >> >> > > But we always wait in the checkpoint thread for all pending async > >> >> > > tasks. And this is reasonable. > >> >> > > > >> >> > > for (CheckpointListener lsnr : dbLsnrs) > >> >> > > lsnr.beforeCheckpointBegin(ctx0); > >> >> > > > >> >> > > ctx0.awaitPendingTasksFinished(); > >> >> > > > >> >> > > The race was because of inappropriate order of future > >> registration. In > >> >> > > CheckpointContextImpl.executor () (inside listeners execution) > >> >> > > > >> >> > > GridFutureAdapter<?> res = new GridFutureAdapter<>(); > >> >> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat()); > >> >> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res)); > >> >> > > pendingTaskFuture.add(res); > >> >> > > > >> >> > > Here we create a task, submit a task to the executor, and only > >> after > >> >> this > >> >> > > do we register the task. Thus we got a situation where checkpointer > >> >> > thread > >> >> > > was moving on after ctx0.awaitPendingTasksFinished(); and still, > >> >> > > the unregistered asyncRunner task was moving on in parallel. > >> >> > > > >> >> > > But anyway, I propose to remove the update of the heartbeat from > >> other > >> >> > > threads altogether and wrap the call to listeners in a > >> >> blockingSection. > >> >> > > > >> >> > > As I understand heartbeat was designed just to indicate > >> self-progress > >> >> by > >> >> > a > >> >> > > worker. If a worker can not indicate self-progress we should wrap > >> such > >> >> > code > >> >> > > into blockingSections. In case of listeners, worker can not > >> indicate > >> >> > > self-progress, thus let's wrap it into blockingSection. > >> >> > > > >> >> > > Guys, what do you think about this? > >> >> > > > >> >> > > ------------- > >> >> > > Ilya Kazakov > >> >> > > > >> >> > > >> >> > >> >> > >> >> -- > >> >> Sincerely yours, Ivan Daschinskiy > >> >> > >> > > >> > > > > > > -- > > Best regards, > > Andrey V. Mashenkov > >