Alright, it took me longer to get back and look into it. Sorry for a delay. Overall, folks, the things look creepy, seriously. I see 3 primary issues ranged by priority.
1st, until the failure handler gets smart enough how to deal with SYSTEM_WORKER_BLOCKED/SYSTEM_CRITICAL_OPERATION_TIMEOUT events we have to avoid false-positives and print out a warning message instead of stopping a node. *Andrey*, that's the new behavior of 2.7.5 release according to JIRA, right? 2nd, the format of the warning/exception message doesn't give any hints for troubleshooting nor a clue why this happened. I have no idea what to suggest to those who see exceptions of this kind [1] and have to call for help from Andrey and other committers. For instance, if to take [1] as a reference Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [*tryStop*=false, *timeout*=0, super=AbstractFailureHandler [*ignoredFailureTypes*=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=grid-timeout-worker, igniteInstanceName=TravelInventoryTesting, finished=false, *heartbeatTs*=1553481506244]]] class org.apache.ignite.IgniteException: GridWorker [name=grid-timeout-worker, igniteInstanceName=TravelInventoryTesting, finished=false, *heartbeatTs*=1553481506244] A lot of the details might be hidden but, unfortunately, but the interpretation of parameters like heartbeatTs, tryStop, finished, timeout, etc is hard. Seems like a message which has to be fed into a complementary tool which will give me an answer. The format of the message has to help the user (developer/devops/administrator/architect who has zero affiliation with the Ignite community) with troubleshooting without calling for help on the user list: - What happened - out of memory/critical error/hanging threads. We're already pretty good at that. - Why this happened - supply context in human language. For instance, "discovery thread was not responding within N seconds because of starvation or long GC pause." - Troubleshooting guidance - help the user to come around the issue. For instance, "Check your GC logs, ensure that compute tasks are not oversaturating CPUs causing livelocks. Tune parameter Y and Z." Would you see anything else? Let's design and enhance. 3rd, full cluster shutdown. Agree, that's harder. Do we have stats when it usually happens? [1] http://apache-ignite-users.70518.x6.nabble.com/Replace-or-Put-after-PutAsync-causes-Ignite-to-hang-td27871.html#a27873 - Denis On Sat, Apr 6, 2019 at 11:37 AM Me via Boomerang <dma...@gridgain.com> wrote: > Message moved to top of Inbox by Boomerang (view this conversation > <https://mail.google.com/mail/u/0/#search/rfc822msgid:%3CCAK0qHnq%3D%3DP_gzftAW3-dTe3j%3DvJo295cFSd%2BLQM43S96vKv3ng%40mail.gmail.com%3E> > ). > > Don't want this notification email in the future? Go to > https://b4g.baydin.com/settings and uncheck the 'At the top of your > Inbox' option under Settings. Please note that your Boomeranged messages > would no longer return to the top of your Inbox. >