Have you tried to get a thread dump to understand which code is blocking the sys-stripe thread?
On Thu, Oct 31, 2024 at 5:56 AM Raymond Liu <[email protected]> wrote: > Hm, the scenario you described doesn't sound accurate here. I guess it's > my fault for not including logs. I will paste them below. The logs show all > four *inner* entry processors complete without being blocked - we see > four "Entered InnerEntryProcessor" and four "Exiting InnerEntryProcessor" > lines. And yet, one of the OuterEntryProcessors does not seem to get > notified that the inner entry processors have all completed. This is > indicated by the presence of "OuterEntryProcessor with key outer1 waiting > for inner future 1" without a corresponding log stating > "OuterEntryProcessor with key outer1 done waiting for all inner futures". > ------- > 2024-10-30T23:27:55.056-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Starting test > 2024-10-30T23:27:55.057-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Invoking OuterEntryProcessor with key outer0 > 2024-10-30T23:27:55.077-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [mmunicationSpi%] > o.a.i.s.c.t.TcpCommunicationSpi : Accepted incoming communication > connection [locAddr=/127.0.0.1:47101, rmtAddr=/127.0.0.1:61560] > 2024-10-30T23:27:55.077-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [mmunicationSpi%] > o.a.i.s.c.t.TcpCommunicationSpi : Established outgoing > communication connection [locAddr=/127.0.0.1:61560, rmtAddr=/ > 127.0.0.1:47101] > 2024-10-30T23:27:55.081-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Invoked OuterEntryProcessor with key outer0 > 2024-10-30T23:27:55.081-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Invoking OuterEntryProcessor with key outer1 > 2024-10-30T23:27:55.082-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Invoked OuterEntryProcessor with key outer1 > 2024-10-30T23:27:55.082-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Done invoking OuterEntryProcessors. Waiting for them to > complete... > 2024-10-30T23:27:55.082-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Test Runner waiting for outer future 0 > 2024-10-30T23:27:55.093-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : Entered OuterEntryProcessor with key outer1. Sleeping > before further actions... > 2024-10-30T23:27:55.093-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Entered OuterEntryProcessor with key outer0. Sleeping > before further actions... > 2024-10-30T23:27:55.195-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : Invoking with inner key: outer1-inner0 > 2024-10-30T23:27:55.195-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Invoking with inner key: outer0-inner0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : Invoked with inner key: outer1-inner0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Invoked with inner key: outer0-inner0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : Invoking with inner key: outer1-inner1 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Invoking with inner key: outer0-inner1 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Invoked with inner key: outer0-inner1 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : Invoked with inner key: outer1-inner1 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : OuterEntryProcessor with key outer1 waiting for inner > future 0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : OuterEntryProcessor with key outer0 waiting for inner > future 0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest > : Entered InnerEntryProcessor with inner key: outer0-inner0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest > : Exiting InnerEntryProcessor with inner key: outer0-inner0 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest > : Entered InnerEntryProcessor with inner key: outer0-inner1 > 2024-10-30T23:27:55.209-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest > : Exiting InnerEntryProcessor with inner key: outer0-inner1 > 2024-10-30T23:27:55.210-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest > : Entered InnerEntryProcessor with inner key: outer1-inner1 > 2024-10-30T23:27:55.210-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest > : Exiting InnerEntryProcessor with inner key: outer1-inner1 > 2024-10-30T23:27:55.215-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : OuterEntryProcessor with key outer0 waiting for inner > future 1 > 2024-10-30T23:27:55.216-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : OuterEntryProcessor with key outer0 done waiting for all > inner futures > 2024-10-30T23:27:55.216-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Entered InnerEntryProcessor with inner key: outer1-inner0 > 2024-10-30T23:27:55.216-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest > : Exiting InnerEntryProcessor with inner key: outer1-inner0 > 2024-10-30T23:27:55.216-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [ Test worker] c.p.i.BlockedThreadTest > : Test Runner waiting for outer future 1 > 2024-10-30T23:27:55.217-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest > : OuterEntryProcessor with key outer1 waiting for inner > future 1 > 2024-10-30T23:27:59.621-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [-notifier-timer] > o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. > 2024-10-30T23:28:01.283-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [-notifier-timer] > o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. > 2024-10-30T23:28:02.877-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [-notifier-timer] > o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. > 2024-10-30T23:28:04.399-04:00 INFO 17056 --- > [ignite-blocked-thread-sample] [-notifier-timer] > o.a.i.i.p.c.GridUpdateNotifier : Your version is up to date. > 2024-10-30T23:28:08.508-04:00 ERROR 17056 --- > [ignite-blocked-thread-sample] [%-#118%server1%] o.a.i.i.u.t.G > : Blocked system-critical thread has been detected. This can > lead to cluster-wide undefined behaviour [workerName=sys-stripe-6, > threadName=sys-stripe-6-#7%server1%, blockedFor=13s] > ------- > > Thanks for the advice - we are probably using entry processors in a way > originally intended for compute jobs with transactions, and we will > probably move towards that direction eventually. But to learn from this and > prevent future issues, I'd like to understand exactly what the problem is, > and with these logs, I can't understand how it's a deadlock. > > - Raymond > > > On Wed, Oct 30, 2024 at 7:33 AM Pavel Tupitsyn <[email protected]> > wrote: > >> Looks like a deadlock to me. >> >> Processor 1 invokes processor 2 and blocks on a future, stripe thread T >> is blocked >> Processor 2 invokes processor 3, which needs stripe thread T and waits >> forever >> >> General advice - do not block striped pool threads. Avoid IO, waiting for >> futures, etc. An entry processor is supposed to perform a local computation >> and return quickly. >> >> On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <[email protected]> >> wrote: >> >>> Hi team, >>> >>> We see a failure with nested asynchronous entry processor calls where the >>> nested calls appear to complete, but >>> the parent entry processor is not aware the child entry processor is done. >>> >>> In this sample test setup... >>> 1. a test runner invokes an "outer" entry processor twice asynchronously. >>> 2. The two outer entry processor invocations each invoke two inner entry >>> processors asynchronously. >>> 3. All four inner entry processor invocations appear to complete. >>> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) for >>> the last inner entry processor to complete. >>> The program hangs, prints thread dumps, and prints the following error: >>> "Blocked system-critical thread has been detected. This can lead to >>> cluster-wide undefined behaviour [workerName=sys-stripe-6, >>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]" >>> >>> I've uploaded this in a Github repo called "ignite-blocked-thread-sample": >>> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java >>> >>> While this sample uses asynchronous invocations, a version of this test >>> with synchronous invocations runs into the same >>> issue. >>> >>> This appears to happen only when there is more than one server node and the >>> nested invocation goes to a different >>> server node than the outer invocation. A debug breakpoint shows that, on >>> the server executing the inner entry processor, >>> the sys-stripe thread pool is idle; but on the server executing the outer >>> entry processor, there is one sys-stripe >>> thread pool waiting on IgniteFuture.get(). >>> >>> We originally suspected a deadlock, but this error also occurs when we try >>> setting thread pool sizes to a large number. >>> This almost appears to be a failure to communicate that a future should be >>> complete. >>> >>> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s >>> executing on the public thread pool, do not run >>> into this problem. So, we worked around this issue by wrapping our entry >>> processors inside `IgniteCallable`s. That alone >>> was already enough to resolve the issue; of course, at higher scale, we'd >>> run into deadlocks, so we also followed the >>> advice in the "thread pools tuning" documentation page about using a custom >>> thread pool. >>> >>> Regardless, we would like to make sure we understand the issue. Can anyone >>> think of an explanation? Is this expected >>> behavior or possibly a bug? >>> >>> Thank you, >>> Raymond >>> >>>
