Have you tried to get a thread dump to understand which code is blocking
the sys-stripe thread?

On Thu, Oct 31, 2024 at 5:56 AM Raymond Liu <[email protected]> wrote:

> Hm, the scenario you described doesn't sound accurate here. I guess it's
> my fault for not including logs. I will paste them below. The logs show all
> four *inner* entry processors complete without being blocked - we see
> four "Entered InnerEntryProcessor" and four "Exiting InnerEntryProcessor"
> lines. And yet, one of the OuterEntryProcessors does not seem to get
> notified that the inner entry processors have all completed. This is
> indicated by the presence of "OuterEntryProcessor with key outer1 waiting
> for inner future 1" without a corresponding log stating
> "OuterEntryProcessor with key outer1 done waiting for all inner futures".
> -------
> 2024-10-30T23:27:55.056-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Starting test
> 2024-10-30T23:27:55.057-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Invoking OuterEntryProcessor with key outer0
> 2024-10-30T23:27:55.077-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [mmunicationSpi%]
> o.a.i.s.c.t.TcpCommunicationSpi          : Accepted incoming communication
> connection [locAddr=/127.0.0.1:47101, rmtAddr=/127.0.0.1:61560]
> 2024-10-30T23:27:55.077-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [mmunicationSpi%]
> o.a.i.s.c.t.TcpCommunicationSpi          : Established outgoing
> communication connection [locAddr=/127.0.0.1:61560, rmtAddr=/
> 127.0.0.1:47101]
> 2024-10-30T23:27:55.081-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Invoked OuterEntryProcessor with key outer0
> 2024-10-30T23:27:55.081-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Invoking OuterEntryProcessor with key outer1
> 2024-10-30T23:27:55.082-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Invoked OuterEntryProcessor with key outer1
> 2024-10-30T23:27:55.082-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Done invoking OuterEntryProcessors. Waiting for them to
> complete...
> 2024-10-30T23:27:55.082-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Test Runner waiting for outer future 0
> 2024-10-30T23:27:55.093-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : Entered OuterEntryProcessor with key outer1. Sleeping
> before further actions...
> 2024-10-30T23:27:55.093-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Entered OuterEntryProcessor with key outer0. Sleeping
> before further actions...
> 2024-10-30T23:27:55.195-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : Invoking with inner key: outer1-inner0
> 2024-10-30T23:27:55.195-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Invoking with inner key: outer0-inner0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : Invoked with inner key: outer1-inner0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Invoked with inner key: outer0-inner0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : Invoking with inner key: outer1-inner1
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Invoking with inner key: outer0-inner1
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Invoked with inner key: outer0-inner1
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : Invoked with inner key: outer1-inner1
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : OuterEntryProcessor with key outer1 waiting for inner
> future 0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : OuterEntryProcessor with key outer0 waiting for inner
> future 0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest
>                : Entered InnerEntryProcessor with inner key: outer0-inner0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-1-#2%server1%] c.p.i.BlockedThreadTest
>                : Exiting InnerEntryProcessor with inner key: outer0-inner0
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest
>                : Entered InnerEntryProcessor with inner key: outer0-inner1
> 2024-10-30T23:27:55.209-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-0-#1%server1%] c.p.i.BlockedThreadTest
>                : Exiting InnerEntryProcessor with inner key: outer0-inner1
> 2024-10-30T23:27:55.210-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest
>                : Entered InnerEntryProcessor with inner key: outer1-inner1
> 2024-10-30T23:27:55.210-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [6-#298%server3%] c.p.i.BlockedThreadTest
>                : Exiting InnerEntryProcessor with inner key: outer1-inner1
> 2024-10-30T23:27:55.215-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : OuterEntryProcessor with key outer0 waiting for inner
> future 1
> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : OuterEntryProcessor with key outer0 done waiting for all
> inner futures
> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Entered InnerEntryProcessor with inner key: outer1-inner0
> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [5-#142%server2%] c.p.i.BlockedThreadTest
>                : Exiting InnerEntryProcessor with inner key: outer1-inner0
> 2024-10-30T23:27:55.216-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [    Test worker] c.p.i.BlockedThreadTest
>                : Test Runner waiting for outer future 1
> 2024-10-30T23:27:55.217-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [e-6-#7%server1%] c.p.i.BlockedThreadTest
>                : OuterEntryProcessor with key outer1 waiting for inner
> future 1
> 2024-10-30T23:27:59.621-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [-notifier-timer]
> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
> 2024-10-30T23:28:01.283-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [-notifier-timer]
> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
> 2024-10-30T23:28:02.877-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [-notifier-timer]
> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
> 2024-10-30T23:28:04.399-04:00  INFO 17056 ---
> [ignite-blocked-thread-sample] [-notifier-timer]
> o.a.i.i.p.c.GridUpdateNotifier           : Your version is up to date.
> 2024-10-30T23:28:08.508-04:00 ERROR 17056 ---
> [ignite-blocked-thread-sample] [%-#118%server1%] o.a.i.i.u.t.G
>                : Blocked system-critical thread has been detected. This can
> lead to cluster-wide undefined behaviour [workerName=sys-stripe-6,
> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]
> -------
>
> Thanks for the advice - we are probably using entry processors in a way
> originally intended for compute jobs with transactions, and we will
> probably move towards that direction eventually. But to learn from this and
> prevent future issues, I'd like to understand exactly what the problem is,
> and with these logs, I can't understand how it's a deadlock.
>
> - Raymond
>
>
> On Wed, Oct 30, 2024 at 7:33 AM Pavel Tupitsyn <[email protected]>
> wrote:
>
>> Looks like a deadlock to me.
>>
>> Processor 1 invokes processor 2 and blocks on a future, stripe thread T
>> is blocked
>> Processor 2 invokes processor 3, which needs stripe thread T and waits
>> forever
>>
>> General advice - do not block striped pool threads. Avoid IO, waiting for
>> futures, etc. An entry processor is supposed to perform a local computation
>> and return quickly.
>>
>> On Wed, Oct 30, 2024 at 7:05 AM Raymond Liu <[email protected]>
>> wrote:
>>
>>> Hi team,
>>>
>>> We see a failure with nested asynchronous entry processor calls where the 
>>> nested calls appear to complete, but
>>> the parent entry processor is not aware the child entry processor is done.
>>>
>>> In this sample test setup...
>>> 1. a test runner invokes an "outer" entry processor twice asynchronously.
>>> 2. The two outer entry processor invocations each invoke two inner entry 
>>> processors asynchronously.
>>> 3. All four inner entry processor invocations appear to complete.
>>> 4. One outer entry processor is stuck waiting (via IgniteFuture.get()) for 
>>> the last inner entry processor to complete.
>>>    The program hangs, prints thread dumps, and prints the following error:
>>>    "Blocked system-critical thread has been detected. This can lead to 
>>> cluster-wide undefined behaviour [workerName=sys-stripe-6, 
>>> threadName=sys-stripe-6-#7%server1%, blockedFor=13s]"
>>>
>>> I've uploaded this in a Github repo called "ignite-blocked-thread-sample": 
>>> https://github.com/Philosobyte/ignite-blocked-thread-sample/blob/main/src/test/java/com/philosobyte/igniteblockedthreadsample/BlockedThreadTest.java
>>>
>>> While this sample uses asynchronous invocations, a version of this test 
>>> with synchronous invocations runs into the same
>>> issue.
>>>
>>> This appears to happen only when there is more than one server node and the 
>>> nested invocation goes to a different
>>> server node than the outer invocation. A debug breakpoint shows that, on 
>>> the server executing the inner entry processor,
>>> the sys-stripe thread pool is idle; but on the server executing the outer 
>>> entry processor, there is one sys-stripe
>>> thread pool waiting on IgniteFuture.get().
>>>
>>> We originally suspected a deadlock, but this error also occurs when we try 
>>> setting thread pool sizes to a large number.
>>> This almost appears to be a failure to communicate that a future should be 
>>> complete.
>>>
>>> Nested compute tasks, such as `IgniteCallable`s and `IgniteClosure`s 
>>> executing on the public thread pool, do not run
>>> into this problem. So, we worked around this issue by wrapping our entry 
>>> processors inside `IgniteCallable`s. That alone
>>> was already enough to resolve the issue; of course, at higher scale, we'd 
>>> run into deadlocks, so we also followed the
>>> advice in the "thread pools tuning" documentation page about using a custom 
>>> thread pool.
>>>
>>> Regardless, we would like to make sure we understand the issue. Can anyone 
>>> think of an explanation? Is this expected
>>> behavior or possibly a bug?
>>>
>>> Thank you,
>>> Raymond
>>>
>>>

Reply via email to