[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Iain Buclaw changed: What|Removed |Added CC||bra...@puremagic.com --- Comment #31 from Iain Buclaw --- *** Issue 13416 has been marked as a duplicate of this issue. *** --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #30 from Iain Buclaw --- *** Issue 10351 has been marked as a duplicate of this issue. *** --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Iain Buclaw changed: What|Removed |Added Status|NEW |RESOLVED CC||ibuc...@gdcproject.org Resolution|--- |FIXED --- Comment #29 from Iain Buclaw --- PR got merged. https://github.com/dlang/druntime/pull/3617 --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Dlang Bot changed: What|Removed |Added Keywords||pull --- Comment #28 from Dlang Bot --- @hatf0 updated dlang/druntime pull request #3617 "Move SIGUSR1/SIGUSR2 to SIGRT for GC" fixing this issue: - Fix Issue 15939 -- Move SIGUSR1/SIGUSR2 to SIGRT for GC https://github.com/dlang/druntime/pull/3617 --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Илья Ярошенко changed: What|Removed |Added Assignee|ilyayaroshe...@gmail.com|nob...@puremagic.com --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 safety0ff.bugzchanged: What|Removed |Added See Also||https://issues.dlang.org/sh ||ow_bug.cgi?id=16979 --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #27 from Martin Nowak--- (In reply to Илья Ярошенко from comment #26) > Probably related issue > http://forum.dlang.org/post/igqwbqawrtxnigplg...@forum.dlang.org No, looks like an unrelated crash in a finalizer. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #26 from Илья Ярошенко--- Probably related issue http://forum.dlang.org/post/igqwbqawrtxnigplg...@forum.dlang.org --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #25 from Martin Nowak--- (In reply to Aleksei Preobrazhenskii from comment #24) > Since I changed signals to real-time and migrated to recent kernel I haven't > seen that issue in the release builds, however, I tried running profile > build recently (unfortunately I only did it for the old kernel) and it was > consistently stuck every time. Thanks, good to hear from you. There is a chance that these are kernel bugs fixed in 3.10 https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db and 3.18 https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #24 from Aleksei Preobrazhenskii--- (In reply to Martin Nowak from comment #23) > Anyone still experiencing this issue? Can't seem to fix it w/o reproducing > it. Since I changed signals to real-time and migrated to recent kernel I haven't seen that issue in the release builds, however, I tried running profile build recently (unfortunately I only did it for the old kernel) and it was consistently stuck every time. It might be something related to the issue, I will try to reproduce it with simpler code when I have time. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #23 from Martin Nowak--- Anyone still experiencing this issue? Can't seem to fix it w/o reproducing it. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #22 from Илья Ярошенко--- (In reply to Martin Nowak from comment #21) > Nope, that doesn't seem to be the problem. > All the thread exit code synchronizes on Thread.slock_nothrow. > It shouldn't even be possible to send a signal to an exiting thread, b/c > they get removed from the thread list before that, and that is synchronized > around the suspend loop. > > Might still be a problem with the synchronization of m_isRunning and/or > thread_cleanupHandler. Did your apps by any chance use thread cancellation > or pthread_exit? No, but an Exception may be thrown in a thread. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #21 from Martin Nowak--- Nope, that doesn't seem to be the problem. All the thread exit code synchronizes on Thread.slock_nothrow. It shouldn't even be possible to send a signal to an exiting thread, b/c they get removed from the thread list before that, and that is synchronized around the suspend loop. Might still be a problem with the synchronization of m_isRunning and/or thread_cleanupHandler. Did your apps by any chance use thread cancellation or pthread_exit? --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #20 from Илья Ярошенко--- I have not access to the source code anymore :/ --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #19 from Martin Nowak--- (In reply to Илья Ярошенко from comment #17) > > https://github.com/dlang/druntime/pull/1110, that would affect dmd >= > > 2.070.0. > > Could someone test their code with 2.069.2? > > Yes, the bug was found first for 2.069. But that change is not in 2.069.x, only in 2.070.0 and following. Can you somewhat reproduce it? Would simplify my life a lot. Following my hypothesis, it should be fairly simple to trigger with one thread continuously looping on GC.collect(), while concurrently spawning many short lived threads, to increase the change of triggering the race between signal delivery and the thread exiting. If realtime signals are delivered faster (before pthread_kill returns), then they might indeed avoid the race condition by pure chance. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #17 from Илья Ярошенко--- (In reply to Martin Nowak from comment #16) > (In reply to Aleksei Preobrazhenskii from comment #13) > > All suspending signals were delivered, but it seems that number of calls to > > sem_wait was different than number of calls to sem_post (or something > > similar). I have no reasonable explanation for that. > > > > It doesn't invalidate the hypothesis that RT signals helped with original > > deadlock though. > > To be hypothesis it must verifyable, but as we can't explain why RT signals > would help, it's not a real hypothesis. Can anyone somewhat repeatedly > reproduce the issue? It is not easy to catch it on PC. The bug was found when program was running on multiple CPUs on multiple servers during a day. > I would suspect that this issue came with the recent parallel suspend > feature. > https://github.com/dlang/druntime/pull/1110, that would affect dmd >= > 2.070.0. > Could someone test their code with 2.069.2? Yes, the bug was found first for 2.069. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #16 from Martin Nowak--- (In reply to Aleksei Preobrazhenskii from comment #13) > All suspending signals were delivered, but it seems that number of calls to > sem_wait was different than number of calls to sem_post (or something > similar). I have no reasonable explanation for that. > > It doesn't invalidate the hypothesis that RT signals helped with original > deadlock though. To be hypothesis it must verifyable, but as we can't explain why RT signals would help, it's not a real hypothesis. Can anyone somewhat repeatedly reproduce the issue? I would suspect that this issue came with the recent parallel suspend feature. https://github.com/dlang/druntime/pull/1110, that would affect dmd >= 2.070.0. Could someone test their code with 2.069.2? --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Artem Tarasovchanged: What|Removed |Added CC||lomerei...@gmail.com --- Comment #15 from Artem Tarasov --- I'm apparently bumping into the same problem. Here's the last stack trace that I've received from a user, very similar to the one posted here: https://gist.github.com/rtnh/e2eab6afa7c0a37dbc96578d0f73c540 The prominent kernel bug mentioned here has been ruled out already. Another hint I've got is that reportedly 'error doesn't happen on XenServer hypervisors, only on KVM' (full discussion is taking place at https://github.com/lomereiter/sambamba/issues/189) --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #14 from safety0ff.bugz--- (In reply to Aleksei Preobrazhenskii from comment #13) > > All suspending signals were delivered, but it seems that number of calls to > sem_wait was different than number of calls to sem_post (or something > similar). I have no reasonable explanation for that. > > It doesn't invalidate the hypothesis that RT signals helped with original > deadlock though. I haven't looked too closely at whether there's any races for thread termination. My suspicions are still on a low-level synchronization bug. Have you tried a more recent (3.19+ kernel) or trying to newer glibc? I'm aware of this bug [1] which was supposed to affect kernels 3.14 - 3.18 but perhaps there's a preexisting bug which affects your machine? [1] https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #13 from Aleksei Preobrazhenskii--- I saw new deadlock with different symptoms today. Stack trace of collecting thread: Thread XX (Thread 0x7fda6700 (LWP 32383)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86 #1 0x007b4046 in thread_suspendAll () #2 0x007998dd in gc.gc.Gcx.fullcollect() () #3 0x00797e24 in gc.gc.Gcx.bigAlloc() () #4 0x0079bb5f in gc.gc.GC.__T9runLockedS47_D2gc2gc2GC12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS21_D2gc2gc10mallocTimelS21_D2gc2gc10numMallocslTmTkTmTxC8TypeInfoZ.runLocked() () #5 0x0079548e in gc.gc.GC.malloc() () #6 0x00760ac7 in gc_qalloc () #7 0x0076437b in _d_arraysetlengthT () ...application stack Stack traces of other threads: Thread XX (Thread 0x7fda5cff9700 (LWP 32402)): #0 0x7fda78927454 in do_sigsuspend (set=0x7fda5cff76c0) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63 #1 __GI___sigsuspend (set=) at ../sysdeps/unix/sysv/linux/sigsuspend.c:78 #2 0x0075d979 in core.thread.thread_suspendHandler() () #3 0x0075e220 in core.thread.callWithStackShell() () #4 0x0075d907 in thread_suspendHandler () #5 #6 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:160 #7 0x00760069 in core.sync.condition.Condition.wait() () ...application stack All suspending signals were delivered, but it seems that number of calls to sem_wait was different than number of calls to sem_post (or something similar). I have no reasonable explanation for that. It doesn't invalidate the hypothesis that RT signals helped with original deadlock though. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #12 from Aleksei Preobrazhenskii--- (In reply to Martin Nowak from comment #11) > Did you have gdb attached while the signal was send? That sometime causes > issues w/ signal delivery. No, I didn't. I attached gdb to investigate deadlock which already happened at that point. > Are there any other reasons for switching to real-time signals? I read that traditional signals are internally mapped to real-time signals. If that's true I see no reason to stick with inferior emulated entity with weaker guarantees. > Which real-time signals are usually not used for other purposes? Basically all real-time signals from range SIGRTMIN .. SIGRTMAX are intended for custom use (SIGRTMIN might vary from platform to platform though, because of things like NPTL and LinuxThreads). --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #11 from Martin Nowak--- Having the main thread hang while waiting for semaphore posts in the thread_suspendAll is a good indication that the signal was lost. Did you have gdb attached while the signal was send? That sometime causes issues w/ signal delivery. The setup looks fairly simple (a few threads allocating classes and extending arrays) to be run for a few days, maybe we can reproduce the problem. Are there any other reasons for switching to real-time signals? Which real-time signals are usually not used for other purposes? --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Илья Ярошенкоchanged: What|Removed |Added Assignee|nob...@puremagic.com|ilyayaroshe...@gmail.com --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #10 from Aleksei Preobrazhenskii--- (In reply to safety0ff.bugz from comment #9) > Could you run strace to get a log of the signal usage? I did it before to catch the deadlock, but I wasn't able to do that while strace was running. And, unfortunately, I don't have original code running in production anymore. > I'm wondering if there are any other signal handler invocations in the > "...application stack" part of your stack traces. No, there was no signal related code in hidden parts of stack trace. > I've seem a deadlock caused by an assert firing within the > thread_suspendHandler, which deadlocks on the GC lock. In my case that was a release build, so I assume no asserts. > What should happen in this case is since the signal is masked upon signal > handler invocation, the new suspend signal is marked as "pending" and run > once thread_suspendHandler returns and the signal is unblocked. Yeah, my reasoning was wrong. I did a quick test and saw that signals weren't delivered, apparently, I forgot that pthread_kill is asynchronous, so signals should've coalesced in my test. > Their queuing and ordering guarantees should be irrelevant due to > synchronization and signal masks. Ideally, yeah, but as I said, I just changed SIGUSR1/SIGUSR2 to SIGRTMIN/SIGRTMIN+1 and didn't see any deadlocks for a long time, and I saw them pretty consistently before. So, either "irrelevant" part is wrong, or there is something else which is different and relevant (and probably not documented) for real-time signals. The other explanation is that bug is still there and real-time signals just somehow reduced probability of it happening. Also, I have no other explanation why stack traces look like that, the simplest one is that signal wasn't delivered. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 safety0ff.bugzchanged: What|Removed |Added CC||safety0ff.b...@gmail.com --- Comment #9 from safety0ff.bugz --- Could you run strace to get a log of the signal usage? For example: strace -f -e signal -o signals.log command_to_run_program Then add the output signals.log to the bug report? I don't know if it'll be useful but it will be something more to look for hints. I'm wondering if there are any other signal handler invocations in the "...application stack" part of your stack traces. I've seem a deadlock caused by an assert firing within the thread_suspendHandler, which deadlocks on the GC lock. (In reply to Aleksei Preobrazhenskii from comment #6) > Like, if thread_suspendAll happens while some threads are still in the > thread_suspendHandler (already handled resume signal, but still didn't leave > the suspend handler). What should happen in this case is since the signal is masked upon signal handler invocation, the new suspend signal is marked as "pending" and run once thread_suspendHandler returns and the signal is unblocked. The suspended thread cannot receive another resume or suspend signal until after the sem_post in thread_suspendHandler. I've mocked up the suspend / resume code and it does not deadlock from the situation you've described. > Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery > guarantees Their queuing and ordering guarantees should be irrelevant due to synchronization and signal masks. I don't see any other benefits of RT signals. (In reply to Walter Bright from comment #8) > > Since you've written the code to fix it, please write a Pull Request for it. > That way you get the credit! He modified his code to use the thread_setGCSignals function: https://dlang.org/phobos/core_thread.html#.thread_setGCSignals P.S.: I don't mean to sound doubtful, I just want a sound explanation of the deadlock so it can be properly address at the cause. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Walter Brightchanged: What|Removed |Added CC||bugzi...@digitalmars.com --- Comment #8 from Walter Bright --- (In reply to Aleksei Preobrazhenskii from comment #7) > I was running tests for past five days, I didn't see any deadlocks since I > switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN, > SIGRTMIN + 1)). I would recommend to change default signals accordingly. Since you've written the code to fix it, please write a Pull Request for it. That way you get the credit! --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Vladimir Panteleevchanged: What|Removed |Added CC||c...@dawg.eu, ||thecybersha...@gmail.com --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #7 from Aleksei Preobrazhenskii--- I was running tests for past five days, I didn't see any deadlocks since I switched GC to using real-time POSIX signals (thread_setGCSignals(SIGRTMIN, SIGRTMIN + 1)). I would recommend to change default signals accordingly. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Aleksei Preobrazhenskiichanged: What|Removed |Added See Also||https://issues.dlang.org/sh ||ow_bug.cgi?id=10351 --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #6 from Aleksei Preobrazhenskii--- I think I saw the same behaviour in debug builds, I will try to verify it. As of 32-bit question, due to the nature of the program I can't test it in 32-bit environment. After investigating problem a little further, I think that the issue might be in GC relying on traditional POSIX signals. One way to get such stack traces is if suspend signal (SIGUSR1 by default) wasn't delivered, which could happen for traditional POSIX signals if they occur in quick succession. Like, if thread_suspendAll happens while some threads are still in the thread_suspendHandler (already handled resume signal, but still didn't leave the suspend handler). Real-time POSIX signals (SIGRTMIN .. SIGRTMAX) have stronger delivery guarantees, I'm going to try the same code but with thread_setGCSignals(SIGRTMIN, SIGRTMIN + 1). --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #5 from Sobirari Muhomori--- Also what about 32-bit mode? --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #4 from Sobirari Muhomori--- (In reply to Aleksei Preobrazhenskii from comment #0) > dmd 2.071.0 with -O -release -inline -boundscheck=off Do these flags affect the hang? --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Ivan Kazmenkochanged: What|Removed |Added CC||ga...@mail.ru --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Marco Leisechanged: What|Removed |Added CC||marco.le...@gmx.de --- Comment #3 from Marco Leise --- This issue has a smell of https://issues.dlang.org/show_bug.cgi?id=10351 In absence of a repro case that works without the profiler I just kept it open for future reference. Not how the GC hangs in thread_suspendAll() in both cases. --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 Илья Ярошенкоchanged: What|Removed |Added CC||ilyayaroshe...@gmail.com --- Comment #2 from Илья Ярошенко --- +1 I had the same problems --
[Issue 15939] GC.collect causes deadlock in multi-threaded environment
https://issues.dlang.org/show_bug.cgi?id=15939 --- Comment #1 from Aleksei Preobrazhenskii--- I wasn't able to reproduce the issue using simpler code using GC operations only. I noticed that nanosleep is a syscall which should be interrupted by GC signal. So, probably there is something else involved aside from GC. I use standard library only and I have no custom signal-related code. --