This looks like a shutdown bug in mono. Do you have a reliable way to reproduce it? How loaded are the machines running your workload?
On Thu, Apr 28, 2016 at 6:41 PM, George, Glover E ERDC-RDE-ITL-MS CIV < glover.e.geo...@erdc.dren.mil> wrote: > One more important point I forgot to mention, when this occurs, it is > generally only a few mono instances out of the 500+ simultaneous mono > instances. Most of them terminate. And more importantly, the mono > instances that do hang actually complete their work! It seems to be when > mono is trying to close down. > > This has occurred with mono 4.2.x, 4.3.x, 4.4.x, and current git. > > Happened today with this configuration: > Mono JIT compiler version 4.3.2 (Stable 4.3.2.467/ba2e5e4 Tue Mar 15 > 11:44:06 CDT 2016) > Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. > www.mono-project.com > TLS: __thread > SIGSEGV: altstack > Notifications: epoll > Architecture: amd64 > Disabled: none > Misc: bigarrays softdebug > LLVM: supported, not enabled. > GC: sgen > > > From: <mono-devel-list-boun...@lists.ximian.com> on behalf of Glover > George <glover.e.geo...@erdc.dren.mil> > Date: Thursday, April 28, 2016 at 5:25 PM > To: "mono-devel-list@lists.ximian.com" <mono-devel-list@lists.ximian.com> > Subject: [Mono-dev] Random hangs while running mono app > > Hi all, > > Quick background: We’re running mono on an HPC platform (SGI ICE-X / SUSE > Enterprise Linux 11) with a Lustre filesystem. For several months, I’ve > noticed random hangs with our application. In a given batch job, we run > 500+ instances of mono in parallel (no interprocess communication). I > haven’t been able to reliably reproduce the issue which leads me to believe > it’s a timing issue, and more specifically, I have reason to believe it may > be an issue with the interaction of mono and the filesystem. Being a > clustered file-system, the load on the backing stores can vary due to other > users. > > When I notice that my job hasn’t finished in a reasonable amount of time, > I log into the compute node and do a “ ps -efL | grep mono “ which > reveals: > > george 16728 16575 16728 6 3 16:13 ? 00:03:58 mono > --runtime=v4.0.30319 /app/MyConsole.exe > ggeorge 16728 16575 16768 0 3 16:13 ? 00:00:02 mono > --runtime=v4.0.30319 /app/MyConsole.exe > ggeorge 16728 16575 16815 0 3 16:13 ? 00:00:04 mono > --runtime=v4.0.30319 /app/MyConsole.exe > > 16728 is the top proc which spawns 16768 and 16815 (main, gc and maybe > finalizer thread?). > > Attaching to each of these 3 pid’s with gdb and doing a backtrace reveals > the traces below. > > 1. Is it possible this is a GC hang? > > 2. Is it possible this is a race condition that has anything to do with > the filesystem? > > 3. Any idea how this sequence could occur? If there’s a way to gather > more information, please let me know. > > The reason for the file system suspicion is my app writes several small > files, times the number of mono instances (500+). It only happens > randomly, but seems to be more frequent the more mono instances I run. > > Any help is HIGHLY appreciated. > > PID 16728 > ——— > (gdb) bt > #0 0x00007fffecccd324 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x00007fffeccc8684 in _L_lock_1091 () from /lib64/libpthread.so.0 > #2 0x00007fffeccc84f6 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2 > #4 0x00007fffec842530 in do_dlopen () from /lib64/libc.so.6 > #5 0x00007fffed8f2e86 in _dl_catch_error () from > /lib64/ld-linux-x86-64.so.2 > #6 0x00007fffec8425e5 in dlerror_run () from /lib64/libc.so.6 > #7 0x00007fffec8426d7 in __libc_dlopen_mode () from /lib64/libc.so.6 > #8 0x00007fffec81d2e5 in init () from /lib64/libc.so.6 > #9 0x00007fffecccbd03 in pthread_once () from /lib64/libpthread.so.0 > #10 0x00007fffec81d43c in backtrace () from /lib64/libc.so.6 > #11 0x00000000004ac025 in mono_handle_native_sigsegv (signal=<optimized > out>, ctx=<optimized out>, info=<optimized out>) > at mini-exceptions.c:2309 > #12 <signal handler called> > #13 0x00007fffec75e875 in raise () from /lib64/libc.so.6 > #14 0x00007fffec75fe51 in abort () from /lib64/libc.so.6 > #15 0x000000000064528a in monoeg_log_default_handler (log_domain=0x0, > log_level=G_LOG_LEVEL_ERROR, message= > 0x1ac7660 "suspend_thread suspend took 200 ms, which is more than the > allowed 200 ms", unused_data=0x0) at goutput.c:233 > #16 0x0000000000645077 in monoeg_g_logv (log_domain=0x0, > log_level=G_LOG_LEVEL_ERROR, format= > 0x7015d8 "suspend_thread suspend took %d ms, which is more than the > allowed %d ms", args=0x7fffffffce58) at goutput.c:113 > #17 0x000000000064512d in monoeg_g_log (log_domain=0x0, > log_level=G_LOG_LEVEL_ERROR, format= > 0x7015d8 "suspend_thread suspend took %d ms, which is more than the > allowed %d ms") at goutput.c:123 > #18 0x000000000063a13f in mono_threads_wait_pending_operations () at > mono-threads.c:238 > #19 0x000000000063a8cd in suspend_sync (interrupt_kernel=1, > tid=140737159329536) at mono-threads.c:877 > #20 suspend_sync_nolock (interrupt_kernel=1, id=140737159329536) at > mono-threads.c:892 > #21 mono_thread_info_safe_suspend_and_run (id=140737159329536, > interrupt_kernel=interrupt_kernel@entry=1, callback=callback@entry= > 0x58d5c0 <abort_thread_critical>, > user_data=user_data@entry=0x7fffffffd3e0) > at mono-threads.c:935 > #22 0x0000000000591a86 in abort_thread_internal > (thread=thread@entry=0x7fffec6e0230, > install_async_abort=install_async_abort@entry=1, > can_raise_exception=1) at threads.c:4728 > #23 0x0000000000591b29 in mono_thread_internal_stop > (thread=0x7fffec6e0230) at threads.c:2385 > ---Type <return> to continue, or q <return> to quit--- > #24 0x00000000005b123e in mono_gc_cleanup () at gc.c:842 > #25 0x00000000005aab8e in mono_runtime_cleanup (domain=domain@entry=0x9d9e00) > at appdomain.c:356 > #26 0x0000000000426c8b in mini_cleanup (domain=0x9d9e00) at > mini-runtime.c:4017 > #27 0x000000000047fac6 in mono_main (argc=11, argv=<optimized out>) at > driver.c:2115 > #28 0x0000000000424c68 in mono_main_with_options (argv=0x7fffffffd698, > argc=11) at main.c:20 > #29 main (argc=<optimized out>, argv=<optimized out>) at main.c:53 > > Thread 16768 > ——— > (gdb) bt > #0 0x00007fffeccca66c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x000000000060c873 in mono_os_cond_wait (mutex=0x97e640 <lock>, > cond=0x97e600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105 > #2 thread_func (thread_data=0x0) at sgen-thread-pool.c:118 > #3 0x00007fffeccc6806 in start_thread () from /lib64/libpthread.so.0 > #4 0x00007fffec80a9bd in clone () from /lib64/libc.so.6 > #5 0x0000000000000000 in ?? () > (gdb) > > > Thread 16815 > ———— > Thread #0 0x00007fffec75ec8b in sigsuspend () from /lib64/libc.so.6 > #1 0x000000000063cda6 in suspend_signal_handler (_dummy=<optimized out>, > info=<optimized out>, context=0x7fffec633dc0) > at mono-threads-posix-signals.c:209 > #2 <signal handler called> > #3 0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2 > #4 0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2 > #5 0x00007fffed8eade0 in open_path () from /lib64/ld-linux-x86-64.so.2 > #6 0x00007fffed8ece7c in _dl_map_object () from > /lib64/ld-linux-x86-64.so.2 > #7 0x00007fffed8f7400 in dl_open_worker () from > /lib64/ld-linux-x86-64.so.2 > #8 0x00007fffed8f2e86 in _dl_catch_error () from > /lib64/ld-linux-x86-64.so.2 > #9 0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2 > #10 0x00007fffecedcf9b in dlopen_doit () from /lib64/libdl.so.2 > #11 0x00007fffed8f2e86 in _dl_catch_error () from > /lib64/ld-linux-x86-64.so.2 > #12 0x00007fffecedd33c in _dlerror_run () from /lib64/libdl.so.2 > #13 0x00007fffecedcf01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 > #14 0x0000000000630b79 in mono_dl_open (name=name@entry=0x1c02790 > "libSystem.Data", flags=flags@entry=1, error_msg=error_msg@entry= > 0x7fffec634e80) at mono-dl.c:150 > #15 0x000000000054b9f0 in cached_module_load (name=name@entry=0x1c02790 > "libSystem.Data", err=err@entry=0x7fffec634e80, flags=1) > at loader.c:1398 > > > _______________________________________________ > Mono-devel-list mailing list > Mono-devel-list@lists.ximian.com > http://lists.ximian.com/mailman/listinfo/mono-devel-list > >
_______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list