We have seen some similar crashes of mono in linux (ubuntu and amazon linux).
One thing we have done that greatly reduces the frequency of the crashes so far (removed 95%+ of them) is. MONO_GC_DEBUG=clear-at-gc There is an issue here as well https://bugzilla.xamarin.com/show_bug.cgi?id=18151 that is likely related. On Thu, Jul 23, 2015 at 3:03 PM, Taloth Saldono <talothsald...@gmail.com> wrote: > Hey guys, > > (Initially I incorrectly posted this to the mono-list, so for those > receiving this message twice, my apologies.) > > I'm looking for a mono expert on the managed threading system, hopefully you > can give me a pointer to where to look. > > The problem a couple of my users experience is that since linux kernel 4.1 > mono crashes in a reproducible manner. (Using test case bug-18026 in a loop, > which is a threadpool stress-test) > > A similar problem occurred in 3.13.0 but that was fixed by backporting some > commits in the ubuntu kernel. (See > https://bugzilla.xamarin.com/show_bug.cgi?id=29212) > > Initially I believed that in 4.1 those commits were reverted, but tests > indicated that wasn't the cause. > So I did a full bisect on linux 4.0-4.1 on a 64-bit Ubuntu 14.04.2 > Virtualbox. (~13 compiles of the kernel, took a couple of days) > And it ended up on > https://github.com/torvalds/linux/commit/c70e1b475f37f07ab7181ad28458666d59aae634. > > The problem seems to cause NullReferenceException and possibly native > SIGSEGVs in a variety of places. (I can dump some stacktraces if desired, > but I suspect that won't be helpful coz the corruption is likely caused > elsewhere.) > > To me it seems impossible that reading the tsc in any way could result in > the nullrefs. So my guess would it a side-effect of the memory barrier. From > what I understand from the commit, the 'mfence+lfence' changed to 'mfence or > lfence' (depending on what the cpu supports) and mfrence=lfence+sfence (not > entirely true, but close), so I have no idea what the heck is going on > there. > But if I would venture a guess that somewhere, indirectly, mono unknowingly > relies on that barrier to be there. > Theoretically it still means other native apps could experience the same > problem, but I would've expected reports about that already. > > My experience in these matters is pretty much non-existent. But dumping > issues on devs is the least productive way to get them fixed, so I try to > investigate as far as I can. Especially since it involves an issue that > could be caused by either mono or the kernel. > > So my question is: Is there a likely candidate in mono where it uses the tsc > (possibly for profiling) where the changed barrier could cause this odd > behavior? And obviously, is there anything in particular I could try to > narrow this down further? > > Almost forgot, but I did the bisect using mono 4.0.2.5, but I tested the > nightly version as well. > > Thank you for your time. > > Taloth > > _______________________________________________ > Mono-devel-list mailing list > Mono-devel-list@lists.ximian.com > http://lists.ximian.com/mailman/listinfo/mono-devel-list > -- Studying for the Turing test _______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list