We have seen some similar crashes of mono in linux (ubuntu and amazon linux).

One thing we have done that greatly reduces the frequency of the
crashes so far (removed 95%+ of them) is. MONO_GC_DEBUG=clear-at-gc

There is an issue here as well
https://bugzilla.xamarin.com/show_bug.cgi?id=18151 that is likely
related.

On Thu, Jul 23, 2015 at 3:03 PM, Taloth Saldono <talothsald...@gmail.com> wrote:
> Hey guys,
>
> (Initially I incorrectly posted this to the mono-list, so for those
> receiving this message twice, my apologies.)
>
> I'm looking for a mono expert on the managed threading system, hopefully you
> can give me a pointer to where to look.
>
> The problem a couple of my users experience is that since linux kernel 4.1
> mono crashes in a reproducible manner. (Using test case bug-18026 in a loop,
> which is a threadpool stress-test)
>
> A similar problem occurred in 3.13.0 but that was fixed by backporting some
> commits in the ubuntu kernel. (See
> https://bugzilla.xamarin.com/show_bug.cgi?id=29212)
>
> Initially I believed that in 4.1 those commits were reverted, but tests
> indicated that wasn't the cause.
> So I did a full bisect on linux 4.0-4.1 on a 64-bit Ubuntu 14.04.2
> Virtualbox. (~13 compiles of the kernel, took a couple of days)
> And it ended up on
> https://github.com/torvalds/linux/commit/c70e1b475f37f07ab7181ad28458666d59aae634.
>
> The problem seems to cause NullReferenceException and possibly native
> SIGSEGVs in a variety of places. (I can dump some stacktraces if desired,
> but I suspect that won't be helpful coz the corruption is likely caused
> elsewhere.)
>
> To me it seems impossible that reading the tsc in any way could result in
> the nullrefs. So my guess would it a side-effect of the memory barrier. From
> what I understand from the commit, the 'mfence+lfence' changed to 'mfence or
> lfence' (depending on what the cpu supports) and mfrence=lfence+sfence (not
> entirely true, but close), so I have no idea what the heck is going on
> there.
> But if I would venture a guess that somewhere, indirectly, mono unknowingly
> relies on that barrier to be there.
> Theoretically it still means other native apps could experience the same
> problem, but I would've expected reports about that already.
>
> My experience in these matters is pretty much non-existent. But dumping
> issues on devs is the least productive way to get them fixed, so I try to
> investigate as far as I can. Especially since it involves an issue that
> could be caused by either mono or the kernel.
>
> So my question is: Is there a likely candidate in mono where it uses the tsc
> (possibly for profiling) where the changed barrier could cause this odd
> behavior? And obviously, is there anything in particular I could try to
> narrow this down further?
>
> Almost forgot, but I did the bisect using mono 4.0.2.5, but I tested the
> nightly version as well.
>
> Thank you for your time.
>
> Taloth
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list@lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>



-- 
Studying for the Turing test
_______________________________________________
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Reply via email to