Re: [Mono-dev] Investigating mono crashes on linux 4.1

2015-07-23 Thread Taloth Saldono
Hey Greg,

With my current test case it crashes anywhere between 0.1 and 30 sec,
occasionally longer.
If I run my test case till it crashes, 10 times in a row, measuring the
total run time:
vanilla = 3m9.216s 2m26.571s 2m31.168s 3m8.670s
clear-at-gc = 1m50.81s 2m01.85s 1m10.21s 1m10.21s
disable-minor = 0m16.74s 0m16.32s (duh, more major collections. the reverse
happens if you increase the nursery size.)

So yeah, clear-at-gc actually makes it worse. ;)

It quite possibly has something to do with the GC, but i'm trying to find
the link with that rdtsc instruction.
Assuming the tsc isn't used in some convoluted way, it means it should be a
missing memory barrier somewhere.

Taloth


On Thu, Jul 23, 2015 at 2:11 PM, Greg Young  wrote:

> We have seen some similar crashes of mono in linux (ubuntu and amazon
> linux).
>
> One thing we have done that greatly reduces the frequency of the
> crashes so far (removed 95%+ of them) is. MONO_GC_DEBUG=clear-at-gc
>
> There is an issue here as well
> https://bugzilla.xamarin.com/show_bug.cgi?id=18151 that is likely
> related.
>
> On Thu, Jul 23, 2015 at 3:03 PM, Taloth Saldono 
> wrote:
> > Hey guys,
> >
> > (Initially I incorrectly posted this to the mono-list, so for those
> > receiving this message twice, my apologies.)
> >
> > I'm looking for a mono expert on the managed threading system, hopefully
> you
> > can give me a pointer to where to look.
> >
> > The problem a couple of my users experience is that since linux kernel
> 4.1
> > mono crashes in a reproducible manner. (Using test case bug-18026 in a
> loop,
> > which is a threadpool stress-test)
> >
> > A similar problem occurred in 3.13.0 but that was fixed by backporting
> some
> > commits in the ubuntu kernel. (See
> > https://bugzilla.xamarin.com/show_bug.cgi?id=29212)
> >
> > Initially I believed that in 4.1 those commits were reverted, but tests
> > indicated that wasn't the cause.
> > So I did a full bisect on linux 4.0-4.1 on a 64-bit Ubuntu 14.04.2
> > Virtualbox. (~13 compiles of the kernel, took a couple of days)
> > And it ended up on
> >
> https://github.com/torvalds/linux/commit/c70e1b475f37f07ab7181ad28458666d59aae634
> .
> >
> > The problem seems to cause NullReferenceException and possibly native
> > SIGSEGVs in a variety of places. (I can dump some stacktraces if desired,
> > but I suspect that won't be helpful coz the corruption is likely caused
> > elsewhere.)
> >
> > To me it seems impossible that reading the tsc in any way could result in
> > the nullrefs. So my guess would it a side-effect of the memory barrier.
> From
> > what I understand from the commit, the 'mfence+lfence' changed to
> 'mfence or
> > lfence' (depending on what the cpu supports) and mfrence=lfence+sfence
> (not
> > entirely true, but close), so I have no idea what the heck is going on
> > there.
> > But if I would venture a guess that somewhere, indirectly, mono
> unknowingly
> > relies on that barrier to be there.
> > Theoretically it still means other native apps could experience the same
> > problem, but I would've expected reports about that already.
> >
> > My experience in these matters is pretty much non-existent. But dumping
> > issues on devs is the least productive way to get them fixed, so I try to
> > investigate as far as I can. Especially since it involves an issue that
> > could be caused by either mono or the kernel.
> >
> > So my question is: Is there a likely candidate in mono where it uses the
> tsc
> > (possibly for profiling) where the changed barrier could cause this odd
> > behavior? And obviously, is there anything in particular I could try to
> > narrow this down further?
> >
> > Almost forgot, but I did the bisect using mono 4.0.2.5, but I tested the
> > nightly version as well.
> >
> > Thank you for your time.
> >
> > Taloth
> >
> > ___
> > Mono-devel-list mailing list
> > Mono-devel-list@lists.ximian.com
> > http://lists.ximian.com/mailman/listinfo/mono-devel-list
> >
>
>
>
> --
> Studying for the Turing test
>
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] Investigating mono crashes on linux 4.1

2015-07-23 Thread Greg Young
We have seen some similar crashes of mono in linux (ubuntu and amazon linux).

One thing we have done that greatly reduces the frequency of the
crashes so far (removed 95%+ of them) is. MONO_GC_DEBUG=clear-at-gc

There is an issue here as well
https://bugzilla.xamarin.com/show_bug.cgi?id=18151 that is likely
related.

On Thu, Jul 23, 2015 at 3:03 PM, Taloth Saldono  wrote:
> Hey guys,
>
> (Initially I incorrectly posted this to the mono-list, so for those
> receiving this message twice, my apologies.)
>
> I'm looking for a mono expert on the managed threading system, hopefully you
> can give me a pointer to where to look.
>
> The problem a couple of my users experience is that since linux kernel 4.1
> mono crashes in a reproducible manner. (Using test case bug-18026 in a loop,
> which is a threadpool stress-test)
>
> A similar problem occurred in 3.13.0 but that was fixed by backporting some
> commits in the ubuntu kernel. (See
> https://bugzilla.xamarin.com/show_bug.cgi?id=29212)
>
> Initially I believed that in 4.1 those commits were reverted, but tests
> indicated that wasn't the cause.
> So I did a full bisect on linux 4.0-4.1 on a 64-bit Ubuntu 14.04.2
> Virtualbox. (~13 compiles of the kernel, took a couple of days)
> And it ended up on
> https://github.com/torvalds/linux/commit/c70e1b475f37f07ab7181ad28458666d59aae634.
>
> The problem seems to cause NullReferenceException and possibly native
> SIGSEGVs in a variety of places. (I can dump some stacktraces if desired,
> but I suspect that won't be helpful coz the corruption is likely caused
> elsewhere.)
>
> To me it seems impossible that reading the tsc in any way could result in
> the nullrefs. So my guess would it a side-effect of the memory barrier. From
> what I understand from the commit, the 'mfence+lfence' changed to 'mfence or
> lfence' (depending on what the cpu supports) and mfrence=lfence+sfence (not
> entirely true, but close), so I have no idea what the heck is going on
> there.
> But if I would venture a guess that somewhere, indirectly, mono unknowingly
> relies on that barrier to be there.
> Theoretically it still means other native apps could experience the same
> problem, but I would've expected reports about that already.
>
> My experience in these matters is pretty much non-existent. But dumping
> issues on devs is the least productive way to get them fixed, so I try to
> investigate as far as I can. Especially since it involves an issue that
> could be caused by either mono or the kernel.
>
> So my question is: Is there a likely candidate in mono where it uses the tsc
> (possibly for profiling) where the changed barrier could cause this odd
> behavior? And obviously, is there anything in particular I could try to
> narrow this down further?
>
> Almost forgot, but I did the bisect using mono 4.0.2.5, but I tested the
> nightly version as well.
>
> Thank you for your time.
>
> Taloth
>
> ___
> Mono-devel-list mailing list
> Mono-devel-list@lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>



-- 
Studying for the Turing test
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


[Mono-dev] Investigating mono crashes on linux 4.1

2015-07-23 Thread Taloth Saldono
Hey guys,

(Initially I incorrectly posted this to the mono-list, so for those
receiving this message twice, my apologies.)

I'm looking for a mono expert on the managed threading system, hopefully
you can give me a pointer to where to look.

The problem a couple of my users experience is that since linux kernel 4.1
mono crashes in a reproducible manner. (Using test case bug-18026 in a
loop, which is a threadpool stress-test)

A similar problem occurred in 3.13.0 but that was fixed by backporting some
commits in the ubuntu kernel. (See
https://bugzilla.xamarin.com/show_bug.cgi?id=29212)

Initially I believed that in 4.1 those commits were reverted, but tests
indicated that wasn't the cause.
So I did a full bisect on linux 4.0-4.1 on a 64-bit Ubuntu 14.04.2
Virtualbox. (~13 compiles of the kernel, took a couple of days)
And it ended up on
https://github.com/torvalds/linux/commit/c70e1b475f37f07ab7181ad28458666d59aae634
.

The problem seems to cause NullReferenceException and possibly native
SIGSEGVs in a variety of places. (I can dump some stacktraces if desired,
but I suspect that won't be helpful coz the corruption is likely caused
elsewhere.)

To me it seems impossible that reading the tsc in any way could result in
the nullrefs. So my guess would it a side-effect of the memory barrier.
>From what I understand from the commit, the 'mfence+lfence' changed to
'mfence or lfence' (depending on what the cpu supports) and
mfrence=lfence+sfence (not entirely true, but close), so I have no idea
what the heck is going on there.
But if I would venture a guess that somewhere, indirectly, mono unknowingly
relies on that barrier to be there.
Theoretically it still means other native apps could experience the same
problem, but I would've expected reports about that already.

My experience in these matters is pretty much non-existent. But dumping
issues on devs is the least productive way to get them fixed, so I try to
investigate as far as I can. Especially since it involves an issue that
could be caused by either mono or the kernel.

So my question is: Is there a likely candidate in mono where it uses the
tsc (possibly for profiling) where the changed barrier could cause this odd
behavior? And obviously, is there anything in particular I could try to
narrow this down further?

Almost forgot, but I did the bisect using mono 4.0.2.5, but I tested the
nightly version as well.

Thank you for your time.

Taloth
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list