[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
Launchpad has imported 7 comments from the remote bug at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60931. If you reply to an imported comment from within Launchpad, your comment will be sent to the remote bug automatically. Read more about Launchpad's inter-bugtracker facilities at https://help.launchpad.net/InterBugTracking. On 2014-04-23T07:06:09+00:00 Anton Blanchard wrote: Created attachment 32659 Bump page size to 64kB We are seeing random failures with go programs on a 64kB page size ppc64 box. It looks like garbage collection issues - sometimes we SEGV in timer code, sometimes we SEGV in the code that wraps a kernel read syscall. If I prevent the garbage collector from running, the programs work. The libgo malloc hard codes the page size so I wrote a quick hack to bump this (and a few other dependent variables). This makes the problem go away, but we will need to come up with a better way to do this at runtime. Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/15 On 2014-04-23T07:11:55+00:00 Pinskia wrote: This is going to be true on AARCH64 also where most distros are going to be using 64k pages (some might use 4k pages if they also support AARCH32). MIPS has many different page sizes too (4k, 8k, 16k, 32k, and 64k). So hard coding the page size seems wrong, maybe you should call getpagesize instead. Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/17 On 2014-04-23T07:26:53+00:00 Anton Blanchard wrote: I agree, but when I tried this I found a few places that expect PageSize to be a compile time constant so it is not as trivial as I had hoped. Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/18 On 2014-04-23T16:42:11+00:00 Ian Lance Taylor wrote: It would be extremely helpful if you could find a test case that can recreate this problem with some reliability. There is no obvious dependency on the system page size in libgo. The PageSize constant is the unit that the memory allocator deals in, and should have no direct relationship to the system page size. I believe that there is a bug, but we need to track it down. If you set the environment variable GOGC=1 the garbage collector will run much more frequently; perhaps that will help get a reproducible test case. Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/19 On 2014-04-24T00:17:06+00:00 Anton Blanchard wrote: Created attachment 32669 Don't use madvise(DONT_NEED) on sub pages Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/20 On 2014-04-24T00:18:11+00:00 Anton Blanchard wrote: I think I see it: 19112 madvise(0xc21103, 4096, MADV_DONTNEED) = 0 That 4kB madvise(MADV_DONTNEED) gets rounded up to the system page size of 64kB and we end up covering still in use memory. The following patch fixes it for me, but it just ignores any sub pages. We should keep them around so later calls have a chance at consolidating regions up to a system page size. Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/21 On 2014-04-24T05:38:26+00:00 Jakub-gcc wrote: Perhaps it would be better instead of not doing the madvise at all if start or length isn't page aligned round the start to the next page boundary and end to the previous page boundary and madvise if the rounded end is above the rounded start. Reply at: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/comments/22 ** Changed in: gcc Status: Unknown => New ** Changed in: gcc Importance: Unknown => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
** Also affects: gcc via http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60931 Importance: Unknown Status: Unknown -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
Hi Dave, It does look like a page size issue. I submitted the following bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60931 ** Bug watch added: GCC Bugzilla #60931 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60931 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
An excellent point. Timers are managed by a single goroutine and a priority queue of events to wait on and channels to send the timer event. It should be doable to write some code that stresses timers. However I don't believe that SIGALARM is used, well at least not in gc which most of the gccgo standard library extends from, gccgo might be slightly different. The event that crashes the go process is related to a watchdog timer that expires and tries to kill the subprocess. On Wed, Apr 16, 2014 at 6:04 PM, Anton Blanchard wrote: > There shouldn't be any difference in terms of signal handling. > > I've now seen a couple of failures in mongodb/TLS networking code: > > panic: runtime error: invalid memory address or nil pointer dereference > [signal 0xb code=0x1 addr=0x38] > > goroutine 16 [running]: > crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn > ../../../gcc/libgo/go/crypto/tls/conn.go:111 > labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket > /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273 > labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket > /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474 > labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket > /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320 > labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer > /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278 > created by mgo.newServer > /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80 > > which is: > > func (c *Conn) SetWriteDeadline(t time.Time) error { > return c.conn.SetWriteDeadline(t) > } > > SetWriteDeadline will end up in timer code, and I've previously seen > failures in the timer code. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1304754 > > Title: > gccgo on ppc64el using split stacks when not supported > > Status in “gccgo-4.9” package in Ubuntu: > Confirmed > > Bug description: > On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is > killing gccgo compiled binaries > > [18519.444748] jujud[19277]: bad frame in setup_rt_frame: > nip lr > [18519.673632] init: juju-agent-ubuntu-local main process (19220) > killed by SEGV signal > [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning > > In powerpc/kernel/signal_64.c: > > sys_rt_sigreturn is jumping to the badframe: label and executing an > unconditional force_sigsegv which is delivered to the userland > process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer > access and blame some random function that happened to be the top > stack frame. > > Reverting to the 3.13-08 kernel appears to resolve the issue which > (weakly) points the finger at the recent switch to 64k pages. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
There shouldn't be any difference in terms of signal handling. I've now seen a couple of failures in mongodb/TLS networking code: panic: runtime error: invalid memory address or nil pointer dereference [signal 0xb code=0x1 addr=0x38] goroutine 16 [running]: crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn ../../../gcc/libgo/go/crypto/tls/conn.go:111 labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273 labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474 labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320 labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278 created by mgo.newServer /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80 which is: func (c *Conn) SetWriteDeadline(t time.Time) error { return c.conn.SetWriteDeadline(t) } SetWriteDeadline will end up in timer code, and I've previously seen failures in the timer code. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
Hi Anton, I've been looking at another angle via a different crash. I see a crash if a child process gets a signal, which sort of reflects back on the parent. Are there any alignment requirements for signal handling on 64k kernels ? Dave On Wed, Apr 16, 2014 at 4:28 PM, Anton Blanchard wrote: > This doesn't explain why we failed in the first place however. Using > gdb, I have seen a couple of SEGVs in: > > * 1Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc > (dummy=) at ../../../gcc/libgo/runtime/time.goc:217 > > ie: > > f = (void*)t->fv->fn; > > Perhaps a stale timer that we aren't cancelling? > > I've also seen a fail here: > > fatal error: runtime_lock: lock count > > goroutine 2 [running]: > runtime_dopanic > ../../../gcc/libgo/runtime/panic.c:78 > runtime_throw > ../../../gcc/libgo/runtime/panic.c:116 > runtime_lock > ../../../gcc/libgo/runtime/lock_futex.c:41 > runtime_allocmcache > ../../../gcc/libgo/runtime/malloc.goc:337 > runtime_startpanic > ../../../gcc/libgo/runtime/panic.c:46 > runtime_throw > ../../../gcc/libgo/runtime/panic.c:114 > runtime_unlock > ../../../gcc/libgo/runtime/lock_futex.c:101 > runtime_MHeap_Scavenger > ../../../gcc/libgo/runtime/mheap.c:482 > kickoff > ../../../gcc/libgo/runtime/proc.c:237 > > :0 > > :0 > created by runtime_main > ../../../gcc/libgo/runtime/proc.c:565 > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1304754 > > Title: > gccgo on ppc64el using split stacks when not supported > > Status in “gccgo-4.9” package in Ubuntu: > Confirmed > > Bug description: > On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is > killing gccgo compiled binaries > > [18519.444748] jujud[19277]: bad frame in setup_rt_frame: > nip lr > [18519.673632] init: juju-agent-ubuntu-local main process (19220) > killed by SEGV signal > [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning > > In powerpc/kernel/signal_64.c: > > sys_rt_sigreturn is jumping to the badframe: label and executing an > unconditional force_sigsegv which is delivered to the userland > process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer > access and blame some random function that happened to be the top > stack frame. > > Reverting to the 3.13-08 kernel appears to resolve the issue which > (weakly) points the finger at the recent switch to 64k pages. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
On Wed, Apr 16, 2014 at 4:26 PM, Anton Blanchard wrote: > I've made some progress with these fails. A lot of the confusion is > around the way gccgo hooks the SEGV handler and attempts to backtrace > all goroutines (the code is in runtime_tracebackothers()) > > It does this by calling runtime_gogo() which temporarily switches to the > goroutine using setcontext(). If the context is bad in any way, this > will cause us to SEGV again. I printed out the stack pointer (r1) and > the NIA during this stack backtracing, and we see where things go south > just as we are about to dump goroutine 0: > > goroutine 0 [idle]: > DEBUG: runtime_gogo r1 0 nia 0 > > r1 = 0, nia = 0. When we call setcontext on this invalid context we die > with: > > jujud[5258]: bad frame in setup_rt_frame: nip > lr > > Perhaps we aren't saving away the context for goroutine 0 correctly. Hmm, could be. It looks like the process was crashing anyway. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1304754 > > Title: > gccgo on ppc64el using split stacks when not supported > > Status in “gccgo-4.9” package in Ubuntu: > Confirmed > > Bug description: > On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is > killing gccgo compiled binaries > > [18519.444748] jujud[19277]: bad frame in setup_rt_frame: > nip lr > [18519.673632] init: juju-agent-ubuntu-local main process (19220) > killed by SEGV signal > [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning > > In powerpc/kernel/signal_64.c: > > sys_rt_sigreturn is jumping to the badframe: label and executing an > unconditional force_sigsegv which is delivered to the userland > process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer > access and blame some random function that happened to be the top > stack frame. > > Reverting to the 3.13-08 kernel appears to resolve the issue which > (weakly) points the finger at the recent switch to 64k pages. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
This doesn't explain why we failed in the first place however. Using gdb, I have seen a couple of SEGVs in: * 1Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc (dummy=) at ../../../gcc/libgo/runtime/time.goc:217 ie: f = (void*)t->fv->fn; Perhaps a stale timer that we aren't cancelling? I've also seen a fail here: fatal error: runtime_lock: lock count goroutine 2 [running]: runtime_dopanic ../../../gcc/libgo/runtime/panic.c:78 runtime_throw ../../../gcc/libgo/runtime/panic.c:116 runtime_lock ../../../gcc/libgo/runtime/lock_futex.c:41 runtime_allocmcache ../../../gcc/libgo/runtime/malloc.goc:337 runtime_startpanic ../../../gcc/libgo/runtime/panic.c:46 runtime_throw ../../../gcc/libgo/runtime/panic.c:114 runtime_unlock ../../../gcc/libgo/runtime/lock_futex.c:101 runtime_MHeap_Scavenger ../../../gcc/libgo/runtime/mheap.c:482 kickoff ../../../gcc/libgo/runtime/proc.c:237 :0 :0 created by runtime_main ../../../gcc/libgo/runtime/proc.c:565 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
I've made some progress with these fails. A lot of the confusion is around the way gccgo hooks the SEGV handler and attempts to backtrace all goroutines (the code is in runtime_tracebackothers()) It does this by calling runtime_gogo() which temporarily switches to the goroutine using setcontext(). If the context is bad in any way, this will cause us to SEGV again. I printed out the stack pointer (r1) and the NIA during this stack backtracing, and we see where things go south just as we are about to dump goroutine 0: goroutine 0 [idle]: DEBUG: runtime_gogo r1 0 nia 0 r1 = 0, nia = 0. When we call setcontext on this invalid context we die with: jujud[5258]: bad frame in setup_rt_frame: nip lr Perhaps we aren't saving away the context for goroutine 0 correctly. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported
Anton: I've done some experiments with the peano.go test and confirmed that gccgo on ppc is correctly configured to not use f-split-stack. It turns out the peano.go can't pass without split stacks. On gccgo/ppc64 the program crashes at a stack depth of Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x3fffb770 (LWP 24713)] 0x10004e0c in main.is_zero () (gdb) bt #0 0x10004e0c in main.is_zero () #1 0x100051fc in main.count () #2 0x1000522c in main.count () ... #31380 0x1000522c in main.count () #31381 0x10005854 in main.main () I think the peano example is just a straght 'fall off the stack' type error, it also generates a slightly different ubuntu@winton-02:~/go/test$ ./a.out Segmentation fault (core dumped) ubuntu@winton-02:~/go/test$ dmesg | tail -n1 [501663.078093] a.out[25679]: bad frame in setup_rt_frame: 00c20ffaf0e0 nip 10004e0c lr 100051fc -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1304754 Title: gccgo on ppc64el using split stacks when not supported To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs