Re: [PATCH] fix for Re: crash in gc with upside-down stack
2008/11/13 Ludovic Courtès <[EMAIL PROTECTED]>: > Hi, > > "Linas Vepstas" <[EMAIL PROTECTED]> writes: > >> The patch below fixes a crash during garbage collection, where, during >> the mark-stack phase, the top and bottom of the stack are found to be >> in backwards order, typically because scm_with_guile() was called when >> the stack is much shorter than when a thread was first guilified. That >> is, the stack base pointer is stale, and can be inverted from the stack >> top. If GC runs due to activity in some other thread, the stale base >> pointer leads to the crash (as base-top is approximately 2^32 or 2^64). > > Good catch! I applied it, along with a test case that reproduced the > problem: > > > http://git.savannah.gnu.org/gitweb/?p=guile.git;a=commitdiff;h=cd1a1e47b5e781277560d9933a44e6aabd0c9c49 Yes indeed. Nice work Linas, and nice test Ludovic. Neil
Re: [PATCH] fix for Re: crash in gc with upside-down stack
Hi, "Linas Vepstas" <[EMAIL PROTECTED]> writes: > The patch below fixes a crash during garbage collection, where, during > the mark-stack phase, the top and bottom of the stack are found to be > in backwards order, typically because scm_with_guile() was called when > the stack is much shorter than when a thread was first guilified. That > is, the stack base pointer is stale, and can be inverted from the stack > top. If GC runs due to activity in some other thread, the stale base > pointer leads to the crash (as base-top is approximately 2^32 or 2^64). Good catch! I applied it, along with a test case that reproduced the problem: http://git.savannah.gnu.org/gitweb/?p=guile.git;a=commitdiff;h=cd1a1e47b5e781277560d9933a44e6aabd0c9c49 Thanks! Ludo'.
[PATCH] fix for Re: crash in gc with upside-down stack
Patch below; I'm also attaching the same patch, in case gmail is scrambling this thing :-/ Also, I've long had a generic assignment on file with the FSF. --linas The patch below fixes a crash during garbage collection, where, during the mark-stack phase, the top and bottom of the stack are found to be in backwards order, typically because scm_with_guile() was called when the stack is much shorter than when a thread was first guilified. That is, the stack base pointer is stale, and can be inverted from the stack top. If GC runs due to activity in some other thread, the stale base pointer leads to the crash (as base-top is approximately 2^32 or 2^64). A typical symptom of this bug, on a 32-bit system, is: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf5333b90 (LWP 20587)] 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 435 SCM obj = * (SCM *) &x[m]; Current language: auto; currently c (gdb) bt #0 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 #1 0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375 Notice that 4294966782 == fdfe == -202 Please apply in time for guile-1.8.6! Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> --- libguile/threads.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) Index: guile-1.8.5/libguile/threads.c === --- guile-1.8.5.orig/libguile/threads.c 2008-11-13 15:17:12.0 -0600 +++ guile-1.8.5/libguile/threads.c 2008-11-13 15:32:07.0 -0600 @@ -577,9 +577,24 @@ scm_i_init_thread_for_guile (SCM_STACKIT /* This thread is already guilified but not in guile mode, just resume it. -XXX - base might be lower than when this thread was first -guilified. + A user call to scm_with_guile() will lead us to here. This + could happen anywhere on the stack, and in particular, the + stack can be *much* shorter than what it was when this thread + was first guilified. This will typically happen in + on_thread_exit(), where the stack is *always* shorter than + when the thread was first guilified. If the GC happens to + get triggered due to some other thread, we'd end up with + t->top "upside-down" w.r.t. t->base, which will result in + chaos in scm_threads_mark_stacks() when top-base=2^32 or 2^64. + Thus, reset the base, if needed. */ +#if SCM_STACK_GROWS_UP + if (base < t->base) + t->base = base; +#else + if (base > t->base) + t->base = base; +#endif scm_enter_guile ((scm_t_guile_ticket) t); return 1; } The patch below fixes a crash during garbage collection, where, during the mark-stack phase, the top and bottom of the stack are found to be in backwards order, typically because scm_with_guile() was called when the stack is much shorter than when a thread was first guilified. That is, the stack base pointer is stale, and can be inverted from the stack top. If GC runs due to activity in some other thread, the stale base pointer leads to the crash (as base-top is approximately 2^32 or 2^64). A typical symptom of this bug, on a 32-bit system, is: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf5333b90 (LWP 20587)] 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 435 SCM obj = * (SCM *) &x[m]; Current language: auto; currently c (gdb) bt #0 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 #1 0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375 Notice that 4294966782 == fdfe == -202 Please apply in time for guile-1.8.6! Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> --- libguile/threads.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) Index: guile-1.8.5/libguile/threads.c === --- guile-1.8.5.orig/libguile/threads.c 2008-11-13 15:17:12.0 -0600 +++ guile-1.8.5/libguile/threads.c 2008-11-13 15:32:07.0 -0600 @@ -577,9 +577,24 @@ scm_i_init_thread_for_guile (SCM_STACKIT /* This thread is already guilified but not in guile mode, just resume it. - XXX - base might be lower than when this thread was first - guilified. + A user call to scm_with_guile() will lead us to here. This + could happen anywhere on the stack, and in particular, the + stack can be *much* shorter than what it was when this thread + was first guilified. This will typically happen in + on_thread_exit(), where the stack is *always* shorter than + when the thread was first guilified. If the GC happens to + get triggered due to some other thread, we'd end up with + t->top "upside-down" w.r.t. t->base, which will result in + chaos in
Re: crash in gc with upside-down stack
Attached below is a debugging patch, and its output, which shows that the stack bounds are frequently up-side-down, and are sometimes upside-down when the GC runs, thus leading to a crash. In the next email, I'll propose a patch that fixes the the problem. The original problem report: > 2008/11/11 Linas Vepstas <[EMAIL PROTECTED]>: >> >> My stack below. >> >> Program received signal SIGSEGV, Segmentation fault. >> [Switching to Thread 0xf5333b90 (LWP 20587)] >> 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at >> gc-mark.c:435 >> 435 SCM obj = * (SCM *) &x[m]; >> Current language: auto; currently c >> (gdb) bt >> #0 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) >>at gc-mark.c:435 >> #1 0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375 >> #2 0xf7711d38 in scm_mark_all () at gc-mark.c:82 >> #3 0xf7710d33 in scm_i_gc (what=0xf778602e "cells") at gc.c:598 > A debugging patch. Yes, its ugly, its intentionally ugly. More of an eye-catcher that way. Index: guile-1.8.5/libguile/threads.c === --- guile-1.8.5.orig/libguile/threads.c 2008-11-13 07:58:22.0 -0600 +++ guile-1.8.5/libguile/threads.c 2008-11-13 13:14:00.0 -0600 @@ -395,6 +395,10 @@ static scm_t_guile_ticket scm_leave_guile () { scm_i_thread *t = suspend (); +int sz=t->base - t->top; +if(0>sz) { +printf("duuude scm_leav_guile backwards stack %d\n", sz); +} scm_i_pthread_mutex_unlock (&t->heap_mutex); return (scm_t_guile_ticket) t; } @@ -694,7 +698,15 @@ scm_i_with_guile_and_parent (void *(*fun really_entered = scm_i_init_thread_for_guile (&base_item, parent); res = scm_c_with_continuation_barrier (func, data); if (really_entered) -scm_leave_guile (); +{ +// scm_leave_guile (); +scm_i_thread * t = (scm_i_thread *) scm_leave_guile (); +int sz=t->base - t->top; +int szb=t->base - &base_item; +if(0>sz) { +printf("duuude scm_leav_guile and parent %d %d\n", sz, szb); +} +} return res; } @@ -704,6 +716,11 @@ scm_without_guile (void *(*func)(void *) void *res; scm_t_guile_ticket t; t = scm_leave_guile (); +scm_i_thread * s = (scm_i_thread *) t; +int sz=s->base - s->top; +if(0>sz) { +printf("duuude scm_wo guile %d\n", sz); +} res = func (data); scm_enter_guile (t); return res; @@ -1371,8 +1388,15 @@ scm_threads_mark_stacks (void) #if SCM_STACK_GROWS_UP scm_mark_locations (t->base, t->top - t->base); + #else +int sz=t->base - t->top; +if(0<=sz) { scm_mark_locations (t->top, t->base - t->top); +} else { +printf ("duude bugg!!\n"); +printf ("duude stack top=%p base=%p sz=%d\n", t->top, t->base, t->base - t->top); +} #endif scm_mark_locations ((SCM_STACKITEM *) t->regs, ((size_t) sizeof(t->regs) @@ -1441,6 +1465,11 @@ int scm_pthread_mutex_lock (scm_i_pthread_mutex_t *mutex) { scm_t_guile_ticket t = scm_leave_guile (); +scm_i_thread * s = (scm_i_thread *) t; +int sz=s->base - s->top; +if(0>sz) { +printf("duuude scm_mutexe %d\n", sz); +} int res = scm_i_pthread_mutex_lock (mutex); scm_enter_guile (t); return res; @@ -1463,6 +1492,11 @@ int scm_pthread_cond_wait (scm_i_pthread_cond_t *cond, scm_i_pthread_mutex_t *mutex) { scm_t_guile_ticket t = scm_leave_guile (); +scm_i_thread * s = (scm_i_thread *) t; +int sz=s->base - s->top; +if(0>sz) { +printf("duuude scm_conde %d\n", sz); +} int res = scm_i_pthread_cond_wait (cond, mutex); scm_enter_guile (t); return res; @@ -1578,7 +1612,12 @@ scm_i_thread_put_to_sleep () { scm_i_thread *t; - scm_leave_guile (); + // scm_leave_guile (); + t = (scm_i_thread *) scm_leave_guile (); +int sz=t->base - t->top; +if(0>sz) { +printf("duuude scm_leav_guile backwards was scm_i_thread_put_to_sleep %d\n", sz); +} scm_i_pthread_mutex_lock (&thread_admin_mutex); /* Signal all threads to go to sleep @@ -1620,6 +1659,10 @@ void scm_i_thread_sleep_for_gc () { scm_i_thread *t = suspend (); +int sz=t->base - t->top; +if(0>sz) { +printf("duuude scm_i_thread_sleep_for_gc backwards stack %d\n", sz); +} scm_i_pthread_cond_wait (&wake_up_cond, &t->heap_mutex); resume (t); } Here is an example of the output generated: duuude scm_leav_guile backwards stack -54 duuude scm_leav_guile and parent -54 -76 duuude scm_leav_guile backwards stack -54 duuude scm_leav_guile backwards stack -54 duuude scm_leav_guile and parent -54 -76 duuude scm_leav_guile backwards stack -54 duuude scm_leav_guile and parent -54 -76 duuude scm_leav_guile backwards stack -54 duuude scm_leav_guile and parent -54 -76 duude bugg!! duude stack top=0xf355b9e0 base=0xf355b908 sz=-54 duude bugg!! duude stack top=0xf355b9e0 base=0xf355b908 sz=-54 duuude scm_leav_guile backwards stack -54 duuude scm_leav_guile and parent -54 -76 duuude scm_leav_guile backw
Re: crash in gc with upside-down stack
Some minor updates: 2008/11/11 Linas Vepstas <[EMAIL PROTECTED]>: > > My stack below. > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0xf5333b90 (LWP 20587)] > 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 > 435 SCM obj = * (SCM *) &x[m]; > Current language: auto; currently c > (gdb) bt > #0 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) >at gc-mark.c:435 > #1 0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375 > #2 0xf7711d38 in scm_mark_all () at gc-mark.c:82 > #3 0xf7710d33 in scm_i_gc (what=0xf778602e "cells") at gc.c:598 My current code reproduces this fairly readily, I am seeing it many dozens/hundreds of times a day. I tweaked guile to check that the stack bounds are in order, and to print an error message when they are, and then to just troop on -- and so I see dozens/hundreds of prints. When the stack bounds are reversed, the difference is *always* 58 bytes; and in fact, the two bad stack bounds are always the same. It appears to happen *only* when I have multiple threads all trying to define functions at the same time, it never happens when one thread goes off to do some heavy computing. --linas
crash in gc with upside-down stack
Here's another one, I'm trying to dig into this: Its more or less the same crash as the one reported at: http://bugs.gentoo.org/228097 and http://www.mail-archive.com/bug-guile@gnu.org/msg04568.html My stack below. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf5333b90 (LWP 20587)] 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 435 SCM obj = * (SCM *) &x[m]; Current language: auto; currently c (gdb) bt #0 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435 #1 0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375 #2 0xf7711d38 in scm_mark_all () at gc-mark.c:82 #3 0xf7710d33 in scm_i_gc (what=0xf778602e "cells") at gc.c:598 #4 0xf7710f4d in scm_gc_for_newcell (freelist=0xf779b76c, free_cells=0x1228e9b0) at gc.c:509 #5 0xf7768bd8 in scm_c_catch (tag=0x104, body=0xf76f3830 , body_data=0xf528, handler=0xf76f3850 , handler_data=0xf528, pre_unwind_handler=0xf77683e0 , pre_unwind_handler_data=0x0) at ../libguile/inline.h:186 #6 0xf76f3cf2 in scm_i_with_continuation_barrier (body=0xf76f3830 , body_data=0xf528, handler=0xf76f3850 , handler_data=0xf528, pre_unwind_handler=0xf77683e0 , pre_unwind_handler_data=0x0) at continuations.c:326 #7 0xf76f3dd3 in scm_c_with_continuation_barrier ( func=0xf7767ab0 , data=0x1228e938) at continuations.c:368 ---Type to continue, or q to quit--- #8 0xf77678f9 in scm_i_with_guile_and_parent (func=0xf7767ab0 , data=0x1228e938, parent=0x19f63670) at threads.c:695 #9 0xf77679ee in scm_with_guile (func=0xf7767ab0 , data=0x1228e938) at threads.c:683 #10 0xf7767a43 in on_thread_exit (v=0x1228e938) at threads.c:505 #11 0xf7d7abb0 in __nptl_deallocate_tsd () from /lib/tls/i686/cmov/libpthread.so.0 #12 0xf7d7b509 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #13 0xf7b79e5e in clone () from /lib/tls/i686/cmov/libc.so.6 (gdb) I've seen this twice now in two days, but its not readily reproducible. By plugging in the insanely large n into a hex calc, you'll see its actually 0xfffsomething. Looking carefully near threads.c:1375 seems to imply that stack top and stack bottom are reversed. So I added a printf at that location, and tried to reproduce the crash. Several gazzilion print statements later, no crash. I suspect that this is some sort of thread-race condition; I think it happens when I am defining some functions from several different threads at once. It seems *not* to occur once I get into hard-core computations-- i.e. it happens no later than the first few dozen gc's. This is on guile-1.8.5, --with-threads, on Ubuntu, Intel (actually AMD64 cpu.) --linas