Re: [PATCH] fix for Re: crash in gc with upside-down stack

2008-11-15 Thread Neil Jerram
2008/11/13 Ludovic Courtès <[EMAIL PROTECTED]>:
> Hi,
>
> "Linas Vepstas" <[EMAIL PROTECTED]> writes:
>
>> The patch below fixes a crash during garbage collection, where, during
>> the mark-stack phase, the top and bottom of the stack are found to be
>> in backwards order, typically because scm_with_guile() was called when
>> the stack is much shorter than when a thread was first guilified. That
>> is, the stack base pointer is stale, and can be inverted from the stack
>> top. If GC runs due to activity in some other thread, the stale base
>> pointer leads to the crash (as base-top is approximately 2^32 or 2^64).
>
> Good catch!  I applied it, along with a test case that reproduced the
> problem:
>
>  
> http://git.savannah.gnu.org/gitweb/?p=guile.git;a=commitdiff;h=cd1a1e47b5e781277560d9933a44e6aabd0c9c49

Yes indeed.  Nice work Linas, and nice test Ludovic.

Neil




Re: [PATCH] fix for Re: crash in gc with upside-down stack

2008-11-13 Thread Ludovic Courtès
Hi,

"Linas Vepstas" <[EMAIL PROTECTED]> writes:

> The patch below fixes a crash during garbage collection, where, during
> the mark-stack phase, the top and bottom of the stack are found to be
> in backwards order, typically because scm_with_guile() was called when
> the stack is much shorter than when a thread was first guilified. That
> is, the stack base pointer is stale, and can be inverted from the stack
> top. If GC runs due to activity in some other thread, the stale base
> pointer leads to the crash (as base-top is approximately 2^32 or 2^64).

Good catch!  I applied it, along with a test case that reproduced the
problem:

  
http://git.savannah.gnu.org/gitweb/?p=guile.git;a=commitdiff;h=cd1a1e47b5e781277560d9933a44e6aabd0c9c49

Thanks!

Ludo'.





[PATCH] fix for Re: crash in gc with upside-down stack

2008-11-13 Thread Linas Vepstas
Patch below; I'm also attaching the same patch, in case
gmail is scrambling this thing :-/  Also, I've long had a
generic assignment on file with the FSF.

--linas

The patch below fixes a crash during garbage collection, where, during
the mark-stack phase, the top and bottom of the stack are found to be
in backwards order, typically because scm_with_guile() was called when
the stack is much shorter than when a thread was first guilified. That
is, the stack base pointer is stale, and can be inverted from the stack
top. If GC runs due to activity in some other thread, the stale base
pointer leads to the crash (as base-top is approximately 2^32 or 2^64).

A typical symptom of this bug, on a 32-bit system, is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf5333b90 (LWP 20587)]
0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
435   SCM obj = * (SCM *) &x[m];
Current language:  auto; currently c
(gdb) bt
#0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at
gc-mark.c:435
#1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375

Notice that 4294966782 == fdfe == -202

Please apply in time for guile-1.8.6!

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>

---
 libguile/threads.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c 2008-11-13 15:17:12.0 -0600
+++ guile-1.8.5/libguile/threads.c  2008-11-13 15:32:07.0 -0600
@@ -577,9 +577,24 @@ scm_i_init_thread_for_guile (SCM_STACKIT
   /* This thread is already guilified but not in guile mode, just
 resume it.

-XXX - base might be lower than when this thread was first
-guilified.
+ A user call to scm_with_guile() will lead us to here. This
+ could happen anywhere on the stack, and in particular, the
+ stack can be *much* shorter than what it was when this thread
+ was first guilified. This will typically happen in
+ on_thread_exit(), where the stack is *always* shorter than
+ when the thread was first guilified. If the GC happens to
+ get triggered due to some other thread, we'd end up with
+ t->top "upside-down" w.r.t. t->base, which will result in
+ chaos in scm_threads_mark_stacks() when top-base=2^32 or 2^64.
+ Thus, reset the base, if needed.
*/
+#if SCM_STACK_GROWS_UP
+  if (base < t->base)
+ t->base = base;
+#else
+  if (base > t->base)
+ t->base = base;
+#endif
   scm_enter_guile ((scm_t_guile_ticket) t);
   return 1;
 }
The patch below fixes a crash during garbage collection, where, during
the mark-stack phase, the top and bottom of the stack are found to be 
in backwards order, typically because scm_with_guile() was called when
the stack is much shorter than when a thread was first guilified. That
is, the stack base pointer is stale, and can be inverted from the stack
top. If GC runs due to activity in some other thread, the stale base
pointer leads to the crash (as base-top is approximately 2^32 or 2^64).

A typical symptom of this bug, on a 32-bit system, is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf5333b90 (LWP 20587)]
0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
435   SCM obj = * (SCM *) &x[m];
Current language:  auto; currently c
(gdb) bt
#0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
#1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375

Notice that 4294966782 == fdfe == -202

Please apply in time for guile-1.8.6!

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>

---
 libguile/threads.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c	2008-11-13 15:17:12.0 -0600
+++ guile-1.8.5/libguile/threads.c	2008-11-13 15:32:07.0 -0600
@@ -577,9 +577,24 @@ scm_i_init_thread_for_guile (SCM_STACKIT
   /* This thread is already guilified but not in guile mode, just
 	 resume it.
 	 
-	 XXX - base might be lower than when this thread was first
-	 guilified.
+ A user call to scm_with_guile() will lead us to here. This
+ could happen anywhere on the stack, and in particular, the
+ stack can be *much* shorter than what it was when this thread
+ was first guilified. This will typically happen in
+ on_thread_exit(), where the stack is *always* shorter than
+ when the thread was first guilified. If the GC happens to
+ get triggered due to some other thread, we'd end up with
+ t->top "upside-down" w.r.t. t->base, which will result in
+ chaos in

Re: crash in gc with upside-down stack

2008-11-13 Thread Linas Vepstas
Attached below is a debugging patch, and its output,
which shows that the stack bounds are frequently
up-side-down, and are sometimes upside-down
when the GC runs, thus leading to a crash.

In the next email, I'll propose a patch that fixes the
the problem.

The original problem report:

> 2008/11/11 Linas Vepstas <[EMAIL PROTECTED]>:
>>
>> My stack below.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 0xf5333b90 (LWP 20587)]
>> 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at 
>> gc-mark.c:435
>> 435   SCM obj = * (SCM *) &x[m];
>> Current language:  auto; currently c
>> (gdb) bt
>> #0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782)
>>at gc-mark.c:435
>> #1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375
>> #2  0xf7711d38 in scm_mark_all () at gc-mark.c:82
>> #3  0xf7710d33 in scm_i_gc (what=0xf778602e "cells") at gc.c:598
>

A debugging patch. Yes, its ugly, its intentionally ugly.
More of an eye-catcher that way.

Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c 2008-11-13 07:58:22.0 -0600
+++ guile-1.8.5/libguile/threads.c  2008-11-13 13:14:00.0 -0600
@@ -395,6 +395,10 @@ static scm_t_guile_ticket
 scm_leave_guile ()
 {
   scm_i_thread *t = suspend ();
+int sz=t->base - t->top;
+if(0>sz) {
+printf("duuude scm_leav_guile backwards stack %d\n", sz);
+}
   scm_i_pthread_mutex_unlock (&t->heap_mutex);
   return (scm_t_guile_ticket) t;
 }
@@ -694,7 +698,15 @@ scm_i_with_guile_and_parent (void *(*fun
   really_entered = scm_i_init_thread_for_guile (&base_item, parent);
   res = scm_c_with_continuation_barrier (func, data);
   if (really_entered)
-scm_leave_guile ();
+{
+// scm_leave_guile ();
+scm_i_thread * t = (scm_i_thread *) scm_leave_guile ();
+int sz=t->base - t->top;
+int szb=t->base - &base_item;
+if(0>sz) {
+printf("duuude scm_leav_guile and parent %d %d\n", sz, szb);
+}
+}
   return res;
 }

@@ -704,6 +716,11 @@ scm_without_guile (void *(*func)(void *)
   void *res;
   scm_t_guile_ticket t;
   t = scm_leave_guile ();
+scm_i_thread * s = (scm_i_thread *) t;
+int sz=s->base - s->top;
+if(0>sz) {
+printf("duuude scm_wo guile %d\n", sz);
+}
   res = func (data);
   scm_enter_guile (t);
   return res;
@@ -1371,8 +1388,15 @@ scm_threads_mark_stacks (void)

 #if SCM_STACK_GROWS_UP
   scm_mark_locations (t->base, t->top - t->base);
+
 #else
+int sz=t->base - t->top;
+if(0<=sz) {
   scm_mark_locations (t->top, t->base - t->top);
+} else {
+printf ("duude bugg!!\n");
+printf ("duude stack top=%p base=%p sz=%d\n", t->top, t->base,
t->base - t->top);
+}
 #endif
   scm_mark_locations ((SCM_STACKITEM *) t->regs,
  ((size_t) sizeof(t->regs)
@@ -1441,6 +1465,11 @@ int
 scm_pthread_mutex_lock (scm_i_pthread_mutex_t *mutex)
 {
   scm_t_guile_ticket t = scm_leave_guile ();
+scm_i_thread * s = (scm_i_thread *) t;
+int sz=s->base - s->top;
+if(0>sz) {
+printf("duuude scm_mutexe %d\n", sz);
+}
   int res = scm_i_pthread_mutex_lock (mutex);
   scm_enter_guile (t);
   return res;
@@ -1463,6 +1492,11 @@ int
 scm_pthread_cond_wait (scm_i_pthread_cond_t *cond,
scm_i_pthread_mutex_t *mutex)
 {
   scm_t_guile_ticket t = scm_leave_guile ();
+scm_i_thread * s = (scm_i_thread *) t;
+int sz=s->base - s->top;
+if(0>sz) {
+printf("duuude scm_conde %d\n", sz);
+}
   int res = scm_i_pthread_cond_wait (cond, mutex);
   scm_enter_guile (t);
   return res;
@@ -1578,7 +1612,12 @@ scm_i_thread_put_to_sleep ()
 {
   scm_i_thread *t;

-  scm_leave_guile ();
+  // scm_leave_guile ();
+   t = (scm_i_thread *) scm_leave_guile ();
+int sz=t->base - t->top;
+if(0>sz) {
+printf("duuude scm_leav_guile backwards was scm_i_thread_put_to_sleep
%d\n", sz);
+}
   scm_i_pthread_mutex_lock (&thread_admin_mutex);

   /* Signal all threads to go to sleep
@@ -1620,6 +1659,10 @@ void
 scm_i_thread_sleep_for_gc ()
 {
   scm_i_thread *t = suspend ();
+int sz=t->base - t->top;
+if(0>sz) {
+printf("duuude scm_i_thread_sleep_for_gc backwards stack %d\n", sz);
+}
   scm_i_pthread_cond_wait (&wake_up_cond, &t->heap_mutex);
   resume (t);
 }


Here is an example of the output generated:

duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duude bugg!!
duude stack top=0xf355b9e0 base=0xf355b908 sz=-54
duude bugg!!
duude stack top=0xf355b9e0 base=0xf355b908 sz=-54
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backw

Re: crash in gc with upside-down stack

2008-11-12 Thread Linas Vepstas
Some minor updates:

2008/11/11 Linas Vepstas <[EMAIL PROTECTED]>:
>
> My stack below.
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0xf5333b90 (LWP 20587)]
> 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
> 435   SCM obj = * (SCM *) &x[m];
> Current language:  auto; currently c
> (gdb) bt
> #0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782)
>at gc-mark.c:435
> #1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375
> #2  0xf7711d38 in scm_mark_all () at gc-mark.c:82
> #3  0xf7710d33 in scm_i_gc (what=0xf778602e "cells") at gc.c:598

My current code reproduces this fairly readily, I am seeing
it many dozens/hundreds of times a day.

I tweaked guile to check that the stack bounds are in order,
and to print an error message when they are, and then to
just troop on -- and so I see dozens/hundreds of prints.
When the stack bounds are reversed, the difference
is *always* 58 bytes; and in fact, the two bad stack
bounds are always the same.

It appears to happen *only* when I have multiple threads
all trying to define functions at the same time, it never
happens when one thread goes off to do some heavy
computing.

--linas




crash in gc with upside-down stack

2008-11-12 Thread Linas Vepstas
Here's another one, I'm trying to dig into this:

Its more or less the same crash as the one  reported at:

http://bugs.gentoo.org/228097
and
http://www.mail-archive.com/bug-guile@gnu.org/msg04568.html

My stack below.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf5333b90 (LWP 20587)]
0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
435   SCM obj = * (SCM *) &x[m];
Current language:  auto; currently c
(gdb) bt
#0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782)
at gc-mark.c:435
#1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375
#2  0xf7711d38 in scm_mark_all () at gc-mark.c:82
#3  0xf7710d33 in scm_i_gc (what=0xf778602e "cells") at gc.c:598
#4  0xf7710f4d in scm_gc_for_newcell (freelist=0xf779b76c,
free_cells=0x1228e9b0)
at gc.c:509
#5  0xf7768bd8 in scm_c_catch (tag=0x104, body=0xf76f3830 ,
body_data=0xf528, handler=0xf76f3850 ,
handler_data=0xf528,
pre_unwind_handler=0xf77683e0 ,
pre_unwind_handler_data=0x0) at ../libguile/inline.h:186
#6  0xf76f3cf2 in scm_i_with_continuation_barrier (body=0xf76f3830 ,
body_data=0xf528, handler=0xf76f3850 ,
handler_data=0xf528,
pre_unwind_handler=0xf77683e0 ,
pre_unwind_handler_data=0x0) at continuations.c:326
#7  0xf76f3dd3 in scm_c_with_continuation_barrier (
func=0xf7767ab0 , data=0x1228e938) at continuations.c:368
---Type  to continue, or q  to quit---
#8  0xf77678f9 in scm_i_with_guile_and_parent (func=0xf7767ab0
,
data=0x1228e938, parent=0x19f63670) at threads.c:695
#9  0xf77679ee in scm_with_guile (func=0xf7767ab0 ,
data=0x1228e938) at threads.c:683
#10 0xf7767a43 in on_thread_exit (v=0x1228e938) at threads.c:505
#11 0xf7d7abb0 in __nptl_deallocate_tsd ()
   from /lib/tls/i686/cmov/libpthread.so.0
#12 0xf7d7b509 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#13 0xf7b79e5e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb)

I've seen this twice now in two days, but its not readily reproducible.
By plugging in the insanely large n into a hex calc, you'll see its actually
0xfffsomething. Looking carefully near  threads.c:1375 seems to imply
that stack top and stack bottom are reversed. So I added a printf at that
location, and tried to reproduce the crash. Several gazzilion print
statements later, no crash.

I suspect that this is some sort of thread-race condition; I think it
happens when I am defining some functions from several different
threads at once. It seems *not* to occur once I get into hard-core
computations-- i.e. it happens no later than the first few dozen gc's.

This is on guile-1.8.5, --with-threads, on Ubuntu, Intel (actually AMD64 cpu.)

--linas