bug#38041: crypto with gnutls aka nettle (libhogweed) and scm_realloc

2019-11-02 Thread Linas Vepstas
I've got an app that links gnutls (for crypto code) which links nettle
(libhogweed) with is a GMP-using crypto library which seems like it wanted
to call plain-old realloc, and ended up calling scm_realloc instead.  Note
that nettle does NOT use guile, so there's no plausible way that I know of
to end up in guile code.  This only seems to happen when nettle is used
from multiple threads (so is maybe a nettle bug??) but the stack trace is
so bizarre, I thought I'd report it here.

It would seem that someone, somewhere, is doing some low-level thunking or
trampolining of realloc().  First, the crazy stack trace:

It's currently highly reproducible and exact:
(gdb) r
Starting program:
/home/linas/src/novamente/src/atomspace-dht/build/tests/persist/dht/MultiUserUTest

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running cxxtest tests (2 tests)Start creating 6 user sessions
[2019-11-03 00:46:03:350] [DEBUG] BEGIN TEST: test_multiuser
Collecting from unknown thread

Thread 13 "MultiUserUTest" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe0ff9700 (LWP 3844)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x76834535 in __GI_abort () at abort.c:79
#2  0x75c80ded in GC_push_all_stacks () at pthread_stop_world.c:585
#3  0x75c777df in GC_mark_some (
cold_gc_frame=0x7fffe0ff59d0 "\274\327\354\365\377\177") at mark.c:322
#4  0x75c6d15d in GC_stopped_mark (
stop_func=stop_func@entry=0x75c6cbf0 )
at alloc.c:698
#5  0x75c6dc69 in GC_try_to_collect_inner (
stop_func=0x75c6cbf0 ) at alloc.c:486
#6  0x75c6deea in GC_try_to_collect_general (
stop_func=stop_func@entry=0x0, force_unmap=force_unmap@entry=0)
at alloc.c:1065
#7  0x75c6dfbd in GC_gcollect () at alloc.c:1089
#8  0x76df3e5e in scm_gc_register_allocation (size=size@entry=136)
at ../../libguile/gc.c:596
#9  0x76df3554 in do_realloc (new_size=136, from=0x0)
at ../../libguile/gc-malloc.c:70
#10 scm_realloc (mem=0x0, size=136) at ../../libguile/gc-malloc.c:117
#11 0x7630431f in _nettle_gmp_alloc ()
   from /usr/lib/x86_64-linux-gnu/libhogweed.so.4
#12 0x762fc968 in nettle_mpz_random_size ()
   from /usr/lib/x86_64-linux-gnu/libhogweed.so.4
#13 0x762fc9f4 in nettle_mpz_random ()
   from /usr/lib/x86_64-linux-gnu/libhogweed.so.4
#14 0x762fcd63 in _nettle_generate_pocklington_prime ()
   from /usr/lib/x86_64-linux-gnu/libhogweed.so.4
#15 0x762fd2ce in nettle_random_prime ()
   from /usr/lib/x86_64-linux-gnu/libhogweed.so.4
#16 0x76300b53 in nettle_rsa_generate_keypair ()
   from /usr/lib/x86_64-linux-gnu/libhogweed.so.4
#17 0x77e0729e in ?? () from
/usr/lib/x86_64-linux-gnu/libgnutls.so.30
#18 0x77da8f07 in gnutls_x509_privkey_generate2 ()
   from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#19 0x77f16990 in dht::crypto::PrivateKey::generate(unsigned int) ()
   from
/home/linas/src/novamente/src/atomspace-dht/build/opencog/persist/dht/libpersist-dht.so

Next, verify that nettle does not use scm:

$ nm /usr/lib/x86_64-linux-gnu/libhogweed.a |grep scm
(nothing printed)
$ nm /usr/lib/x86_64-linux-gnu/libhogweed.a |grep GC
(nothing printed)
$ nm /usr/lib/x86_64-linux-gnu/libhogweed.a |grep alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
 U _nettle_gmp_alloc
05f0 T _nettle_gmp_alloc
04e0 T _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs
 U _nettle_gmp_alloc_limbs

Debugging suggestions?
-- 
cassette tapes - analog TV - film cameras - you


bug#33642: guile-2.9.1 failing to load ~/.guile file!?

2018-12-05 Thread Linas Vepstas
There is now an observable difference between saying
```
$ guile  /some/file.scm
```
and
```
$ guile
scheme@(guile-user)> (load "/some/file.scm")
```
The latter loads `~/.guile`  before providing the guile prompt.  The
former either doesn't load `~/.guile` at all (I didn't check), or
loads
it only after digesting `/some/file.scm`. This was not the case for
guile-2.2

-- Linas

-- 
cassette tapes - analog TV - film cameras - you





bug#33641: guile-2.9.1 multi-threading crash

2018-12-05 Thread Linas Vepstas
The following bug report is informal, without any simple test, right now.
Very reproducible, though.

I have a unit test (it passes with guile-2.2) that creates 120 threads and
races them as fast as possible, each thread launched from C++, entering
guile, and then from guile, calling some wrappered C++ code. With 2.9.1,
the test crashes about half the time, always with the same stack trace
```
(gdb) info threads
  Id   Target Id Frame
  1Thread 0x77fdcbc0 (LWP 24595) "MultiThreadUTes"
0x77bc298d in pthread_join (threadid=140737247344384,
thread_return=0x0)
at pthread_join.c:90
```
and most of the rest in `__lll_lock_wait` (that my c++ code asks for) or
`pthread_cond_wait@@GLIBC_2.3.2` from GC_wait_marker. The stack trace
itself is useless; the core issue is the `thread_return=0x0` above.
```
(gdb) bt
#0  0x7ffdb03c7040 in ?? ()
#1  0x0001 in ?? ()
#2  0x740a553c in __GI___libc_free (mem=)
at malloc.c:2968
#3  0x in ?? ()
```
Its hard to see what this has to do with guile, other than that this test
has been run thousands of times on guile-2.2 without issues.

(Reproducible by running the "MultiThreadUTest" of
https://github.com/opencog/atomspace)

-- Linas
-- 
cassette tapes - analog TV - film cameras - you


bug#27234: Followup.

2017-09-09 Thread Linas Vepstas
The original bug report failed to explain how to trigger the bug.  I think
I know of a simple way to trigger the bug, but have not yet created the
simple test-case.  I believe that the bug can be triggered like so:

Create some medium-sized "random" (arbitrary) snippet of guile code. Send
it to the guile repl server.  Do it again and again, as fast as possible.
The bug will trigger.

By "arbitary/random" I mean some code that is simple but non-repeating:
say, concatenate some random strings, count the number of letters in them,
take the square root -- just do assorted, non-repeating computations that
force the REPL to evaluate some new blob of scheme code every time.

Now, once out of every 5 or 50 times, introduce a bug in the scheme code,
causing the REPL to throw an exception.  Using ut8 strings might also be a
required ingredient.

I believe that this will trigger the bug. I believe that the bug is some
unprotected section in the guile interpreter, and a race condition clobbers
the guile stack(s) and it all goes downhill from here.


bug#27234: Hang in GC, inf loop while walking frame pointers

2017-06-04 Thread Linas Vepstas
what: guile-2.2-stable, from git.

I've got a large, complex, heavily multi-threaded guile program that
hangs during garbage collection; usually after running for half a day.
It hangs in a tight loop in scm_i_vm_mark_stack, spinning at 100% of CPU.

This is due to the for-loop line fp = SCM_FRAME_DYNAMIC_LINK (fp))
at libguile/vm.c line 975 failing to advance the frame pointer.
There's no "obvious" corruption in the stack; it simply looks like
the frame was incompletely set up, and so incrementing to the next
fp does not go anywhere.

I have recompiled guile with VM_ENABLE_ASSERTIONS and am trying to
reproduce the bug now.  The rest of this email is a record of a long
debug session isolating the problem, and showing that, overall, the
thread and stack data look more-or-less correct and uncorrupoted,
except for the inability to walk forward in the frame.

-- linas

(gdb) bt
#0  scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
#1  0x7f3e9db8835e in GC_mark_from (mark_stack_top=0x7f3e9b783ee0,
mark_stack_top@entry=0x7f3e9b783f00,
mark_stack=mark_stack@entry=0x7f3e9b783eb0,
mark_stack_limit=mark_stack_limit@entry=0x7f3e9b793eb0) at ../mark.c:772
#2  0x7f3e9db8897e in GC_do_local_mark (local_mark_stack=0x7f3e9b783eb0,
local_top=0x7f3e9b783f00) at ../mark.c:1037
#3  0x7f3e9db88b98 in GC_mark_local (
local_mark_stack=local_mark_stack@entry=0x7f3e9b783eb0, id=id@entry=4)
at ../mark.c:1170
#4  0x7f3e9db88eaa in GC_help_marker (my_mark_no=my_mark_no@entry=80003)
at ../mark.c:1238
#5  0x7f3e9db92e3c in GC_mark_thread (id=)
at ../pthread_support.c:380
#6  0x7f3e9e2ae6ba in start_thread (arg=0x7f3e9b794700)
at pthread_create.c:333
#7  0x7f3e9dfdd82d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) step

#0  find_slot_map (cache=0x7f3e9b783ab0, ip=0x1) at
../../libguile/vm.c:935
935   if (cache->entries[slot].ip == ip)
(gdb) print slot
$1 = 0
(gdb) print cache->entries[slot].ip
$2 = (scm_t_uint32 *) 0x1
(gdb) print cache->entries[slot].map
$4 = (const scm_t_uint8 *) 0x0
(gdb) step
scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
1011slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
);
(gdb) print fp
$5 = (union scm_vm_stack_element *) 0x7f3e94cdee38

#define SCM_FRAME_RETURN_ADDRESS(fp)((fp)[0].as_ip)
(gdb) print (fp)[0].as_ip
$6 = (scm_t_uint32 *) 0x1

OK that looks weird ... is this corrupted ?? but whatever,
because the returned slot_map is never used ... because ...

(gdb) step
scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:975
975fp = SCM_FRAME_DYNAMIC_LINK (fp))
(gdb) print fp
$7 = (union scm_vm_stack_element *) 0x7f3e94cdee38

frames.h:#define SCM_FRAME_DYNAMIC_LINK(fp)  ((fp) + (fp)[1].as_uint)
(gdb) print (fp)[1].as_uint
$8 = 0

OK, that seems bad, because now fp never advances, it just repeats
over and over with this same value.

(gdb)
979   for (slot = nlocals - 1; sp < fp; sp++, slot--)
(gdb) print nlocals
$10 = -2
(gdb) print sp
$11 = (union scm_vm_stack_element *) 0x7f3e94cdee48
#define SCM_FRAME_NUM_LOCALS(fp, sp)((fp) - (sp))
(gdb) print fp
$12 = (union scm_vm_stack_element *) 0x7f3e94cdee38
(gdb) print  ((fp) - (sp))
$13 = -2
Ohh .. its not -16 because its -2 * sizeof (union scm_vm_stack_element *)
so that's OK.
So for loops is skipped, it should go to.
  sp = SCM_FRAME_PREVIOUS_SP (fp);
frames.h:#define SCM_FRAME_PREVIOUS_SP(fp)((fp) + 2)

and so now it loops around and repeats.
(gdb) print cache
$19 = {entries = {{ip = 0x1, map = 0x0}, {ip = 0x0,
  map = 0x0} , {ip = 0x1a1b5e0, map = 0x0}, {ip =
0x0,
  map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0,
  map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0,
  map = 0x0}}}
(gdb) print 
$20 = (struct slot_map_cache *) 0x7f3e9b783ab0

and so the loop repeates forever, because
fp = SCM_FRAME_DYNAMIC_LINK (fp) never advances fp, because
(fp)[1].as_uint  is zero.

So where is fp pointing to?  recall fp == 0x7f3e94cdee38

(gdb) x/20x 0x7f3e94cdee00
0x7f3e94cdee00: 0x0904  0x200c
0x7f3e94cdee10: 0x01a1b7e0  0x01a1b5e0
0x7f3e94cdee20: 0x0004  0x200c
0x7f3e94cdee30: 0x00169bd6  0x0001
0x7f3e94cdee40: 0x  0x0192acd0
0x7f3e94cdee50: 0x01fa4bc0  0x01a1c6d0
0x7f3e94cdee60: 0x44507950  0x0002
0x7f3e94cdee70: 0x0005a6f5  0x018febd0
0x7f3e94cdee80: 0x6d490d00  0x7f3e9e97241c
0x7f3e94cdee90: 0x0002  0x7f3e9e97241c


(gdb) x/s 0x7f3e94cdee60
0x7f3e94cdee60: "PyPD"  <<< ?? is this a meaningul string?
(gdb) x/s 

bug#26616: also a bug in guile-2.0.11

2017-04-23 Thread Linas Vepstas
I just noticed that this bug DOES reproduce in guile 2.0.11 which is the
version that ships with Ubuntu Xenial (Ubuntu 16.04) 

The above loop took 37 cpu-minutes to execute. The non-parallel version
executes instantly (fractions of a cpu-second).


bug#26616: close

2017-04-22 Thread Linas Vepstas
Per last message, close.


bug#26616: close

2017-04-22 Thread Linas Vepstas
I just retested with a pull from today's git and the problem does NOT
reproduce.

Attempting to close this bug.

tested with

$ guile --version
guile (GNU Guile) 2.2.2.1-886ac


bug#26616: guile-2.2 par-for-each hangs for large lists

2017-04-22 Thread Linas Vepstas
The following spins, burning about 250% cpu time, for large lists; it works
fine for smaller lists.   This is for a recent version of guile from git:

guile -v
guile (GNU Guile) 2.1.5.19-7e9395

will retry with newer guile shortly.

On fresh guile:

(use-modules (srfi srfi-1))
(define al (list-tabulate 10 values))
(define foo 0)
(par-for-each (lambda (x) (set! foo (+ x foo))) al)

^CERROR: In procedure scm-error:
ERROR: User interrupt

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> ,bt
In ice-9/threads.scm:
   289:22  5 (loop _)
In ice-9/futures.scm:
   265:11  4 (touch #)
   243:14  3 (work)
In unknown file:
   2 (wait-condition-variable #t)
While executing meta-command:
ERROR: In procedure cdr: Wrong type argument in position 1 (expecting
pair): ()
scheme@(guile-user) [1]>

The above is after interrupting after an hour or so of accumulated CPU
time.

Again, shorter lists (e.g. 10K long) work fine. My production lists are
about 400K to 2M in size.

--linas


bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string

2017-03-01 Thread Linas Vepstas
In the bad old days, not every thing was documented ... My use of scm_puts
dates back to guile-1.8.  I only ever send it utf8.  I can change my code,
no problem,... I just thought I'd report a regression in case  others
are affected.

Linas

On Wednesday, March 1, 2017, Andy Wingo <wi...@pobox.com> wrote:

> On Tue 10 Jan 2017 04:34, Linas Vepstas <linasveps...@gmail.com
> <javascript:;>> writes:
>
> > void *wrap_puts(void* p)
> > {
> >char *wtf = p;
> >
> >SCM port = scm_current_output_port ();
> >
> >scm_puts("the port-encoding is=", port);
> >scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port);
> >
> >scm_puts("\nThe string to display is =", port);
> >scm_puts (wtf, port);
> >
> >scm_puts("\nWas expecting to see this=", port);
> >SCM str = scm_from_utf8_string(wtf);
> >scm_display(str, port);
> >scm_puts("\n\n", port);
> >
> >return NULL;
> > }
>
> So, there are a few questions here.  scm_puts and scm_lfwrite are not
> documented, so we need to do basic science on them to see what they are
> supposed to do.
>
> Firstly, is scm_puts() a textual interface or a binary interface?
> I.e. does it write a sequence of characters or a sequence of bytes?
>
> If I look at uses of scm_puts in Guile sources, it seems clear that it's
> a textual interface.  That is to say, at all points, the intention seems
> to be to write characters on a Guile port.  All of the uses are of
> strings.  Please do a "git grep" on your source to see if your
> perceptions correspond.
>
> Now the question is, what encoding is the argument in?  If the port is
> UTF-16, that byte string should be decoded to characters, and that
> character sequence encoded to UTF-16.
>
> All of the scm_puts calls in Guile are of one-byte characters with
> codepoints less than 128, so when doing some port refactoring I chose to
> interpret the argument as latin1.
>
> FTR, in Guile 2.0, this was effectively a binary interface.  Guile 2.0's
> scm_lfwrite interpreted the incoming bytes as ISO-8859-1 codepoints for
> the purposes of updating line and column, but scm_puts and scm_lfwrite
> just wrote out the bytes to the port directly, regardless of the
> encoding.  That was the wrong thing.
>
> Are you arguing that the byte string given to scm_puts should be decoded
> from UTF-8?  That would be OK.
>
> Andy
>


bug#25498: Crash in open-file; patch attached

2017-02-15 Thread Linas Vepstas
Specifically, this crashes, now: GNU Guile 2.1.6.10-710eb



On Wed, Feb 15, 2017 at 2:25 AM, Linas Vepstas <linasveps...@gmail.com>
wrote:

> I'm using version 2.1 pulled from git, maybe a few days or week before the
> bug was opened.
>
> On Sat, Feb 11, 2017 at 3:10 PM, Ludovic Courtès <l...@gnu.org> wrote:
>
>> Hi Linas,
>>
>> Linas Vepstas <linasveps...@gmail.com> skribis:
>>
>> > The following crashes instantly; I used single-quotes by accident.
>> >
>> > (open-file "/tmp/lg" 'w')
>> >
>> > Stack:
>> >
>> > Enter `,help' for help.
>> > scheme@(guile-user)> (open-file "/tmp/lg" 'w')
>> >
>> > Thread 1 "guile" received signal SIGSEGV, Segmentation fault.
>> > scm_i_mode_to_open_flags (mode=mode@entry=0x55ac5660,
>> > is_binary=is_binary@entry=0x7fffd46c,
>> > FUNC_NAME=FUNC_NAME@entry=0x77b89a7d "open-file")
>> > at ../../libguile/fports.c:168
>>
>> What version of Guile are you using?  With 2.0.13, I get:
>>
>> --8<---cut here---start->8---
>> scheme@(guile-user)> (open-file "/tmp/lg" 'w')
>> ERROR: In procedure open-file:
>> ERROR: In procedure open-file: Value out of range: w'
>> --8<---cut here---end--->8---
>>
>> Ludo’.
>>
>
>


bug#25498: Crash in open-file; patch attached

2017-02-15 Thread Linas Vepstas
I'm using version 2.1 pulled from git, maybe a few days or week before the
bug was opened.

On Sat, Feb 11, 2017 at 3:10 PM, Ludovic Courtès <l...@gnu.org> wrote:

> Hi Linas,
>
> Linas Vepstas <linasveps...@gmail.com> skribis:
>
> > The following crashes instantly; I used single-quotes by accident.
> >
> > (open-file "/tmp/lg" 'w')
> >
> > Stack:
> >
> > Enter `,help' for help.
> > scheme@(guile-user)> (open-file "/tmp/lg" 'w')
> >
> > Thread 1 "guile" received signal SIGSEGV, Segmentation fault.
> > scm_i_mode_to_open_flags (mode=mode@entry=0x55ac5660,
> > is_binary=is_binary@entry=0x7fffd46c,
> > FUNC_NAME=FUNC_NAME@entry=0x77b89a7d "open-file")
> > at ../../libguile/fports.c:168
>
> What version of Guile are you using?  With 2.0.13, I get:
>
> --8<---cut here---start->8---
> scheme@(guile-user)> (open-file "/tmp/lg" 'w')
> ERROR: In procedure open-file:
> ERROR: In procedure open-file: Value out of range: w'
> --8<---cut here---end--->8---
>
> Ludo’.
>


bug#25498: Crash in open-file; patch attached

2017-01-20 Thread Linas Vepstas
The following crashes instantly; I used single-quotes by accident.

(open-file "/tmp/lg" 'w')

Stack:

Enter `,help' for help.
scheme@(guile-user)> (open-file "/tmp/lg" 'w')

Thread 1 "guile" received signal SIGSEGV, Segmentation fault.
scm_i_mode_to_open_flags (mode=mode@entry=0x55ac5660,
is_binary=is_binary@entry=0x7fffd46c,
FUNC_NAME=FUNC_NAME@entry=0x77b89a7d "open-file")
at ../../libguile/fports.c:168
168  switch (*md)
(gdb) bt
#0  scm_i_mode_to_open_flags (mode=mode@entry=0x55ac5660,
is_binary=is_binary@entry=0x7fffd46c,
FUNC_NAME=FUNC_NAME@entry=0x77b89a7d "open-file")
at ../../libguile/fports.c:168
#1  0x77b057e9 in scm_open_file_with_encoding (
filename=filename@entry=0x55b7fd98, mode=mode@entry=0x55ac5660,
guess_encoding=0x4, encoding=0x4) at ../../libguile/fports.c:242
#2  0x77b05b83 in scm_i_open_file (filename=0x55b7fd98,
mode=0x55ac5660, keyword_args=)
at ../../libguile/fports.c:380
#3  0x77b6a221 in vm_debug_engine (thread=0x55ac5660,
vp=0x55844f30, registers=0x54aad62357094bc, resume=39)
at ../../libguile/vm-engine.c:760

A patch that seems reasonable to me:

$ git diff
diff --git a/libguile/fports.c b/libguile/fports.c
index 8fa69933d..28e666b6a 100644
--- a/libguile/fports.c
+++ b/libguile/fports.c
@@ -230,6 +230,9 @@ scm_open_file_with_encoding (SCM filename, SCM mode,
   unsigned int retries;
   char *file;

+  if (SCM_UNLIKELY (!scm_is_string (mode)))
+scm_wrong_type_arg_msg (FUNC_NAME, 2, mode, "mode to be string");
+
   if (SCM_UNLIKELY (!(scm_is_false (encoding) || scm_is_string (encoding
 scm_wrong_type_arg_msg (FUNC_NAME, 0, encoding,
 "encoding to be string or false");





bug#25387: guile-2.2 multi-thread segfault in SCM_VALIDATE_WEAK_TABLE

2017-01-11 Thread Linas Vepstas
Hi Andy: I just code-reviewed, it looks like a good fix;  you're saying that
the dynamic state was being accidentally collected when it shouldn't
have been.

Tested, it tests OK, after 40 mins cpu time, its still running.

--linas





bug#25386: This can be closed

2017-01-10 Thread Linas Vepstas
This can be closed as 'fixed'; I tested on today's git

guile (GNU Guile) 2.1.5.19-7e9395

and the worst of it seems to be over.  mem usage growth on the
original test case:

(heap-size . 7921664) (gc-times . 40)
(heap-size . 14344192) (gc-times . 953)
(heap-size . 14344192) (gc-times . 5219)  ; after 4 minutes CPU
(heap-size . 26419200) (gc-times . 64975) ; after 77 minutes CPU
(heap-size . 26419200) (gc-times . 133346) ; after 154 mins CPU
(heap-size . 26419200) (gc-times . 170083) ; after 192 mins CPU
(heap-size . 26419200) (gc-times . 249102) ; after 283 mins cpu
(heap-size . 26419200) (gc-times . 420031) ; after 468 min cpu
(heap-size . 26419200) (gc-times . 557039) ; after 804 mins CPU

i.e. 26MBytes - larger than it needs to be, but acceptable.

The last entry was, in full,
((gc-time-taken . 355210357) (heap-size . 26419200) (heap-free-size .
20336640) (heap-total-allocated . 2522568563312)
(heap-allocated-since-gc . 57648) (protected-objects . 0) (gc-times .
557619))

i.e. of the 26MB, only 6MB is in use, the rest is free.  The 6MB is
close to what it starts with.  2522 GB were chewed through in the
process, so this is OK, I guess.

A variant test case, create 510 threads before calling join:  (change
10 to 510 in above test)

(heap-size . 10604544) (gc-times . 32)
(heap-size . 19505152) (gc-times . 484)
(heap-size . 35926016) (gc-times . 1761)
(heap-size . 48238592) (gc-times . 4217)  ; after 8 minutes cpu time
(heap-size . 48238592) (gc-times . 47902) ; after 76 mins CPU
(heap-size . 48238592) (gc-times . 73063) ; after 114 mins CPU
(heap-size . 65540096) (gc-times . 128094) ; after 209 mins cpu
(heap-size . 65540096) (gc-times . 248321) ; after 399 mins
(heap-size . 65540096) (gc-times . 344197) ; after 546 min

i.e. 65MBytes .. acceptable, I guess.

The last one was:
((gc-time-taken . 218714374) (heap-size . 65540096) (heap-free-size .
54419456) (heap-total-allocated . 2057186203744)
(heap-allocated-since-gc . 4553872) (protected-objects . 0) (gc-times
. 344799))

so of the 65MB, only 11MB is in-use.


My production server is doing this:

(heap-size . 652918784) (gc-times . 233) ; about 8 mins CPU
(heap-size . 737722368) (gc-times . 339) ; 12 mins CPU
(heap-size . 1332973568) (gc-times . 1797) ; 120 mins CPU
(heap-size . 1441443840) (gc-times . 2221) ; 151 min CPU
(heap-size . 1521213440) (gc-times . 2441) ; 168 min cpu
(heap-size . 1595101184) (gc-times . 3061) ; 218 min cpu
(heap-size . 1726119936) (gc-times . 3292) ; 237 min
(heap-size . 1960865792) (gc-times . 6698) ; 510 minn
(heap-size . 1960865792) (gc-times . 10383) ; 805 min
(heap-size . 2931556352) (gc-times . 14211) ; 1199 min

about 3GB --

Last one is, in  full:
(gc-stats)
((gc-time-taken . 19818394581722) (heap-size . 2931556352)
(heap-free-size . 1767579648) (heap-total-allocated . 731393991040)
(heap-allocated-since-gc . 4063680) (protected-objects . 318)
(gc-times . 14211))

so of the 3GB, 1.8GB is free, and 1.2GB in use which is surprisingly
high for my app, but I can live with that.

Thanks!

--linas





bug#25387: better but still an issue.

2017-01-10 Thread Linas Vepstas
Retested with today's version of git. still crashes, but not instantly;
it now takes 20 seconds to 5 minutes to reproduce.

guile (GNU Guile) 2.1.5.19-7e9395





bug#25267: guile-2.2 crash in GC

2017-01-09 Thread Linas Vepstas
On Mon, Jan 9, 2017 at 3:53 PM, Andy Wingo <wi...@pobox.com> wrote:
> On Sat 24 Dec 2016 19:43, Linas Vepstas <linasveps...@gmail.com> writes:
>
>> [Switching to Thread 0x7fffc0ff9700 (LWP 3680)]
>> thread_mark (addr=0x558f7700, mark_stack_ptr=,
>> mark_stack_limit=0x7fffc0ff7c50, env=)
>> at ../../libguile/threads.c:111
>> 111  while ((chain = *(void **)chain))
>
> I ran into this one too!  I think I fixed it; can you verify?

Yep, this is now fixed. You can close this.

(20 minutes of cpu time racked up on it. git version as of today:
7e93950552cd9e85a1f3eb73faf16e8423b0fbbe )

--linas





bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string

2017-01-09 Thread Linas Vepstas
This short C program illustrates the issue.  The locale, the output port etc.
are UTF-8.  The bad results are no surprise: the code currently in git for
scm_puts etc. explicitly ignores the locale setting, always, and always
assumes latin1 -- its hard-coded in there.

--linas

#include 

void *wrap_eval(void* p)
{
   char *wtf = "(setlocale LC_ALL \"\")";
   SCM eval_str = scm_from_utf8_string(wtf);
   scm_eval_string(eval_str);

   return NULL;
}

void *wrap_puts(void* p)
{
   char *wtf = p;

   SCM port = scm_current_output_port ();

   scm_puts("the port-encoding is=", port);
   scm_puts(scm_to_utf8_string(scm_port_encoding(port)), port);

   scm_puts("\nThe string to display is =", port);
   scm_puts (wtf, port);

   scm_puts("\nWas expecting to see this=", port);
   SCM str = scm_from_utf8_string(wtf);
   scm_display(str, port);
   scm_puts("\n\n", port);

   return NULL;
}

int main(int argc, char* argv[])
{
   scm_with_guile(wrap_eval, 0x0);

   char * wtf = "Ćićolina";
   scm_with_guile(wrap_puts, wtf);

   wtf = "Thủ Dầu Một";
   scm_with_guile(wrap_puts, wtf);

   wtf = "Småland";
   scm_with_guile(wrap_puts, wtf);

   wtf = "Hòa Phú Phú Tân";
   scm_with_guile(wrap_puts, wtf);

   wtf = "係 拉 丁 字 母";
   scm_with_guile(wrap_puts, wtf);
}

The output is always this:

the port-encoding is=UTF-8
The string to display is =Ćićolina
Was expecting to see this=Ćićolina

the port-encoding is=UTF-8
The string to display is =Thủ Dầu Một
Was expecting to see this=Thủ Dầu Một

the port-encoding is=UTF-8
The string to display is =Småland
Was expecting to see this=Småland

the port-encoding is=UTF-8
The string to display is =Hòa Phú Phú Tân
Was expecting to see this=Hòa Phú Phú Tân

the port-encoding is=UTF-8
Was expecting to see this=係 拉 丁 字 母 æ¯


What's cool is that all this stuff works in email!

--linas

On Mon, Jan 9, 2017 at 4:03 PM, Andy Wingo <wi...@pobox.com> wrote:
> On Sun 08 Jan 2017 19:16, Linas Vepstas <linasveps...@gmail.com> writes:
>
>> There appears to be a regression in guile-2.2 with utf8 handling
>> in the scm_puts() scm_lfwrite() and scm_c_put_string() functions.
>>
>> In guile-2.0, one could give these utf8-encoded strings, and these
>> would display just fine.  In 2.2 they get mangled.
>
> Could it be this from NEWS:
>
>   ** Better locale support in Guile scripts
>
>   When Guile is invoked directly, either from the command line or via a
>   hash-bang line (e.g. "#!/usr/bin/guile"), it now installs the current
>   locale via a call to `(setlocale LC_ALL "")'.  For users with a unicode
>   locale, this makes all ports unicode-capable by default, without the
>   need to call `setlocale' in your program.  This behavior may be
>   controlled via the GUILE_INSTALL_LOCALE environment variable; see the
>   manual for more.





bug#25386: Please ignore previous stack trace.

2017-01-08 Thread Linas Vepstas
Sorry, please ignore the previous stack trace.  I updated to version
2.0.13 (manually compiled) and it crashed with a zillion messages:

guile: warning: weak hash table corruption (https://bugs.gnu.org/19180)

so that's a done deal, then.





bug#25386: calling gc too often triggers a crash

2017-01-08 Thread Linas Vepstas
Since above works so swimmingly in the bove example, I tried it in a
production system.  Calling gc shortly before thread exit results in a
crash, always the same crash, always in under 20 minutes:

guile: hashtab.c:137: vacuum_weak_hash_table: Assertion `removed <= len'
failed.
Aborted

again, this is for guile -v
guile (GNU Guile) 2.0.11
Packaged by Debian (2.0.11-deb+1-10)

Perhaps this is fixed in 2.0.13 ???

guile: hashtab.c:137: vacuum_weak_hash_table: Assertion `removed <= len' failed.

Thread 1416 "guile" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffe7b7fe700 (LWP 29883)]
0x7749e428 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x7749e428 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x774a002a in __GI_abort () at abort.c:89
#2  0x77496bd7 in __assert_fail_base (fmt=,
assertion=assertion@entry=0x77b5f7a2 "removed <= len",
file=file@entry=0x77b5f798 "hashtab.c", line=line@entry=137,
function=function@entry=0x77b5ff60 "vacuum_weak_hash_table")
at assert.c:92
#3  0x77496c82 in __GI___assert_fail (
assertion=0x77b5f7a2 "removed <= len",
file=0x77b5f798 "hashtab.c", line=137,
function=0x77b5ff60 "vacuum_weak_hash_table") at assert.c:101
#4  0x77ac3108 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#5  0x77ac31af in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#6  0x77ac5b1c in scm_c_hook_run ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#7  0x77207ff5 in GC_try_to_collect_inner ()
   from /usr/lib/x86_64-linux-gnu/libgc.so.1
#8  0x772082aa in GC_try_to_collect_general ()
   from /usr/lib/x86_64-linux-gnu/libgc.so.1
#9  0x7720838d in GC_gcollect ()
   from /usr/lib/x86_64-linux-gnu/libgc.so.1
#10 0x77ab9109 in scm_gc ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#11 0x77b3402b in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#12 0x77aab107 in scm_call_1 ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#13 0x77b34093 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#14 0x77aab21e in scm_call_3 ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#15 0x77b34093 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#16 0x77aab283 in scm_call_4 ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#17 0x7fffefb90d79 in
opencog::SchemeEval::do_eval(std::__cxx11::basic_string const&) (
this=0x7ffe74000980, expr=...)
at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:564
#18 0x7fffefb90e2a in opencog::SchemeEval::c_wrap_eval(void*) (
p=0x7ffe74000980)
at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:493
#19 0x77aa158a in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#20 0x77b34093 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#21 0x77aab283 in scm_call_4 ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#22 0x77aa1d21 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#23 0x77aa1e05 in scm_c_with_continuation_barrier ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#24 0x772198e7 in GC_call_with_gc_active ()
   from /usr/lib/x86_64-linux-gnu/libgc.so.1
#25 0x77b21c01 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#26 0x77213952 in GC_call_with_stack_base ()
   from /usr/lib/x86_64-linux-gnu/libgc.so.1
#27 0x77b21fe8 in scm_with_guile ()
   from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#28 0x7fffefb90eae in
opencog::SchemeEval::eval_expr(std::__cxx11::basic_string const&) (
this=0x7ffe74000980, expr=...)
at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:465
#29 0x7fffe386cc86 in opencog::GenericShell::eval_loop (
this=0x7ffe4c001380)
at /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:446
#30 0x7fffee768c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#31 0x778396ba in start_thread (arg=0x7ffe7b7fe700)
at pthread_create.c:333
#32 0x7756f82d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109





bug#25397: guile-2.2 regression in utf8 support in scm_puts scm_lfwrite scm_c_put_string

2017-01-08 Thread Linas Vepstas
There appears to be a regression in guile-2.2 with utf8 handling
in the scm_puts() scm_lfwrite() and scm_c_put_string() functions.

In guile-2.0, one could give these utf8-encoded strings, and these
would display just fine.  In 2.2 they get mangled.

The source of the mangling seems to be an assumption that these
three are being given latin1 strings, which they then attempt to
convert to utf8, thus wrecking the encoding.  See, e.g. libguile/ports.c
line 3526

Presumably this change was intentional, but I don't understand
why; guile-2.0 seems utf-8 clean, correctly handling utf-8 in
essentially all cases.  Why would one want to go back to the
bad old days of latin1 and iso-8859-1 for guile 2.2?

I could submit a patch for this, but would it be wanted?

Test case is straight-forward:

printf("duuude port-encoding is=%s\n",
   scm_to_utf8_string(scm_port_encoding(scm_current_output_port (;
scm_puts ("係 拉 丁 字 母", scm_current_output_port ());

which works in guile-2.0 but is garbled in 2.2





bug#25386: test case update

2017-01-08 Thread Linas Vepstas
Ran the three test case overnight; saw mostly no increase in mem usage.

-- gc before every thread exit: up to 9M thread exits, no change in heap-size

-- gc before every third thread exit: 25M thread exits, no change.

-- gc before every 17th thread exit: 44M thread exits, relatively
small increase, from (heap-size . 1875902464) to (heap-size .
2068840448)





bug#25386: Manual gc helps

2017-01-07 Thread Linas Vepstas
I did a fairly through review of the thread-creation and thread-join
code in the git master branch, and it looks to be just fine. Thus,
some experimentation is in order:

Going back to guile-2.0, I see this behavior:
guile -v
guile (GNU Guile) 2.0.11
Packaged by Debian (2.0.11-deb+1-10)

If I add a manual gc to the exit of the thread, like so:

   (define (mkthr v) (call-with-new-thread (lambda ()
 (set! junk (+ junk 1)) (gc) )))

then the heap blows up, in minutes, to about 180MB but then stops
growing, even after hours and millions of thread creates:
(heap-size . 183734272) (gc-times . 1957954)

If I gc only every third thread create, it quickly blows up to about
400MB, and then stablizes, for hours:
(heap-size . 428638208) (gc-times . 1292663)

If I gc every 17th thread, it blows up to about 1.8GB and then is stable:
(heap-size . 1875902464) (gc-times . 327462)

This last one after about 5.5 million thread creates and joins.
The counting is done like so:

   (define (mkthr v) (call-with-new-thread (lambda ()
(lock-mutex mtx)
(if (eq? 0 (modulo junk 17)) (gc))
(set! junk (+ junk 1))
(unlock-mutex mtx)
)))

In each case, it seems to hit a plateau at about (n+1)*100MB when gc
is done on one out of every n threads.  This seems quite bizarre to
me: why does this inverted relation on number of gcs vs number of
thread creates? What's magic about 100MB? Clearly 100MB is wayyy too
large for this very simple program.  I mean, even if I gc at *every*
thread-exit ...

(I have not yet explored above in guile-2.2)

Since I cannot find any 'obvious' bugs in guile, this suggests some
strange stochastic behavior in bdw-gc?





bug#25387: also crashes in guile-2.0

2017-01-07 Thread Linas Vepstas
Also crashes in guile-2.0, but takes much longer - 5 minutes

--linas





bug#25387: guile-2.2 multi-thread segfault in SCM_VALIDATE_WEAK_TABLE

2017-01-07 Thread Linas Vepstas
Following program crashes immediately (fraction of a second)
in guile-2.2, current git version (as of 29 Dec 2016
a0656ad4cf976b3845e9b9663a90b46b4cf9fc5a )

It runs fine in guile-2.0. Its doing something slightly squonky:
referencing the variable 'cnt' in a thread.  Note definition of
use before definition of variable

Its deterministic - always crashes in the same place.

(define junk 0)
(define halt #f)

(define (wtf-thr)
   (define start (- (current-time) 0.1))

   ; Create thread that does junk and exits.  Yes, the increment
   ; of `junk` is not protected, and its racey, but so what.
   (define (mkthr v) (call-with-new-thread (lambda ()
  (if (eq? 0 (modulo cnt 30)) (gc))      crashes here!!!
(set! junk (+ junk 1)

   ; thread arguments
   (define thrarg (make-list 10 0))

   (define cnt 0)
   (define (mke)
  ; Create a limited number of threads
  (define thr-list (map mkthr thrarg))
  ; (display (length (all-threads)))
  (map join-thread thr-list)

  ; Some handy debug printing.
  (set! cnt (+ cnt 1))
  (if (eq? 0 (modulo cnt 500))
 (begin
(display "rate=")
(display (/ cnt (- (current-time) start))) (newline)
(display "cnt=") (display cnt) (newline)
(display (gc-stats)) (newline) (newline)
 )))

   ; tail recursive infinite loop.
   (define (aloop) (mke) (if (not halt) (aloop)))

   ; while forever.
   (aloop)
)

; Run elsewhere, so that we have a shell prompt
; (not required for the bug)
(call-with-new-thread wtf-thr)

; halt if desired.
; (set! halt #t)


Thread 621 "guile" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedbe1700 (LWP 10504)]
0x77b78af1 in scm_c_weak_table_ref (table=0x0,
raw_hash=2738445758486295669, pred=0x77b77bb0 ,
closure=0x558fff00, dflt=0x904) at ../../libguile/weak-table.c:862
warning: Source file is more recent than executable.
862  SCM_VALIDATE_WEAK_TABLE (1, table);
(gdb) bt
#0  0x77b78af1 in scm_c_weak_table_ref (table=0x0,
raw_hash=2738445758486295669, pred=0x77b77bb0 ,
closure=0x558fff00, dflt=0x904) at ../../libguile/weak-table.c:862
#1  0x77b02fa4 in fluid_ref (dynamic_state=0x55f8ce60,
fluid=0x558fff00) at ../../libguile/fluids.c:287
#2  0x77b0325f in scm_fluid_ref (fluid=0x558fff00)
at ../../libguile/fluids.c:308
#3  0x77b34424 in scm_i_default_port_conversion_strategy ()
at ../../libguile/ports.c:1015
#4  0x77b5e4df in scm_i_default_string_failed_conversion_handler ()
at ../../libguile/strings.c:1619
#5  scm_from_locale_stringn (
str=0x77b88d50 "Wrong type argument in position ~A: ~S",
len=len@entry=18446744073709551615) at ../../libguile/strings.c:1626
#6  0x77b5e51c in scm_from_locale_string (str=)
at ../../libguile/strings.c:1613
#7  0x77af76c6 in scm_error (key=0x558fa960,
subr=subr@entry=0x77b8a080 
"set-current-dynamic-state", message=,
args=0x55c6ce30,
rest=rest@entry=0x55c6ce50) at ../../libguile/error.c:59
#8  0x77af7968 in scm_wrong_type_arg (
subr=subr@entry=0x77b8a080 
"set-current-dynamic-state", pos=pos@entry=1,
bad_value=bad_value@entry=0x55c6c3b0)
---Type  to continue, or q  to quit---
at ../../libguile/error.c:251
#9  0x77b03096 in scm_set_current_dynamic_state (
state=state@entry=0x55c6c3b0) at ../../libguile/fluids.c:496
#10 0x77b6351a in guilify_self_2 (
dynamic_state=dynamic_state@entry=0x55c6c3b0)
at ../../libguile/threads.c:466
#11 0x77b63e0c in scm_i_init_thread_for_guile (base=0x7fffedbe0ec0,
dynamic_state=0x55c6c3b0) at ../../libguile/threads.c:595
#12 0x77b63e59 in with_guile (base=base@entry=0x7fffedbe0ec0,
data=data@entry=0x7fffedbe0ef0) at ../../libguile/threads.c:638
#13 0x76c71812 in GC_call_with_stack_base (
fn=fn@entry=0x77b63e40 , arg=arg@entry=0x7fffedbe0ef0)
at misc.c:1925
#14 0x77b635cc in scm_i_with_guile (dynamic_state=,
data=0x55c6c410, func=0x77b635e0 )
at ../../libguile/threads.c:688
#15 launch_thread (d=0x55c6c410) at ../../libguile/threads.c:750
#16 0x7735f464 in start_thread (arg=0x7fffedbe1700)
at pthread_create.c:333
#17 0x770a29df in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105





bug#25386: guile-2.0 and 2.2 thread leakage+crash; very simple test attached.

2017-01-07 Thread Linas Vepstas
The (very simple) program below leaks ... something, very rapidly, and
then crashes after about 15-30 seconds.  Last thing printed before
crash:

rate=194.80519560944032
num threads=2
((gc-time-taken . 2791348254) (heap-size . 7532883968) (heap-free-size
. 2449408) (heap-total-allocated . 23912882640)
(heap-allocated-since-gc . 1073995264) (protected-objects . 90)
(gc-times . 87))

Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
Aborted

Similar issue in guile-2.2 except it takes longer (8 minutes) and
crashes in gc somewhere.  I assume that some sort of
continuation is left lying about, even though the thread has
exited.

(define junk 0)
(define halt #f)

(define (wtf-thr)
   (define start (- (current-time) 0.1))

   ; Create thread that does junk and exits.  Yes, the increment
   ; of `junk` is not protected, and its racey, but so what.
   (define (mkthr v) (call-with-new-thread (lambda () (set! junk (+ junk
1)

   ; thread arguments
   (define thrarg (make-list 10 0))

   (define cnt 0)
   (define (mke)
  ; Create a limited number of threads
  (define thr-list (map mkthr thrarg))
  ; (display (length (all-threads)))
  (map join-thread thr-list)

  ; Some handy debug printing.
  (set! cnt (+ cnt 1))
  (if (eq? 0 (modulo cnt 500))
 (begin
(display "rate=")
(display (/ cnt (- (current-time) start))) (newline)
(display "num threads=")
(display (length (all-threads))) (newline)
(display (gc-stats)) (newline) (newline)
 )))

   ; tail recursive infinite loop.
   (define (aloop) (mke) (if (not halt) (aloop)))

   ; while forever.
   (aloop)
)

; Run elsewhere, so that we have a shell prompt
; (not required for the bug)
(call-with-new-thread wtf-thr)

; halt if desired.
; (set! halt #t)





bug#25267: crashes here only for invalid scheme

2016-12-24 Thread Linas Vepstas
FYI: important note: this crashes only because an exception path is
taken. Due to a "bug" in the shell script above, `ctr` is undefined,
so an unbound-variable exception is thrown.  When the scheme is valid,
then it does NOT crash here!

--linas


opencog> (NumberNode ctr)
Entering scheme shell; use ^D or a single . on a line by itself to exit.
guile> Backtrace:
In ice-9/boot-9.scm:
 157: 12 [catch #t # ...]
In unknown file:
   ?: 11 [apply-smob/1 #]
In ice-9/boot-9.scm:
 157: 10 [catch #t # ...]
In unknown file:
   ?: 9 [apply-smob/1 #]
   ?: 8 [call-with-input-string "(NumberNode ctr)\n" ...]
In ice-9/boot-9.scm:
2320: 7 [save-module-excursion #]
In ice-9/eval-string.scm:
  44: 6 [read-and-eval # #:lang ...]
  37: 5 [lp (NumberNode ctr)]
In ice-9/eval.scm:
 387: 4 [eval # ()]
 393: 3 [eval # ()]
In unknown file:
   ?: 2 [memoize-variable-access! # #]
In ice-9/boot-9.scm:
 102: 1 [# unbound-variable ...]
In unknown file:
   ?: 0 [apply-smob/1 # unbound-variable ...]

ERROR: In procedure apply-smob/1:
ERROR: Unbound variable: ctr
ABORT: unbound-variable





bug#24446: close - not a bug

2016-12-24 Thread Linas Vepstas
Seems to me this bug report can be closed as "not a bug", given the
post above.

-- linas





bug#25267: guile-2.2 crash in GC

2016-12-24 Thread Linas Vepstas
FYI, this is quickly and easily reproducible, happens within seconds,
and hits the same spot every time. Note-to-self (not for general
consumption): my unit test to provoke this is to start the cogserver
and run this shell script:

#!/bin/bash

i=0
while true ; do
  let i=$i+1
  if [ "$(($i % 2000))" -eq "0" ] ; then
echo loop $i
  fi
  # echo '(display ctr)' | nc localhost 17001
  echo '(NumberNode ctr)' | nc localhost 17001
done

other testing variants are described in
https://github.com/opencog/opencog/issues/2550





bug#25267: guile-2.2 crash in GC

2016-12-24 Thread Linas Vepstas
Merry Christmas!

Below is a crash observed in guile-2.2, the git version of 21 December
2016  (last commit 0ce8a9a5e01d3a12d83fea85968e1abb602c9298 Author:
Andy Wingo 
Date:   Sun Dec 18 23:00:07 2016 +0100)

I do not have any simple test-case to reproduce this (yet?) so this is
an FYI bug report.  It was provoked by a stress test, with the goal of
running some 60+ calls to scm_c_catch in 60+ distinct C++ threads.  I
have no idea if this will crash any other version of guile; I have
never done this stress test before.

Here's what GDB says:

Thread 296 "cogserver" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc0ff9700 (LWP 3680)]
thread_mark (addr=0x558f7700, mark_stack_ptr=,
mark_stack_limit=0x7fffc0ff7c50, env=)
at ../../libguile/threads.c:111
111  while ((chain = *(void **)chain))
(gdb) bt
#0  thread_mark (addr=0x558f7700, mark_stack_ptr=,
mark_stack_limit=0x7fffc0ff7c50, env=)
at ../../libguile/threads.c:111
#1  0x72a80ffb in GC_mark_from (mark_stack_top=0x7fffc0fe7c60,
mark_stack_top@entry=0x7fffc0fe7ca0,
mark_stack=mark_stack@entry=0x7fffc0fe7c50,
mark_stack_limit=mark_stack_limit@entry=0x7fffc0ff7c50) at mark.c:737
#2  0x72a8163e in GC_do_local_mark (local_mark_stack=0x7fffc0fe7c50,
local_top=0x7fffc0fe7ca0) at mark.c:994
#3  0x72a81864 in GC_mark_local (
local_mark_stack=local_mark_stack@entry=0x7fffc0fe7c50, id=id@entry=0)
at mark.c:1129
#4  0x72a819bf in GC_do_parallel_mark () at mark.c:1157
#5  0x72a8282d in GC_mark_some (
cold_gc_frame=0x7fffc0ff7cb0 "\344\207\315\362\377\177") at mark.c:372
#6  0x72a782dd in GC_stopped_mark (
stop_func=0x72a77d70 ) at alloc.c:698
#7  0x72a78dca in GC_try_to_collect_inner (
stop_func=0x72a77d70 ) at alloc.c:486
#8  0x72a79782 in GC_collect_or_expand (
needed_blocks=needed_blocks@entry=1,
ignore_off_page=ignore_off_page@entry=0, retry=retry@entry=0)
at alloc.c:1344
---Type  to continue, or q  to quit---
#9  0x72a79942 in GC_allocobj (gran=gran@entry=2, kind=1)
at alloc.c:1434
#10 0x72a7f0a6 in GC_generic_malloc_inner (lb=lb@entry=32, k=k@entry=1)
at malloc.c:140
#11 0x72a80114 in GC_generic_malloc_many (lb=32, k=1,
result=0x563f7d88) at mallocx.c:439
#12 0x77728c34 in scm_inline_gc_alloc (kind=,
idx=, freelist=)
at ../../libguile/gc-inline.h:94
#13 scm_inline_gc_malloc (thread=, bytes=)
at ../../libguile/gc-inline.h:125
#14 scm_inline_gc_malloc_words (words=, thread=)
at ../../libguile/gc-inline.h:132
#15 scm_inline_words (n_words=, car=,
thread=) at ../../libguile/gc-inline.h:163
#16 vm_regular_engine (thread=0x0, vp=0x566fbd80,
registers=0x7fffc0ff7c50, resume=1434328064)
at ../../libguile/vm-engine.c:1622
#17 0x7772928e in scm_call_n (proc=0x7fffd971dd70,
argv=argv@entry=0x7fffc0ff80b0, nargs=nargs@entry=4)
at ../../libguile/vm.c:1250
#18 0x776ac224 in scm_call_4 (proc=,
arg1=arg1@entry=0x56750fa0, arg2=arg2@entry=0x56870fa0,
---Type  to continue, or q  to quit---
arg3=arg3@entry=0x5607d890, arg4=arg4@entry=0x52)
at ../../libguile/eval.c:502
#19 0x7769dd55 in display_backtrace_body (a=)
at ../../libguile/backtrace.c:244
#20 0x777251da in vm_regular_engine (thread=0x0, vp=0x566fbd80,
registers=0x7fffc0ff7c50, resume=1434328064)
at ../../libguile/vm-engine.c:760
#21 0x7772928e in scm_call_n (proc=proc@entry=0x56870f80,
argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1250
#22 0x776ac189 in scm_call_0 (proc=proc@entry=0x56870f80)
at ../../libguile/eval.c:475
#23 0x77718280 in catch (tag=tag@entry=0x404, thunk=0x56870f80,
handler=0x56870f60, pre_unwind_handler=0x4)
at ../../libguile/throw.c:138
#24 0x777185c5 in scm_catch_with_pre_unwind_handler (
key=key@entry=0x404, thunk=, handler=,
pre_unwind_handler=) at ../../libguile/throw.c:252
#25 0x7771877f in scm_c_catch (tag=tag@entry=0x404,
body=body@entry=0x7769dc30 ,
body_data=body_data@entry=0x7fffc0ff8480,
handler=handler@entry=0x7769e050 ,
handler_data=handler_data@entry=0x56870fa0,
pre_unwind_handler=pre_unwind_handler@entry=0x0,
---Type  to continue, or q  to quit---
pre_unwind_handler_data=0x0) at ../../libguile/throw.c:375
#26 0x7771878e in scm_internal_catch (tag=tag@entry=0x404,
body=body@entry=0x7769dc30 ,
body_data=body_data@entry=0x7fffc0ff8480,
handler=handler@entry=0x7769e050 ,
handler_data=handler_data@entry=0x56870fa0)
at ../../libguile/throw.c:384
#27 0x7769dc25 in scm_display_backtrace_with_highlights (
stack=, port=port@entry=0x56870fa0,
first=first@entry=0x4, depth=depth@entry=0x4,
highlights=highlights@entry=0x304) at ../../libguile/backtrace.c:282

bug#25238: guile-2.2 threading bug

2016-12-20 Thread Linas Vepstas
Merry Christmas!  A guile-git threading bug!

-- linas


$ cat fail.cc
//
// fail.cc
//
// This C++ program crashes, when compiled against todays (20 Dec 2016)
// guile-2.2 from git. git log says a recent commit is
//0ce8a9a5e01d3a12d83fea85968e1abb602c9298
// but I beielve any guile-2.2 version from late 2016 will crash.
//
// I built this as:
// cc fail.cc -I /usr/local/include/guile/2.2 -lguile-2.2  -lpthread -lstdc++
//
// gdb gives the following stack trace:
/*
Thread 7 "a.out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x73306700 (LWP 23578)]
0x77b03076 in is_dynamic_state (x=0x0) at ../../libguile/fluids.c:97
97  return SCM_HAS_TYP7 (x, scm_tc7_dynamic_state);
(gdb) bt
#0  0x77b03076 in is_dynamic_state (x=0x0) at ../../libguile/fluids.c:97
#1  scm_set_current_dynamic_state (state=state@entry=0x0)
at ../../libguile/fluids.c:496
#2  0x77b6351a in guilify_self_2 (dynamic_state=dynamic_state@entry=0x0)
at ../../libguile/threads.c:466
#3  0x77b63e0c in scm_i_init_thread_for_guile (base=0x73305df0,
dynamic_state=0x0) at ../../libguile/threads.c:595
#4  0x77b63e59 in with_guile (base=base@entry=0x73305df0,
data=data@entry=0x73305e20) at ../../libguile/threads.c:638
#5  0x76c15812 in GC_call_with_stack_base (
fn=fn@entry=0x77b63e40 , arg=arg@entry=0x73305e20)
at misc.c:1925
#6  0x77b641f8 in scm_i_with_guile (dynamic_state=,
data=, func=) at ../../libguile/threads.c:688
#7  scm_with_guile (func=, data=)
at ../../libguile/threads.c:694
#8  0x5064 in foo(int) ()
#9  0x6776 in void std::_Bind_simple::_M_invoke<0ul>(std::_Index_tuple<0ul>) ()
#10 0x66c9 in std::_Bind_simple::operator()() ()
#11 0x66a8 in
std::thread::_State_impl::_M_run() ()
#12 0x775cccdf in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x7789b464 in start_thread (arg=0x73306700) at
pthread_create.c:333
#14 0x770469df in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb)
 /


#include 

#include 
#include 

void * wrap_foo(void *p)
{
   scm_c_eval_string ("(setlocale LC_ALL \"\")");
}

static volatile bool hold = true;

void foo(int thread_id)
{
   while (hold) {} // spin
   // A long sleep here avoids the crash
   // usleep(thread_id * 10);
   scm_with_guile(wrap_foo, nullptr);
}

main()
{
   int n_threads = 12;
   std::vector thread_pool;
   for (int i=0; i < n_threads; i++)
  thread_pool.push_back(std::thread(, i));

   printf("Done creating %d threads\n", n_threads);
   hold = false;

   for (std::thread& t : thread_pool) t.join();
   printf("Done joining %d threads\n", n_threads);
}





bug#19180: guile bug#19180: vacuum_weak_hash_table error

2016-06-22 Thread Linas Vepstas
Anand, I've been using guile, straight from git for maybe almost 2
years(??) in a semi-harsh environment (lots of threads, lots of C++ smob
jiggery-pokery, entering and exiting guile, redirecting ports, using fluids
at the scheme/c++ boundary, catching and throwing exceptions to/from C++,
interleaving all this with python, too, -- and medium cpu burn over many
days/week) and it seems to work fine.  Try it -- fix a tag, run system
test, I bet it will work for you.  I think Ludo and Andy have done a good
job.

The only recent glitch is that the setvbuf API changed.  An old quasi-issue
is that garbage collection seems to not be aggressive enough.  After
several layers of C calling guile calling C and so on, mem usage seems to
bloat (a lot -- many many GB's) unless I forcibly run GC every 50th time I
re-enter guile. But I think it does that in guile-2.0 too.

--linas

On Wed, Jun 22, 2016 at 10:43 AM, Anand Mohanadoss 
wrote:

> Hi Andy,
>
> Thanks a lot for looking into this and your response!  Any idea when we
> will have a stable 2.2 release that we can move to given that 2.1 has been
> out for a few months.
>
> Thanks,
> Anand
>
> On Wed, Jun 22, 2016 at 8:25 PM, Andy Wingo  wrote:
>
>> Hi :)
>>
>> On Mon 15 Dec 2014 07:36, Anand Mohanadoss  writes:
>>
>> > Here is what we changed in hashtab.c -
>> >
>> > 130a131
>> >> size_t orig_len = len;
>> > 137,138c138,144
>> > < assert (removed <= len);
>> > < len -= removed;
>> > ---
>> >> if (removed <= len)
>> >> len -= removed;
>> >> else
>> >> {
>> >> printf ("Vacuum weak hash table assert Table=%p len=%zi removed=%zi
>> > orig_len=%zi n_items=%zi\n", table, len, removed, orig_len,
>> > SCM_HASHTABLE_N_ITEMS (table));
>> >> len = 0;
>> >> }
>> >
>> > With this change, we got lines similar to the following printed
>> > periodically -
>> >
>> > Vacuum weak hash table assert Table=0x9bdb840 len=0 removed=1
>> > orig_len=2321 n_items=2321
>>
>> I guess printing a warning is not worse than crashing.  I was unable to
>> make this table work in a reliable way in 2.0 without rewriting it, so
>> in 2.2 there's a new implementation with hopefully no bug in this
>> regard.
>>
>> Ludovic what do you thing, should we just be sloppy in 2.0 and remove
>> the assertion?  I don't think it's fixable.  The other option I see is
>> to close as WONTFIX.
>>
>> Andy
>>
>
>


bug#21221: closed (Re: bug#21221: guile-2.2 socket ports used with display does not send utf8 !)

2016-06-21 Thread Linas Vepstas
On Mon, Jun 20, 2016 at 10:56 AM, GNU bug Tracking System <
help-debb...@gnu.org> wrote:

> (display "Hòa Phú Phú Tân Hiệp An  Tương Bình Hiệp Định Hòa\n" sss)


Yep,  it now works for me with guile-2.1.3

--linas


bug#21221: guile-2.2 socket ports used with display does not send utf8 !

2015-08-08 Thread Linas Vepstas
The following simple client-server program fails for me.
For simplicity, for the server, just use netcat listening on port :

$ nc -l 

In a guile shell, try this:
(setlocale LC_ALL )
(define sss (socket PF_INET SOCK_STREAM 0))
(set-port-encoding! sss utf-8)
(connect sss AF_INET (inet-pton AF_INET 127.0.0.1) )
(set-port-encoding! sss utf-8)
(display SmålandSmåland\n sss)
(close-port sss)


The SmålandSmåland gets corrupted:  nc receives Sm?landSm?land

Some types of utf8 do go through, so e.g. (display Ćićolina\n sss)
seems to work fine.  This finish/norweign thing, though, fails,
vietnamese too.
(display Hòa Phú Phú Tân Hiệp An  Tương Bình Hiệp Định Hòa\n sss)

I suppose the answer is don't use display for sending strings on a
socket, but I'm stumped as to why there should even be an encoding
error, why its not utf8 end-to-end.

This is for guile-2.2 from a recent git pull of the master source from
about June 2015, but I believe the problem occurs on guile-2.0 as
well.

guile --version
guile (GNU Guile) 2.1.0.305-e7097-dirty

this is on ubuntu 14.04 aka ubuntu trusty

-- Linas Vepstas





Re: UTF-8 regression in guile 1.9.5

2009-12-11 Thread Linas Vepstas
2009/12/11 Mike Gran spk...@yahoo.com:

 I think I prefer that the coder take the responsibility of calling
 setlocale, but, I only think that because it is how C works.  I'm used
 to that convention.


OK works for me.

--linas




UTF-8 regression in guile 1.9.5

2009-12-06 Thread Linas Vepstas
Hi,

I seem to see either a regression in guile-1.9.5 with regard
to UTF-8 strings, or at least some sort of incompatible change.

In guile-1.8.6, I am able to do the following:

SCM new_node (SCM sname)
{
char * cname = scm_to_locale_string(sname);
printf (The name is %s\n, cname);
free (cname);
return SCM_EOL;
}

scm_c_define_gsubr(new-node, 1, 0, 0, ss_name);

Then, from the guile prompt, I can evaluate the following:

   (new-node てみました。)

and get the output The name is てみました。


However, in guile-1.9.5, the above gives me:

   The name is てみましたã€

Now, it is very possible that I've forgotten to say

  (use-modules some-new-utf8-module)

but I am unclear on what that module is (and why its not
specified by default).

In both cases, my shell has: LANG=en_US.UTF-8

--linas




Re: UTF-8 regression in guile 1.9.5

2009-12-06 Thread Linas Vepstas
2009/12/6 Mike Gran spk...@yahoo.com:
 From: Linas Vepstas linasveps...@gmail.com


 Then, from the guile prompt, I can evaluate the following:

    (new-node てみました。)

 and get the output The name is てみました。


 However, in guile-1.9.5, the above gives me:

    The name is ã¦ã¿ã¾ããã

 Hmm.  The ã is a dead giveaway that you are printing a UTF-8 string
 that is being interpreted as a ISO-8859-1 string.

 You've already said that you're in a UTF-8 locale.  It could be that you
 need to call (setlocale LC_ALL )

That cured it.

 as well as having a setlocale call in your program.

Doesn't seem to be required, after the above.

Thanks!

Why this happened is strange; I'm now investigating.  Sorry to
have bothered you with something that is dohh .. basic.

--linas




Re: UTF-8 regression in guile 1.9.5

2009-12-06 Thread Linas Vepstas
2009/12/6 Mike Gran spk...@yahoo.com:

  need to call (setlocale LC_ALL )

 But for Guile to store characters as codepoints, declaring a locale
 pretty much a requirement now.

Would it make sense to add (setlocale LC_ALL ) to some default,
e.g. boot-9.scm  ?

--linas




Re: GNU Guile 1.9.5 released (alpha)

2009-11-28 Thread Linas Vepstas
2009/11/17 Ludovic Courtès l...@gnu.org:
 We are pleased to announce GNU Guile release 1.9.5.

FWIW, it appears that guile-1.9.5 does not work with the default
bdw-gc in ubuntu/debian, which is gc-6.8 -- I got the crash below

downloading, compiling, installing gc-7.1 seems to fix the
problem.

BTW, I am vaguely thinking of using bdw-gc for my code,
which links to guile .. will this be a problem ??

--linas

==
a crash, when evaluating (+ 1 1)

I believe that this is another threading bug -- my app loads a
number of scm files during startup, and then starts an
interactive shell. The crash occurs in the shell, with the following
stack trace:

#0  0xf73fbce5 in GC_local_malloc () from /usr/lib/libgc.so.1
#1  0xf7d53ca3 in scm_gc_malloc (size=348, what=0xf7df6634 thread)
at gc-malloc.c:200
#2  0xf7db453c in guilify_self_1 (base=0xf54a49dc) at threads.c:326
#3  0xf7db633d in scm_i_init_thread_for_guile (base=0xf54a49dc,
parent=0xa293af0) at threads.c:593
#4  0xf7db6545 in scm_i_with_guile_and_parent (func=0xf7c56180
opencog::SchemeEval::c_wrap_eval(void*),
data=0xa2167d8, parent=0xa293af0) at threads.c:732
#5  0xf7db668e in scm_with_guile (func=0xf7c56180
opencog::SchemeEval::c_wrap_eval(void*), data=0xa2167d8)
at threads.c:715
#6  0xf7c564ba in opencog::SchemeEval::eval (this=0xa2167d8, ex...@0xf54a4a6c)
at 
/home/linas/src/novamente/src/opencog-embodiment/opencog/guile/SchemeEval.cc:460

FWIW, the above was built against gc-6.8, which is what
ubuntu and debian come with.  Suspecting that this is a
gc problem, I downloaded, compiled, installed gc-7.1 and
went to rebuild guile.   it works!

--linas




[bug #24867] `define' should be thread-safe

2008-12-23 Thread Linas Vepstas

Follow-up Comment #5, bug #24867 (project guile):

1) Easier said than done, because: 

1a) the mutex needs to be recursive, since sym2var evaluates code in
boot9.scm which can cause sym2var to run again. The core problem is that the
mechanism for specifying recursive mutexes seems to be somewhat OS-dependent
(and possibly some OS'es don't support recursive mutexes??) and so a
portability wrapper might be needed. :-(

1b) There's still a strange deadlock somehow; am debugging.

3) Fine-grained usually means speedy. *if* there was some per-module C
struct, then the mutex could be put in there. (I don't know of any, but I
don't know guile internals).  The alternative would be somehow grabbing a lock
in the boot9.scm code, but I don't see how, without making some symbol lookup
(i.e. race).


___

Reply to this item at:

  http://savannah.gnu.org/bugs/?24867

___
  Message sent via/by Savannah
  http://savannah.gnu.org/





[bug #24867] `define' should be thread-safe

2008-12-23 Thread Linas Vepstas

Follow-up Comment #6, bug #24867 (project guile):

I've attached a patch to lock the entry and exit to scm_sym2var().  However,
it doesn't fix the problem (at all).

-- There are still deadlocks in garbage collection, (but different from the
previously reported one !?)
-- There are still crashes (with same stack trace as previously posted)
-- There are still races.



(file #17129)
___

Additional Item Attachment:

File name: symvar-race.patch  Size:1 KB


___

Reply to this item at:

  http://savannah.gnu.org/bugs/?24867

___
  Message sent via/by Savannah
  http://savannah.gnu.org/





[bug #24867] `define' should be thread-safe

2008-12-22 Thread Linas Vepstas

Follow-up Comment #2, bug #24867 (project guile):


The following patch protects the update of the module hash tables
to be thread-safe. This is a partial solution to the bug reported 
in https://savannah.gnu.org/bugs/?24867 This is not a full solution,
because other threads might still be reading the hash tables while
they are being updated, and thus may obtain stale/bad data.

Signed-off-by: Linas Vepstas linasveps...@gmail.com

---
 libguile/modules.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: guile-1.8.6/libguile/modules.c
===
--- guile-1.8.6.orig/libguile/modules.c 2008-12-22 18:38:41.0 -0600
+++ guile-1.8.6/libguile/modules.c  2008-12-22 20:22:19.0 -0600
@@ -555,11 +555,16 @@ scm_c_define (const char *name, SCM valu
   return scm_define (scm_from_locale_symbol (name), value);
 }
 
+scm_i_pthread_mutex_t scm_i_define_mutex;
+
 SCM
 scm_define (SCM sym, SCM value)
 {
-  SCM var =
+  SCM var;
+  scm_pthread_mutex_lock(scm_i_define_mutex);
+  var =
 scm_sym2var (sym, scm_current_module_lookup_closure (), SCM_BOOL_T);
+  scm_i_pthread_mutex_unlock(scm_i_define_mutex);
   SCM_VARIABLE_SET (var, value);
   return var;
 }
@@ -651,6 +656,8 @@ void
 scm_init_modules ()
 {
 #include libguile/modules.x
+  scm_i_pthread_mutex_init (scm_i_define_mutex, NULL);
+
   module_make_local_var_x_var = scm_c_define (module-make-local-var!,
SCM_UNDEFINED);
   scm_tc16_eval_closure = scm_make_smob_type (eval-closure, 0);


(file #17118)
___

Additional Item Attachment:

File name: define-race.patch  Size:1 KB


___

Reply to this item at:

  http://savannah.gnu.org/bugs/?24867

___
  Message sent via/by Savannah
  http://savannah.gnu.org/





[bug #24867] `define' should be thread-safe

2008-12-22 Thread Linas Vepstas

Additional Item Attachment, bug #24867 (project guile):

File name: define-race.c  Size:2 KB


___

Reply to this item at:

  http://savannah.gnu.org/bugs/?24867

___
  Message sent via/by Savannah
  http://savannah.gnu.org/





[bug #24867] `define' should be thread-safe

2008-12-22 Thread Linas Vepstas

Follow-up Comment #3, bug #24867 (project guile):

Note that the attached test case shows 4 distinct behaviours:

1) runs fine, exiting normally, w/o any errors

2) Spewing messages such as:
ERROR: Unbound variable: x2-347
ERROR: Unbound variable: x2-347
ERROR: Unbound variable: x2-347
ERROR: Unbound variable: define
ERROR: Unbound variable: x1-2525
ERROR: Unbound variable: x1-2525
ERROR: Unbound variable: x1-2525
ERROR: Unbound variable: x1-2525
ERROR: Unbound variable: x1-2525
ERROR: Unbound variable: x1-2525

  the numbers being different each time, of course, but then exiting
normally.

3) Deadlocking in garbage collection, with all four threads
   stuck in 
   #5  0xf7f01f6b in scm_gc_for_newcell (freelist=0xf7f8c8ec,
free_cells=0x90c810c) at gc.c:484

  (this may be preceeded by the prints above)

4) Segfault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf74c5b90 (LWP 30649)]
0xf7f426ad in scm_assert_smob_type (tag=639, val=0x0) at smob.c:63
63if (!SCM_SMOB_PREDICATE (tag, val))
(gdb) bt
#0  0xf7f426ad in scm_assert_smob_type (tag=639, val=0x0) at smob.c:63
#1  0xf7f0b4c5 in scm_make_dynamic_state (parent=0x0) at fluids.c:508
#2  0xf7f62bdf in guilify_self_2 (parent=0x0) at threads.c:508
#3  0xf7f64992 in scm_i_init_thread_for_guile (base=0xf74c5388, parent=0x0)
at threads.c:611
#4  0xf7f649c5 in scm_i_with_guile_and_parent (
func=0x8048754 guile_mode_definer, data=0xffaf0cbc, parent=0x0)
at threads.c:743
#5  0xf7f64ace in scm_with_guile (func=0x8048754 guile_mode_definer, 
data=0xffaf0cbc) at threads.c:732
#6  0x080488a8 in definer ()
#7  0xf7fa14fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#8  0xf7e45e5e in clone () from /lib/tls/i686/cmov/libc.so.6


I'm guessing that the use of immutable vlists, as proposed by Ludo, for use
in the module-obarray per mail discussion, would solve all of the above.  


___

Reply to this item at:

  http://savannah.gnu.org/bugs/?24867

___
  Message sent via/by Savannah
  http://savannah.gnu.org/





Re: [PATCH] Final: thread lock nesting debugging

2008-11-20 Thread Linas Vepstas
2008/11/19 Neil Jerram [EMAIL PROTECTED]:
 2008/11/17 Linas Vepstas [EMAIL PROTECTED]:
 I've been seeing all sorts of deadlocks in guile, and so I wrote a small
 debugging utility to try to track down the problems.

 Interesting patch!

 One query; I may be being a bit dumb, I'm only just recovering from a
 bad cold, but anyway...  Your patch checks for a thread unlocking
 mutexes in the reverse order that it locked them in (let's call this
 point A).  But I thought your recent investigations had shown that
 the problem was threads doing locking in inconsistent order, e.g.
 thread 1 locks M1 and then M2, while thread 2 locks M2 and then M1
 (point B).  Are points A and B equivalent?  (It isn't obvious to
 me if so.)

Hi Neil,

There is (should be) only one lock in guile that is inconsistent
in its locking order, and this is the t-heap_mutex lock.

My guess is that valgrind is tripping over this one. I guess
I should argue that this is why one needs a custom patch,
instead of using valgrind (which is otherwise pretty fantastic
for mem corruption and the like).

The  t-heap_mutex lock is heled whenever a thread is
guilified or is in guile mode.  Its primary reason for
being is to keep the garbage collector from running
until all threads have been halted. (This is done by
scm_i_thread_put_to_sleep)

After applying my set of patches, the only inconsistent
(and widespread!) lock ordering problem that I'm seeing
stems from the asymmetric way in which scm_i_scm_pthread_mutex_lock is
used to take a lock,
and then drop it.  If you follow the #define for
scm_i_scm_pthread_mutex_lock, you find that its of
the form:

   drop (thread-heap_mutex)
   take(some lock)
   take (thread-heap_mutex)

Whereas the unlock is just

   drop(some lock)

You can see this, for example, in ports.c line 728

  scm_i_scm_pthread_mutex_lock (scm_i_port_table_mutex);
  scm_i_remove_port (port);
  scm_i_pthread_mutex_unlock (scm_i_port_table_mutex);

Tto be correctly nested, the unlock should have droped
the heap mutex first, and then reacquired it.  I believe that
doing this would be enough to quiet down helgrind. (at least
for most cases ... what remains would be interesting)

OK, the above was just facts; below, some random comments
which might be incorrect (reasoning about locks can be
deceptive; I've certainly mis-reasoned several times w.r.t guile)

-- I had decided that the way that the dropping of the
lock is done is OK, and that it would be silly (and a
performance hit) to try to fix the unlocking order.
For debugging with helgrind, you may want to do this
anyway, but for production, it seemed un-necessary.
Prod me, and perhaps I can reconstruct why I cam to
this conclusion.

-- The reason for dropping the heap_mutex before grabbing
the other lock (for example  scm_i_port_table_mutex),
is somewhat obscure, but at one point I decided that this
was OK, and arguably correct.  As I write this, I've
forgotten why. However, this should be a focus of attention,
and re-thought-out.  If you are willing to think about it, prod
me and maybe I can say something intelligent.  Changing
this would also quiet helgrind (and boost performance).
It might be safe, I need to rethink this.

Anyway, my patch allows for this single occurance of the
out-of-sequence heap_mutex unlock.

--linas




Re: [PATCH] Final: thread lock nesting debugging

2008-11-20 Thread Linas Vepstas
2008/11/20 Linas Vepstas [EMAIL PROTECTED]:

 -- The reason for dropping the heap_mutex before grabbing
 the other lock (for example  scm_i_port_table_mutex),
 is somewhat obscure, but at one point I decided that this
 was OK, and arguably correct.  As I write this, I've
 forgotten why. However, this should be a focus of attention,
 and re-thought-out.

Well, a quick look reminds me of the situation: in many/most
cases, locked sections might trigger garbage collection.
Thus, the heap_mutex *must* be dropped before the lock
is taken.

My gut impression is that this is a poor design point; and
that the correct thing to do would be make locks fine-grained,
so that there is never a need to run GC while a lock is held.
This would require extensive auditing of the guile code.

--linas




Re: Hung threads

2008-11-16 Thread Linas Vepstas
Hi,

2008/11/14 Linas Vepstas [EMAIL PROTECTED]:
 Here's a deadlock I saw today.

Here's a different deadlock that is fully debugged. The
guilty code leading to the deadlock is in make_struct(),
in struct.c circa line 463, which tries to alloc memory
while holding a CRITICAL_SECTION lock.  Of course,
everything deadlocks in GC.

I am trying to figure out how to fix this now. its kind
of gnarly.

Summary:
thread 7 -- holding critical section lock, sleeping on scm_i_sweep_mutex
thread 5 -- holding heap_mutex, sleeping on critical section
thread 12 -- holding scm_i_sweep_mutex, sleeping on heap_mutex


^C
Program received signal SIGINT, Interrupt.
[Switching to Thread 0xf79f86c0 (LWP 10364)]
0xe425 in __kernel_vsyscall ()
(gdb) info threads
  12 Thread 0xf24fdb90 (LWP 10395)  0xe425 in __kernel_vsyscall ()
  10 Thread 0xf34ffb90 (LWP 10389)  0xe425 in __kernel_vsyscall ()
  9 Thread 0xf3e7db90 (LWP 10387)  0xe425 in __kernel_vsyscall ()
  7 Thread 0xf4e7fb90 (LWP 10380)  0xe425 in __kernel_vsyscall ()
  6 Thread 0xf5680b90 (LWP 10377)  0xe425 in __kernel_vsyscall ()
  5 Thread 0xf5ec4b90 (LWP 10374)  0xe425 in __kernel_vsyscall ()
  2 Thread 0xf76c7b90 (LWP 10365)  0xe425 in __kernel_vsyscall ()
* 1 Thread 0xf79f86c0 (LWP 10364)  0xe425 in __kernel_vsyscall ()

lock summary
thread 5 -- holding heap_mutex
sleeping on CRITICAL_SECTION in scm_c_catch

thread 6 -- holds no locks
sleeping on scm_i_sweep_mutex in increase_mtrigger

thread 7 -- holding critical section lock  in make_struct
then tries to alloc mem ... !
sleeping on scm_i_sweep_mutex in increase_mtrigger

thread 9 -- holding heap_mutex
sleeping on CRITICAL SECTION in scm_c_catch

thread 10 -- holding ?? in  increase_mtrigger
called from scm_gc_register_collectable_memory
sleeping on scm_i_sweep_mutex in increase_mtrigger

thread 12 --  trying to put everything to sleep,
 holding admin mutex, and most heap_mutexes
 sleeping on remaining heap_mutexes that are still held.

The guilty party is thread 7 which tries to alloc memory while holding
a critical section lock. This leads to a deadlock:

thread 7 -- holding critical section lock, sleeping on scm_i_sweep_mutex
thread 5 -- holding heap_mutex, sleeping on critical section
thread 12 -- holding scm_i_sweep_mutex, sleeping on heap_mutex


(gdb) call prt_lockholders()

Thread 0xf5680b90 -- thread 6

Thread 0xf34ffb90  -- thread 10
0: mutex (0xf7816d0c) in:
/usr/lib/libguile.so.17 [0xf778f9f1]
c_register_collectable_memory+0x2a) [0xf778fbca]
/libguile.so.17(scm_gc_malloc+0x40) [0xf7790010]
.so.17(scm_gc_calloc+0x2c) [0xf77901bc]

Thread 0xf24fdb90 -- thread 12
0: mutex (0xf7812780) in: -- this is the scm_i_sweep_mutex
/usr/lib/libguile.so.17(scm_i_thread_put_to_sleep+0x7f) [0xf77e395f]
f]
x19) [0xf778db29]
(scm_gc_register_collectable_memory+0x2a) [0xf778fbca]
1: thread_admin_mutex (0xf78191ec) in:
/usr/lib/libguile.so.17(scm_i_thread_put_to_sleep+0x7f) [0xf77e395f]
/usr/lib/libguile.so.17(scm_i_gc+0x19) [0xf778db29]
/usr/lib/libguile.so.17 [0xf778fa5c]
/usr/lib/libguile.so.17(scm_gc_register_collectable_memory+0x2a)
[0xf778fbca]
2: t-heap_mutex (0x8c11484) in:
/usr/lib/libguile.so.17(scm_i_thread_put_to_sleep+0x7f) [0xf77e395f]
/usr/lib/libguile.so.17(scm_i_gc+0x19) [0xf778db29]
/usr/lib/libguile.so.17 [0xf778fa5c]
/usr/lib/libguile.so.17(scm_gc_register_collectable_memory+0x2a)
[0xf778fbca]
3: t-heap_mutex (0x8c682bc) in:
/usr/lib/libguile.so.17(scm_i_thread_put_to_sleep+0x7f) [0xf77e395f]
/usr/lib/libguile.so.17(scm_i_gc+0x19) [0xf778db29]
/usr/lib/libguile.so.17 [0xf778fa5c]
/usr/lib/libguile.so.17(scm_gc_register_collectable_memory+0x2a)
[0xf778fbca]
4: t-heap_mutex (0xf3518c74) in:
/usr/lib/libguile.so.17(scm_i_thread_put_to_sleep+0x7f) [0xf77e395f]
/usr/lib/libguile.so.17(scm_i_gc+0x19) [0xf778db29]
/usr/lib/libguile.so.17 [0xf778fa5c]
/usr/lib/libguile.so.17(scm_gc_register_collectable_memory+0x2a)
[0xf778fbca]
5: t-heap_mutex (0x8c10e84) in:
/usr/lib/libguile.so.17(scm_i_thread_put_to_sleep+0x7f) [0xf77e395f]
/usr/lib/libguile.so.17(scm_i_gc+0x19) [0xf778db29]
/usr/lib/libguile.so.17 [0xf778fa5c]
/usr/lib/libguile.so.17(scm_gc_register_collectable_memory+0x2a)
[0xf778fbca]

Thread 0xf4e7fb90 -- thread 7
0: scm_i_critical_section_mutex (0xf781c808) in:
/usr/lib/libguile.so.17(scm_make_struct+0xe2) [0xf77e1d52]
/usr/lib/libguile.so.17(scm_make_stack+0x181) [0xf77c6b11]

/home/linas/src/novamente/src/opencog-stage4/staging/bin/opencog/guile/libsmob.so(_ZN7opencog10SchemeEval17preunwind_handlerEP17scm_unused_structS2_+0x26)
[0xf7879696]

/home/linas/src/novamente/src/opencog-stage4/staging/bin/opencog

[PATCH]: deadlock in make_struct()

2008-11-16 Thread Linas Vepstas
Hi,

Here's a deadlock patch. When committing this patch, please copy
the text below into the source code commit message, as it provides
a record of what the patch does, and why it was made.

--linas


The patch below fixes a deadlock in the multi-threading code.
It fixes this by simply removing the CRITICAL_SECTION lock.
There does not seem to be any need for a lock on this section,
since all variables accessed are local, and all other literals
are compile-time constants.

A typical deadlock that was witnessed was:

thread 7 -- holding critical section lock, sleeping on scm_i_sweep_mutex
held in scm_make_struct() at struct.c:463,
sleeping in increase_mtrigger() at  gc-malloc.c:234
thread 5 -- holding heap_mutex, sleeping on critical section
held because thread is in guile mode
sleeping in scm_c_catch() at throw.c:187
thread 12 -- holding scm_i_sweep_mutex, sleeping on heap_mutex
 held in increase_mtrigger() at gc-malloc.c:234
 sleeping in scm_i_thread_put_to_sleep() at threads.c:1621

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

---
 libguile/struct.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: guile-1.8.5/libguile/struct.c
===
--- guile-1.8.5.orig/libguile/struct.c  2008-11-16 10:58:31.0 -0600
+++ guile-1.8.5/libguile/struct.c   2008-11-16 12:17:05.0 -0600
@@ -450,7 +450,14 @@ SCM_DEFINE (scm_make_struct, make-struc
 goto bad_tail;
 }

-  SCM_CRITICAL_SECTION_START;
+  /* In guile 1.8.5 and earlier, everything below was covered by a
+ CRITICAL_SECTION lock. This leads to deadlocks in garbage collection,
+ since other threads might be holding the heap_mutex, while sleeping
+ on the CRITICAL_SECTION lock. There does not seem to be any need
+ for a lock on the section below, as it does not access or update
+ any globals. vtable, basic_size, tail_elts are all local variables,
+ scm_tc3_struct and scm_struct_i_* are all compile-time consts.
+ So the lock has been removed. */
   if (SCM_STRUCT_DATA (vtable)[scm_struct_i_flags]  SCM_STRUCTF_ENTITY)
 {
   data = scm_alloc_struct (basic_size + tail_elts,
@@ -466,7 +473,6 @@ SCM_DEFINE (scm_make_struct, make-struc
   handle = scm_double_cell scm_t_bits) SCM_STRUCT_DATA (vtable))
 + scm_tc3_struct),
(scm_t_bits) data, 0, 0);
-  SCM_CRITICAL_SECTION_END;

   /* In guile 1.8.1 and earlier, the SCM_CRITICAL_SECTION_END above covered
  also the following scm_struct_init.  But that meant if scm_struct_init
Subject: [PATCH]: deadlock in make_struct()
Date: 16 Nov 2008
To: bug-guile@gnu.org

Hi, 

Here's a deadlock patch. When committing this patch, please copy
the text below into the source code commit message, as it provides
a record of what the patch does, and why it was made. 

--linas


The patch below fixes a deadlock in the multi-threading code.
It fixes this by simply removing the CRITICAL_SECTION lock.
There does not seem to be any need for a lock on this section,
since all variables accessed are local, and all other literals
are compile-time constants.

A typical deadlock that was witnessed was:

thread 7 -- holding critical section lock, sleeping on scm_i_sweep_mutex
held in scm_make_struct() at struct.c:463, 
sleeping in increase_mtrigger() at  gc-malloc.c:234
thread 5 -- holding heap_mutex, sleeping on critical section
held because thread is in guile mode
sleeping in scm_c_catch() at throw.c:187
thread 12 -- holding scm_i_sweep_mutex, sleeping on heap_mutex
 held in increase_mtrigger() at gc-malloc.c:234
 sleeping in scm_i_thread_put_to_sleep() at threads.c:1621

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

---
 libguile/struct.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: guile-1.8.5/libguile/struct.c
===
--- guile-1.8.5.orig/libguile/struct.c	2008-11-16 10:58:31.0 -0600
+++ guile-1.8.5/libguile/struct.c	2008-11-16 12:17:05.0 -0600
@@ -450,7 +450,14 @@ SCM_DEFINE (scm_make_struct, make-struc
 goto bad_tail;
 }
 
-  SCM_CRITICAL_SECTION_START;
+  /* In guile 1.8.5 and earlier, everything below was covered by a
+ CRITICAL_SECTION lock. This leads to deadlocks in garbage collection,
+ since other threads might be holding the heap_mutex, while sleeping
+ on the CRITICAL_SECTION lock. There does not seem to be any need
+ for a lock on the section below, as it does not access or update
+ any globals. vtable, basic_size, tail_elts are all local variables,
+ scm_tc3_struct and scm_struct_i_* are all compile-time consts.
+ So the lock has been removed. */
   if (SCM_STRUCT_DATA (vtable)[scm_struct_i_flags]  SCM_STRUCTF_ENTITY

[PATCH] Final: thread lock nesting debugging

2008-11-16 Thread Linas Vepstas
I've been seeing all sorts of deadlocks in guile, and so I wrote a small
debugging utility to try to track down the problems.  I'd like to see
this patch included in future versions of guile.

I found one bug with this tool, and have submitted a patch for that
already.  It looks like there's another bug involving signals --
there is a probably deadlock in garbage collection if a signal
is sent at just the wrong time. The bug can be seen by enabling
this patch, and then running 'make check'.  I'm going to ignore
this, as I'm not worried about signals right now.

This is my final version of this patch, I'd sent a beta version
a few days ago. Its final because I'm not anticipating any
further changes.

--linas

Add a deadlock debugging facility to guile.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

---
 config.h.in|3
 configure.in   |   11 +++
 libguile/Makefile.am   |2
 libguile/debug-locks.c |  159 +
 libguile/pthread-threads.h |   15 
 libguile/threads.c |   53 +++
 libguile/threads.h |8 ++
 7 files changed, 250 insertions(+), 1 deletion(-)

Index: guile-1.8.5/libguile/pthread-threads.h
===
--- guile-1.8.5.orig/libguile/pthread-threads.h 2008-11-16
18:57:19.0 -0600
+++ guile-1.8.5/libguile/pthread-threads.h  2008-11-16 18:57:48.0 
-0600
@@ -91,6 +91,21 @@ extern pthread_mutexattr_t scm_i_pthread
 #define scm_i_scm_pthread_cond_wait scm_pthread_cond_wait
 #define scm_i_scm_pthread_cond_timedwaitscm_pthread_cond_timedwait

+#ifdef GUILE_DEBUG_LOCKS
+#undef scm_i_pthread_mutex_lock
+#define scm_i_pthread_mutex_lock(ARG) scm_i_pthread_mutex_lock_dbg(ARG, #ARG)
+
+#undef scm_i_pthread_mutex_unlock
+#define scm_i_pthread_mutex_unlock(ARG)
scm_i_pthread_mutex_unlock_dbg(ARG, #ARG)
+
+int scm_i_pthread_mutex_lock_dbg(pthread_mutex_t *, const char *);
+int scm_i_pthread_mutex_unlock_dbg(pthread_mutex_t *, const char *);
+
+void prt_lockholders(void);
+void prt_this_lockholder(void);
+
+#endif
+
 #endif  /* SCM_PTHREADS_THREADS_H */

 /*
Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c 2008-11-16 18:57:19.0 -0600
+++ guile-1.8.5/libguile/threads.c  2008-11-16 18:57:48.0 -0600
@@ -441,6 +441,24 @@ guilify_self_1 (SCM_STACKITEM *base)
   SCM_SET_FREELIST_LOC (scm_i_freelist, t-freelist);
   SCM_SET_FREELIST_LOC (scm_i_freelist2, t-freelist2);

+#ifdef GUILE_DEBUG_LOCKS
+  int i, j;
+  for(i=0; iLOCK_STACK_DEPTH; i++)
+{
+  t-lockname[i] = NULL;
+  t-lockmutex[i] = NULL;
+  for(j=0; jTRACE_STACK_DEPTH; j++)
+{
+  t-lockholder[i][j] = NULL;
+}
+}
+  if (scm_initialized_p == 0)
+{
+  t-lockname[0] = scm_i_init_mutex;
+  t-lockmutex[0] = scm_i_init_mutex;
+}
+#endif
+
   scm_i_pthread_setspecific (scm_i_thread_key, t);

   scm_i_pthread_mutex_lock (t-heap_mutex);
@@ -1624,8 +1642,21 @@ scm_i_thread_wake_up ()
   scm_i_thread *t;

   scm_i_pthread_cond_broadcast (wake_up_cond);
+#ifndef GUILE_DEBUG_LOCKS
   for (t = all_threads; t; t = t-next_thread)
scm_i_pthread_mutex_unlock (t-heap_mutex);
+#else
+  /* Unlock in reverse order from locking */
+  scm_i_thread *tt = NULL;
+  while (tt != all_threads)
+{
+  scm_i_thread *tp = NULL;
+  for (t = all_threads; t != tt; t = t-next_thread)
+ tp = t;
+  scm_i_pthread_mutex_unlock (tp-heap_mutex);
+  tt = tp;
+}
+#endif
   scm_i_pthread_mutex_unlock (thread_admin_mutex);
   scm_enter_guile ((scm_t_guile_ticket) SCM_I_CURRENT_THREAD);
 }
@@ -1721,6 +1752,28 @@ scm_init_threads_default_dynamic_state (
   scm_i_default_dynamic_state = scm_permanent_object (state);
 }

+#ifdef GUILE_DEBUG_LOCKS
+extern int guile_do_abort_on_badlock;
+
+void prt_one_lockholder(scm_i_thread *);
+void prt_lockholders(void)
+{
+  scm_i_thread *t;
+
+  if (!guile_do_abort_on_badlock) return;
+
+  for (t = all_threads; t; t = t-next_thread)
+{
+  prt_one_lockholder(t);
+}
+}
+
+void prt_this_lockholder(void)
+{
+  prt_one_lockholder(SCM_I_CURRENT_THREAD);
+}
+#endif
+
 void
 scm_init_thread_procs ()
 {
Index: guile-1.8.5/libguile/threads.h
===
--- guile-1.8.5.orig/libguile/threads.h 2008-11-16 18:57:19.0 -0600
+++ guile-1.8.5/libguile/threads.h  2008-11-16 18:57:48.0 -0600
@@ -108,6 +108,14 @@ typedef struct scm_i_thread {
   SCM_STACKITEM *top;
   jmp_buf regs;

+#ifdef GUILE_DEBUG_LOCKS
+#define LOCK_STACK_DEPTH 250
+#define TRACE_STACK_DEPTH 4
+  const char *lockname[LOCK_STACK_DEPTH];
+  char *lockholder[LOCK_STACK_DEPTH][TRACE_STACK_DEPTH];
+  pthread_mutex_t *lockmutex[LOCK_STACK_DEPTH];
+#endif

Hung threads

2008-11-14 Thread Linas Vepstas
Here's a deadlock I saw today. It appears to be a case of
improperly-nested locks.

Thread 16 is trying to exit.

CRITICAL SECTION -- threads 16
scm_i_port_table_mutex -- threads 18,24,27,28,29,30,32,33,37,39
scm_i_sweep_mutex -- threads 19,21, 35, 36, 38
t-heap_mutex -- threads 23
thread_admin_mutex -- threads 34

thread 16 -- SCM_CRITICAL_SECTION_START -throw.c 201
 scm_c_catch throw.c:201
thread 18 -- scm_i_scm_pthread_mutex_lock (scm_i_port_table_mutex);
 ports.c 764
thread 19 -- scm_pthread_mutex_lock (scm_i_sweep_mutex);
 in scm_gc_for_newcell gc.c:486
thread 21 -- scm_pthread_mutex_lock (scm_i_sweep_mutex);
 increase_mtrigger gc-malloc.c:234
thread 23 -- pthread_mutex_lock (t-heap_mutex);
 scm_i_thread_put_to_sleep in scm_i_gc gc.c:552
thread 24 -- scm_pthread_mutex_lock (scm_i_port_table_mutex);
 scm_mkstrport in strports.c:321
thread 27 -- scm_pthread_mutex_lock
 scm_mkstrport in strports.c:321
thread 28 -- scm_pthread_mutex_lock (scm_i_port_table_mutex)
 scm_close_port ports.c:764
thread 29 -- scm_pthread_mutex_lock
 scm_mkstrport strports.c:321
thread 30 -- scm_pthread_mutex_lock
 scm_mkstrport strports.c:321
thread 32 -- scm_pthread_mutex_lock
 scm_close_port ports.c:764
thread 33 -- scm_pthread_mutex_lock
 scm_mkstrport strports.c:321
thread 34 -- scm_pthread_mutex_lock (thread_admin_mutex);
 do_thread_exit threads.c:483
thread 35 -- scm_pthread_mutex_lock
 scm_gc_for_newcell gc.c:486
thread 36 -- scm_pthread_mutex_lock
 scm_gc_for_newcell gc.c:486
thread 37 -- scm_pthread_mutex_lock
 scm_close_port  ports.c:764
thread 38 -- scm_pthread_mutex_lock
 scm_gc_for_newcell gc.c:486
thread 39 -- scm_pthread_mutex_lock
 scm_mkstrport strports.c:321



(gdb) info threads
  39 Thread 0xe8c66b90 (LWP 22581)  0xe425 in __kernel_vsyscall ()
  38 Thread 0xe9467b90 (LWP 22580)  0xe425 in __kernel_vsyscall ()
  37 Thread 0xe9c68b90 (LWP 22579)  0xe425 in __kernel_vsyscall ()
  36 Thread 0xea469b90 (LWP 22578)  0xe425 in __kernel_vsyscall ()
  35 Thread 0xeac6ab90 (LWP 22577)  0xe425 in __kernel_vsyscall ()
  34 Thread 0xeb46bb90 (LWP 22576)  0xe425 in __kernel_vsyscall ()
  33 Thread 0xebc6cb90 (LWP 22575)  0xe425 in __kernel_vsyscall ()
  32 Thread 0xec46db90 (LWP 22561)  0xe425 in __kernel_vsyscall ()
  30 Thread 0xee0d4b90 (LWP 22553)  0xe425 in __kernel_vsyscall ()
  29 Thread 0xee8d5b90 (LWP 22552)  0xe425 in __kernel_vsyscall ()
  28 Thread 0xef0d6b90 (LWP 22551)  0xe425 in __kernel_vsyscall ()
  27 Thread 0xef9d2b90 (LWP 22550)  0xe425 in __kernel_vsyscall ()
  24 Thread 0xf28cdb90 (LWP 22547)  0xe425 in __kernel_vsyscall ()
  23 Thread 0xf42feb90 (LWP 22533)  0xe425 in __kernel_vsyscall ()
  21 Thread 0xf30ceb90 (LWP 22525)  0xe425 in __kernel_vsyscall ()
  19 Thread 0xf5e1db90 (LWP 22523)  0xe425 in __kernel_vsyscall ()
* 18 Thread 0xf6e1fb90 (LWP 22522)  0xe425 in __kernel_vsyscall ()
  16 Thread 0xf561cb90 (LWP 22516)  0xe425 in __kernel_vsyscall ()
  2 Thread 0xf7620b90 (LWP 22466)  0xe425 in __kernel_vsyscall ()
---Type return to continue, or q return to quit---
  1 Thread 0xf794f6c0 (LWP 22465)  0xe425 in __kernel_vsyscall ()


(gdb) thread 16
[Switching to thread 16 (Thread 0xf561cb90 (LWP 22516))]#0  0xe425
in __kernel_vsyscall ()
(gdb) bt
#0  0xe425 in __kernel_vsyscall ()
#1  0xf7d95589 in __lll_lock_wait () from
/lib/tls/i686/cmov/libpthread.so.0
#2  0xf7d90bb4 in _L_lock_236 () from /lib/tls/i686/cmov/libpthread.so.0
#3  0xf7d9060b in pthread_mutex_lock () from
/lib/tls/i686/cmov/libpthread.so.0

this is in SCM_CRITICAL_SECTION_START
#4  0xf773dd7d in scm_c_catch (tag=0x104, body=0xf76c89d0 c_body,
body_data=0xf561c328, handler=0xf76c89f0 c_handler,
handler_data=0xf561c328,
pre_unwind_handler=0xf773d630 scm_handle_by_message_noexit,
pre_unwind_handler_data=0x0) at throw.c:201
#5  0xf76c8e92 in scm_i_with_continuation_barrier (body=0xf76c89d0
c_body,
body_data=0xf561c328, handler=0xf76c89f0 c_handler,
handler_data=0xf561c328,
pre_unwind_handler=0xf773d630 scm_handle_by_message_noexit,
pre_unwind_handler_data=0x0) at continuations.c:326
#6  0xf76c8f73 in scm_c_with_continuation_barrier (
func=0xf773cd00 do_thread_exit, data=0x9cc5900) at
continuations.c:368
#7  0xf773cb49 in scm_i_with_guile_and_parent (func=0xf773cd00
do_thread_exit,
data=0x9cc5900, parent=0xf4da9e90) at threads.c:710
#8  0xf773cc3e in scm_with_guile (func=0xf773cd00 do_thread_exit,
---Type return to continue, or q return to quit---
data=0x9cc5900) at threads.c:698
#9  0xf773cc93 in on_thread_exit (v=0x9cc5900) at threads.c:505
#10 0xf7d8dbb0 in __nptl_deallocate_tsd ()
   from /lib/tls/i686/cmov/libpthread.so.0
#11 0xf7d8e509 in start_thread () from

[PATCH] thread lock nesting debugging

2008-11-14 Thread Linas Vepstas
I've been seeing all sorts of deadlocks in guile, and so I wrote a small
debugging utility to try to track down the problems (and I'm finding
bugs with this already.

Right now, this is an FYI patch, as its subject to change. However, I'd
like to eventually have it applied permanently to the guile source tree;
any suggestions on how it should be organized so that it would be
acceptable?


Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

---
 config.h.in|3 +
 configure.in   |   11 
 libguile/Makefile.am   |2
 libguile/debug-locks.c |  111 +
 libguile/pthread-threads.h |   11 
 libguile/threads.c |   31 
 libguile/threads.h |8 +++
 7 files changed, 176 insertions(+), 1 deletion(-)

Index: guile-1.8.5/libguile/pthread-threads.h
===
--- guile-1.8.5.orig/libguile/pthread-threads.h 2008-11-14
14:28:58.0 -0600
+++ guile-1.8.5/libguile/pthread-threads.h  2008-11-14 17:51:52.0 
-0600
@@ -91,6 +91,17 @@ extern pthread_mutexattr_t scm_i_pthread
 #define scm_i_scm_pthread_cond_wait scm_pthread_cond_wait
 #define scm_i_scm_pthread_cond_timedwaitscm_pthread_cond_timedwait

+#ifdef GUILE_DEBUG_LOCKS
+#undef scm_i_pthread_mutex_lock
+#define scm_i_pthread_mutex_lock(ARG) scm_i_pthread_mutex_lock_dbg(ARG, #ARG)
+
+#undef scm_i_pthread_mutex_unlock
+#define scm_i_pthread_mutex_unlock(ARG)
scm_i_pthread_mutex_unlock_dbg(ARG, #ARG)
+
+int scm_i_pthread_mutex_lock_dbg(pthread_mutex_t *, const char *);
+int scm_i_pthread_mutex_unlock_dbg(pthread_mutex_t *, const char *);
+#endif
+
 #endif  /* SCM_PTHREADS_THREADS_H */

 /*
Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c 2008-11-14 14:21:18.0 -0600
+++ guile-1.8.5/libguile/threads.c  2008-11-14 17:39:25.0 -0600
@@ -441,6 +441,19 @@ guilify_self_1 (SCM_STACKITEM *base)
   SCM_SET_FREELIST_LOC (scm_i_freelist, t-freelist);
   SCM_SET_FREELIST_LOC (scm_i_freelist2, t-freelist2);

+#ifdef GUILE_DEBUG_LOCKS
+  int i, j;
+  for(i=0; iLOCK_STACK_DEPTH; i++)
+{
+  t-lockname[i] = NULL;
+  t-lockmutex[i] = NULL;
+  for(j=0; jTRACE_STACK_DEPTH; j++)
+{
+  t-lockholder[i][j] = NULL;
+}
+}
+#endif
+
   scm_i_pthread_setspecific (scm_i_thread_key, t);

   scm_i_pthread_mutex_lock (t-heap_mutex);
@@ -1721,6 +1734,24 @@ scm_init_threads_default_dynamic_state (
   scm_i_default_dynamic_state = scm_permanent_object (state);
 }

+#ifdef GUILE_DEBUG_LOCKS
+extern int guile_do_abort_on_badlock;
+
+void prt_one_lockholder(scm_i_thread *);
+void prt_lockholders(void);
+void prt_lockholders(void)
+{
+  scm_i_thread *t;
+
+  if (!guile_do_abort_on_badlock) return;
+
+  for (t = all_threads; t; t = t-next_thread)
+{
+  prt_one_lockholder(t);
+}
+}
+#endif
+
 void
 scm_init_thread_procs ()
 {
Index: guile-1.8.5/libguile/threads.h
===
--- guile-1.8.5.orig/libguile/threads.h 2008-11-14 14:17:20.0 -0600
+++ guile-1.8.5/libguile/threads.h  2008-11-14 17:20:12.0 -0600
@@ -108,6 +108,14 @@ typedef struct scm_i_thread {
   SCM_STACKITEM *top;
   jmp_buf regs;

+#ifdef GUILE_DEBUG_LOCKS
+#define LOCK_STACK_DEPTH 8
+#define TRACE_STACK_DEPTH 4
+  const char *lockname[LOCK_STACK_DEPTH];
+  char *lockholder[LOCK_STACK_DEPTH][TRACE_STACK_DEPTH];
+  pthread_mutex_t *lockmutex[LOCK_STACK_DEPTH];
+#endif
+
 } scm_i_thread;

 #define SCM_I_IS_THREAD(x)SCM_SMOB_PREDICATE (scm_tc16_thread, x)
Index: guile-1.8.5/configure.in
===
--- guile-1.8.5.orig/configure.in   2008-11-14 15:01:15.0 -0600
+++ guile-1.8.5/configure.in2008-11-14 16:42:59.0 -0600
@@ -122,6 +122,13 @@ AC_ARG_ENABLE(debug-malloc,
   [Define this if you want to debug scm_must_malloc/realloc/free calls.])
   fi)

+AC_ARG_ENABLE(debug-locks,
+  [  --enable-debug-locksinclude thread lock debugging code],
+  if test $enable_debug_locks = y || test $enable_debug_locks = yes; then
+AC_DEFINE(GUILE_DEBUG_LOCKS, 1,
+  [Define this if you want to debug pthread lock nesting and
deadlock trouble.])
+  fi)
+
 SCM_I_GSC_GUILE_DEBUG=0
 AC_ARG_ENABLE(guile-debug,
   [AC_HELP_STRING([--enable-guile-debug],
@@ -263,6 +270,10 @@ if test $enable_debug_malloc = yes; th
AC_LIBOBJ([debug-malloc])
 fi

+if test $enable_debug_locks = yes; then
+   AC_LIBOBJ([debug-locks])
+fi
+
 if test $enable_elisp = yes; then
   SCM_I_GSC_ENABLE_ELISP=1
 else
Index: guile-1.8.5/config.h.in
===
--- guile-1.8.5.orig/config.h.in2008-11-14 15:22:54.0 -0600
+++ guile-1.8.5/config.h.in 2008-11-14 15:23:38.0 -0600
@@ -42,6 +42,9

Many, many lock races

2008-11-14 Thread Linas Vepstas
I'm now going through guile-1.8.5 code, and see the potential
for races  leading to deadlocks in many dozens of places.

What I's seeing is lots of this:

scm_i_scm_pthread_mutex_lock(some_lock)
do_something()
scm_i_pthread_mutex_unlock(some_lock)

With the current set of #defines, this turns into the following

pthread_mutex_unlock(thread-heap_mutex); // leave guile
pthead_mutex_lock(some_lock)
pthread_mutex_lock(thead-heap_mutex) // enter guile
do_something()
pthread_mutex_unlock(some_lock)

The above is very clearly badly nested, and leads to a race
with garbage collection, resulting in a deadlock.  I hope this
is obvious to the reader: ... right? ... but, to be clear, consider
the following:

thread A:
pthread_mutex_unlock(thread-heap_mutex);  // leave guile
pthead_mutex_lock(some_lock)
pthread_mutex_lock(thread-heap_mutex) { //enter guile
   sleep, since thread C just grabbed this heap_mutex

thread B:
in guile mode (i.e. its own heap_mutex is held)
sleeping on some_lock, which A is holding.

thread C:
scm_i_gc() {
   scm_i_thread_put_to_sleep() {
 scm_i_pthread_mutex_lock (thread A)
 scm_i_pthread_mutex_lock (thread B) {
  sleep, since thread B is already holding it.

and so A is waiting on C is waiting on B is waiting on A ...

I'm planning on going through all of these instances on a
case-by-case basis, but there seems to be many dozens
of these, and this will result in many dozens of patches.

Suggestions?

--linas




Re: crash in gc with upside-down stack

2008-11-13 Thread Linas Vepstas
Attached below is a debugging patch, and its output,
which shows that the stack bounds are frequently
up-side-down, and are sometimes upside-down
when the GC runs, thus leading to a crash.

In the next email, I'll propose a patch that fixes the
the problem.

The original problem report:

 2008/11/11 Linas Vepstas [EMAIL PROTECTED]:

 My stack below.

 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0xf5333b90 (LWP 20587)]
 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at 
 gc-mark.c:435
 435   SCM obj = * (SCM *) x[m];
 Current language:  auto; currently c
 (gdb) bt
 #0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782)
at gc-mark.c:435
 #1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375
 #2  0xf7711d38 in scm_mark_all () at gc-mark.c:82
 #3  0xf7710d33 in scm_i_gc (what=0xf778602e cells) at gc.c:598


A debugging patch. Yes, its ugly, its intentionally ugly.
More of an eye-catcher that way.

Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c 2008-11-13 07:58:22.0 -0600
+++ guile-1.8.5/libguile/threads.c  2008-11-13 13:14:00.0 -0600
@@ -395,6 +395,10 @@ static scm_t_guile_ticket
 scm_leave_guile ()
 {
   scm_i_thread *t = suspend ();
+int sz=t-base - t-top;
+if(0sz) {
+printf(duuude scm_leav_guile backwards stack %d\n, sz);
+}
   scm_i_pthread_mutex_unlock (t-heap_mutex);
   return (scm_t_guile_ticket) t;
 }
@@ -694,7 +698,15 @@ scm_i_with_guile_and_parent (void *(*fun
   really_entered = scm_i_init_thread_for_guile (base_item, parent);
   res = scm_c_with_continuation_barrier (func, data);
   if (really_entered)
-scm_leave_guile ();
+{
+// scm_leave_guile ();
+scm_i_thread * t = (scm_i_thread *) scm_leave_guile ();
+int sz=t-base - t-top;
+int szb=t-base - base_item;
+if(0sz) {
+printf(duuude scm_leav_guile and parent %d %d\n, sz, szb);
+}
+}
   return res;
 }

@@ -704,6 +716,11 @@ scm_without_guile (void *(*func)(void *)
   void *res;
   scm_t_guile_ticket t;
   t = scm_leave_guile ();
+scm_i_thread * s = (scm_i_thread *) t;
+int sz=s-base - s-top;
+if(0sz) {
+printf(duuude scm_wo guile %d\n, sz);
+}
   res = func (data);
   scm_enter_guile (t);
   return res;
@@ -1371,8 +1388,15 @@ scm_threads_mark_stacks (void)

 #if SCM_STACK_GROWS_UP
   scm_mark_locations (t-base, t-top - t-base);
+
 #else
+int sz=t-base - t-top;
+if(0=sz) {
   scm_mark_locations (t-top, t-base - t-top);
+} else {
+printf (duude bugg!!\n);
+printf (duude stack top=%p base=%p sz=%d\n, t-top, t-base,
t-base - t-top);
+}
 #endif
   scm_mark_locations ((SCM_STACKITEM *) t-regs,
  ((size_t) sizeof(t-regs)
@@ -1441,6 +1465,11 @@ int
 scm_pthread_mutex_lock (scm_i_pthread_mutex_t *mutex)
 {
   scm_t_guile_ticket t = scm_leave_guile ();
+scm_i_thread * s = (scm_i_thread *) t;
+int sz=s-base - s-top;
+if(0sz) {
+printf(duuude scm_mutexe %d\n, sz);
+}
   int res = scm_i_pthread_mutex_lock (mutex);
   scm_enter_guile (t);
   return res;
@@ -1463,6 +1492,11 @@ int
 scm_pthread_cond_wait (scm_i_pthread_cond_t *cond,
scm_i_pthread_mutex_t *mutex)
 {
   scm_t_guile_ticket t = scm_leave_guile ();
+scm_i_thread * s = (scm_i_thread *) t;
+int sz=s-base - s-top;
+if(0sz) {
+printf(duuude scm_conde %d\n, sz);
+}
   int res = scm_i_pthread_cond_wait (cond, mutex);
   scm_enter_guile (t);
   return res;
@@ -1578,7 +1612,12 @@ scm_i_thread_put_to_sleep ()
 {
   scm_i_thread *t;

-  scm_leave_guile ();
+  // scm_leave_guile ();
+   t = (scm_i_thread *) scm_leave_guile ();
+int sz=t-base - t-top;
+if(0sz) {
+printf(duuude scm_leav_guile backwards was scm_i_thread_put_to_sleep
%d\n, sz);
+}
   scm_i_pthread_mutex_lock (thread_admin_mutex);

   /* Signal all threads to go to sleep
@@ -1620,6 +1659,10 @@ void
 scm_i_thread_sleep_for_gc ()
 {
   scm_i_thread *t = suspend ();
+int sz=t-base - t-top;
+if(0sz) {
+printf(duuude scm_i_thread_sleep_for_gc backwards stack %d\n, sz);
+}
   scm_i_pthread_cond_wait (wake_up_cond, t-heap_mutex);
   resume (t);
 }


Here is an example of the output generated:

duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duude bugg!!
duude stack top=0xf355b9e0 base=0xf355b908 sz=-54
duude bugg!!
duude stack top=0xf355b9e0 base=0xf355b908 sz=-54
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54
duuude scm_leav_guile and parent -54 -76
duuude scm_leav_guile backwards stack -54

[PATCH] fix for Re: crash in gc with upside-down stack

2008-11-13 Thread Linas Vepstas
Patch below; I'm also attaching the same patch, in case
gmail is scrambling this thing :-/  Also, I've long had a
generic assignment on file with the FSF.

--linas

The patch below fixes a crash during garbage collection, where, during
the mark-stack phase, the top and bottom of the stack are found to be
in backwards order, typically because scm_with_guile() was called when
the stack is much shorter than when a thread was first guilified. That
is, the stack base pointer is stale, and can be inverted from the stack
top. If GC runs due to activity in some other thread, the stale base
pointer leads to the crash (as base-top is approximately 2^32 or 2^64).

A typical symptom of this bug, on a 32-bit system, is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf5333b90 (LWP 20587)]
0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
435   SCM obj = * (SCM *) x[m];
Current language:  auto; currently c
(gdb) bt
#0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at
gc-mark.c:435
#1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375

Notice that 4294966782 == fdfe == -202

Please apply in time for guile-1.8.6!

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

---
 libguile/threads.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c 2008-11-13 15:17:12.0 -0600
+++ guile-1.8.5/libguile/threads.c  2008-11-13 15:32:07.0 -0600
@@ -577,9 +577,24 @@ scm_i_init_thread_for_guile (SCM_STACKIT
   /* This thread is already guilified but not in guile mode, just
 resume it.

-XXX - base might be lower than when this thread was first
-guilified.
+ A user call to scm_with_guile() will lead us to here. This
+ could happen anywhere on the stack, and in particular, the
+ stack can be *much* shorter than what it was when this thread
+ was first guilified. This will typically happen in
+ on_thread_exit(), where the stack is *always* shorter than
+ when the thread was first guilified. If the GC happens to
+ get triggered due to some other thread, we'd end up with
+ t-top upside-down w.r.t. t-base, which will result in
+ chaos in scm_threads_mark_stacks() when top-base=2^32 or 2^64.
+ Thus, reset the base, if needed.
*/
+#if SCM_STACK_GROWS_UP
+  if (base  t-base)
+ t-base = base;
+#else
+  if (base  t-base)
+ t-base = base;
+#endif
   scm_enter_guile ((scm_t_guile_ticket) t);
   return 1;
 }
The patch below fixes a crash during garbage collection, where, during
the mark-stack phase, the top and bottom of the stack are found to be 
in backwards order, typically because scm_with_guile() was called when
the stack is much shorter than when a thread was first guilified. That
is, the stack base pointer is stale, and can be inverted from the stack
top. If GC runs due to activity in some other thread, the stale base
pointer leads to the crash (as base-top is approximately 2^32 or 2^64).

A typical symptom of this bug, on a 32-bit system, is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf5333b90 (LWP 20587)]
0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
435   SCM obj = * (SCM *) x[m];
Current language:  auto; currently c
(gdb) bt
#0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
#1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375

Notice that 4294966782 == fdfe == -202

Please apply in time for guile-1.8.6!

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

---
 libguile/threads.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: guile-1.8.5/libguile/threads.c
===
--- guile-1.8.5.orig/libguile/threads.c	2008-11-13 15:17:12.0 -0600
+++ guile-1.8.5/libguile/threads.c	2008-11-13 15:32:07.0 -0600
@@ -577,9 +577,24 @@ scm_i_init_thread_for_guile (SCM_STACKIT
   /* This thread is already guilified but not in guile mode, just
 	 resume it.
 	 
-	 XXX - base might be lower than when this thread was first
-	 guilified.
+ A user call to scm_with_guile() will lead us to here. This
+ could happen anywhere on the stack, and in particular, the
+ stack can be *much* shorter than what it was when this thread
+ was first guilified. This will typically happen in
+ on_thread_exit(), where the stack is *always* shorter than
+ when the thread was first guilified. If the GC happens to
+ get triggered due to some other thread, we'd end up with
+ t-top upside-down w.r.t. t-base, which will result in
+ chaos

Re: Does anyone actually use threads with guile?

2008-11-13 Thread Linas Vepstas
2008/11/13 Andy Wingo [EMAIL PROTECTED]:
 For my part I apologize for not having the cycles

Fine, I'm hacking around it for now, but would like to see
something for 1.8.6.

 On Thu 13 Nov 2008 05:56, Linas Vepstas [EMAIL PROTECTED] writes:

 Basically, at any given time, some thread might be
 in a critical section. Some other thread may be
 throwing an error for some utterly unrelated reason.
 Yet, when the error is thrown, this critical section
 check will trip, and it will do so for an utterly bogus
 reason.  At least, that describes my case.

 Is there any reason at all not to remove this check
 entirely? (at  libguile/throw.c line 695.)

 I think the idea behind the check sounds good -- it is incorrect to
 throw from within a critical section, and the check detects this.

 But the check is incorrect as you noticed, it should be checking if the
 current thread is in a critical section.

The patch below does this.

I do not understand how 'async' fits into the grand scheme
of things. From what I can tell, though, there won't be any
cases where the thrower will be in a different thread than
where scm_ithrow() will run.  So the patch should be good.

--linas

---
 libguile/throw.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: guile-1.8.5/libguile/throw.c
===
--- guile-1.8.5.orig/libguile/throw.c  2008-11-13 16:02:26.0 -0600
+++ guile-1.8.5/libguile/throw.c 2008-11-13 16:29:46.0 -0600
@@ -689,7 +689,7 @@ scm_ithrow (SCM key, SCM args, int noret
   SCM dynpair = SCM_UNDEFINED;
   SCM winds;

-  if (scm_i_critical_section_level)
+  if (SCM_I_CURRENT_THREAD-block_asyncs)
 {
   fprintf (stderr, throw from within critical section.\n);
   abort ();




crash in gc with upside-down stack

2008-11-12 Thread Linas Vepstas
Here's another one, I'm trying to dig into this:

Its more or less the same crash as the one  reported at:

http://bugs.gentoo.org/228097
and
http://www.mail-archive.com/bug-guile@gnu.org/msg04568.html

My stack below.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf5333b90 (LWP 20587)]
0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
435   SCM obj = * (SCM *) x[m];
Current language:  auto; currently c
(gdb) bt
#0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782)
at gc-mark.c:435
#1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375
#2  0xf7711d38 in scm_mark_all () at gc-mark.c:82
#3  0xf7710d33 in scm_i_gc (what=0xf778602e cells) at gc.c:598
#4  0xf7710f4d in scm_gc_for_newcell (freelist=0xf779b76c,
free_cells=0x1228e9b0)
at gc.c:509
#5  0xf7768bd8 in scm_c_catch (tag=0x104, body=0xf76f3830 c_body,
body_data=0xf528, handler=0xf76f3850 c_handler,
handler_data=0xf528,
pre_unwind_handler=0xf77683e0 scm_handle_by_message_noexit,
pre_unwind_handler_data=0x0) at ../libguile/inline.h:186
#6  0xf76f3cf2 in scm_i_with_continuation_barrier (body=0xf76f3830 c_body,
body_data=0xf528, handler=0xf76f3850 c_handler,
handler_data=0xf528,
pre_unwind_handler=0xf77683e0 scm_handle_by_message_noexit,
pre_unwind_handler_data=0x0) at continuations.c:326
#7  0xf76f3dd3 in scm_c_with_continuation_barrier (
func=0xf7767ab0 do_thread_exit, data=0x1228e938) at continuations.c:368
---Type return to continue, or q return to quit---
#8  0xf77678f9 in scm_i_with_guile_and_parent (func=0xf7767ab0
do_thread_exit,
data=0x1228e938, parent=0x19f63670) at threads.c:695
#9  0xf77679ee in scm_with_guile (func=0xf7767ab0 do_thread_exit,
data=0x1228e938) at threads.c:683
#10 0xf7767a43 in on_thread_exit (v=0x1228e938) at threads.c:505
#11 0xf7d7abb0 in __nptl_deallocate_tsd ()
   from /lib/tls/i686/cmov/libpthread.so.0
#12 0xf7d7b509 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#13 0xf7b79e5e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb)

I've seen this twice now in two days, but its not readily reproducible.
By plugging in the insanely large n into a hex calc, you'll see its actually
0xfffsomething. Looking carefully near  threads.c:1375 seems to imply
that stack top and stack bottom are reversed. So I added a printf at that
location, and tried to reproduce the crash. Several gazzilion print
statements later, no crash.

I suspect that this is some sort of thread-race condition; I think it
happens when I am defining some functions from several different
threads at once. It seems *not* to occur once I get into hard-core
computations-- i.e. it happens no later than the first few dozen gc's.

This is on guile-1.8.5, --with-threads, on Ubuntu, Intel (actually AMD64 cpu.)

--linas




Re: crash in gc with upside-down stack

2008-11-12 Thread Linas Vepstas
Some minor updates:

2008/11/11 Linas Vepstas [EMAIL PROTECTED]:

 My stack below.

 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0xf5333b90 (LWP 20587)]
 0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782) at gc-mark.c:435
 435   SCM obj = * (SCM *) x[m];
 Current language:  auto; currently c
 (gdb) bt
 #0  0xf7711ce3 in scm_mark_locations (x=0xf5333110, n=4294966782)
at gc-mark.c:435
 #1  0xf7766a12 in scm_threads_mark_stacks () at threads.c:1375
 #2  0xf7711d38 in scm_mark_all () at gc-mark.c:82
 #3  0xf7710d33 in scm_i_gc (what=0xf778602e cells) at gc.c:598

My current code reproduces this fairly readily, I am seeing
it many dozens/hundreds of times a day.

I tweaked guile to check that the stack bounds are in order,
and to print an error message when they are, and then to
just troop on -- and so I see dozens/hundreds of prints.
When the stack bounds are reversed, the difference
is *always* 58 bytes; and in fact, the two bad stack
bounds are always the same.

It appears to happen *only* when I have multiple threads
all trying to define functions at the same time, it never
happens when one thread goes off to do some heavy
computing.

--linas