"Julian Graham" <[EMAIL PROTECTED]> writes: > Hi Neil, > >> 1. Embarassingly - given that I already said "Nice fix" to this - I'm >> afraid I can't now see exactly why this is needed. > > Argh, you're right -- when I first noticed this behavior, I was so > astonished to see my logs showing threads entering and leaving guile > mode during GC that my first move was to try and prevent this. When > my changes got rid of this behavior, I assumed everything was > hunky-dory. However, when pressed to explain the details, I added > more logging, which showed that errant thread ultimately did go to > sleep at the proper time -- it just never woke up when the > wake_up_cond was broadcast on. > > My current inclination is that the problem lies with sleeping on the > global wake_up_cond -- each thread calls pthread_cond_wait with its > own, thread-specific heap_mutex, the result of which is undefined, or > so say the glibc docs.
Agreed. All the examples I've seen have the same mutex for all threads that wait on the cond var. > I'm testing a fix now that uses a mutex reserved for this purpose > instead. OK. While looking through the docs, though, and playing with possible solutions, I noted a couple of other pitfalls (which the current code also appears to suffer from). 1. pthread_cond_wait() returning does not necessarily mean that the cond var was signalled. Apparently pthread_cond_wait() can return early because of an interrupt. 2. If two threads are using pthread_cond_wait and pthread_cond_signal to communicate, and using the cond_var itself as a state indication, they have to be certain that the pthread_cond_wait starts before the pthread_cond_signal, otherwise it won't work. The practical impact of these is that one shouldn't use the cond_var itself as an indication of "reached so-and-so state". Instead, one can represent the state using an explicit variable, which is protected by the associated mutex, and then interpret the cond_var as indicating simply that the variable _might_ have changed. In our case, I think the state variable could be scm_i_thread_go_to_sleep, protected by thread_admin_mutex. Here's a possible solution based on this, but it isn't yet complete, because it doesn't explain how num_guile_threads_awake is calculated. (And I have to go to bed!) scm_i_thread_sleep_for_gc () { scm_i_thread *t = suspend (); pthread_mutex_lock (&thread_admin_mutex); if (scm_i_thread_go_to_sleep) { num_guile_threads_awake--; pthread_cond_signal (&going_to_sleep_cond); while (scm_i_thread_go_to_sleep) { pthread_cond_wait (&wake_up_cond, &thread_admin_mutex); } num_guile_threads_awake++; } pthread_mutex_unlock (&thread_admin_mutex); resume (t); } scm_i_thread_put_to_sleep () { pthread_mutex_lock (&thread_admin_mutex); scm_i_thread_go_to_sleep = 1; while (num_guile_threads_awake > 0) { pthread_cond_wait (&going_to_sleep_cond, &thread_admin_mutex); } } scm_i_thread_wake_up () { scm_i_thread_go_to_sleep = 0; pthread_mutex_unlock (&thread_admin_mutex); pthread_cond_broadcast (&wake_up_cond); } > So why hasn't this been reported before? I'm not really sure, except > that based on my logs, a GC involving more than two threads (one > thread stays awake, of course, to manage the collection) is kind of > rare. It doesn't even necessarily happen during an entire run of my > SRFI-18 test suite, which lasts for several seconds and is fairly > multi-threaded. Not sure what you mean here. Surely if there are >2 threads, they all have to go to sleep before GC can proceed? >> Is that right? I think you suggested in one of your previous emails >> that it might be possible for thread A to enter and leave guile mode >> multiple times, but I don't see how that is possible. > > It *is* possible, because a thread can enter and leave guile mode and > do a fair number of things without SCM_TICK getting called. I don't > know if that's significant or not. That may mean that we need some more SCM_TICK calls. What kind of processing was the thread doing? >> 2. Should admin_mutex be locked in scm_c_thread_exited_p()? I think >> it should. (This was equally wrong when using thread_admin_mutex, of >> course; your patch doesn't make anything worse, but it's worth fixing >> this in passing if you agree.) > > Sure -- wouldn't hurt. I'll include that with whatever ends up in the > final "bug" patch. Thanks. > Apologies that it takes me so long to reply to these emails. Blame > the overhead of looping my test code all night? No need to apologize there! My time at the moment is pretty limited too, so if you replied any quicker, you'd then just be waiting for me (even more)! Regards, Neil