Re: srfi-18 requirements

Neil Jerram Wed, 16 Jan 2008 17:48:47 -0800

"Julian Graham" <[EMAIL PROTECTED]> writes:

> Hi Neil,
>
>> 1. Embarassingly - given that I already said "Nice fix" to this - I'm
>> afraid I can't now see exactly why this is needed.
>
> Argh, you're right -- when I first noticed this behavior, I was so
> astonished to see my logs showing threads entering and leaving guile
> mode during GC that my first move was to try and prevent this.  When
> my changes got rid of this behavior, I assumed everything was
> hunky-dory.  However, when pressed to explain the details, I added
> more logging, which showed that errant thread ultimately did go to
> sleep at the proper time -- it just never woke up when the
> wake_up_cond was broadcast on.
>
> My current inclination is that the problem lies with sleeping on the
> global wake_up_cond -- each thread calls pthread_cond_wait with its
> own, thread-specific heap_mutex, the result of which is undefined, or
> so say the glibc docs.


Agreed.  All the examples I've seen have the same mutex for all
threads that wait on the cond var.

>  I'm testing a fix now that uses a mutex reserved for this purpose
> instead.

OK.  While looking through the docs, though, and playing with possible
solutions, I noted a couple of other pitfalls (which the current code
also appears to suffer from).

1. pthread_cond_wait() returning does not necessarily mean that the
   cond var was signalled.  Apparently pthread_cond_wait() can return
   early because of an interrupt.

2. If two threads are using pthread_cond_wait and pthread_cond_signal
   to communicate, and using the cond_var itself as a state
   indication, they have to be certain that the pthread_cond_wait
   starts before the pthread_cond_signal, otherwise it won't work.

The practical impact of these is that one shouldn't use the cond_var
itself as an indication of "reached so-and-so state".  Instead, one
can represent the state using an explicit variable, which is protected
by the associated mutex, and then interpret the cond_var as indicating
simply that the variable _might_ have changed.

In our case, I think the state variable could be
scm_i_thread_go_to_sleep, protected by thread_admin_mutex.  Here's a
possible solution based on this, but it isn't yet complete, because it
doesn't explain how num_guile_threads_awake is calculated.  (And I
have to go to bed!)

scm_i_thread_sleep_for_gc ()
{
  scm_i_thread *t = suspend ();

  pthread_mutex_lock (&thread_admin_mutex);
  if (scm_i_thread_go_to_sleep)
  {
    num_guile_threads_awake--;
    pthread_cond_signal (&going_to_sleep_cond);

    while (scm_i_thread_go_to_sleep)
    {
      pthread_cond_wait (&wake_up_cond, &thread_admin_mutex);
    }
    num_guile_threads_awake++;
  }
  pthread_mutex_unlock (&thread_admin_mutex);

  resume (t);
}

scm_i_thread_put_to_sleep ()
{
  pthread_mutex_lock (&thread_admin_mutex);
  scm_i_thread_go_to_sleep = 1;
  while (num_guile_threads_awake > 0)
  {
    pthread_cond_wait (&going_to_sleep_cond, &thread_admin_mutex);
  }
}

scm_i_thread_wake_up ()
{
  scm_i_thread_go_to_sleep = 0;
  pthread_mutex_unlock (&thread_admin_mutex);
  pthread_cond_broadcast (&wake_up_cond);
}

> So why hasn't this been reported before?  I'm not really sure, except
> that based on  my logs, a GC involving more than two threads (one
> thread stays awake, of course, to manage the collection) is kind of
> rare.  It doesn't even necessarily happen during an entire run of my
> SRFI-18 test suite, which lasts for several seconds and is fairly
> multi-threaded.

Not sure what you mean here.  Surely if there are >2 threads, they all
have to go to sleep before GC can proceed?

>> Is that right?  I think you suggested in one of your previous emails
>> that it might be possible for thread A to enter and leave guile mode
>> multiple times, but I don't see how that is possible.
>
> It *is* possible, because a thread can enter and leave guile mode and
> do a fair number of things without SCM_TICK getting called.  I don't
> know if that's significant or not.

That may mean that we need some more SCM_TICK calls.  What kind of
processing was the thread doing?

>> 2. Should admin_mutex be locked in scm_c_thread_exited_p()?  I think
>> it should.  (This was equally wrong when using thread_admin_mutex, of
>> course; your patch doesn't make anything worse, but it's worth fixing
>> this in passing if you agree.)
>
> Sure -- wouldn't hurt.  I'll include that with whatever ends up in the
> final "bug" patch.

Thanks.

> Apologies that it takes me so long to reply to these emails.  Blame
> the overhead of looping my test code all night?

No need to apologize there!  My time at the moment is pretty limited
too, so if you replied any quicker, you'd then just be waiting for me
(even more)!

Regards,
        Neil

Re: srfi-18 requirements

Reply via email to