Hello Juan Jose.
 
It's enough to have a fix for the old libc only.
 
It's a good idea, thanks. I will only need to think about race conditions, like if GC starts after the sigwait returns, but before we called
GC_register_my_thread. Or this code doesn't allocate any heap memory and should not be protected by GC?
 
Best regards,
- Anton
 
 
20.01.2011, 00:16, "Juan Jose Garcia-Ripoll" <juanjose.garciarip...@googlemail.com>:
I only have one suggestion, which is to temporarily deregister that thread so that the garbage collector does not suspend it. Something like


        /* Waiting may fail! */
        int status;
        GC_unregister_my_thread();
        status = sigwait(&handled_set, &signo);
        if (status == 0) {
            if (interrupt_signal == signo)
                goto RETURN;
            signal_code = call_handler(lisp_signal_handler, signo,
                           NULL, NULL);
            if (!Null(signal_code)) {
                mp_process_run_function(3, @'si::handle-signal',
                            @'si::handle-signal',
                            signal_code);
            }
        }
        GC_register_my_thread((void*)&status);

Unfortunately this can not be used when the library works as expected (sigwait does not block all signals) because some interrupt handlers may need the garbage collector to work.

Juanjo

On Wed, Jan 19, 2011 at 7:21 PM, Anton Vodonosov <avodono...@yandex.ru> wrote:
Hello.

I am building ECL for glibc-2.2.5. With that old glibc version
a deadlock occurs any time when garbage collection starts.

I found out the mechanics of how it happens.

Not sure if you want to fix it, because the libc version is old,
but maybe you can provide an advice how can I workaround it.

How it happens. Two parts are involved:

1. The Boehm-Weiser GC tries to stop all the threads before
  performing garbage collection (it's called "stop world").
  This is implemented by sending a SIG_SUSPEND signal to
  every thread. The signal handler in every thread then
  tells "ok, I am stopped" to the thread which wants to perform
  the garbage collection, and then waits until the GC instruct
  it to restart.

  The "I am stopped" confirmation is sent via a
  semaphore: sem_post(&GC_suspend_ack_sem).

  The GC expects this from every thread. It performs
  sem_wait(&GC_suspend_ack_sem) as many times, as
  many threads were notified by the SIG_SUSPEND signal.

  The corresponding code is in the src/gc/pthread_stop_world.c,
  the functions GC_stop_world which calls GC_suspend_all.
  The signal handler behavior is implemented in the
  GC_suspend_handler_inner.

2. ECL has a special thread which handles all the signals
  not handled by other threads.

  See it's implementation in the function
  asynchronous_signal_servicing_thread, file src/c/unixint.d.

  It is an endless loop of
     sigwait(<signlals blocked in other threads>);

The deadlock is caused by the difference in sigwait behavior
between the old libc and the contemporary libc.

Namely, what happens when the asynchronous_signal_servicing_thread
is waiting in sigwait(<signlals blocked in other threads>),
and some signal _not_ from this set arrive? In particular, when
GC sends the SIG_SUSPEND signal.

The contemporary libc calls the signal handler. The old libc
doesn't call the signal handler; sigwait just blocks
the signlas other than it waits for.

In result, with the old libc the sem_wait(&GC_suspend_ack_sem)
is  not performed by the asynchronous_signal_servicing_thread,
therefore the GC waits on the semaphore forever.

ECL hangs first time the GC is invoked, for example on
(MAKE-ARRAY 3000000).

What would be the easiest way to workaround this problem?

Best regards,
- Anton





------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list



--
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand 
malware threats, the impact they can have on your business, and how you 
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list

Reply via email to