Hi all, I finally took a shovel and dug in deep into #877. It's a very tricky one and I think it's been so elusive due to the fact that the bug *probably* only crops up on machines with multiple CPU (cores).
At least it's relatively easy to detect the error and clean up the child, and abort the test run with a failure, so the test part of the attached patch is definitely an improvement. The patch also includes a fix for runtime.c for what seems to be the cause. I noticed that whenever it would fail, the test would print only letters, no numbers or dots. So, it looked like the child was no longer responding to the parent's signals OR spinning in its threads which print dots and underscores. This seems to indicate it is flooded with signals or something. The icky signal queueing code in runtime.c takes care not to queue up more than 100 signals, and it doesn't queue up timer interrupts in an attempt not to flood itself, but this doesn't help you when someone else is flooding you from the outside. So now it will drop signals more aggressively, by ignoring signals that have already been queued. Hopefully this will help. At least I have tried repeatedly on the call-cc.org server and I can no longer trigger the failure situation (which was relatively easy to reproduce on that machine, it would fail about once every 5 runs). If that's not enough we could make a queue the size of the total number of available signals. This would be slightly faster, but could also be an extremely big vector on some operating systems(?). I think something like that has been suggested before by Jerry. For now it doesn't necessary or likely that so many *different* signals would arrive at the same time, not even in case of an attack: you can't send signals to processes you don't own. Sorry for rambling. Cheers, Peter -- http://www.more-magic.net
>From 18dacdb6da22fab19f2e4fefc557d5abd24f2396 Mon Sep 17 00:00:00 2001 From: Peter Bex <peter....@xs4all.nl> Date: Sun, 5 Jan 2014 20:38:43 +0100 Subject: [PATCH] Fix race condition in #877. Don't allow a flood of identical signals to hog the signal queue. In the test, just to be sure, first set up a signal that the child can send the parent when it's ready. Then start sending signals to the child, so we know the child managed to set up all its handlers. Finally, wait for acknowledgement of child shutdown so we can detect whether the error persists. --- runtime.c | 13 +++++++-- tests/signal-tests.scm | 72 ++++++++++++++++++++++++++++++++---------------- 2 files changed, 59 insertions(+), 26 deletions(-) diff --git a/runtime.c b/runtime.c index e91d444..b82612d 100644 --- a/runtime.c +++ b/runtime.c @@ -4459,9 +4459,16 @@ C_regparm void C_fcall C_raise_interrupt(int reason) #endif interrupt_time = C_cpu_milliseconds(); pending_interrupts[ pending_interrupts_count++ ] = reason; - } else if(reason != C_TIMER_INTERRUPT_NUMBER && - pending_interrupts_count < MAX_PENDING_INTERRUPTS) { - /* drop signals if too many */ + } else if(pending_interrupts_count < MAX_PENDING_INTERRUPTS) { + int i; + /* + * Drop signals if too many, but don't queue up multiple entries + * for the same signal. + */ + for (i = 0; i < pending_interrupts_count; ++i) { + if (pending_interrupts[i] == reason) + return; + } pending_interrupts[ pending_interrupts_count++ ] = reason; } } diff --git a/tests/signal-tests.scm b/tests/signal-tests.scm index 3d685e2..0289fdc 100644 --- a/tests/signal-tests.scm +++ b/tests/signal-tests.scm @@ -9,6 +9,12 @@ (use posix srfi-18 extras) +(define all-go? (make-parameter #f)) + +;; This is set before starting the child to avoid the race condition +;; from #877. The child itself overwrites these signal handlers +;; before sending the "all go" signal (usr1) to the parent. +(set-signal-handler! signal/usr1 (lambda (sig) (all-go? #t))) (define received1 0) (define received2 0) @@ -29,41 +35,61 @@ (define (fini _) (printf "~%child terminating, received: ~a USR1, ~a USR2~%" received1 received2) + (thread-sleep! 0.5) + (process-signal (parent-process-id) signal/usr1) (exit)) (define (child) (print "child started") - (thread-start! + (thread-start! (lambda () (do () (#f) (thread-sleep! 0.5) (tick #\_)))) (set-signal-handler! signal/usr1 handler) - (set-signal-handler! signal/usr2 handler) - (set-signal-handler! signal/term fini) - (do () (#f) + (set-signal-handler! signal/usr2 handler) + (set-signal-handler! signal/term fini) + (process-signal (parent-process-id) signal/usr1) + (do () (#f) (thread-sleep! 1) (tick #\.))) (let ((pid (process-fork child)) (sent1 0) (sent2 0)) - (sleep 1) - (print "sending signals to " pid) - (do ((i 1000 (sub1 i))) - ((zero? i)) - (thread-sleep! (/ (random 10) 1000)) - (do ((j (random 4) (sub1 j))) - ((zero? j)) - (case (random 2) - ((0) - (tick #\A) - (set! sent1 (add1 sent1)) - (process-signal pid signal/usr1)) - ((1) - (tick #\B) - (set! sent2 (add1 sent2)) - (process-signal pid signal/usr2))))) - (printf "~%signals sent: ~a USR1, ~a USR2~%" sent1 sent2) - (print "terminating child process ...") - (process-signal pid)) + (print "Sleeping until child wakes us up") ; signal *should* interrupt the sleep + (print "would have slept for " (sleep 5) " more seconds") + (cond ((all-go?) + (print "sending signals to " pid) + (do ((i 1000 (sub1 i))) + ((zero? i)) + (thread-sleep! (/ (random 10) 1000)) + (do ((j (random 4) (sub1 j))) + ((zero? j)) + (case (random 2) + ((0) + (tick #\A) + (set! sent1 (add1 sent1)) + (process-signal pid signal/usr1)) + ((1) + (tick #\B) + (set! sent2 (add1 sent2)) + (process-signal pid signal/usr2))))) + (printf "~%signals sent: ~a USR1, ~a USR2~%" sent1 sent2) + (print "terminating child process ...") + (all-go? #f) + (print "Sending signal and waiting for acknowledgement from child") + (process-signal pid signal/term) + (unless (all-go?) ; There's a bit of a race condition here, but that's okay + (print "Would've slept for " (sleep 5) " more seconds")) + (cond ((all-go?) + (print "Everything is ok!") + (exit 0)) + (else + (print "ERROR! Did not receive acknowledgement of child shutdown within 5 seconds, or another process awoke us") + (print "Attempting to kill child forcefully via SIGKILL") + (process-signal pid signal/kill) + (exit 1)))) + (else (print "ERROR! Did not receive a signal from child within 10 seconds, or another process awoke us") + (print "terminating child process ...") + (exit 1)))) -- 1.7.10.4
_______________________________________________ Chicken-hackers mailing list Chicken-hackers@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-hackers