Hi all,

I finally took a shovel and dug in deep into #877.
It's a very tricky one and I think it's been so elusive due
to the fact that the bug *probably* only crops up on machines
with multiple CPU (cores).

At least it's relatively easy to detect the error and clean up
the child, and abort the test run with a failure, so the test part
of the attached patch is definitely an improvement.  The patch
also includes a fix for runtime.c for what seems to be the cause.

I noticed that whenever it would fail, the test would print only
letters, no numbers or dots.  So, it looked like the child was no
longer responding to the parent's signals OR spinning in its threads
which print dots and underscores.  This seems to indicate it is flooded
with signals or something.  The icky signal queueing code in runtime.c
takes care not to queue up more than 100 signals, and it doesn't queue
up timer interrupts in an attempt not to flood itself, but this doesn't
help you when someone else is flooding you from the outside.  So now
it will drop signals more aggressively, by ignoring signals that have
already been queued.  Hopefully this will help.  At least I have tried
repeatedly on the call-cc.org server and I can no longer trigger the
failure situation (which was relatively easy to reproduce on that
machine, it would fail about once every 5 runs).

If that's not enough we could make a queue the size of the total number
of available signals.  This would be slightly faster, but could also
be an extremely big vector on some operating systems(?).  I think
something like that has been suggested before by Jerry.  For now it
doesn't necessary or likely that so many *different* signals would
arrive at the same time, not even in case of an attack: you can't send
signals to processes you don't own.

Sorry for rambling.

Cheers,
Peter
-- 
http://www.more-magic.net
>From 18dacdb6da22fab19f2e4fefc557d5abd24f2396 Mon Sep 17 00:00:00 2001
From: Peter Bex <peter....@xs4all.nl>
Date: Sun, 5 Jan 2014 20:38:43 +0100
Subject: [PATCH] Fix race condition in #877.

Don't allow a flood of identical signals to hog the signal queue.

In the test, just to be sure, first set up a signal that the child can
send the parent when it's ready.  Then start sending signals to the child,
so we know the child managed to set up all its handlers.

Finally, wait for acknowledgement of child shutdown so we can detect
whether the error persists.
---
 runtime.c              |   13 +++++++--
 tests/signal-tests.scm |   72 ++++++++++++++++++++++++++++++++----------------
 2 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/runtime.c b/runtime.c
index e91d444..b82612d 100644
--- a/runtime.c
+++ b/runtime.c
@@ -4459,9 +4459,16 @@ C_regparm void C_fcall C_raise_interrupt(int reason)
 #endif
       interrupt_time = C_cpu_milliseconds();
       pending_interrupts[ pending_interrupts_count++ ] = reason;
-    } else if(reason != C_TIMER_INTERRUPT_NUMBER &&
-              pending_interrupts_count < MAX_PENDING_INTERRUPTS) {
-      /* drop signals if too many */
+    } else if(pending_interrupts_count < MAX_PENDING_INTERRUPTS) {
+      int i;
+      /*
+       * Drop signals if too many, but don't queue up multiple entries
+       * for the same signal.
+       */
+      for (i = 0; i < pending_interrupts_count; ++i) {
+        if (pending_interrupts[i] == reason)
+          return;
+      }
       pending_interrupts[ pending_interrupts_count++ ] = reason;
     }
   }
diff --git a/tests/signal-tests.scm b/tests/signal-tests.scm
index 3d685e2..0289fdc 100644
--- a/tests/signal-tests.scm
+++ b/tests/signal-tests.scm
@@ -9,6 +9,12 @@
 
 (use posix srfi-18 extras)
 
+(define all-go? (make-parameter #f))
+
+;; This is set before starting the child to avoid the race condition
+;; from #877.  The child itself overwrites these signal handlers
+;; before sending the "all go" signal (usr1) to the parent.
+(set-signal-handler! signal/usr1 (lambda (sig) (all-go? #t)))
 
 (define received1 0)
 (define received2 0)
@@ -29,41 +35,61 @@
 (define (fini _)
   (printf "~%child terminating, received: ~a USR1, ~a USR2~%"
     received1 received2)
+  (thread-sleep! 0.5)
+  (process-signal (parent-process-id) signal/usr1)
   (exit))
 
 (define (child)
   (print "child started")
-  (thread-start! 
+  (thread-start!
    (lambda ()
      (do () (#f)
        (thread-sleep! 0.5)
        (tick #\_))))
   (set-signal-handler! signal/usr1 handler)
-  (set-signal-handler! signal/usr2 handler)   
-  (set-signal-handler! signal/term fini)   
-  (do () (#f) 
+  (set-signal-handler! signal/usr2 handler)
+  (set-signal-handler! signal/term fini)
+  (process-signal (parent-process-id) signal/usr1)
+  (do () (#f)
     (thread-sleep! 1)
     (tick #\.)))
 
 (let ((pid (process-fork child))
       (sent1 0)
       (sent2 0))
-  (sleep 1)
-  (print "sending signals to " pid)
-  (do ((i 1000 (sub1 i)))
-      ((zero? i))
-    (thread-sleep! (/ (random 10) 1000))
-    (do ((j (random 4) (sub1 j)))
-       ((zero? j))
-      (case (random 2)
-       ((0) 
-        (tick #\A)
-        (set! sent1 (add1 sent1))
-        (process-signal pid signal/usr1))
-       ((1) 
-        (tick #\B)
-        (set! sent2 (add1 sent2))
-        (process-signal pid signal/usr2)))))
-  (printf "~%signals sent: ~a USR1, ~a USR2~%" sent1 sent2)
-  (print "terminating child process ...")
-  (process-signal pid))
+  (print "Sleeping until child wakes us up") ; signal *should* interrupt the 
sleep
+  (print "would have slept for " (sleep 5) " more seconds")
+  (cond ((all-go?)
+        (print "sending signals to " pid)
+        (do ((i 1000 (sub1 i)))
+            ((zero? i))
+          (thread-sleep! (/ (random 10) 1000))
+          (do ((j (random 4) (sub1 j)))
+              ((zero? j))
+            (case (random 2)
+              ((0)
+               (tick #\A)
+               (set! sent1 (add1 sent1))
+               (process-signal pid signal/usr1))
+              ((1)
+               (tick #\B)
+               (set! sent2 (add1 sent2))
+               (process-signal pid signal/usr2)))))
+        (printf "~%signals sent: ~a USR1, ~a USR2~%" sent1 sent2)
+        (print "terminating child process ...")
+        (all-go? #f)
+        (print "Sending signal and waiting for acknowledgement from child")
+        (process-signal pid signal/term)
+        (unless (all-go?) ; There's a bit of a race condition here, but that's 
okay
+          (print "Would've slept for " (sleep 5) " more seconds"))
+        (cond ((all-go?)
+               (print "Everything is ok!")
+               (exit 0))
+              (else
+               (print "ERROR! Did not receive acknowledgement of child 
shutdown within 5 seconds, or another process awoke us")
+               (print "Attempting to kill child forcefully via SIGKILL")
+               (process-signal pid signal/kill)
+               (exit 1))))
+       (else (print "ERROR! Did not receive a signal from child within 10 
seconds, or another process awoke us")
+             (print "terminating child process ...")
+             (exit 1))))
-- 
1.7.10.4

_______________________________________________
Chicken-hackers mailing list
Chicken-hackers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-hackers

Reply via email to