I'd like to thank everyone who replied on these bugs over the last
couple years.  Especially the people who prodded me into trying out
kernel 2.6.18-6.  I've tracked down the bug, or at least I think maybe
I have.  There is definitely a bug here, but it seems to be failing
even in cases where it "ought to" work; I suspect that 2.6.18-6 is also
a bit broken.

  The problem stems from the fact that the input thread, for its entire
execution, holds a lock on a mutex.  The purpose of this mutex is to
guard a condition variable and an associated member variable of the
input thread, both of which are used to synchronize the input thread and
the foreground code that actually reads from the keyboard.  Reading
input in the foreground was introduced in the patch that introduced this
bug (previously we tried to read in the background; it turns out that
curses updates the display when you read keyboard input, so this led to
simultaneous calls to refresh() and corrupted output).

  Unfortunately, the threads wrapper used in cwidget does not provide
any way to push thread cleanup handlers for mutex locks, because the
authors of the POSIX threads spec cunningly made it impossible to embed
these in an RAII C++ object due to their use of a macro.  You probably
see where this is going...

  Because of the flakiness of pthread_cancel(), cwidget mostly avoids
it.  However, it is used in a few places, including when we shut down
the input thread.  IIRC, this is done because it's necessary to
interrupt a select(), and cancel() is the main tool at our disposal to
do that.  The problem is that, in the words of pthread_cleanup_push(3):

    ...if a thread exits OR IS CANCELLED while it owns a locked mutex,
       the mutex will remain locked forever and prevent other threads
       from executing normally.  (emphasis mine)

  Recall that the only reason this mutex is locked is that we need it
for the condition variable.  But if you look at the code of the input
thread, you'll see that the condition variable is only accessed in one
sub-clause of that code.  In fact, of the three cancellation points in
that function (counting select(), which isn't one on Linux, and the
bracketing pthread_testcancel() calls, which are), only one of them
occurs in the area where the mutex actually has to be locked, and that's
the line where we actually wait on the condition variable.  And it turns
out that wait() *DOES* handle cleanup from cancellation properly.

  One unexplained question is why the bug hit even when aptitude
suspended in response to a keystroke.  When aptitude was waiting for
a keystroke, the input thread would be blocked in select(), so no
cleanup handlers were installed and the scenario above makes sense.
But when aptitude was waiting for a keystroke, I thought that the
input thread would still be blocked waiting on the input event
condition, and as I noted earlier, that should install a proper cleanup
handler.  Another question is why we didn't see this bug in 2.6.26.
I can't think of anything, except that maybe the implementation of
mutexes changed somehow so that they get unlocked when a thread is
canceled.

  So, the long and the short of it is: if we just lock the mutex when
we actually need it, that is, when we're about to send a get_input_event
to the main thread, there's no danger of being canceled and leaving the
mutex locked.  I've attached a patch that does this; on my computer, it
eliminates the hang.  I'll ask the release managers for permission to
apply this patch in Lenny.

  Now I just need to reboot into 2.6.26 so that my wireless works and I
can actually send this mail.

  Daniel
diff --git a/src/cwidget/toplevel.cc b/src/cwidget/toplevel.cc
index c912ad6..e0b6ab2 100644
--- a/src/cwidget/toplevel.cc
+++ b/src/cwidget/toplevel.cc
@@ -1,6 +1,6 @@
 // toplevel.cc
 //
-//  Copyright 1999-2005, 2007-2008 Daniel Burrows
+//  Copyright 1999-2005, 2007-2009 Daniel Burrows
 //
 //  This program is free software; you can redistribute it and/or modify
 //  it under the terms of the GNU General Public License as published by
@@ -446,7 +446,6 @@ namespace cwidget
 
       void operator()()
       {
-	threads::mutex::lock l(input_event_mutex);
 	input_event_fired = false;
 
 	// Important note: this routine only blocks indefinitely in
@@ -480,6 +479,7 @@ namespace cwidget
 	      }
 	    else
 	      {
+		threads::mutex::lock l(input_event_mutex);
 		post_event(new get_input_event(input_event_mutex,
 					       input_event_fired,
 					       input_event_condition));

Reply via email to