[PATCH] Thread.critical= deadlocks entire system
------------------------------------------------
Key: JRUBY-1157
URL: http://jira.codehaus.org/browse/JRUBY-1157
Project: JRuby
Issue Type: Bug
Components: Core Classes/Modules
Affects Versions: JRuby 1.0.0
Environment: Suspected: any; tested on: Mac OS X 10.4.9, Java
1.5.0_07, JRuby 1.0.0 (Ruby 1.8.5).
Reporter: Andrew Geweke
Priority: Critical
Summary: a bug in Thread.critical= causes the entire system to deadlock (all
Ruby threads); this happens with very high probability if given sufficient
concurrency (a large number of threads). The synchronization happening at the
Java layer in the implementation of Thread.critical is incorrect, and results
in this deadlock. The attached patch fixes it.
I discovered this by running ~10 threads, each of which made many calls to the
Logger (without actually outputting anything); Logger uses a Mutex, which in
turn uses Thread.critical=. The threads deadlocked almost without fail after
only about 10-20 seconds of runtime on my box. (Obviously, your results may
vary wildly, as this is all about thread timing.)
Specifically, org.jruby.internal.runtime.ThreadService.{getCritical(),
setCritical(boolean)} have the following synchronization bug: they all hold the
Java monitor on the singleton ThreadService object, then call into the
ReentrantLock meant to implement Thread.critical. But if the call into the
ReentrantLock ever blocks, nobody can ever unlock it, because the thread
blocked on the ReentrantLock will hold the Java monitor on the ThreadService,
which is required to be held in order to unlock the ReentrantLock. (Blocking on
the ReentrantLock will unlock *it*, but not monitors held higher up the call
chain -- like the one on the ThreadService.)
There was also an apparent race condition in ThreadService.waitForCritical()
(it tested whether the current thread held the critical lock, then tried to
lock and unlock it, but there was no guard around these; hence, the test could
execute and pass, only to have the test no longer be true by the time it got to
the lock/unlock clause); my reworked synchronization fixes this, too.
Full description, from the comments in the supplied testcase:
Background:
- In MRI, setting 'Thread.critical = true' means no other threads can possibly
be scheduled, period. (This is because MRI uses 'green threads', meaning it
implements its own thread scheduler, and the OS knows nothing about the
threads in the process; as such, it can guarantee this easily.) Thus it is
simply not possible for another thread to set Thread.critical to anything,
period, until the first thread sets 'Thread.critical = false'.
- In JRuby, Ruby threads are mapped directly onto Java threads. Because of
this,
JRuby cannot actually guarantee that no other threads are scheduled once you
set 'Thread.critical = true'; all it can do is block other threads as soon
as
they call one of a certain set of library functions that explicitly check
the
value of 'Thread.critical'.
The bug involves two threads:
1. Thread A calls 'Thread.critical = true'.
2. Thread B is then selected to run by the OS's scheduler. In MRI, this is not
possible, but in JRuby, it is.
3. Thread B calls 'Thread.critical = true'. This correctly does the right thing,
which is to block until Thread A sets 'Thread.critical = false'. However,
here is where the bug occurred: this would grab a monitor (an ordinary Java
monitor) that was required to do anything at all with the 'critical lock',
and hold it while it was blocked on a *different* lock waiting for thread A
to set 'Thread.critical = false'. (Because it is blocked on a different lock,
Java's synchronization primitives do not automatically release the monitor
this thread has acquited.)
4. Thread A calls 'Thread.critical = false'. However, in order to change the
value, it needs to acquire the monitor that thread A holds.
5. Deadlock.
The fix was to rework the synchronization so that the same monitor is used for
both locking the value of Thread.critical and to block until it's set to 'false'
when required, above. This test case demonstrates that the bug is fixed, and
makes sure it never recurs.
The bug was nasty because lots of synchronization objects (e.g., Mutex) use
Thread.critical in their implementation, and it ends up just being a matter of
time until the entire system deadlocks. (In particular, even just calls to
Logger from different threads will do this, as it uses a Mutex, which, in
turn, uses Thread.critical.)
Unfortunately, this test case simply *hangs* when the bug is present, as that's
the consequence of the bug (it's a deadlock). Further, we can't do anything
about it, as using another thread to try to time out the test just exacerbates
the problem and deadlocks itself. Alas, that's all we can do, but at least a
hung test is likely to get noticed when compared to no test at all. :-)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe from this list please visit:
http://xircles.codehaus.org/manage_email