[PATCH] Thread.critical= deadlocks entire system
------------------------------------------------

                 Key: JRUBY-1157
                 URL: http://jira.codehaus.org/browse/JRUBY-1157
             Project: JRuby
          Issue Type: Bug
          Components: Core Classes/Modules
    Affects Versions: JRuby 1.0.0
         Environment: Suspected: any; tested on: Mac OS X 10.4.9, Java 
1.5.0_07, JRuby 1.0.0 (Ruby 1.8.5).
            Reporter: Andrew Geweke
            Priority: Critical


Summary: a bug in Thread.critical= causes the entire system to deadlock (all 
Ruby threads); this happens with very high probability if given sufficient 
concurrency (a large number of threads). The synchronization happening at the 
Java layer in the implementation of Thread.critical is incorrect, and results 
in this deadlock. The attached patch fixes it.

I discovered this by running ~10 threads, each of which made many calls to the 
Logger (without actually outputting anything); Logger uses a Mutex, which in 
turn uses Thread.critical=. The threads deadlocked almost without fail after 
only about 10-20 seconds of runtime on my box. (Obviously, your results may 
vary wildly, as this is all about thread timing.)

Specifically, org.jruby.internal.runtime.ThreadService.{getCritical(), 
setCritical(boolean)} have the following synchronization bug: they all hold the 
Java monitor on the singleton ThreadService object, then call into the 
ReentrantLock meant to implement Thread.critical. But if the call into the 
ReentrantLock ever blocks, nobody can ever unlock it, because the thread 
blocked on the ReentrantLock will hold the Java monitor on the ThreadService, 
which is required to be held in order to unlock the ReentrantLock. (Blocking on 
the ReentrantLock will unlock *it*, but not monitors held higher up the call 
chain -- like the one on the ThreadService.)

There was also an apparent race condition in ThreadService.waitForCritical() 
(it tested whether the current thread held the critical lock, then tried to 
lock and unlock it, but there was no guard around these; hence, the test could 
execute and pass, only to have the test no longer be true by the time it got to 
the lock/unlock clause); my reworked synchronization fixes this, too.


Full description, from the comments in the supplied testcase:

Background:

  - In MRI, setting 'Thread.critical = true' means no other threads can possibly
    be scheduled, period. (This is because MRI uses 'green threads', meaning it
    implements its own thread scheduler, and the OS knows nothing about the
    threads in the process; as such, it can guarantee this easily.) Thus it is
    simply not possible for another thread to set Thread.critical to anything,
    period, until the first thread sets 'Thread.critical = false'.

  - In JRuby, Ruby threads are mapped directly onto Java threads. Because of 
this,
    JRuby cannot actually guarantee that no other threads are scheduled once you
    set 'Thread.critical = true'; all it can do is block other threads as soon 
as
    they call one of a certain set of library functions that explicitly check 
the
    value of 'Thread.critical'.

The bug involves two threads:

1. Thread A calls 'Thread.critical = true'.
2. Thread B is then selected to run by the OS's scheduler. In MRI, this is not
   possible, but in JRuby, it is.
3. Thread B calls 'Thread.critical = true'. This correctly does the right thing,
   which is to block until Thread A sets 'Thread.critical = false'. However,
   here is where the bug occurred: this would grab a monitor (an ordinary Java
   monitor) that was required to do anything at all with the 'critical lock',
   and hold it while it was blocked on a *different* lock waiting for thread A
   to set 'Thread.critical = false'. (Because it is blocked on a different lock,
   Java's synchronization primitives do not automatically release the monitor
   this thread has acquited.)
4. Thread A calls 'Thread.critical = false'. However, in order to change the
   value, it needs to acquire the monitor that thread A holds.
5. Deadlock.

The fix was to rework the synchronization so that the same monitor is used for
both locking the value of Thread.critical and to block until it's set to 'false'
when required, above. This test case demonstrates that the bug is fixed, and
makes sure it never recurs.

The bug was nasty because lots of synchronization objects (e.g., Mutex) use
Thread.critical in their implementation, and it ends up just being a matter of
time until the entire system deadlocks. (In particular, even just calls to
Logger from different threads will do this, as it uses a Mutex, which, in
turn, uses Thread.critical.)

Unfortunately, this test case simply *hangs* when the bug is present, as that's
the consequence of the bug (it's a deadlock). Further, we can't do anything
about it, as using another thread to try to time out the test just exacerbates
the problem and deadlocks itself. Alas, that's all we can do, but at least a
hung test is likely to get noticed when compared to no test at all. :-)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe from this list please visit:

    http://xircles.codehaus.org/manage_email

Reply via email to