2008/10/6 James Abley <[EMAIL PROTECTED]>: > Hi, > > I've seen some liveness failures in DefaultISMLocking, where our > webapp is unresponsive and thread dumps (which will follow tomorrow / > later today depending on your timezone). The list of suspect causes > for this problem currently stands at this: > > 1. JRockit JVM does not honour finally blocks. > 2. Bug in concurrent-utils. > 3. Bug in Jackrabbit code. > 4. Bug in our code calling Jackrabbit. > 5. Door number 3. > > 1. is obviously a frightening thought and cannot be the problem - just > listing the obvious. > 2. is highly unlikely. It's a very widely used library written and > reviewed by some very smart people. > 3. is possible, but fairly unlikely. A problem would presumably have > been reported by someone else and a reasonable number of people are > using Jackrabbit without ever seeing this problem. > 4. Less people are using our code than the Jackrabbit code, so this is > most likely where the problem lies. Further analysis of the thread > dumps is required to see what's going on. > 5. Or something I've not though of yet. > > I've not yet done sufficient analysis to determine whether it is a > deadlock, missed notification or some other reason for the application > becoming unresponsive. From my reading of the Jackrabbit code, it > looks fine in terms of locks being acquired and then released in a > finally block. One question I do have though, is that the lock > acquisition code all use the blocking form of trying to acquire the > lock; i.e. in DefaultISMLocking: > > rwLock.writeLock().acquire(); > > and > > rwLock.readLock().acquire(); > > These methods can potentially wait for ever (and that is what they > look like doing, since the thread dumps we have seem to indicate that > no thread is making progress over a 5 minute timeframe). Is there any > particular reason why the timeout version isn't used? i.e. > > rwLock.writeLock().attempt(10000); > > and > > rwLock.readLock().attemp(10000); > > Again, from my static analysis of the code, this should allow an > exception to safely propagate and my application would fail / display > an error message to the customer, but would not require the servlet > container to be restarted. To my mind, that would be a safer > implementation? > > I plan on trying to write a test to recreate the problem (which to > date I think we've only seen on JRockit JVMs, hence my listing of that > as a possible issue), and then putting in an implementation of > ISMLocking using the Java 5 java.util.concurrent primitives with the > timeout versions of the methods being used. But I was just curious as > to what the list might think about this issue? > > Cheers, > > James >
Attaching thread dumps. There are two files. The first one is the full dump; in the second one I've removed all of the threads which were stuck in our code, queued up behind "[STUCK] ExecuteThread: '26' for queue: 'weblogic.kernel.Default (self-tuning)'". Those threads aren't listed in the JRockit blocked chains, since they are using java.util.concurrent Lock rather than synchronization keyword primitives. I've removed them since ExecuteThread '26' is stuck waiting for a notification in DefaultISMLocking, and so they don't add any information about the problem. 1. Am I asking this in the correct place, or should this be on the dev list? Just wanted to confirm. 2. Thinking about the problem a little more over the last couple of days, I can see an argument for not using the timing out versions of the API. If thread A is trying to get a resource that is locked by thread B and A gets to the point where it could timeout, what is best? Hanging waiting for a notification that is never going to come, or throwing an timeout exception and allowing the client to retry acquiring a resource that is never going to become available? 3. I'm still trying to write a test that reproduces the problem. Cheers, James
