2008/10/7 James Abley <[EMAIL PROTECTED]>: > 2008/10/6 James Abley <[EMAIL PROTECTED]>: >> Hi, >> >> I've seen some liveness failures in DefaultISMLocking, where our >> webapp is unresponsive and thread dumps (which will follow tomorrow / >> later today depending on your timezone). The list of suspect causes >> for this problem currently stands at this: >> >> 1. JRockit JVM does not honour finally blocks. >> 2. Bug in concurrent-utils. >> 3. Bug in Jackrabbit code. >> 4. Bug in our code calling Jackrabbit. >> 5. Door number 3. >> >> 1. is obviously a frightening thought and cannot be the problem - just >> listing the obvious. >> 2. is highly unlikely. It's a very widely used library written and >> reviewed by some very smart people. >> 3. is possible, but fairly unlikely. A problem would presumably have >> been reported by someone else and a reasonable number of people are >> using Jackrabbit without ever seeing this problem. >> 4. Less people are using our code than the Jackrabbit code, so this is >> most likely where the problem lies. Further analysis of the thread >> dumps is required to see what's going on. >> 5. Or something I've not though of yet. >> >> I've not yet done sufficient analysis to determine whether it is a >> deadlock, missed notification or some other reason for the application >> becoming unresponsive. From my reading of the Jackrabbit code, it >> looks fine in terms of locks being acquired and then released in a >> finally block. One question I do have though, is that the lock >> acquisition code all use the blocking form of trying to acquire the >> lock; i.e. in DefaultISMLocking: >> >> rwLock.writeLock().acquire(); >> >> and >> >> rwLock.readLock().acquire(); >> >> These methods can potentially wait for ever (and that is what they >> look like doing, since the thread dumps we have seem to indicate that >> no thread is making progress over a 5 minute timeframe). Is there any >> particular reason why the timeout version isn't used? i.e. >> >> rwLock.writeLock().attempt(10000); >> >> and >> >> rwLock.readLock().attemp(10000); >> >> Again, from my static analysis of the code, this should allow an >> exception to safely propagate and my application would fail / display >> an error message to the customer, but would not require the servlet >> container to be restarted. To my mind, that would be a safer >> implementation? >> >> I plan on trying to write a test to recreate the problem (which to >> date I think we've only seen on JRockit JVMs, hence my listing of that >> as a possible issue), and then putting in an implementation of >> ISMLocking using the Java 5 java.util.concurrent primitives with the >> timeout versions of the methods being used. But I was just curious as >> to what the list might think about this issue? >> >> Cheers, >> >> James >> > > Attaching thread dumps. There are two files. The first one is the full > dump; in the second one I've removed all of the threads which were > stuck in our code, queued up behind "[STUCK] ExecuteThread: '26' for > queue: 'weblogic.kernel.Default (self-tuning)'". Those threads aren't > listed in the JRockit blocked chains, since they are using > java.util.concurrent Lock rather than synchronization keyword > primitives. I've removed them since ExecuteThread '26' is stuck > waiting for a notification in DefaultISMLocking, and so they don't add > any information about the problem. > > 1. Am I asking this in the correct place, or should this be on the dev > list? Just wanted to confirm. > 2. Thinking about the problem a little more over the last couple of > days, I can see an argument for not using the timing out versions of > the API. If thread A is trying to get a resource that is locked by > thread B and A gets to the point where it could timeout, what is best? > Hanging waiting for a notification that is never going to come, or > throwing an timeout exception and allowing the client to retry > acquiring a resource that is never going to become available? > 3. I'm still trying to write a test that reproduces the problem. > > Cheers, > > James >
Zip attachment was stripped off. Try this: http://pastebin.com/m7d46791d The larger thread dump won't go into the paste bin, so shout if you think it's worth having. I saw that 1.4.6 mentioned a problem with XA and deadlocking in this area. But since I can't reliably replicate the problem, it's hard to tell whether that will fix it. Cheers, James
