Re: Liveness failures in DefaultISMLocking

James Abley Fri, 10 Oct 2008 03:31:57 -0700

2008/10/7 James Abley <[EMAIL PROTECTED]>:
> 2008/10/6 James Abley <[EMAIL PROTECTED]>:
>> Hi,
>>
>> I've seen some liveness failures in DefaultISMLocking, where our
>> webapp is unresponsive and thread dumps (which will follow tomorrow /
>> later today depending on your timezone). The list of suspect causes
>> for this problem currently stands at this:
>>
>> 1. JRockit JVM does not honour finally blocks.
>> 2. Bug in concurrent-utils.
>> 3. Bug in Jackrabbit code.
>> 4. Bug in our code calling Jackrabbit.
>> 5. Door number 3.
>>
>> 1. is obviously a frightening thought and cannot be the problem - just
>> listing the obvious.
>> 2. is highly unlikely. It's a very widely used library written and
>> reviewed by some very smart people.
>> 3. is possible, but fairly unlikely. A problem would presumably have
>> been reported by someone else and a reasonable number of people are
>> using Jackrabbit without ever seeing this problem.
>> 4. Less people are using our code than the Jackrabbit code, so this is
>> most likely where the problem lies. Further analysis of the thread
>> dumps is required to see what's going on.
>> 5. Or something I've not though of yet.
>>
>> I've not yet done sufficient analysis to determine whether it is a
>> deadlock, missed notification or some other reason for the application
>> becoming unresponsive. From my reading of the Jackrabbit code, it
>> looks fine in terms of locks being acquired and then released in a
>> finally block. One question I do have though, is that the lock
>> acquisition code all use the blocking form of trying to acquire the
>> lock; i.e. in DefaultISMLocking:
>>
>> rwLock.writeLock().acquire();
>>
>> and
>>
>> rwLock.readLock().acquire();
>>
>> These methods can potentially wait for ever (and that is what they
>> look like doing, since the thread dumps we have seem to indicate that
>> no thread is making progress over a 5 minute timeframe). Is there any
>> particular reason why the timeout version isn't used?  i.e.
>>
>> rwLock.writeLock().attempt(10000);
>>
>> and
>>
>> rwLock.readLock().attemp(10000);
>>
>> Again, from my static analysis of the code, this should allow an
>> exception to safely propagate and my application would fail / display
>> an error message to the customer, but would not require the servlet
>> container to be restarted. To my mind, that would be a safer
>> implementation?
>>
>> I plan on trying to write a test to recreate the problem (which to
>> date I think we've only seen on JRockit JVMs, hence my listing of that
>> as a possible issue), and then putting in an implementation of
>> ISMLocking using the Java 5 java.util.concurrent primitives with the
>> timeout versions of the methods being used. But I was just curious as
>> to what the list might think about this issue?
>>
>> Cheers,
>>
>> James
>>
>
> Attaching thread dumps. There are two files. The first one is the full
> dump; in the second one I've removed all of the threads which were
> stuck in our code, queued up behind "[STUCK] ExecuteThread: '26' for
> queue: 'weblogic.kernel.Default (self-tuning)'". Those threads aren't
> listed in the JRockit blocked chains, since they are using
> java.util.concurrent Lock rather than synchronization keyword
> primitives. I've removed them since ExecuteThread '26' is stuck
> waiting for a notification in DefaultISMLocking, and so they don't add
> any information about the problem.
>
> 1. Am I asking this in the correct place, or should this be on the dev
> list? Just wanted to confirm.
> 2. Thinking about the problem a little more over the last couple of
> days, I can see an argument for not using the timing out versions of
> the API. If thread A is trying to get a resource that is locked by
> thread B and A gets to the point where it could timeout, what is best?
> Hanging waiting for a notification that is never going to come, or
> throwing an timeout exception and allowing the client to retry
> acquiring a resource that is never going to become available?
> 3. I'm still trying to write a test that reproduces the problem.
>
> Cheers,
>
> James
>


Zip attachment was stripped off. Try this:

http://pastebin.com/m7d46791d

The larger thread dump won't go into the paste bin, so shout if you
think it's worth having.

I saw that 1.4.6 mentioned a problem with XA and deadlocking in this
area. But since I can't reliably replicate the problem, it's hard to
tell whether that will fix it.

Cheers,

James

Re: Liveness failures in DefaultISMLocking

Reply via email to