Re: Server Hang in a Cluster, might be a deadlock

Ian Boston Fri, 18 May 2007 00:52:17 -0700

Now tracking this issue as
https://issues.apache.org/jira/browse/JCR-929




Ian Boston wrote:

I think I have found where the problem is...
the HTTP threads appear to lock waiting inAbstractJournal.lockAndSync(), while the ClusterNode thread waits in theLockManagerImple.aquire();
Since both the HTTP threads are trying to aquire, and this doesnt happenin a non clustered deployment, then I am going to guess that thespinlock in LockManagerImple.aquire(); has already got the a WriterLock,hence it blocks the http threads.
I cant quite see why the aquire spins forever, I'll put some more debug in.

Ian
Thread http-8580-Processor24 waiting by[EMAIL PROTECTED]::WAITING atEDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock.acquire(null:-1)
     at java.lang.Object.wait(Object.java:-2)
     at java.lang.Object.wait(Object.java:474)
atEDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock.acquire(null:-1)atorg.apache.jackrabbit.core.journal.AbstractJournal.lockAndSync(AbstractJournal.java:228)atorg.apache.jackrabbit.core.journal.DefaultRecordProducer.append(DefaultRecordProducer.java:51)
Thread http-8580-Processor23 waiting by[EMAIL PROTECTED]::WAITING atEDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock.acquire(null:-1)
     at java.lang.Object.wait(Object.java:-2)
     at java.lang.Object.wait(Object.java:474)
atEDU.oswego.cs.dl.util.concurrent.WriterPreferenceReadWriteLock$WriterLock.acquire(null:-1)atorg.apache.jackrabbit.core.journal.AbstractJournal.lockAndSync(AbstractJournal.java:228)atorg.apache.jackrabbit.core.journal.DefaultRecordProducer.append(DefaultRecordProducer.java:51)
Thread ClusterNode-localhost2 waiting by[EMAIL PROTECTED] ::WAITING atEDU.oswego.cs.dl.util.concurrent.ReentrantLock.acquire(null:-1)
     at java.lang.Object.wait(Object.java:-2)
     at java.lang.Object.wait(Object.java:474)
     at EDU.oswego.cs.dl.util.concurrent.ReentrantLock.acquire(null:-1)
atorg.apache.jackrabbit.core.lock.LockManagerImpl.acquire(LockManagerImpl.java:599)atorg.apache.jackrabbit.core.lock.LockManagerImpl.nodeAdded(LockManagerImpl.java:838)
Dominique Pfister wrote:
Hi Ian,

have you been able to generate a thread dump of the stalled node, at
the moment it doesn't appear to respond any more? That might help...

Kind regards
Dominique

On 5/15/07, Ian Boston <[EMAIL PROTECTED]> wrote:
Hi,

I've been doing some testing of a 2 node jackrabbit cluster using 1.3
(with the JCR-915 patch), but I am getting some strange behavior.

I use OSX Finder to mount a DAV service from each node and then upload
lots of files to each dav mount at the same time. All goes Ok for the
first few 1000 files, and then one of the nodes stops responding to that
session. The other node continues and finishes.

Eventually OSX disconnects the stalled node.

When I try the port of the apparently stalled cluster node, its still
responds, however with some strange behaviour.

A remount attempt responds with a 401 and forces basic login, but stalls
after that point. (the URL is to the base of a workspace)

If I open firefox and access the dav servlet via firefox, I can navigate
down the directory tree, but if I try and refresh any jcr folder or jcr
file that I have already visited (since the cluster node has been up),
FF spins forever.

I have put a Deadlock detector class into both nodes (java class that
looks for deadlock through jmx) but it doesnt detect anything.

I have also use JProfiler connected to one node but it never detects a
deadlock.

I have tried all of this in single node mode, with no Journal or
ClusterNode and not been able to re-create the problem (yet).

The one thing that I have seen in JProfiler is threads blocked waiting
for an ItemState? monitor inside jackrabbit, but never for more that500ms.
I am using the standard DatabaseJournal and the
SimpleDbPersistanceManager, however I see the same happening with the
FileJournal.

Any ideas ? I might put some very simple debug in near that monitor that
was blocking for 500ms ?

I did search JIRA but couldnt find anything that was a close match.


Ian

Re: Server Hang in a Cluster, might be a deadlock

Reply via email to