Please find attached a write-up that briefly explains the problem, the causes and
suggests several solutions.
The Problem
If there are two ore more cluster nodes with a HASingleton service enabled and the
master service instance shuts down (explicit node shutdown, for example), there is a
noticeable delay until another node takes over and starts the service instance. More
precisely, it takes 60 seconds for the transfer to complete.
The problem does not show up in JBoss 3.2.3 (JGroups 2.2.0) but shows up in JBoss
3.2.4 (JGroups 2.2.4).
The Cause
The sequence of events happening in the HA Singleton layer during an explicit node
shutdown follows: the HASingletonController is notified to shut down.
HAServiceMBeanSupprt.stopService() executes and during the unregistration of the DRM
listener, it calls "_remove" asynchronously on the cluster, so the call goes to the
future master too. On the future master, DistributedReplicantManager._remove()
triggers local master election. The HASingletonController instance realizes that it
will become master and in one of the subsequent steps it synchronously call
"_stopOldMaster" on the partition. This distributed RPC blocks and exits only with
timeout (60 secs).
I was able to replicate the problem independently in JGroups. The deadlock shows up
every time I use nested distributed calls. The test setup consists of two
RpcDispatchers A and B, each dispatcher being able to handle innerMethod() and
outerMethod(). innerMethod() is simple (just a System.out.println(), for example).
outerMethod() internally calls innerMethod() as a group RPC:
public void innerMethod() {
System.out.println("innerMethod()");
}
public void outerMethod() {
rpcDispatcher.callRemoteMethods(null,
new MethodCall("innerMethod", new Object[0]),
GroupRequest.GET_ALL,
60000);
}
>From A, I callRemoteMethod(B, "outerMethod"). The following things happen:
A B
callRemoteMethod(B, "outerMethod")
outerMethod() executes
outerMethod() calls innerMethod()
on group
the group call never returns but by
timeout
innerMethod() executes
.....
innerMethod() group call timeouts
after 60000
The bottom-most cause of the problem is the fact that MessageDispatcher uses only one
thread ("MessageDispatcher up processing thread") to handle incoming RPC requests and
to make nested calls. Once a nested call is made, the thread blocks on a mutex and
never gets woken up to handle the response to the nested calls, not unless the timeout
expires.
Solutions
1. A quick and temporary solution is to use an asychronous "_stopOldMaster"
distributed RPC call in HASingletonSupport.partitionTopologyChanged(). In the
situation presented above, it does not matter anyway, since the master instance is
shutting down or it's down already. If another service instance in the cluster (not
the master) goes down or up, that shouldn't be a reason to switch the master, so
again, it shouldn't matter anyway. This works; however it just hides the symptoms.
2. Fix JGroups. One idea is to have RpcDispatcher use a thread per incoming call and
possible a thread pool. This way, the deadlock problem goes away. I am still looking
at JGroups 2.2.0 and trying to understand why nested distributed calls work for that
release. I will come up with a solution and a test case.
Cheers,
Ovidiu
View the original post :
http://www.jboss.org/index.html?module=bb&op=viewtopic&p=3845542#3845542
Reply to the post :
http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=3845542
-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
JBoss-Development mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/jboss-development