[JBoss-dev] [Clusters on JBoss (Clusters/JBoss)] - Clustering related Question: Case 1843

ivelin Wed, 18 Aug 2004 21:33:20 -0700

Please find attached a write-up that briefly explains the problem, the causes and 
suggests several solutions.


The Problem

If there are two ore more cluster nodes with a HASingleton service enabled and the 
master service instance shuts down (explicit node shutdown, for example), there is a 
noticeable delay until another node takes over and starts the service instance. More 
precisely, it takes 60 seconds for the transfer to complete.

The problem does not show up in JBoss 3.2.3 (JGroups 2.2.0) but shows up in JBoss 
3.2.4 (JGroups 2.2.4). 

The Cause

The sequence of events happening in the HA Singleton layer during an explicit node 
shutdown  follows: the HASingletonController is notified to shut down. 
HAServiceMBeanSupprt.stopService() executes and during the unregistration of the DRM 
listener, it calls "_remove" asynchronously on the cluster, so the call goes to the 
future master too. On the future master, DistributedReplicantManager._remove() 
triggers local master election. The HASingletonController instance realizes that it 
will become master and in one of the subsequent steps it synchronously call 
"_stopOldMaster" on the partition. This distributed RPC blocks and exits only with 
timeout (60 secs).

I was able to replicate the problem independently in JGroups. The deadlock shows up 
every time I use nested distributed calls. The test setup consists of two 
RpcDispatchers A and B, each dispatcher being able to handle innerMethod() and 
outerMethod(). innerMethod() is simple (just a System.out.println(), for example). 
outerMethod() internally calls innerMethod() as a group RPC:

public void innerMethod() {
    System.out.println("innerMethod()");
}

public void outerMethod() {
     rpcDispatcher.callRemoteMethods(null, 
                                     new MethodCall("innerMethod", new Object[0]),
                                     GroupRequest.GET_ALL,
                                     60000);
}


>From A, I callRemoteMethod(B, "outerMethod"). The following things happen:

                         A                                 B
        callRemoteMethod(B, "outerMethod")

                                                   outerMethod() executes
                                                   outerMethod() calls innerMethod() 
on group
                                                   the group call never returns but by 
timeout

        innerMethod() executes
                                                   .....

                                                   innerMethod() group call timeouts 
after 60000
    

The bottom-most cause of the problem is the fact that MessageDispatcher uses only one 
thread ("MessageDispatcher up processing thread") to handle incoming RPC requests and 
to make nested calls. Once a nested call is made, the thread blocks on a mutex and 
never gets woken up to handle the response to the nested calls, not unless the timeout 
expires.

Solutions

1. A quick and temporary solution is to use an asychronous "_stopOldMaster" 
distributed RPC call in HASingletonSupport.partitionTopologyChanged(). In the 
situation presented above, it does not matter anyway, since the master instance is 
shutting down or it's down already. If another service instance in the cluster (not 
the master) goes down or up, that shouldn't be a reason to switch the master, so 
again, it shouldn't  matter anyway. This works; however it just hides the symptoms.

2. Fix JGroups. One idea is to have RpcDispatcher use a thread per incoming call and 
possible a thread pool. This way, the deadlock problem goes away. I am still looking 
at JGroups 2.2.0 and trying to understand why nested distributed calls work for that 
release. I will come up with a solution and a test case.

Cheers,
Ovidiu


View the original post : 
http://www.jboss.org/index.html?module=bb&op=viewtopic&p=3845542#3845542

Reply to the post : 
http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=3845542


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
JBoss-Development mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/jboss-development

[JBoss-dev] [Clusters on JBoss (Clusters/JBoss)] - Clustering related Question: Case 1843

Reply via email to