Re: Clustering (long)

Andy Piper Tue, 02 Aug 2005 05:07:21 -0700

Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

I agree on the SPoF thing - but I think you misunderstand myCoordinator arch. I do not have a single static Coordinator node,but a dynamic Coordinator role, into which a node may be elected.Thus every node is a potential Coordinator. If the electedCoordinator dies, another is immediately elected. The electionstrategy is pluggable, although it will probably end up beinghardwired to "oldest-cluster-member". The reason behind this is thatrelaying out your cluster is much simpler if it is done in a singlevm. I originally tried to do it in multiple vms, each takingresponsibility for pieces of the cluster, but if the vms views arenot completely in sync, things get very hairy, and completely insync is an expensive thing to achieve - and would introduce acluster-wide single point of contention. So I do it in a single vm,as fast as I can, with fail over, in case that vm evaporates. Doesthat sound better than the scenario that you had in mind ?

This is exactly the "hard" computer science problem that youshouldn't be trying to solve if at all possible. Its hard becausenetwork partitions or hung processes (think GC) make it very easy foryour colleagues to think you are dead when you do not share thatview. The result is two processes who think they are the coordinatorand anarchy can ensue (commonly called split-brain syndrome). I canpoint you at papers if you want, but I really suggest that you aimfor an implementation that is independent of a central coordinator.Note that a central coordinator is necessary if you want to implementa strongly-consistent in-memory database, but this is not usually arequirement for session replication say.

http://research.microsoft.com/Lampson/58-Consensus/Abstract.htmlgives a good introduction to some of these things. I also presentedat JavaOne on related issues, you should be able to download thepresentation from dev2dev.bea.com at some point (not there yet - Ijust checked).

The Coordinator is not there to support session replication, butrather the management of the distributed map (map of which a fewbuckets live on each node) which is used by WADI to discover veryefficiently whether a session exists and where it is located. Thismap must be rearranged, in the most efficient way possible, eachtime a node joins or leaves the cluster.

Understood. Once you have a fault-tolerant singleton coordinator youcan solve lots of interesting problems, its just hard and often notworth the effort or the expense (typical implementations involve HAHW or an HA DB or at least 3 server processes).

Replication is NYI - but I'm running a few mental background threadsthat suggest that an extension to the index will mean that itassociates the session's id not just to its current location, butalso to the location of a number of replicants. I also have ideas onhow a session might choose nodes into which it will place itsreplicants and how I can avoid the primary session copy ever beingcolocated with a replicant (potential SPoF - if you only have onereplicant), etc...


Right definitely something you want to avoid.

Yes, I can see that happening - I have an improvement (NYI) toWADI's evacuation strategy (how sessions are evacuated when a nodewishes to leave). Each session will be evacuated to the node whichowns the bucket into which its id hashes. This is because colocationof the session with the bucket allows many messages concered withits future destruction and relocation to be optimised away. Futurerequests falling elsewhere but needing this session should, in themost efficient case, be relocated to this same node, other wise thesession may be relocated, but at a cost...

How do you relocate the request? Many HW load-balancers do notsupport this (or else it requires using proprietary APIs), so youprobably have to count on

moving sessions in the normal failover case.

I would be very grateful in any thoughts or feedback that you couldgive me. I hope to get much more information about WADI into thewiki over the next few weeks. That should help generate morediscussion, although I would be more than happy for people to ask mequestions here on Geronimo-dev because this will give me an idea ofwhat documentation I should write and how existing documentation maybe lacking or misleading.

I guess my general comment would be that you might find it better tothink specifically about the end-user problem you are trying to solve(say session replication) and work towards a solution based on that.Most short-cuts / optimizations that vendors make are specific to theproblem domain and do not generally apply to all clustering problems.


Hope this helps

andy

Re: Clustering (long)

Reply via email to