Checked in a small test app that allows this stuff to be taken for a
spin.
http://svn.apache.org/repos/asf/geronimo/sandbox/failover/
Hopefully we can use it as a tool to start getting some feedback.
Some of the parts of it are getting reworked, but should be runnable
soon.
-David
On Sep 12, 2008, at 5:43 PM, David Blevins wrote:
I've added some functionality to OpenEJB trunk which has been
enabled in Geronimo trunk. Here's an overview of how it works:
DISCOVERY
What we have going on from a tech perspective is each server sends
and receives a multicast heartbeat. Each multicast packet contains
a single URI that advertises a service, its group, and its
location. Say for example "cluster1:ejb:ejbd://thehost:4201". We
can definitely explore the SLP format as Alan suggests.
There are other advantages of the simple, unchanging, URI style.
The URI is essentially stateless as there is no "i'm alive" URI or
an "i'm dead" URI, there is simply a URI for each service a server
offers and its presence on the network indicates its availability
and its absence indicates the service is no longer available. In
this way the issues with UDP being unordered and unreliable melt
away as state is no longer a concern and packet sizes are always
small. Complicated libraries that ride atop UDP and attempt to
offer reliability (retransmission) and ordering on UDP can be
avoided. UDP/Multicast is only used for discovery and from there on
out critical information is transmitted over TCP/IP which is
obviously going to do a better job at ensuring reliability and
ordering.
On the client side of things, a special "multicast://" URL can be
used in the InitialContext properties to signify that multicast
should be used to seed the connection process. Such as:
Properties properties = new Properties();
properties.setProperty(Context.INITIAL_CONTEXT_FACTORY,
"org.apache.openejb.client.RemoteInitialContextFactory");
properties.setProperty(Context.PROVIDER_URL, "multicast://
239.255.2.3:6142");
InitialContext remoteContext = new InitialContext(properties);
The URL has optional query parameters such as "schemes" and "group"
and "timeout" which allow you to zero in on a particular type of
service of a particular cluster group as well as set how long you
are willing to wait in the discovery process till finally giving
up. The first matching service that it sees "flowing" around on the
UDP stream is the one it picks and sticks to for that and subsequent
requests, ensuring UDP is only used when there are no other servers
to talk to.
FAILOVER
On each request the server, the client will send the version number
associated with the list of servers in the cluster it is aware of.
Initially this version will be zero and the list will be empty.
Only when the server sees the client has an old list will the server
send the updated list. This is an important distinction as the list
(ClusterMetaData) is not transmitted back and forth on every
request, only on change. If the membership of the cluster is stable
there is essentially no clustering overhead to the protocol -- 8
byte overhead to each request and 1 byte on each response -- so you
will *not* see an exponential slowdown in response times the more
members are added to the cluster. This new list takes affect for
all proxies that share the same ServerMetaData data. Internally we
key the ClusterMetaData by ServerMetaData. I originally had the
version be a simple "increment by one" strategy, but eventually went
with the value of System.currentTimeMillis(). It's possible more
than one server is reachable via the ServerMetaData (i.e.
multicast://) and each server has it's own list and version number.
Secondly, if a server is restarted, the version number will go back
to zero and the client could be stuck thinking it has a more current
list than the server.
When a server shuts down, more connections are refused, existing
connections not in mid-request are closed, any remaining connections
are closed immediately after completion of the request in progress
and clients can failover gracefully to the next server in the list.
If a server crashes requests are retried on the next server in the
list. This failover pattern is followed until there are no more
servers in the list at which point the client attempts a final
multicast search (if it was created with a multicast PROVIDER_URL)
before abandoning the request and throwing an exception to the
caller. Currently, the failover is ordered but could very easily be
made random. The multicast discovery aspect of the client adds a
nice randomness to the selection of the first server that is perhaps
somewhat "just". Theoretically, servers that are under more load
will send out less heart beats than servers with no load. This may
not happen as theory dictates, but certainly as we get more ejb
statistic data wired into the server functionality we can pursue
deliberate heartbeat throttling techniques that might make that
theory really sing in practice.
GERONIMO
On the G side of things, the multicast functionality has been copied
into Geronimo. Still need to get it updated to the latest changes.
We'll eventually want OpenEJB getting notifications from the
Geronimo version instead of using it's own. Once that is done we
can remove the dep on the openejb-multicast jar. For the moment I
just tucked the multicast server implementation into the
EjbDaemonGBean as a temporary solution. A tricky thing is that when
we get that setup as it's own server component we won't want the
port offset and the hostname to affect the multicast host and port.
The combination of the mutlicast host and port essentially creates a
"topic" that all members of the network can listen to and write
messages to. So any servers that are in the same cluster will need
to listen on the same host/port.
We could really use a GUI for this stuff too. Is there anyone out
there with a few spare cycles who wants to write up a trivial little
"show me the servers on the cluster" kind of portlet for the console?
-David