Hi Dan,
Dan Creswell wrote:
Hi all,
It started with a discussion under the Javaspaces.notify() not reliable
conversation and I've now had a bit more time to formulate my thoughts.
Without this extra feature we do something like the following in the client:
(1) Setup a watchdog timer with a suitable expiry
(2) On receiving a remote event, reset our watchdog timer
(3) If timer expires, check to see if our source is still alive, check
to see if we might've missed an event.
What's being proposed, if I understand correctly is the the source if
it's alive and hasn't generated events in a particular time period
confirm that by posting a SourceAliveRemoteEvent to the client
confirming this.
The idea has 3 aspects:
1) the SourceAliveRemoteEvent (SARE) protocol is triggered by a
QoS invocation constraints set upon registration;
2) the source must send a SARE as the first event (this is helpful in
finding out whether callbacks are possible);
3) the source should send a SARE in case a certain time after the last
remote event sent has elapsed.
Below I will try to clarify why I consider this having advantages over
performing a ping.
This would potentially change the above client code to reset the timer
on just a SourceAliveRemoteEvent (SARE).
Things of note:
(1) The original solution places the responsibility and load on the
client (bar the pinging of the server). This naturally scales out quite
well as the server only has to respond to pings and chances are a client
only maintains timers for a few services. If client timeouts are tuned
appropriately to event frequency/typical pause, pings will be rare.
The SARE protocol is 'triggered' based on a QoS invocation constraint,
i.e. only clients that have interest in SAREs will register for
receiving them with their event registration. A server won't be sending
SAREs for those who have shown no interest, also the constraints can be
rejected in case the timeout period requested would be too small and
the server wants to refuse, i.e. the server has a say in the 'tuning'.
Preventing a client to invoke ping because it sets a very small time-out
seems to be much harder to control.
I must say it really depends on what the ping constitutes before I would
be able to say ping is a trivial operation for the server.
(2) The new solution places much of the responsibility with the server.
I believe there may be a scaling problem here. In contrast to the
client-side approach a server might have a large number of clients to
cope with. This potentially means the server has significant load
tracking a large number of timer events for all it's clients and posting
SARE's in addition to what it already does.
No denial the proposal brings additional complexity to those services
that wish to support the constraint.
I've been implementing SARE in Seven last week and I have it working,
the event framework became more complex although due to experience in
building a few of these similar mechanisms at the application layer I
was able to make some optimizations in the code that gives me the
impression the overhead is quite minimal assuming a time-out is used
that relates to the average expected event rate.
Therefore I'm not that afraid of scalability issues given the fact the
time-out period is expected to be in line with and probably larger then
the event rate at which you will be sending events. Or in other words,
the time-out is likely only small in case you expect a high remote event
frequency, meaning SAREs won't be sent that often. If they do your
server is likely capable of dealing with large number of events anyway.
And on the positive side one must find a proper usage for all these
multi-core/CMT CPUs coming our way.
(3) The only difference between old and new approach from a client
coding perspective is what causes a reset of the watchdog timer.
For a client is seems to me SARE is easier than performing a remote
method invocation (ping) that might take some time to return. I expect
with SARE none to a minimal amount of ordinary remote method invocations
(ping) to take place so for clients it is less likely to take additional
roundtrip time (and the possibility of timing out) of these calls into
account (the calls are exceptions and not the norm). In the ping case
your timer probably will hand of to something that will perform the ping
asynchronously to prevent from interfering with the timer itself.
When your watchdog goes of with SARE you know some QoS criteria hasn't
been met by your source versus go figure out whether it did send events
which haven't arrived. In many cases with SARE you won't perform a
request to your source, you might go straight for a backup service and
ignore the service altogether, or you ring the alarm bell of some
Network Operations Center. But of course there will be cases you want to
be a bit more persistent about your event registration.
(4) SARE's like any other event can be lost - if it's lost the client
watchdog will trigger just as it would in the old approach given
sufficient time between RemoteEvents.
Indeed it is possible a SARE will be lost. Although for most type
of services I've coded (no multiple hops and no event payload provided
by "I mess up the codebase clients") the chance a SARE will be forever
lost due to a transitory failure I consider small compared to the other
expected failures.
(5) If the source has sent events but they've been lost it won't send an
SARE and, again client watchdog will timeout and ping.
>
Based on the above it seems to me that whilst an SARE might save a few
pings there's additional complexity and greater server load. If I've
missed some subtleties, please shout because right now I don't see
enough benefit in this to justify the "pain".
So far I'm not sure in the above what you exactly mean with a 'ping'. Is
it just a way to check whether the service is alive or do you envision
more, something that has a correlation with the event registration and
internal event framework and that can say meaningful things about its
ability to deliver event. If it is only something to check whether the
service is alive/reachable I consider SARE a much richer concept for
getting info about the ability to deliver events, also because it
follows the exact route of event delivery. Ping doesn't represent the
invocation path in case of Jini Distributed Events which (especially in
the case of security and network topology) might be failing just because
of these differences.
In the proposal I also use SARE as the first event to be sent to
verify whether a callback is possible, so besides a 'source alive' it
also serves another purpose, namely to find out whether event delivery
can work at all.
One thing we haven't covered yet is that a ping for reachability might
be successful, even while the source is not able to deliver events
timely due to being overloaded, deadlocks, etc., while SARE will show
the source is not able to deliver events properly. As such it tells
me more about the state the event producer is in and its ability to
serve me.
To conclude, a ping (assuming in its simple and generic form) doesn't
give me enough information about the capabilities of a source to deliver
events, where SARE can do this better. Yes it will lead to complexity at
the server, maybe a slight reduction in scalability, but in most cases a
simplification at the client side and the ability to get indications you
won't be able to get with ping.
I'm not saying this is the only way, but to me it represents a pattern I
have often used and see value in being part of the standard toolbox, but
so might mechanisms to test for reachability/availability (the ones
Dennis mentioned).
My hope is that the common patterns people use can be
standardized/formalized so that we see more support for them, either
through frameworks, utilities or whatever people like to see or fit to
them. But at least in a way they don't stay proprietary in many small
corners of the Jini empire.
--
Mark