Re: Registrar lease timeouts, service liveness

Greg Trasuk Wed, 27 Oct 2010 11:26:37 -0700


Comments interspersed...

Cheers,

Greg.

On Tue, 2010-10-26 at 15:00, Christopher Dolan wrote:
> What lease timeouts are people using with Reggie?
> 
> My project currently uses a 10 minute timeout.  We chose that value as
> balance between 1) wanting to know quickly when a service crashes and 2)
> performance concerns with the Reggie implementation.
I'm curious what the performance issue is that you've seen. 

> 
> I've become dissatisfied with that compromise, however, particularly in
> cases where the service is actually live but the registrar has gone bad
> or a clock sync anomaly has occurred (both cause false negatives).
Remember it's OK to have more than one registrar on the network for a
given group.  That's why JoinManager and ServiceDiscoveryManager look
for all the registrars they can find.  This is built-in to Jini!

> Ideally I would like to disentangle the notion of an expected service
> and the liveness of that service.  That is, I would like to be able to
> query the registrar separately for all of the services that are supposed
> to be running and all of the services that are actually running right
> now.
> 

You may be artificially conflating the ideas of "liveness" and service
discovery (and then you're quite wisely saying you'd like to separate
them).  The fact that a service is registered with one or more
registrars, and that someone is renewing the leases for these
registrations does not indicate in any reliable way that a service is
alive.  Consider, for instance, a service that registered a smart proxy
that does some calculation, but does it locally after the proxy is
loaded at the client (and properly prepared, etc).  The "proxy", which
is really a dynamically-loaded utility class, will happily continue to
work until it is garbage collected, even if the service provider that
registered it has long since shut down.

In another case, a service might register a proxy with Reggie, then
hand-off the lease renewal to another entity, which renews the lease
whether the service is "live" or not.  This could be a
LeaseRenewalManager object in the same JVM as the service, or in some
cases, you might have handed-off to a whole different service (e.g. the
Lease Renewal Service).  In either case, you can have a dead service
that cheerfully renews its lease whenever it runs out.  By the way, this
problem is not just a Jini thing; corner an embedded hardware guy and
ask him whether it's alright to reset a watchdog timer in an
interrupt-driven timer service routine.

Determining liveness is a difficult problem.  Determining deadness is
somewhat easier, but also complicated.

If you try to utilize a service, and it fails with a definite (i.e.
non-communications-related) error, you can be pretty sure the service is
dead.  At the very least, you know you need to look for another service,
because even if it is just a communications error, the service is "dead
to you".

If the service has a registration in a lookup registrar, is it live? 
Not necessarily, because of the reasons mentioned above.

If I've used the service recently, is it live?  Not necessarily, because
it might have died just after if finished answering my last request.

You might try to setup a "heartbeat" message of some kind, where the
service sends a message to subscribers, or maybe a multicast packet, at
some interval.  If you've received a message within the last interval,
is the service alive?  Again, not necessarily, because it might have
died just after its previous heartbeat, or it might have a poor "heart"
implementation that keeps beating after its "brain" is dead (see the
above watchdog timer example).

So, at best, any supervision scheme can put bounds on how much time
elapses before you find out that a service is dead, but can't ensure
liveness.

Sometimes people will say "but I can renew my RemoteEventListener lease
with the service, so the service must be live."  However, the same thing
applies; the lease is for the convenience of the lessor (really to make
sure it has a time boundary after which it can clear out your stuff),
and it may have delegated the handling of the leases to a third party.

> Take for example a collection of redundant services intended to be used
> round-robin. 

This is actually a great use-case for Javaspaces.  In an interview with
Bill Venners, Ken Arnold talked about a Javaspace entry being a "remote
procedure call to nowhere in particular"
(http://www.artima.com/intv/swayP.html).  I've always liked that
analogy.  In fact, you get better than round-robin scheduling because
faster processors will automatically retrieve more work from the Space,
giving optimal load-balancing.

That still leaves the JavaSpace as a single point of failure.  It would
be nice to have a clustered JavaSpaces implementation (I might be wrong,
but I think GigaSpaces is such a thing).

I've also thought about implementing a smart proxy that could do
clustered load balancing.  Then the servers would have to instigate some
kind of liveness self-monitoring and master-election (Paxon-protocol
sort of thing) and coordinate with the smart proxies to distribute the
requests.  However I've never implemented it.

>  I want clients to prefer to contact only the services
> known to be alive to avoid TCP timeouts.  But if the registrar thinks
> all of them are down, I still want clients to try to contact them just
> in case the registrar is wrong.  So, I don't want the services to be
> removed from the LookupCache completely.
> 

What I've normally done is get a proxy, then continue using that proxy
until it fails to work, then flush it from the LookupCache and go find
another proxy.  Thus the assumption is that a service is live until
proven otherwise.

> I've considered adding an Entry to the service's attributeSets that says
> if the service is alive, and setting the registration lease duration to
> be very long.  In that case, I would need to alter Reggie to fill in
> that attribute as "missing" when a service failed to check before a
> liveness timeout but not actually cancel the service lease.  With an
> implementation like that, it would be trivial for me to pick out the
> live services with a simple ServiceItemFilter on the LookupCache.

Personally I wouldn't alter Reggie; you could probably implement a
"liveness-monitoring" service.  Keep in mind the liveness caveats above,
however.  I'm not sure the extra complexity gains much utility over a
"use proxy til dead" approach.

> Another idea is to implement this client side: use a short lease timeout
> with Reggie but add some longer-term caching to the LookupCache.  In
> that case, a serviceRemoved() from a registrar would simply flag the
> ServiceItemReg as not alive.  The service would not be removed from the
> LookupCache, however, until N hours after it was removed from the last
> registrar.
> 
> Has anybody else had similar thoughts?  What compromises, extensions
> and/or architectures have you chosen as a result? 
> 
> Chris
-- 
Greg Trasuk, President
StratusCom Manufacturing Systems Inc. - We use information technology to
solve business problems on your plant floor.
http://stratuscom.com

Re: Registrar lease timeouts, service liveness

Reply via email to