Comments interspersed...
Cheers, Greg. On Tue, 2010-10-26 at 15:00, Christopher Dolan wrote: > What lease timeouts are people using with Reggie? > > My project currently uses a 10 minute timeout. We chose that value as > balance between 1) wanting to know quickly when a service crashes and 2) > performance concerns with the Reggie implementation. I'm curious what the performance issue is that you've seen. > > I've become dissatisfied with that compromise, however, particularly in > cases where the service is actually live but the registrar has gone bad > or a clock sync anomaly has occurred (both cause false negatives). Remember it's OK to have more than one registrar on the network for a given group. That's why JoinManager and ServiceDiscoveryManager look for all the registrars they can find. This is built-in to Jini! > Ideally I would like to disentangle the notion of an expected service > and the liveness of that service. That is, I would like to be able to > query the registrar separately for all of the services that are supposed > to be running and all of the services that are actually running right > now. > You may be artificially conflating the ideas of "liveness" and service discovery (and then you're quite wisely saying you'd like to separate them). The fact that a service is registered with one or more registrars, and that someone is renewing the leases for these registrations does not indicate in any reliable way that a service is alive. Consider, for instance, a service that registered a smart proxy that does some calculation, but does it locally after the proxy is loaded at the client (and properly prepared, etc). The "proxy", which is really a dynamically-loaded utility class, will happily continue to work until it is garbage collected, even if the service provider that registered it has long since shut down. In another case, a service might register a proxy with Reggie, then hand-off the lease renewal to another entity, which renews the lease whether the service is "live" or not. This could be a LeaseRenewalManager object in the same JVM as the service, or in some cases, you might have handed-off to a whole different service (e.g. the Lease Renewal Service). In either case, you can have a dead service that cheerfully renews its lease whenever it runs out. By the way, this problem is not just a Jini thing; corner an embedded hardware guy and ask him whether it's alright to reset a watchdog timer in an interrupt-driven timer service routine. Determining liveness is a difficult problem. Determining deadness is somewhat easier, but also complicated. If you try to utilize a service, and it fails with a definite (i.e. non-communications-related) error, you can be pretty sure the service is dead. At the very least, you know you need to look for another service, because even if it is just a communications error, the service is "dead to you". If the service has a registration in a lookup registrar, is it live? Not necessarily, because of the reasons mentioned above. If I've used the service recently, is it live? Not necessarily, because it might have died just after if finished answering my last request. You might try to setup a "heartbeat" message of some kind, where the service sends a message to subscribers, or maybe a multicast packet, at some interval. If you've received a message within the last interval, is the service alive? Again, not necessarily, because it might have died just after its previous heartbeat, or it might have a poor "heart" implementation that keeps beating after its "brain" is dead (see the above watchdog timer example). So, at best, any supervision scheme can put bounds on how much time elapses before you find out that a service is dead, but can't ensure liveness. Sometimes people will say "but I can renew my RemoteEventListener lease with the service, so the service must be live." However, the same thing applies; the lease is for the convenience of the lessor (really to make sure it has a time boundary after which it can clear out your stuff), and it may have delegated the handling of the leases to a third party. > Take for example a collection of redundant services intended to be used > round-robin. This is actually a great use-case for Javaspaces. In an interview with Bill Venners, Ken Arnold talked about a Javaspace entry being a "remote procedure call to nowhere in particular" (http://www.artima.com/intv/swayP.html). I've always liked that analogy. In fact, you get better than round-robin scheduling because faster processors will automatically retrieve more work from the Space, giving optimal load-balancing. That still leaves the JavaSpace as a single point of failure. It would be nice to have a clustered JavaSpaces implementation (I might be wrong, but I think GigaSpaces is such a thing). I've also thought about implementing a smart proxy that could do clustered load balancing. Then the servers would have to instigate some kind of liveness self-monitoring and master-election (Paxon-protocol sort of thing) and coordinate with the smart proxies to distribute the requests. However I've never implemented it. > I want clients to prefer to contact only the services > known to be alive to avoid TCP timeouts. But if the registrar thinks > all of them are down, I still want clients to try to contact them just > in case the registrar is wrong. So, I don't want the services to be > removed from the LookupCache completely. > What I've normally done is get a proxy, then continue using that proxy until it fails to work, then flush it from the LookupCache and go find another proxy. Thus the assumption is that a service is live until proven otherwise. > I've considered adding an Entry to the service's attributeSets that says > if the service is alive, and setting the registration lease duration to > be very long. In that case, I would need to alter Reggie to fill in > that attribute as "missing" when a service failed to check before a > liveness timeout but not actually cancel the service lease. With an > implementation like that, it would be trivial for me to pick out the > live services with a simple ServiceItemFilter on the LookupCache. Personally I wouldn't alter Reggie; you could probably implement a "liveness-monitoring" service. Keep in mind the liveness caveats above, however. I'm not sure the extra complexity gains much utility over a "use proxy til dead" approach. > Another idea is to implement this client side: use a short lease timeout > with Reggie but add some longer-term caching to the LookupCache. In > that case, a serviceRemoved() from a registrar would simply flag the > ServiceItemReg as not alive. The service would not be removed from the > LookupCache, however, until N hours after it was removed from the last > registrar. > > Has anybody else had similar thoughts? What compromises, extensions > and/or architectures have you chosen as a result? > > Chris -- Greg Trasuk, President StratusCom Manufacturing Systems Inc. - We use information technology to solve business problems on your plant floor. http://stratuscom.com
