What lease timeouts are people using with Reggie? My project currently uses a 10 minute timeout. We chose that value as balance between 1) wanting to know quickly when a service crashes and 2) performance concerns with the Reggie implementation.
I've become dissatisfied with that compromise, however, particularly in cases where the service is actually live but the registrar has gone bad or a clock sync anomaly has occurred (both cause false negatives). Ideally I would like to disentangle the notion of an expected service and the liveness of that service. That is, I would like to be able to query the registrar separately for all of the services that are supposed to be running and all of the services that are actually running right now. Take for example a collection of redundant services intended to be used round-robin. I want clients to prefer to contact only the services known to be alive to avoid TCP timeouts. But if the registrar thinks all of them are down, I still want clients to try to contact them just in case the registrar is wrong. So, I don't want the services to be removed from the LookupCache completely. I've considered adding an Entry to the service's attributeSets that says if the service is alive, and setting the registration lease duration to be very long. In that case, I would need to alter Reggie to fill in that attribute as "missing" when a service failed to check before a liveness timeout but not actually cancel the service lease. With an implementation like that, it would be trivial for me to pick out the live services with a simple ServiceItemFilter on the LookupCache. Another idea is to implement this client side: use a short lease timeout with Reggie but add some longer-term caching to the LookupCache. In that case, a serviceRemoved() from a registrar would simply flag the ServiceItemReg as not alive. The service would not be removed from the LookupCache, however, until N hours after it was removed from the last registrar. Has anybody else had similar thoughts? What compromises, extensions and/or architectures have you chosen as a result? Chris
