Re: Client timeouts and remote calls

Dan Creswell Sat, 16 Jun 2012 00:16:40 -0700

On 14 June 2012 15:37, Gregg Wonderly <[email protected]> wrote:

> If you use a smart proxy, and put the lease renewal call inside the smart
> proxy, and register a listener, you can see the renewal fail.  But, you
> still have to know what that means based on how the service and the lease
> interact.  To get a legitimate, two way, liveness test, you really have to
> have a conversation with the server, from the client, and have a view of
> the endpoint activities on the server.
>
> There are lots of ways to engineer this, and both leasing or transactions
> can be part of the solution.  But, in the end, you must decide what you
> need to know, and then think through what you are expecting vs what is
> actually achievable using the facilities you can deploy.
>
> Most of the time, the true test, is merely to be able to use the
> endpoint(s) end to end by making a call from the client to the service for
> that liveness test.
>
> Given all the possible forms of partial failure that can occur in a
> distributed system.  You can't rely on detached functionality, such as
> leases, as the "only" way to know that something is working on the other
> end.
>

Indeed, options are ultimately limited by the fact that one cannot tell the
difference between genuine machine failure and slowness due to excessive
load or packet loss or network breakage (there is a proof for this, think
it's due to Lynch but...).

One often tackles this sort of problem with a Failure Detector (
http://www.cs.cornell.edu/home/sam/FDpapers.html). Leases are somewhat
related in that they help form a view that something is wrong, what they
don't (and can't) tell you is _what_ is wrong. They essentially rely on a
form of active ping (the extension of the lease) to detect failure. Most
importantly the Lease forms a contract between client and server such that
_both_ can make an independent assumption about failure/loss after a period
of time.

When one detects a failure, one can attempt to diagnose more accurately
what is broken but it's tricky. Let's say we want to connect to a server
using a TCP-based protocol. When connecting we can fail for several reasons
including packet loss, excessive server load or simply because the
connection queue is too big. Deducing which of those is the culprit is much
more a debugging exercise than something one attempts to deal with in the
system code.

To summarise:

(1) Build a model that can eventually deduce there has been a failure of
some sort.
(2) Build a recovery model that, given a failure, can restore whatever
state is required and continue to make progress.

In many cases, one can solve much of the problem by pushing state over to
the client (e.g. recent innovations in browsers), leaving the server
stateless but there are applications where that isn't viable. In those
cases, it gets harder needing disk-replication etc.

Certainly there is a generic model for notify as provided in the spec and
implemented in JavaSpaces, LUS etc. That may be useful. In the particular
case of this problem:

"At present moment, under Windows XP for both client and server, the
ConnectException takes exactly* 21 seconds* to be thrown. Do you know the
reason for this value?"

The implication is that client and server already have an established
connection from the perspective of TCP. I would suspect that the settings
for the Windows network stack apply various timeouts that end up giving a
total of 21 seconds before efforts to interact are terminated.

Also:

"System.setProperty("sun.rmi.transport.tcp.responseTimeout","2000");
System.setProperty("sun.rmi.transport.tcp.handshakeTimeout","2000");
System.setProperty("sun.rmi.transport.tcp.readTimeout","2000");
System.setProperty("sun.rmi.transport.connectionTimeout","2000");
System.setProperty("sun.rmi.transport.proxy.connectTimeout ","2000");
"

Have you configured a JRMP (i.e. native JDK RMI) transport or are you using
JERI? I can't recall just how many of those settings are honoured by JERI
so they may be having no effect at all.

Cheers,

Dan.

<snip>

>
> >>
> >> ________________________________________
> >> From: Gregg Wonderly [[email protected]]
> >> Sent: Wednesday, June 13, 2012 3:53 PM
> >> To: [email protected]
> >> Subject: Re: Client timeouts and remote calls
> >>
> >> There are timeouts that you can change in your Configuration to control
> how long the waits occur.  If it's important that everyone agree on the
> values being changed, you could include the use of a transaction so that if
> one client dies in the middle, then everyone can revert, and you can retry
> to get things to a sane state.
> >>
> >> This is important if the data the clients receives, controls how they
> interact with the service.  But, you can otherwise, just do what you are
> doing, without a transaction.  If you turn on DGC, or use a Lease on the
> client received endpoint, then you might be able to know that a client is
> actually gone, rather than just temporarily unreachable.
> >>
> >> Gregg
> >>
> >> On Jun 13, 2012, at 2:13 PM, Sergio Aguilera Cazorla wrote:
> >>
> >>> Hello,
> >>>
> >>> I have a question regarding client-side timeouts in Jini / Apache
> River. I
> >>> am finishing a program where a certain number of clients can obtain a
> proxy
> >>> and set / get properties (values) from an exported class in a server.
> Each
> >>> client becomes a RemoteEventListener of the server, so each time a
> property
> >>> is changed, the server calls notify() in ALL clients to make them aware
> >>> that a property has changed (and all clients update their data).
> >>>
> >>> This architecture performs great if client programs finish in a
> "graceful"
> >>> way, because I have a register / unregister mechanism that makes the
> server
> >>> have an updated list of "alive" clients. However, if client machines
> "die
> >>> suddenly", the server will be unaware and will try to call notify()
> next
> >>> time that call is needed. Example (setSomething is a remote method on
> the
> >>> Server):
> >>>
> >>> public void setSomething(String param) {
> >>>
> >>> <do the Stuff>
> >>> Remote event ev = <proper RemoteEvent object>
> >>> for(RemoteEventListener l : listeners) {
> >>>
> >>> try {
> >>>
> >>> l.notify(ev);
> >>>
> >>> } catch(Exception e) {   listeners.remove(l);   }
> >>>
> >>> }
> >>>
> >>> }
> >>>
> >>> I'm sure you see where I want to go: if some clients in the list died
> >>> suddenly, the notify() will be called over them. A ConnectException is
> >>> thrown and the client is removed properly but... it takes a long time
> for
> >>> the exception to be thrown! Do you know how to control this situation?
> >>>
> >>> Thanks in advance!
> >>>
> >>> ADDITIONAL DATA:
> >>> I have tried setting un the following RMI System properies, and didn't
> work:
> >>> System.setProperty("sun.rmi.transport.tcp.responseTimeout","2000");
> >>> System.setProperty("sun.rmi.transport.tcp.handshakeTimeout","2000");
> >>> System.setProperty("sun.rmi.transport.tcp.readTimeout","2000");
> >>> System.setProperty("sun.rmi.transport.connectionTimeout","2000");
> >>> System.setProperty("sun.rmi.transport.proxy.connectTimeout ","2000");
> >>>
> >>> At present moment, under Windows XP for both client and server, the
> >>> ConnectException takes exactly* 21 seconds* to be thrown. Do you know
> the
> >>> reason for this value?
> >>>
> >>> --
> >>> *Sergio Aguilera*
> >>
> >
>
>

Re: Client timeouts and remote calls

Reply via email to