Re: API and terms: idle-time-out and heartbeat intervals.

Robbie Gemmell Wed, 28 Sep 2016 14:33:47 -0700

The spec unfortunately says what it says, and even if there were scope
to update it, I'm not sure there is a way to change it to address the
main issue (which regardless of any other clarity issues, is I think
just that it only says you SHOULD advertise half your 'actual
timeout') that wouldn't result in an 'incompatible change' of
behaviour.


Knowing the exact enforced timeout could certainly be clearer in some
ways, but if the period you had to send after wasnt defined (e.g half)
then it would essentially leave you with the same problem as now:
deciding exactly how early you should send heartbeat frames to avoid a
[more-precisely-known] timeout.

With what the spec says we effectively have a lower-bound on the
actual-timeout (if cases the peer didnt half their actual-timeout
before advertising a number) and an upper bound (if they did) due to
the use of SHOULD as opposed to MUST or some other more fixed
definition of behaviour. So we currently try to send frames often
enough to satisfy that lower bound, i.e the number we received. That
essentially reduces it to the the same problem as if we were trying to
satisfy a theoretical exactly-known actual-timeout (just using a
smaller number that we dont want to use, because we followed the specs
recommendation). Currently we do this by sending frames after half the
advertised period, i.e. what we know is actually a quarter of the
actual-timeout enforced by peers such as proton which are following
the specs recommendations. We could choose to do it less often than
that by reducing how conservative it is, e.g use 75+% of advertised
value instead for example, which should have no impact in the cases
where peers had followed the recommendation (as we would still be
sending more than twice as quickly as needed to satisfy their
actual-timeout), but would put us closer to spurious timeout ( if e.g
there was a delay in delivering/processing the heartbeat) against any
peers that didn't follow the recommendation.

On 28 September 2016 at 21:26, Justin Ross <[email protected]> wrote:
> IMO, the overall picture is simpler, and easier to explain to third
> parties, if we go the way Ken suggested.  When a remote peer sends you an
> idle timeout value, it is an expression of an (actual, not simply
> "advertised") guarantee - "I will expire the connection after X time
> without receiving a frame from you".
>
> We could also legitimately go the direction you suggest.  *But* its name is
> "idle timeout".  We can't easily change the name.  I think we should take
> the spec text that goes with the name, and the behavior of our components,
> firmly in the direction Ken suggests.
>
> Off topic: why is this on the dev list?
>
>
> On Wed, Sep 28, 2016 at 12:36 PM, Alan Conway <[email protected]> wrote:
>
>> On Wed, 2016-09-28 at 10:13 -0400, Ken Giusti wrote:
>> > I've had a hand in the way Proton/C interprets the meaning of 'idle-
>> > timeout' and I've never liked the solution.  I think Proton/C's
>> > behavior is not 'pessimistic' as much as it is 'conservative' for the
>> > sake of interoperability.  This, unfortunately ends up with a
>> > needless idle frame chattiness when both ends are Proton-based.
>> >
>> > ----- Original Message -----
>> > >
>> > > From: "Rob Godfrey" <[email protected]>
>> > > To: "qpid" <[email protected]>
>> > > Sent: Wednesday, September 28, 2016 6:19:05 AM
>> > > Subject: Re: API and terms: idle-time-out and heartbeat intervals.
>> > >
>> > > I agree that specifying that the communicated figure should be
>> > > "half"
>> > > the "actual" timeout was a mistake.
>> > >
>> > > What the spec should have tried to communicate is that the sender
>> > > should communicate a value somewhat less than the period it uses to
>> > > determine that the connection has actually timed-out to allow for
>> > > the
>> > > receiver to process and emit a heartbeat frame.
>> >
>> >
>> > Wouldn't it be much clearer to simply send the _actual_ idle timeout
>> > value?
>>
>> My read is that is exactly what it does: It sends the max time that the
>> *sender* of frames may be idle. The receiver of frames SHOULD be more
>> patient than that. The wording of the "discussion" around it and the
>> choice of terms is a bit cloudy but, the text that describes idle-time-
>> out seems clear enough: it is the max interval between sending frames.
>> The frame receiver SHOULD wait longer that that before closing, and 2x
>> seems a reasonable suggestion, but that's for the impl to decide.
>>
>> It's weird that it says "idle-time-out should be half the threshold"
>> instead of "the threshold should be twice the idle-time-out" but it's
>> logically equivalent.
>>
>>
>> > Having the spec suggest "communicating a value *somewhat less*"
>>
>> The wording is odd but the semantics are you communicate *exactly* the
>> max frame delay you want and then you SHOULD set your connection close
>> threshold to something bigger. The other end doesn't need to know how
>> much bigger, they just need to know what rate to send frames.
>>
>> > [emphasis mine] leaves the implementation open for interpretation -
>> > which is exactly how we got into this mess in the first
>> > place.  Developers are a smart bunch - they know that keep alive
>> > traffic will have to be sent frequently enough to prevent idle
>> > timeout.
>> >
>> >
>> > >
>> > >  Similarly the sender
>> > > should ensure that a frame has been emitted well within the timeout
>> > > period to allow for any communication / processing delay.
>> >
>> > Agreed - perfectly acceptable for the spec to point this out.
>> >
>> > >
>> > >  In practice
>> > > these "wiggle room" factors should not be determined by the
>> > > application level timeout setting but by sensible calculations on
>> > > transport delay variance / processing time, etc...  these
>> > > calculation
>> > > may differ between different use-cases / environments (for example
>> > > in
>> > > a low latency / real-time environment you may be able to make hard
>> > > guarantees about the number of milliseconds that communication /
>> > > processing delay will take... on the other hand if you are using an
>> > > interpreted language with stop-the-world garbage collection you may
>> > > not be able to say much better than the delay should be less than
>> > > 30s
>> > > or whatever).
>> > >
>> >
>> > Yes - very important things to keep in mind when implementing
>> > this.  But the spec shouldn't be making these suggestions for
>> > different implementation options. The spec should be as concise as
>> > possible about the mandated behavior, and leave the implementation to
>> > the developers.
>> >
>> > >
>> > > I think application level APIs should be in terms of the timeouts
>> > > that
>> > > will affect the application.  The AMQP library should be massaging
>> > > those numbers in such a way that they can fulfil the application
>> > > requirements.
>> > >
>> >
>> > Agreed.  Now, is there _any_ way we can suggest an update to the
>> > spec?  Perhaps an errata, etc?
>> >
>> > >
>> > > -- Rob
>> > >
>> > > On 28 September 2016 at 10:42, Robbie Gemmell <robbie.gemmell@gmail
>> > > .com>
>> > > wrote:
>> > > >
>> > > > On 27 September 2016 at 22:24, Alan Conway <[email protected]>
>> > > > wrote:
>> > > > >
>> > > > > On Tue, 2016-09-27 at 15:37 -0400, Alan Conway wrote:
>> > > > > >
>> > > > > > I want to clarify and document the meaning of these terms for
>> > > > > > our
>> > > > > > APIs,
>> > > > > > presently I can't find anywhere where they are documented
>> > > > > > clearly.
>> > > > > >
>> > > > > > The AMQP spec says: "Each peer has its own (independent) idle
>> > > > > > timeout.
>> > > > > > At connection open each peer communicates the maximum
>> > > > > > period between activity (frames) on the connection that it
>> > > > > > desires
>> > > > > > from
>> > > > > > its partner.The open frame carries the idletime-out
>> > > > > > field for this purpose. To avoid spurious timeouts, the value
>> > > > > > in
>> > > > > > idle-
>> > > > > > time-out SHOULD be half the peer’s
>> > > > > > actual timeout threshold."
>> > > > > >
>> > > > > > In other words: if I send you an "open" frame with idle-time-
>> > > > > > out=N
>> > > > > > that
>> > > > > > means *you* should not wait for longer than N milliseconds to
>> > > > > > send a
>> > > > > > frame to me. It does not mean *I* will close the connection
>> > > > > > after N
>> > > > > > milliseconds, I SHOULD be more patient and wait for N*2 ms to
>> > > > > > avoid
>> > > > > > closing prematurely due to minor timing wobbles.
>> > > > > >
>> > > > > > I think the choice of name is slightly ambiguous but the spec
>> > > > > > is
>> > > > > > clear
>> > > > > > on the semantics, so it's important to document it to remove
>> > > > > > the
>> > > > > > ambiguity.
>> > > > > >
>> > > > > > Anybody disagree?
>> > > > > >
>> > > > >
>> > > > > Sigh. Sadly proton-C interprets "idle-timeout" differently
>> > > > > depending on
>> > > > > which end of the connection you are on:
>> > > > >
>> > > > >       // as per the recommendation in the spec, advertise half
>> > > > > our
>> > > > >       // actual timeout to the remote
>> > > > >       const pn_millis_t idle_timeout = transport-
>> > > > > >local_idle_timeout
>> > > > >           ? (transport->local_idle_timeout/2)
>> > > > >           : 0;
>> > > > >
>> > > > > So in proton, pn_set_idle_timeout does NOT mean set the AMQP
>> > > > > idle-
>> > > > > timeout value, it means set the local "receive timeout" value
>> > > > > and send
>> > > > > half that as the AMQP "send timeout" for the peer.
>> > > > >
>> > > > > I'm tempted to use a new term in the Go API: "heartbeat". To me
>> > > > > that
>> > > > > clearly means the "send timeout" (hearts beat, they don't
>> > > > > listen for
>> > > > > beats) so it coincides with the meaning of the AMQP "idle-
>> > > > > timeout", but
>> > > > > without the ambiguity that is exacerbated by proton
>> > > > > interpreting it
>> > > > > both ways.
>> > > > >
>> > > > >
>> > > >
>> > > > Proton may seem to behave differently on each end, but I don't
>> > > > think
>> > > > its necessarily a bad thing that it does, and it is also I think
>> > > > largely just reflecting an annoying bit in the spec around this
>> > > > where
>> > > > different behaviours are allowed for, whereas it would be easier
>> > > > if it
>> > > > had less wiggle room.
>> > > >
>> > > > The transport setter/getter for the local timeout takes the
>> > > > 'actual
>> > > > timeout' and then sends half of it as the advertised value in the
>> > > > Open
>> > > > sent. This makes a certain amount of sense since it ensures that
>> > > > appropriate behaviour is actually satisfied, rather than
>> > > > expecting the
>> > > > user to ensure they only give half the value they really want for
>> > > > their actual timeout. The getter for the remote timeout value on
>> > > > the
>> > > > other hand returns the advertised value from the Open that is
>> > > > received. I expect it does that since it cant actually ever
>> > > > return the
>> > > > remotes 'actual timeout' without making an assumption, i.e that
>> > > > they
>> > > > did in fact advertise half (or less) of their actual timeout,
>> > > > which
>> > > > the spec only says that they SHOULD do.
>> > > >
>> > > > Yes the local setter taking the advertised value may have been
>> > > > better
>> > > > for method consistency with the remote getter. On the other hand,
>> > > > sending of necessary heartbeats is handled directly by the
>> > > > transport
>> > > > during the tick process, so users may not necessarily even use
>> > > > the
>> > > > getter themselves, and proton uses that remote value internally
>> > > > by
>> > > > pessimistically halfing it to account for the case that folks on
>> > > > the
>> > > > other end did not advertise half their actual timeout (since the
>> > > > spec
>> > > > doesnt require that they do). Side note: proton could arguably be
>> > > > less
>> > > > pessimistic here and go for say a percentage much nearer the full
>> > > > advertised value, but then you'd probably need to start guaging
>> > > > how
>> > > > close is too close.
>> > > >
>> > > > I think ensuring the doccumentation on the methods is clear what
>> > > > they
>> > > > do is sufficient enough here. I actually prefer idle-timeout as
>> > > > an
>> > > > name rather than heartbeat due to the way this all works. Since
>> > > > you
>> > > > only tell the other side [half] your timeout, you dont actually
>> > > > have
>> > > > direct control over when they send any needed empty frames to
>> > > > satisfy
>> > > > it (as the above shows, we might send them more often than they
>> > > > require) and 'heartbeat' might seem to imply that you do, and
>> > > > possibly
>> > > > even that they need be sent at that period all the time even
>> > > > despite
>> > > > regular traffic, which is not the case.
>> > > >
>> > > > Robbie
>> > > >
>> > > > ---------------------------------------------------------------
>> > > > ------
>> > > > To unsubscribe, e-mail: [email protected]
>> > > > For additional commands, e-mail: [email protected]
>> > > >
>> > >
>> > > -----------------------------------------------------------------
>> > > ----
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> > >
>> > >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: API and terms: idle-time-out and heartbeat intervals.

Reply via email to