Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-08 Thread Thilo Molitor
Georg, you seem to forget the push that is involved.

I think most of your depicted problems would go away if the client would send 
a (XEP-0198) ping upon receiving a push and, if that ping does not succeed, 
decide that the connection is dead.
That way it won't stick to an old half-dead connection but still use that old 
connection if it remained usable.

No second stream needed in this scenario.

- tmolitor


Am Freitag, 6. November 2020, 18:41:15 CET schrieb Georg Lukas:
> Hi Dave,
> 
> * Dave Cridland  [2020-11-04 12:48]:
> > TL;DR: When the session has a ping timeout, do push notifications, but
> > otherwise leave it open - mobile clients will often recover after several
> > minutes have passed.
> 
> That's a great analysis and I haven't considered this situation yet, but
> your proposal sounds very logical to me.
> 
> I see two potential issues:
> 
> the client needs to mirror this logic as well, and stick to an old TCP
> session for a much longer time. I fear there will be some pathological
> situations where it will render the client effectively disconnected,
> e.g. when the old connection gets blackholed, but a new one could be
> established any time.
> 
> Furthermore, some long time ago I've seen situations where a TCP
> connection had a very hard time recovering from intermittent packet
> loss, where the connection's latency remained very high and throughput
> very low. Did you experience this effect as well, or is this just a
> faint memory from times of much worse congestion control algorithms?
> 
> To solve either problem, I can imagine that a client *could* open a
> second stream to the server in parallel to the existing one, and if the
> second stream completes authentication, then just 0198-resume there.
> 
> But this is also going to increase medium contention, and it's even more
> sockets for the client to juggle around.
> 
> 
> Georg


___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-06 Thread Dave Cridland
On Fri, 6 Nov 2020 at 17:41, Georg Lukas  wrote:

> Hi Dave,
>
> * Dave Cridland  [2020-11-04 12:48]:
> > TL;DR: When the session has a ping timeout, do push notifications, but
> > otherwise leave it open - mobile clients will often recover after several
> > minutes have passed.
>
> That's a great analysis and I haven't considered this situation yet, but
> your proposal sounds very logical to me.
>
> I see two potential issues:
>
> the client needs to mirror this logic as well, and stick to an old TCP
> session for a much longer time. I fear there will be some pathological
> situations where it will render the client effectively disconnected,
> e.g. when the old connection gets blackholed, but a new one could be
> established any time.
>
>
I'm not sure that's true - primarily because a client is free to decide to
try a new connection at any time, an option the server does not have
available.

The client also is more likely to know what's happening at the physical
layer (or what passes for the physical layer these days).

We've broadly not seen problems with existing sessions on the same medium
(ie, without a Wifi/4G change) where a new session would solve the problem.


> Furthermore, some long time ago I've seen situations where a TCP
> connection had a very hard time recovering from intermittent packet
> loss, where the connection's latency remained very high and throughput
> very low. Did you experience this effect as well, or is this just a
> faint memory from times of much worse congestion control algorithms?
>
> To solve either problem, I can imagine that a client *could* open a
> second stream to the server in parallel to the existing one, and if the
> second stream completes authentication, then just 0198-resume there.
>
> But this is also going to increase medium contention, and it's even more
> sockets for the client to juggle around.


I think if the client has got as far as decided the TCP socket is
unrecoverable at its end, it can just spin up a new session.

If it cannot ping with the existing connection, it will not be able to
establish a new one if the existing one is potentially recoverable, and
vice-versa.

Dave.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-06 Thread Georg Lukas
Hi Dave,

* Dave Cridland  [2020-11-04 12:48]:
> TL;DR: When the session has a ping timeout, do push notifications, but
> otherwise leave it open - mobile clients will often recover after several
> minutes have passed.

That's a great analysis and I haven't considered this situation yet, but
your proposal sounds very logical to me.

I see two potential issues:

the client needs to mirror this logic as well, and stick to an old TCP
session for a much longer time. I fear there will be some pathological
situations where it will render the client effectively disconnected,
e.g. when the old connection gets blackholed, but a new one could be
established any time.

Furthermore, some long time ago I've seen situations where a TCP
connection had a very hard time recovering from intermittent packet
loss, where the connection's latency remained very high and throughput
very low. Did you experience this effect as well, or is this just a
faint memory from times of much worse congestion control algorithms?

To solve either problem, I can imagine that a client *could* open a
second stream to the server in parallel to the existing one, and if the
second stream completes authentication, then just 0198-resume there.

But this is also going to increase medium contention, and it's even more
sockets for the client to juggle around.


Georg


signature.asc
Description: PGP signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-05 Thread Holger Weiß
* Thilo Molitor  [2020-11-05 10:11]:
> > Our proposal is that when a session is found to be unresponsive, the server
> > starts sending push notifications for unacknowledged (and future) messages,
> > but otherwise leaves the session live when resumable. Only after a
> > significantly longer timeout should the TCP session be terminated (and at
> > that point destroy the session entirely).
>
> Like Marvin explained, prosody already does something like this.
> The default setting for `smacks_max_ack_delay` is 30 seconds [1].
> A push will be generated if an ack is pending for more than 30 seconds and 
> the 
> outgoing XEP-0198 queue is not empty.
> Every new stanza added to the queue while the timeout expired will generate 
> an 
> additional push.
> The standard TCP timeout of prosody is usually much higher --> standard 
> prosody seems to already follow your suggestions pretty well :)
> 
> Not sure how ejabberd handles this, though.

The TCP connection is explicitly closed if if an ACK request times out.
Initially, ejabberd didn't do that.  We ran into issues with keeping the
connection open in this state, but that was 5 years ago and I can hardly
remember the details of issues I ran into 5 weeks ago, sigh ...

Holger
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-05 Thread Thilo Molitor
Hi Dave.

> Our proposal is that when a session is found to be unresponsive, the server
> starts sending push notifications for unacknowledged (and future) messages,
> but otherwise leaves the session live when resumable. Only after a
> significantly longer timeout should the TCP session be terminated (and at
> that point destroy the session entirely).
Like Marvin explained, prosody already does something like this.
The default setting for `smacks_max_ack_delay` is 30 seconds [1].
A push will be generated if an ack is pending for more than 30 seconds and the 
outgoing XEP-0198 queue is not empty.
Every new stanza added to the queue while the timeout expired will generate an 
additional push.
The standard TCP timeout of prosody is usually much higher --> standard 
prosody seems to already follow your suggestions pretty well :)

Not sure how ejabberd handles this, though.

- tmolitor

[1] https://modules.prosody.im/mod_smacks.html


Am Mittwoch, 4. November 2020, 14:17:00 CET schrieb Marvin W:
> Hi Dave,
> 
> Thanks for your message. From my experience with mobile phone networks
> when traveling in Germany (not sure if it applies in other countries, as
> German mobile networks are far below average in my experience), I can
> confirm that temporary connectivity loss is not handled perfectly well
> in some scenarios (although I am not sure if this is a server or client
> side issue).
> 
> On 04.11.20 12:46, Dave Cridland wrote:
> > Our proposal is that when a session is found to be unresponsive, the
> > server starts sending push notifications for unacknowledged (and
> > future) messages, but otherwise leaves the session live when
> > resumable. Only after a significantly longer timeout should the TCP
> > session be terminated (and at that point destroy the session
> > entirely).
> 
> FWIW, this is within the bounds of the current specification. XEP-0357
> leaves it completely open which events warrant a push notification.
> 
> The prosody mod_cloud_notify already is supposed to send push
> notifications when the session is still considered "live", but the
> outgoing message was not ack-ed using XEP-0198 for a certain time (can
> be configured). When using this together with an increased connection
> timeout (which is configurable in prosody's advanced network
> configuration as per ) you should
> be able to realize something that is pretty close to your suggestion.
> 
> Marvin
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___


___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-04 Thread Ruslan N. Marchenko
Am Mittwoch, den 04.11.2020, 11:46 + schrieb Dave Cridland:
> 
> Due to network analysis (and "thanks" to a bug in the server which
> caused some useful logging), we were able to examine not only when
> sessions went into the unresponsive state, but also when the client
> subsequently sent traffic on that session. This often happened well
> after the session had fallen into the resumable state - this resulted
> in an error, as the session had been closed.
> 
> Having seen the result of this in the logging of the server, we
> followed up by looking for the same logging output on the production
> system, where the majority of users are using WiFi or 4G within
> hospitals. Coverage is often poor, and the WiFi overused, so
> clinicians often operate on a weak 4G signal, or highly contented
> WiFi. Think FOSDEM.
> 
> Again, we observed clients recovering sometimes well after the ping
> timeout had triggered. Had these clients been able to, they could
> have continued to use the same TCP session without any disruption
> (or, for that matter, any additional RTTs re-establishing).
> 
> The usual approach here seems to be to increase the timeout required
> to move a session from "live" to "unresponsive" when pinged. However,
> this has the effect of delaying push notifications while the session
> is, in effect in limbo.
> 
> Our proposal is that when a session is found to be unresponsive, the
> server starts sending push notifications for unacknowledged (and
> future) messages, but otherwise leaves the session live when
> resumable. Only after a significantly longer timeout should the TCP
> session be terminated (and at that point destroy the session
> entirely).
> 

Matches my observations [1] as well. If the session is not too active
tcp recovery is instant, all the snd/rcv buffers are flushed and then
queues are flushed and all live as if nothing happened. 

> This means that a client recovering network after several minutes
> will find the connection still live (in effect), whereas if it never
> recovers, it will still get the push notifications in a timely
> manner.
> 
> There are likely to be downsides with this approach; particularly
> presence state will be badly affected. PSA could help here. Overall,
> though, we believe that this will substantially improve the effective
> performance of C2S over high latency, high contention links.

I'm leaning towards ignoring all the timers whatsoever, only care about
how it affects UX. If tcp is still holding up - let it be, if it got
EOF/EOS/Timeout (from whatever side) - let's just do resumption
reconnection - we're reconnectiong continuously anyway.

1. -
 https://github.com/TelepathyIM/wocky/issues/14#issuecomment-720091807
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-04 Thread Marvin W
Hi Dave,

Thanks for your message. From my experience with mobile phone networks
when traveling in Germany (not sure if it applies in other countries, as
German mobile networks are far below average in my experience), I can
confirm that temporary connectivity loss is not handled perfectly well
in some scenarios (although I am not sure if this is a server or client
side issue).

On 04.11.20 12:46, Dave Cridland wrote:
> Our proposal is that when a session is found to be unresponsive, the
> server starts sending push notifications for unacknowledged (and
> future) messages, but otherwise leaves the session live when
> resumable. Only after a significantly longer timeout should the TCP
> session be terminated (and at that point destroy the session
> entirely).

FWIW, this is within the bounds of the current specification. XEP-0357
leaves it completely open which events warrant a push notification.

The prosody mod_cloud_notify already is supposed to send push
notifications when the session is still considered "live", but the
outgoing message was not ack-ed using XEP-0198 for a certain time (can
be configured). When using this together with an increased connection
timeout (which is configurable in prosody's advanced network
configuration as per ) you should
be able to realize something that is pretty close to your suggestion.

Marvin
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-04 Thread Guus der Kinderen
Hi Dave,

Thanks for sharing this. To verify that I got it wrong, can I dumb your
suggestions down by summarizing them as:

   - Increase the timeout after which a connection is considered
   unrecoverably dead (to ... how many minutes?)
   - After a period of inactivity that's a lot shorter than the timeout
   mentioned above (presumable around the existing timeout value) start
   generating push notifications

Regards,

  Guus

On Wed, 4 Nov 2020 at 12:48, Dave Cridland  wrote:

> Hey all,
>
> We (that is, myself and others from Forward Clinical Ltd, my employer)
> have been doing some extensive work to support high latency networks such
> as Satellite Links, in relation to our work with UK Defence Medical
> Services. Our "long thin" links cover the C2S link.
>
> We believe these findings are more generally useful than just SATCOM - in
> particular, we think these will help with the adverse network conditions
> found in hospitals (where people keep putting in lifts and lots of cables,
> giving lots of blackspots), and general applicability with mobile use of
> XMPP.
>
> TL;DR: When the session has a ping timeout, do push notifications, but
> otherwise leave it open - mobile clients will often recover after several
> minutes have passed.
>
> We assume that established sessions may be in several connectivity states
> from the point of view of the server, typically:
>
> "Live" - a session is genuinely live and can be used for communication.
> "Unresponsive" - the session has a TCP connection associated with it, but
> it unresponsive to pings etc.
> "Resumable" - the session has no TCP session, but 198 resumption was
> negotiated and the session remains available.
>
> We expect that the majority of servers will immediately move a session
> detected as unresponsive into the resumable state by closing the TCP
> session, and starting a (relatively short) timeout.
>
> In the process of doing so, unacknowledged stanzas will be processed for
> push notifications etc as needed, and errors will be sent as appropriate.
>
> Due to network analysis (and "thanks" to a bug in the server which caused
> some useful logging), we were able to examine not only when sessions went
> into the unresponsive state, but also when the client subsequently sent
> traffic on that session. This often happened well after the session had
> fallen into the resumable state - this resulted in an error, as the session
> had been closed.
>
> Having seen the result of this in the logging of the server, we followed
> up by looking for the same logging output on the production system, where
> the majority of users are using WiFi or 4G within hospitals. Coverage is
> often poor, and the WiFi overused, so clinicians often operate on a weak 4G
> signal, or highly contented WiFi. Think FOSDEM.
>
> Again, we observed clients recovering sometimes well after the ping
> timeout had triggered. Had these clients been able to, they could have
> continued to use the same TCP session without any disruption (or, for that
> matter, any additional RTTs re-establishing).
>
> The usual approach here seems to be to increase the timeout required to
> move a session from "live" to "unresponsive" when pinged. However, this has
> the effect of delaying push notifications while the session is, in effect
> in limbo.
>
> Our proposal is that when a session is found to be unresponsive, the
> server starts sending push notifications for unacknowledged (and future)
> messages, but otherwise leaves the session live when resumable. Only after
> a significantly longer timeout should the TCP session be terminated (and at
> that point destroy the session entirely).
>
> This means that a client recovering network after several minutes will
> find the connection still live (in effect), whereas if it never recovers,
> it will still get the push notifications in a timely manner.
>
> There are likely to be downsides with this approach; particularly presence
> state will be badly affected. PSA could help here. Overall, though, we
> believe that this will substantially improve the effective performance of
> C2S over high latency, high contention links.
>
> I hope this is useful!
>
> Dave.
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___
>
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


[Standards] Use of XEP-0198 resumption under adverse network conditions

2020-11-04 Thread Dave Cridland
Hey all,

We (that is, myself and others from Forward Clinical Ltd, my employer) have
been doing some extensive work to support high latency networks such as
Satellite Links, in relation to our work with UK Defence Medical Services.
Our "long thin" links cover the C2S link.

We believe these findings are more generally useful than just SATCOM - in
particular, we think these will help with the adverse network conditions
found in hospitals (where people keep putting in lifts and lots of cables,
giving lots of blackspots), and general applicability with mobile use of
XMPP.

TL;DR: When the session has a ping timeout, do push notifications, but
otherwise leave it open - mobile clients will often recover after several
minutes have passed.

We assume that established sessions may be in several connectivity states
from the point of view of the server, typically:

"Live" - a session is genuinely live and can be used for communication.
"Unresponsive" - the session has a TCP connection associated with it, but
it unresponsive to pings etc.
"Resumable" - the session has no TCP session, but 198 resumption was
negotiated and the session remains available.

We expect that the majority of servers will immediately move a session
detected as unresponsive into the resumable state by closing the TCP
session, and starting a (relatively short) timeout.

In the process of doing so, unacknowledged stanzas will be processed for
push notifications etc as needed, and errors will be sent as appropriate.

Due to network analysis (and "thanks" to a bug in the server which caused
some useful logging), we were able to examine not only when sessions went
into the unresponsive state, but also when the client subsequently sent
traffic on that session. This often happened well after the session had
fallen into the resumable state - this resulted in an error, as the session
had been closed.

Having seen the result of this in the logging of the server, we followed up
by looking for the same logging output on the production system, where the
majority of users are using WiFi or 4G within hospitals. Coverage is often
poor, and the WiFi overused, so clinicians often operate on a weak 4G
signal, or highly contented WiFi. Think FOSDEM.

Again, we observed clients recovering sometimes well after the ping timeout
had triggered. Had these clients been able to, they could have continued to
use the same TCP session without any disruption (or, for that matter, any
additional RTTs re-establishing).

The usual approach here seems to be to increase the timeout required to
move a session from "live" to "unresponsive" when pinged. However, this has
the effect of delaying push notifications while the session is, in effect
in limbo.

Our proposal is that when a session is found to be unresponsive, the server
starts sending push notifications for unacknowledged (and future) messages,
but otherwise leaves the session live when resumable. Only after a
significantly longer timeout should the TCP session be terminated (and at
that point destroy the session entirely).

This means that a client recovering network after several minutes will find
the connection still live (in effect), whereas if it never recovers, it
will still get the push notifications in a timely manner.

There are likely to be downsides with this approach; particularly presence
state will be badly affected. PSA could help here. Overall, though, we
believe that this will substantially improve the effective performance of
C2S over high latency, high contention links.

I hope this is useful!

Dave.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___