Hey all, We (that is, myself and others from Forward Clinical Ltd, my employer) have been doing some extensive work to support high latency networks such as Satellite Links, in relation to our work with UK Defence Medical Services. Our "long thin" links cover the C2S link.
We believe these findings are more generally useful than just SATCOM - in particular, we think these will help with the adverse network conditions found in hospitals (where people keep putting in lifts and lots of cables, giving lots of blackspots), and general applicability with mobile use of XMPP. TL;DR: When the session has a ping timeout, do push notifications, but otherwise leave it open - mobile clients will often recover after several minutes have passed. We assume that established sessions may be in several connectivity states from the point of view of the server, typically: "Live" - a session is genuinely live and can be used for communication. "Unresponsive" - the session has a TCP connection associated with it, but it unresponsive to pings etc. "Resumable" - the session has no TCP session, but 198 resumption was negotiated and the session remains available. We expect that the majority of servers will immediately move a session detected as unresponsive into the resumable state by closing the TCP session, and starting a (relatively short) timeout. In the process of doing so, unacknowledged stanzas will be processed for push notifications etc as needed, and errors will be sent as appropriate. Due to network analysis (and "thanks" to a bug in the server which caused some useful logging), we were able to examine not only when sessions went into the unresponsive state, but also when the client subsequently sent traffic on that session. This often happened well after the session had fallen into the resumable state - this resulted in an error, as the session had been closed. Having seen the result of this in the logging of the server, we followed up by looking for the same logging output on the production system, where the majority of users are using WiFi or 4G within hospitals. Coverage is often poor, and the WiFi overused, so clinicians often operate on a weak 4G signal, or highly contented WiFi. Think FOSDEM. Again, we observed clients recovering sometimes well after the ping timeout had triggered. Had these clients been able to, they could have continued to use the same TCP session without any disruption (or, for that matter, any additional RTTs re-establishing). The usual approach here seems to be to increase the timeout required to move a session from "live" to "unresponsive" when pinged. However, this has the effect of delaying push notifications while the session is, in effect in limbo. Our proposal is that when a session is found to be unresponsive, the server starts sending push notifications for unacknowledged (and future) messages, but otherwise leaves the session live when resumable. Only after a significantly longer timeout should the TCP session be terminated (and at that point destroy the session entirely). This means that a client recovering network after several minutes will find the connection still live (in effect), whereas if it never recovers, it will still get the push notifications in a timely manner. There are likely to be downsides with this approach; particularly presence state will be badly affected. PSA could help here. Overall, though, we believe that this will substantially improve the effective performance of C2S over high latency, high contention links. I hope this is useful! Dave.
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________