As Ian writes, the problem is on the server, not at the client. If the client wakes up with something to send after a long silence, it can decide to just resume the connection. But the server can't. If the client connection is dropped, the server is stuck.

The current solution is to use keep-alives. But this is painful for both clients and servers. For clients, that means waking up each time any of the 17 messenger applications on the phone sends a keep alive, wakes up the radio, and drains the battery. For servers, that means getting messages from every client every 15 seconds, even if they only have messages to receive every 15 minutes, which increases CPU load and power consumption. Not great.

There are a few alternatives. The client could use protocols like PCP or UPNP-IGD to open a port in the local NAT. That's fine if the local router supports it. It can work very well if the network supports IPv6 and the client just needs to set a pinhole in the local firewall. But it will not work if the local ISP is using some combination of IPv4 and Carrier Grade NAT. Unless the CGNAT supports PCP and the client has a plausible way to discover the address of the CGNAT. Maybe the IETF could work on that, but I am not holding my breadth.

Another way to reduce the impact on the client is to make sure that all applications doing keep-alive do it exactly at the same time. If they do, then the radio wakes up only once, sends a train of messages, maybe waits for the ACKs. Not perfect, but at least if preserves the battery a bit. Of course, that solution does not help the server at all.

Yet another solution that had been tried is to have a system level process do the keep alive on behalf of all applications in the box. I won't go in the details, but we could maybe do a variant of that with Masque. Have the client use Masque for all outgoing connections, connecting to a Masque server outside the CGNAT. Then the client only needs the Masque session alive -- 1 keep alive instead of N. The end-to-end QUIC session could use IPV6, and long idle timers. Maybe something we could actually ship!

-- Christian Huitema

On 9/1/2024 1:00 PM, Ian Swett wrote:
This is a real problem, but I'm unsure what the best way to approach it is.

I think you're suggesting that a large server operator could try to infer
NAT timeouts for clients of different IP prefixes and communicate that to
the client as a suggested keepalive/ping timeout?  I'm curious about how to
infer NAT timeouts?  Our servers detect a dead connection, but I'm not sure
how to tell what the reason is and more specifically the NAT timeout?
Sometimes devices just drop off the network.

As you may know, Chrome will send a PING as a keepalive after 15 seconds of
idle only if there are outstanding requests (ie: hanging GETs).  The number
was chosen somewhat arbitrarily and is certainly not optimal, but it did
fix some use cases where hanging GETs were otherwise failing.

Thanks, Ian

On Wed, Jul 24, 2024 at 5:55 PM Martin Thomson <[email protected]> wrote:

The intent of the idle timeout was to have that reflect *endpoint*
policy.  That is, it is independent of path.

It's certainly very interesting to consider what you might do about paths
and keep-alives (or not).  But that's a separable problem.  Having a way
for endpoints to share their information about timeouts might work, but I
worry that that will lead to wasteful keepalive traffic.  How would we
ensure that keepalives are not wasteful?

Is there a better way, such as a quick connection continuation?

On Wed, Jul 24, 2024, at 11:24, Lucas Pardue wrote:
Hi folks,

Wearing no hats.

There's been some chatter this week during IETF about selecting QUIC
idle timeouts in the face of Internet paths that might have shorter
timeouts, such as NAT.

This isn't necessarily a new topic, there's past work that's been done
on measurements and attempts to capture that as in IETF documents. For
example, Lars highlighted a study of home gateway characteristics from
2010 [1]. Then there's RFC 4787 [2], and our very own RFC 9308 [3]

There's likely other work that's happened in the meantime that has
provided further insights.

All the discussion got me wondering whether there might be room for a
QUIC extension that could hint at the path timeout to the peer. For
instance, as a server operator, I might have a wide view of network
characteristics that a client doesn't. Sending keepalive pings from the
server is possible but it might not be in the client's interest to
force it to ACK them, especially if there are power saving
considerations that would be hard for the server to know. Instead, a
hint to the peer would allow it to decide what to do. That could allow
us to maintain a large QUIC idle timeouts as befitting of the
application use case, but adapt to the needs of the path for improved
connection reliability.

Such an extension could hint for each and every path, and therefore a
benefit to multipath, which has some addition path idle timeout
considerations [4].

Thoughts?

[1] - https://dl.acm.org/doi/10.1145/1879141.1879174
[2] -  https://www.rfc-editor.org/rfc/rfc4787.html
[3] - https://www.rfc-editor.org/rfc/rfc9308.html#section-3.2
[4] -

https://www.ietf.org/archive/id/draft-ietf-quic-multipath-10.html#name-idle-timeout



Reply via email to