TJ wrote:
client SYN > server LISTENING
client < SYN ACK server SYN_RECEIVED (time-out 3s)
                 server: inet_rsk(req)->acked = 1

client ACK > server (discarded)

client < SYN ACK (DUP) server (time-out 6s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 12s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 24s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 48s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 96s)
client ACK (DUP) > server (discarded)

server: half-open socket closed.

With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT
mechanism eventually the handshake fails after the 'SYN ACK' retries and
time-outs expire.

There is a case for arguing the kernel should be operating in an
enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an
alternative mode, and therefore should accept *both* RFC 793 and
TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for
implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm
guidance.

It seems incorrect to penalise a client that is trying to complete the
handshake according to the RFC 793 specification, especially as the
client has no way of knowing ahead of time whether or not the server is
operating deferred accept.

Interesting problem. TCP_DEFER_ACCEPT does not conform to any standard I'm aware of. (In fact, I'd say it's in violation of RFC 793.) The implementation does exactly what it claims, though -- it "allows a listener to be awakened only when data arrives on the socket."

I think a more useful spec might have been "allows a listener to be awakened only when data arrives on the socket, unless the specified timeout has expired." Once the timeout expires, it should process the embryonic connection as if TCP_DEFER_ACCEPT is not set. Unfortunately, I don't think we can retroactively change this definition, as an application might depend on data being available and do a non-blocking read() after the accept(), expecting data to be there. Is this worth trying to fix?

Also, a listen socket with a backlog and TCP_DEFER_ACCEPT will have reqs sit in the backlog for the full defer timeout, even if they've received data, which is not really the right thing to do.

I've attached a patch implementing this suggestion (compile tested only -- I think I got the logic right but it's late ;). Kind of ugly, and uses up a bit in struct inet_request_sock. Maybe can be done better...

  -John
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 62daf21..f9f64a5 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -72,7 +72,8 @@ struct inet_request_sock {
                                sack_ok    : 1,
                                wscale_ok  : 1,
                                ecn_ok     : 1,
-                               acked      : 1;
+                               acked      : 1,
+                               deferred   : 1;
        struct ip_options       *opt;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 185c7ec..cad2490 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,6 +978,7 @@ static inline void tcp_openreq_init(struct request_sock 
*req,
        ireq->snd_wscale = rx_opt->snd_wscale;
        ireq->wscale_ok = rx_opt->wscale_ok;
        ireq->acked = 0;
+       ireq->deferred = 0;
        ireq->ecn_ok = 0;
        ireq->rmt_port = tcp_hdr(skb)->source;
 }
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..1207fb8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -444,9 +444,6 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
                }
        }
 
-       if (queue->rskq_defer_accept)
-               max_retries = queue->rskq_defer_accept;
-
        budget = 2 * (lopt->nr_table_entries / (timeout / interval));
        i = lopt->clock_hand;
 
@@ -455,7 +452,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
                while ((req = *reqp) != NULL) {
                        if (time_after_eq(now, req->expires)) {
                                if ((req->retrans < thresh ||
-                                    (inet_rsk(req)->acked && req->retrans < 
max_retries))
+                                    (inet_rsk(req)->acked && req->retrans < 
max_retries) ||
+                                    (inet_rsk(req)->deferred && req->retrans <
+                                     queue->rskq_defer_accept + max_retries))
                                    && !req->rsk_ops->rtx_syn_ack(parent, req, 
NULL)) {
                                        unsigned long timeo;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index a12b08f..c4867f3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -637,8 +637,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff 
*skb,
 
                /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
                if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
-                   TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-                       inet_rsk(req)->acked = 1;
+                   TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1 &&
+                   !inet_rsk(req)->acked && req->retrans <
+                   inet_csk(sk)->icsk_accept_queue.rskq_defer_accept) {
+                       inet_rsk(req)->deferred = 1;
                        return NULL;
                }
 
@@ -686,6 +688,9 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff 
*skb,
        listen_overflow:
                if (!sysctl_tcp_abort_on_overflow) {
                        inet_rsk(req)->acked = 1;
+                       /* If deferred, ACK must contain data.  Shortcut defer. 
*/
+                       if (inet_rsk(req)->deferred)
+                               req->retrans = 
inet_csk(sk)->icsk_accept_queue.rskq_defer_accept;
                        return NULL;
                }
 

Reply via email to