Hello,
a while ago I wrote a simple network load generator to inject datagrams or
frames at maximum rate into a network. Maybe I was mistaken but I expected
the socket's send operation to block, if the transmitting network device
becomes saturated (no matter if using UDP or PF_PACKET). However,
sometimes the send operation just returned ENOBUFS immediately without
blocking.
If I understood Wright&Stevens' TCP/IP Illustrated Vol.2 correctly, BSD
(at least 4.4 BSD Lite 1) never throttles a UDP sender, since it does not
account bytes to transmit in any queue on egress path it could block on.
On the other hand, Linux does in certain cases (details later). Even
though I found out about the implementation details, I would still like to
know, if there is any specification or common agreement on the semantics
of socket send operation blocking (back pressure) with saturated network
devices?
Please keep me in CC since I lurk and am not subscribed at the moment.
In order to understand why and under what circumstances blocking or
non-blocking happens, I dug into the protocol stack code. The
corresponding call traces look as follows (Linux 2.6, similar in 2.4):
sock_sendmsg
__sock_sendmsg
socket->ops->sendmsg: e.g. inet_sendmsg or packet_sendmsg
either:
inet_sendmsg
sock->sk_prot->sendmsg: e.g. udp_sendmsg
udp_sendmsg
ip_append_data
sock_alloc_send_skb
sock_alloc_send_pskb
sock_wait_for_wmem
or:
packet_sendmsg
sock_alloc_send_skb
sock_alloc_send_pskb
sock_wait_for_wmem
Now this is where a process might block, if the socket send buffer is full
(atomic_read(&sk->sk_wmem_alloc) >= sk->sk_sndbuf). Suppose sndbuf is
large enough and it won't block. Then the allocated sk_buff will be
processed further in udp_sendmsg or packet_sendmsg and finally find its
way into the device queue. Since we were using an unreliable (transport)
protocol, the sk_buffs are not actually stored in the socket send buffer
(there is no need for possible retransmissions). They are only accounted
for the sndbuf, but they are stored in the device queue:
dev_queue_xmit
q->enqueue: e.g. pfifo_fast_enqueue
pfifo_fast_enqueue
This is where the sk_buff may be dropped, if the device queue is full
(list->qlen >= qdisc->dev->tx_queue_len). Suppose this bad case(?)
happens, then the code path would return NET_XMIT_DROP. Packet_sendmsg
would convert this via net_xmit_errno into a -ENOBUFS and finally return
this as result of the socket send operation to the calling user process.
Similar thing with the same effect is done for UDP messages.
This means, that there are cases where a socket send operation may just
not block and immediately return ENOBUFS. If a process wanted to inject
messages into the network at maximum line speed (or whatever less the NIC
supports), this would in turn lead to "busy sending". Even worse, I didn't
come up with a sane configuration of sndbuf and txqueuelen, that could
prevent this possibly unexpected behavior. If there was only one socket
transmitting over a certain network device, you could roughly configure
sndbuf <= txqueuelen / MTU. For a fixed number of sockets we could use
sndbuf <= txqueuelen / MTU / #sockets. But this breaks as soon as you have
arbitrarily many sockets transmitting over the same device.
In other words this all just happens because sndbuf accounts in bytes
but the device queue measures in frames. But frames can have arbitrary
size within an interval given by the network technology and thus there is
no fixed relation between those two measurements.
I'd be interested in any opinions on the above mentioned effect.
Thanks,
Steffen.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html