I will first briefly outline here how tun/tap device queueing works.
Each tap device has two ends - one is the network device lookalike
'tap0' and the other is the fd received by opening '/dev/net/tun' and
running some ioctls. With UML, the host kernel has the 'tap0' device
and the UML process opens up '/dev/net/tun' side, showing it as 'eth0'
inside the UML kernel. Right now, we are only interested in how
packets travel from the host to the guest, eg. packets sent to the
'tap0' device travelling to the UML kernel.
There are two queues involved in this process. One is the normal
network device send queue attached to the 'tap0'. It can be
manipulated by the QoS tools, like 'tc'. I will call this
'txqueue'. The other is an internal queue of the tun/tap driver, an
skb queue, in the tun_struct as readq. I will call this simply
'readq'.
There are two alternatives on how packets are queued, controlled by
IFF_ONE_QUEUE flag when creating the tap device. The default is that
IFF_ONE_QUEUE is off. In that case, packets are first queued to the
'readq'. When 'readq' grows to 10 packets, netif_stop_queue is called
on the device, which causes the 'txqueue' to start accumulating the
following packets. When packets are read from the fd side, the queue
is started again, which starts filling 'readq' again. So, this
alternative should keep the queue mostly on the device side and allow
normal QoS routines to handle packet dropping and such.
The other alternative, when IFF_ONE_QUEUE is set, uses just a single
queue. Packets are queued to 'readq' always, the net device queue is
never stopped. If 'readq' grows to the interface 'txqueuelen', packets
are simply tail dropped. This means that the normal QoS tools cannot
affect how packets are dropped and the packet queue is 'hidden' inside
the kernel as there is no simple way to see it.
Now, here comes the actual bug report.
If the UML kernel is sent more packets than it can handle, and
IFF_ONE_QUEUE is not set (as it isn't by default), packets first fill
the 'readq' and then start amassing at 'txqueue', as they should. But
when no more packets are being sent, the queue does not start growing
smaller. If no packets are sent to the device, the queue stays there
indefinitely. When single packets are sent, the queue always decreases
by 10 packets (so 11 packets in total arrive on the guest side).
If I understood correctly from the code, UML uses SIGIO to trigger
packet reading. So, either SIGIO is not delivered properly in the case
of the two queues - or UML's SIGIO handling is broken.
If IFF_ONE_QUEUE is set, everything works fine, packets are delivered
in a timely manner and queue is never stalled.
As a workaround to the problem, I've added the IFF_ONE_QUEUE option to
all the places that open a tap device (namely tunctl, uml_net and
uml_router). The patch is attached. But the actual problem should most
likely be fixed once found.
-- Naked
Index: uml-utilities-20040406/uml_net/tuntap.c
===================================================================
--- uml-utilities-20040406.orig/uml_net/tuntap.c
+++ uml-utilities-20040406/uml_net/tuntap.c
@@ -44,7 +44,7 @@
return(-1);
}
memset(ifr, 0, sizeof(*ifr));
- ifr->ifr_flags = IFF_TAP | IFF_NO_PI;
+ ifr->ifr_flags = IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE;
ifr->ifr_name[0] = '\0';
if(ioctl(tap_fd, TUNSETIFF, (void *) ifr) < 0){
output_errno(output, "TUNSETIFF : ");
Index: uml-utilities-20040406/tunctl/tunctl.c
===================================================================
--- uml-utilities-20040406.orig/tunctl/tunctl.c
+++ uml-utilities-20040406/tunctl/tunctl.c
@@ -81,7 +81,7 @@
memset(&ifr, 0, sizeof(ifr));
- ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE;
strncpy(ifr.ifr_name, tun, sizeof(ifr.ifr_name) - 1);
if(ioctl(tap_fd, TUNSETIFF, (void *) &ifr) < 0){
perror("TUNSETIFF");
Index: uml-utilities-20040406/uml_router/tuntap.c
===================================================================
--- uml-utilities-20040406.orig/uml_router/tuntap.c
+++ uml-utilities-20040406/uml_router/tuntap.c
@@ -28,7 +28,7 @@
return(-1);
}
memset(&ifr, 0, sizeof(ifr));
- ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE;
strncpy(ifr.ifr_name, dev, sizeof(ifr.ifr_name) - 1);
if(ioctl(fd, TUNSETIFF, (void *) &ifr) < 0){
perror("TUNSETIFF failed");