Package: qemu-system-x86
Version: 1:7.2+dfsg-7+deb12u5
Severity: normal
X-Debbugs-Cc: g...@libero.it

I believe I spotted a race condition in virtio-net or qemu/kvm (but
only when virtio-net is involved).

To replicate, one needs a virtualization environment similar to

Host:
- debian 12 x86_64
- caching name server listening on 127.0.0.1

Guest:
- linux/musl or linux/glibc or freebsd or openbsd
- kvm acceleration
- virtio netdev, configured in (default) user-mode
- /etc/resolv.conf:
    nameserver 10.0.2.2         i.e. the caching dns in the host
    nameserver 192.168.1.123    non existent

and run the attached program in the guest.

The program opens a UDP socket, sends out a bunch of (dns) requests,
poll()s on the socket, and then receives the responses.

If a delay is inserted between the sendto() calls, the (unique) response
from the host is received correctly:

    $ ./a.out 10.0.2.2 >/dev/null # to warm up the host cache
    $ ./a.out 10.0.2.2 delay 192.168.1.123
    poll: 1 1 1
    recvfrom() 45
    <response packet>
    recvfrom() -1

If the sento()s are performed in short order, the response packet
gets lost:

    $ ./a.out 10.0.2.2 192.168.1.123
    poll: 0 1 0
    recvfrom() -1
    recvfrom() -1

A tcpdump capture on the host side shows no difference between the two cases.

Tcpdump on the guest side is another story: in the good case, it looks like
this

7:32:44.332 IP 10.0.2.15.43276 > 10.0.2.2.53: 33452+ A? example.com. (29)
7:32:44.333 IP 10.0.2.2.53 > 10.0.2.15.43276: 33452 1/0/0 A 93.184.216.34 (45)
7:32:44.349 IP 10.0.2.15.43276 > 192.168.1.123.53: 33452+ A? example.com. (29)

while in the bad case it looks like this

7:32:55.358 IP 10.0.2.15.46537 > 10.0.2.2.53: 33452+ A? example.com. (29)
7:32:55.358 IP 10.0.2.15.46537 > 192.168.1.123.53: 33452+ A? example.com. (29)
7:32:55.358 IP *127.0.0.1*.53 > 10.0.2.15.46537: 33452 1/0/0 A 93.184.216.34 
(45)

where the response packet has wrong src ip.

Looks like a failure of the NAT layer, but it does not happen when
the guest uses another emulated network driver: don't know whether it's
because the relevant code is in virtio-net or because other drivers add
overhead that masks the issue.

There's nothing special in port 53: I was just investigating
a weird failure in name resolution in a MUSL based guest
(https://www.openwall.com/lists/musl/2024/02/17/3) and wrote the program
to mimic MUSL resolver's behaviour.

But it succeeds/fails consistently with a different port, and in all
guests I tried (as long as the emulated network device is virtio-net).

To see the issue, it's important that the response to the first request
is so fast that it's simultaneous with the second request.

Best regards,
        g.b.


-- System Information:
Debian Release: 12.5
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 6.1.0-18-amd64 (SMP w/4 CPU threads; PREEMPT)
Locale: LANG=C, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: sysvinit (via /sbin/init)
LSM: AppArmor: enabled

Versions of packages qemu-system-x86 depends on:
ii  ipxe-qemu           1.0.0+git-20190125.36a4c85-5.1
ii  libaio1             0.3.113-4
ii  libbpf1             1:1.1.0-1
ii  libc6               2.36-9+deb12u4
ii  libcapstone4        4.0.2-5
ii  libfdt1             1.6.1-4+b1
ii  libfuse3-3          3.14.0-4
ii  libgcc-s1           12.2.0-14
ii  libglib2.0-0        2.74.6-2
ii  libgmp10            2:6.2.1+dfsg1-1.1
ii  libgnutls30         3.7.9-2+deb12u2
ii  libhogweed6         3.8.1-2
ii  libibverbs1         44.0-2
ii  libjpeg62-turbo     1:2.1.5-2
ii  libnettle8          3.8.1-2
ii  libnuma1            2.0.16-1
ii  libpixman-1-0       0.42.2-1
ii  libpmem1            1.12.1-2
ii  libpng16-16         1.6.39-2
ii  librdmacm1          44.0-2
ii  libsasl2-2          2.1.28+dfsg-10
ii  libseccomp2         2.5.4-1+b3
ii  libslirp0           4.7.0-1
ii  libudev1            252.22-1~deb12u1
ii  liburing2           2.3-3
ii  libvdeplug2         4.0.1-4
ii  libzstd1            1.5.4+dfsg2-5
ii  qemu-system-common  1:7.2+dfsg-7+deb12u5
ii  qemu-system-data    1:7.2+dfsg-7+deb12u5
ii  seabios             1.16.2-1
ii  zlib1g              1:1.2.13.dfsg-1

Versions of packages qemu-system-x86 recommends:
ii  ovmf              2022.11-6+deb12u1
pn  qemu-block-extra  <none>
ii  qemu-system-gui   1:7.2+dfsg-7+deb12u5
ii  qemu-utils        1:7.2+dfsg-7+deb12u5

Versions of packages qemu-system-x86 suggests:
pn  samba  <none>
pn  vde2   <none>

-- no debconf information
#include <stdio.h>
#include <time.h>
#include <poll.h>
#include <assert.h>
#include <string.h>

#include <arpa/inet.h>
#include <netdb.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/socket.h>
#include <sys/types.h>

static void dump(const char *s, size_t len) {
    while (len--) {
        char t = *s++;
        if (' ' <= t && t <= '~' && t != '\\')
            printf("%c", t);
        else
            printf("\\%o", t & 0xff);
    }
    printf("\n");
}

int main(int argc, char *argv[]) {
    int sock, rv, n;
    const char req[] =
        "\202\254\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1";
    struct timespec delay_l = { 1, 0 }; /* 1 sec */
    struct pollfd pfs;
    struct sockaddr_in me = { 0 };

    sock = socket(AF_INET, SOCK_DGRAM | SOCK_CLOEXEC | SOCK_NONBLOCK,
                  IPPROTO_IP);
    assert(sock >= 0);

    me.sin_family = AF_INET;
    me.sin_port = 0;
    me.sin_addr.s_addr = inet_addr("0.0.0.0");
    rv = bind(sock, (struct sockaddr *) &me, sizeof me);
    assert(0 == rv);

    for (n = 1; n < argc; n++) {
        if (0 == strcmp("delay", argv[n])) {
            struct timespec delay_s = { 0, (1 << 24) }; /* ~ 16 msec */
            nanosleep(&delay_s, NULL);
        } else {
            struct sockaddr_in dst = { 0 };
            dst.sin_family = AF_INET;
            dst.sin_port = htons(53);
            dst.sin_addr.s_addr = inet_addr(argv[n]);
            rv = sendto(sock, req, sizeof req - 1, MSG_NOSIGNAL,
                        (struct sockaddr *) &dst, sizeof dst);
            assert(rv >= 0);
        }
    }

    nanosleep(&delay_l, NULL);
    pfs.fd = sock;
    pfs.events = POLLIN;
    rv = poll(&pfs, 1, 2000);
    printf("poll: %d %d %d\n", rv, pfs.events, pfs.revents);

    for (n = 1; n < argc; n++) {
        char resp[4000];
        if (0 == strcmp("delay", argv[n]))
            continue;
        rv = recvfrom(sock, resp, sizeof resp, 0, NULL, NULL);
        printf("recvfrom() %d\n", rv);
        if (rv > 0)
            dump(resp, rv);
    }

    return 0;
}

Reply via email to