Dropping misc@ from cc

Am 01.05.23 um 02:08 schrieb Aaron Mason:
I can reproduce it with this in QEMU 8.0 in Winders (thanks Antun who
sent something like this to the bugs@ list):

qemu-system-x86_64 -accel whpx,kernel-irqchip=off -machine q35 \
    -cpu EPYC-Rome,-monitor -m 8g -smp 6,sockets=1,cores=6 \
    -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22 -vga virtio \
    -drive if=virtio,file=miniroot73.img -device virtio-scsi-pci,id=scsi

It is probably depending on timing and the timing on my machine is different.

The temporary workaround patch results in a booting system.

I fear that just returning from vioscsi_req_done may cause data corruption sometimes.

I enabled debugging on the vioscsi driver, rebuilt the RAMDISK kernel
with those drivers enabled, and got this:

vioscsi0 at virtio1: qsize 128
scsibus0 at vioscsi0: 255 targets
vioscsi_req_get: 0xfffffd803f80d338
vioscsi_scsi_cmd: enter
vioscsi_scsi_cmd: polling...
vioscsi_scsi_cmd: polling timeout

The actual problem is here. One request times out, but the driver does not tell qemu that it should abort the request. The queue entry then gets reused and the two responses from qemu overwrite each other. You could try if increasing the timeout here to e.g. 10000 helps:

        if (ISSET(xs->flags, SCSI_POLL)) {
                DPRINTF("vioscsi_scsi_cmd: polling...\n");
                int timeout = 1000;
                do {
                        virtio_poll_intr(vsc);
                        if (vr->vr_xs != xs)
                                break;
                        delay(1000);
                } while (--timeout > 0);
                if (vr->vr_xs == xs) {
                        // TODO(matthew): Abort the request.
                        xs->error = XS_TIMEOUT;
                        xs->resid = xs->datalen;
                        DPRINTF("vioscsi_scsi_cmd: polling timeout\n");
                        scsi_done(xs);
                }

Unfortunately, it order to properly abort the request, quite a bit of infrastructure related to the control queue is still missing in the driver.

vioscsi_scsi_cmd: done (timeout=0)
vioscsi_scsi_cmd: enter
vioscsi_scsi_cmd: polling...
vioscsi_vq_done: enter
vioscsi_vq_done: slot=127
vioscsi_req_done: enter vr: 0xfffffd803f80d338 xs: 0xfffffd803f8a5e58
vioscsi_req_done: done 0, 2, 0
vioscsi_vq_done: slot=127
vioscsi_req_done: enter vr: 0xfffffd803f80d338 xs: 0x0
uvm_fault(0xffffffff813ec2e0, 0x8, 0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff810e6190 cs 8 rflags 10286 cr2 8 cpl e
rsp ffffffff81606670
gsbase 0xffffffff813dfff0  kgsbase 0x0
panic: trap type 6, code=0, pc=ffffffff810e6190

That "xs: 0x0" bit feels like a clue. It should be trivial to pick up
and handle, but what would be the correct way to handle that?

If I have it return if "xs" is found to be NULL, it continues - the
debugging suggests it goes through each possible target before
finishing up. I don't know if that's correct, but it seems to continue
booting after that even if my example didn't detect the drive with the
kernel I built (I used the RAMDISK kernel and it was pretty stripped
down).

I'm about to attempt a -STABLE build (I've got 7.3 installed and thus
can't yet build a snapshot, but I will do that if this test succeeds)
- here's the patch that hopefully fixes the problem. (and hopefully
gmail doesn't clobber the tabs)

Index: sys/dev/pv/vioscsi.c
===================================================================
RCS file: /cvs/src/sys/dev/pv/vioscsi.c,v
retrieving revision 1.30
diff -u -p -u -p -r1.30 vioscsi.c
--- sys/dev/pv/vioscsi.c 16 Apr 2022 19:19:59 -0000 1.30
+++ sys/dev/pv/vioscsi.c 25 Apr 2023 12:51:16 -0000
@@ -296,6 +296,7 @@ vioscsi_req_done(struct vioscsi_softc *s
    struct scsi_xfer *xs = vr->vr_xs;
    DPRINTF("vioscsi_req_done: enter vr: %p xs: %p\n", vr, xs);

+ if (xs == NULL) return;
    int isread = !!(xs->flags & SCSI_DATA_IN);
    bus_dmamap_sync(vsc->sc_dmat, vr->vr_control,
        offsetof(struct vioscsi_req, vr_req),






Reply via email to