On Mon, May 1, 2023 at 11:32 PM Stefan Fritsch <s...@sfritsch.de> wrote:
>
> Dropping misc@ from cc
>
> Am 01.05.23 um 02:08 schrieb Aaron Mason:
> > I can reproduce it with this in QEMU 8.0 in Winders (thanks Antun who
> > sent something like this to the bugs@ list):
> >
> > qemu-system-x86_64 -accel whpx,kernel-irqchip=off -machine q35 \
> >     -cpu EPYC-Rome,-monitor -m 8g -smp 6,sockets=1,cores=6 \
> >     -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22 -vga virtio \
> >     -drive if=virtio,file=miniroot73.img -device virtio-scsi-pci,id=scsi
>
> It is probably depending on timing and the timing on my machine is
> different.
>
> > The temporary workaround patch results in a booting system.
>
> I fear that just returning from vioscsi_req_done may cause data
> corruption sometimes.

Yeah I was concerned that such a blanket drop might result in, to
quote a certain G-Man, "unforeseen consequences" (mostly out of lack
of experience), hence why I stopped well short of declaring the
problem fixed.

>
> >>> I enabled debugging on the vioscsi driver, rebuilt the RAMDISK kernel
> >>> with those drivers enabled, and got this:
> >>>
> >>> vioscsi0 at virtio1: qsize 128
> >>> scsibus0 at vioscsi0: 255 targets
> >>> vioscsi_req_get: 0xfffffd803f80d338
> >>> vioscsi_scsi_cmd: enter
> >>> vioscsi_scsi_cmd: polling...
> >>> vioscsi_scsi_cmd: polling timeout
>
> The actual problem is here. One request times out, but the driver does
> not tell qemu that it should abort the request. The queue entry then
> gets reused and the two responses from qemu overwrite each other. You
> could try if increasing the timeout here to e.g. 10000 helps:
>
>          if (ISSET(xs->flags, SCSI_POLL)) {
>                  DPRINTF("vioscsi_scsi_cmd: polling...\n");
>                  int timeout = 1000;
>                  do {
>                          virtio_poll_intr(vsc);
>                          if (vr->vr_xs != xs)
>                                  break;
>                          delay(1000);
>                  } while (--timeout > 0);
>                  if (vr->vr_xs == xs) {
>                          // TODO(matthew): Abort the request.
>                          xs->error = XS_TIMEOUT;
>                          xs->resid = xs->datalen;
>                          DPRINTF("vioscsi_scsi_cmd: polling timeout\n");
>                          scsi_done(xs);
>                  }
>
> Unfortunately, it order to properly abort the request, quite a bit of
> infrastructure related to the control queue is still missing in the driver.

I'll give it a go and report back, thanks!

>
> >>> vioscsi_scsi_cmd: done (timeout=0)
> >>> vioscsi_scsi_cmd: enter
> >>> vioscsi_scsi_cmd: polling...
> >>> vioscsi_vq_done: enter
> >>> vioscsi_vq_done: slot=127
> >>> vioscsi_req_done: enter vr: 0xfffffd803f80d338 xs: 0xfffffd803f8a5e58
> >>> vioscsi_req_done: done 0, 2, 0
> >>> vioscsi_vq_done: slot=127
> >>> vioscsi_req_done: enter vr: 0xfffffd803f80d338 xs: 0x0
> >>> uvm_fault(0xffffffff813ec2e0, 0x8, 0, 1) -> e
> >>> fatal page fault in supervisor mode
> >>> trap type 6 code 0 rip ffffffff810e6190 cs 8 rflags 10286 cr2 8 cpl e
> >>> rsp ffffffff81606670
> >>> gsbase 0xffffffff813dfff0  kgsbase 0x0
> >>> panic: trap type 6, code=0, pc=ffffffff810e6190
> >>>
> >>> That "xs: 0x0" bit feels like a clue. It should be trivial to pick up
> >>> and handle, but what would be the correct way to handle that?
> >>>
> >>> If I have it return if "xs" is found to be NULL, it continues - the
> >>> debugging suggests it goes through each possible target before
> >>> finishing up. I don't know if that's correct, but it seems to continue
> >>> booting after that even if my example didn't detect the drive with the
> >>> kernel I built (I used the RAMDISK kernel and it was pretty stripped
> >>> down).
> >>>
> >>> I'm about to attempt a -STABLE build (I've got 7.3 installed and thus
> >>> can't yet build a snapshot, but I will do that if this test succeeds)
> >>> - here's the patch that hopefully fixes the problem. (and hopefully
> >>> gmail doesn't clobber the tabs)
> >>>
> >>> Index: sys/dev/pv/vioscsi.c
> >>> ===================================================================
> >>> RCS file: /cvs/src/sys/dev/pv/vioscsi.c,v
> >>> retrieving revision 1.30
> >>> diff -u -p -u -p -r1.30 vioscsi.c
> >>> --- sys/dev/pv/vioscsi.c 16 Apr 2022 19:19:59 -0000 1.30
> >>> +++ sys/dev/pv/vioscsi.c 25 Apr 2023 12:51:16 -0000
> >>> @@ -296,6 +296,7 @@ vioscsi_req_done(struct vioscsi_softc *s
> >>>     struct scsi_xfer *xs = vr->vr_xs;
> >>>     DPRINTF("vioscsi_req_done: enter vr: %p xs: %p\n", vr, xs);
> >>>
> >>> + if (xs == NULL) return;
> >>>     int isread = !!(xs->flags & SCSI_DATA_IN);
> >>>     bus_dmamap_sync(vsc->sc_dmat, vr->vr_control,
> >>>         offsetof(struct vioscsi_req, vr_req),
> >>>
> >>>
> >
> >
> >



-- 
Aaron Mason - Programmer, open source addict
I've taken my software vows - for beta or for worse

Reply via email to