completion timeouts with pin-based interrupts in QEMU hw/nvme
Hi all (linux-nvme, qemu-devel, maintainers), On QEMU riscv64, which does not use MSI/MSI-X and thus relies on pin-based interrupts, I'm seeing occasional completion timeouts, i.e. nvme nvme0: I/O 333 QID 1 timeout, completion polled To rule out issues with shadow doorbells (which have been a source of frustration in the past), those are disabled. FWIW I'm also seeing the issue with shadow doorbells. diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index f25cc2c235e9..28d8e7f4b56c 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->mdts = n->params.mdts; id->ver = cpu_to_le32(NVME_SPEC_VER); id->oacs = -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF); +cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT); id->cntrltype = 0x1; /* I captured a trace from QEMU when this happens: pci_nvme_mmio_write addr 0x1008 data 0x4e size 4 pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78 pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324 pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 num_prps 5 pci_nvme_map_addr addr 0x80aca000 len 4096 pci_nvme_map_addr addr 0x80ac9000 len 4096 pci_nvme_map_addr addr 0x80ac8000 len 4096 pci_nvme_map_addr addr 0x80ac7000 len 4096 pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242 pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 num_prps 29 pci_nvme_map_addr addr 0x80ae6000 len 4096 pci_nvme_map_addr addr 0x80ae5000 len 4096 pci_nvme_map_addr addr 0x80ae4000 len 4096 pci_nvme_map_addr addr 0x80ae3000 len 4096 pci_nvme_map_addr addr 0x80ae2000 len 4096 pci_nvme_map_addr addr 0x80ae1000 len 4096 pci_nvme_map_addr addr 0x80ae len 4096 pci_nvme_map_addr addr 0x80adf000 len 4096 pci_nvme_map_addr addr 0x80ade000 len 4096 pci_nvme_map_addr addr 0x80add000 len 4096 pci_nvme_map_addr addr 0x80adc000 len 4096 pci_nvme_map_addr addr 0x80adb000 len 4096 pci_nvme_map_addr addr 0x80ada000 len 4096 pci_nvme_map_addr addr 0x80ad9000 len 4096 pci_nvme_map_addr addr 0x80ad8000 len 4096 pci_nvme_map_addr addr 0x80ad7000 len 4096 pci_nvme_map_addr addr 0x80ad6000 len 4096 pci_nvme_map_addr addr 0x80ad5000 len 4096 pci_nvme_map_addr addr 0x80ad4000 len 4096 pci_nvme_map_addr addr 0x80ad3000 len 4096 pci_nvme_map_addr addr 0x80ad2000 len 4096 pci_nvme_map_addr addr 0x80ad1000 len 4096 pci_nvme_map_addr addr 0x80ad len 4096 pci_nvme_map_addr addr 0x80acf000 len 4096 pci_nvme_map_addr addr 0x80ace000 len 4096 pci_nvme_map_addr addr 0x80acd000 len 4096 pci_nvme_map_addr addr 0x80acc000 len 4096 pci_nvme_map_addr addr 0x80acb000 len 4096 pci_nvme_rw_cb cid 4428 blk 'd0' pci_nvme_rw_complete_cb cid 4428 blk 'd0' pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0 [1]: pci_nvme_irq_pin pulsing IRQ pin pci_nvme_rw_cb cid 4429 blk 'd0' pci_nvme_rw_complete_cb cid 4429 blk 'd0' pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0 [2]: pci_nvme_irq_pin pulsing IRQ pin [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4 [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77 TIMEOUT HERE (30s) --- [5]: pci_nvme_mmio_read addr 0x1c size 4 [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4 [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78 --- Interrupt deasserted (cq->tail == cq->head) [ 31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled Following the timeout, everything returns to "normal" and device/driver happily continues. The pin-based interrupt logic in hw/nvme seems sound enough to me, so I am wondering if there is something going on with the kernel driver (but I certainly do not rule out that hw/nvme is at fault here, since pin-based interrupts has also been a source of several issues in the past). What I'm thinking is that following the interrupt in [1], the driver picks up completion for cid 4428 but does not find cid 4429 in the queue since it has not been posted yet. Before getting a cq head doorbell write (which would cause the pin to be deasserted), the device posts the completion for cid 4429 which just keeps the interrupt asserted in [2]. The trace then shows the cq head doorbell update in [3,4] for cid 4428 and then we hit the timeout since the driver is not aware that cid 4429 has been posted in between this (why is it not aware of this?) Timing out, the driver then polls the queue and notices cid 4429 and updates the cq head doorbell in [5-7], causing the device to deassert the interrupt and we are "back in shape". I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0). Tested on QEMU v7.0.0 (to rule out all the shadow doorbell optimizations) as well as QE
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On 1/12/23 05:10, Klaus Jensen wrote: Hi all (linux-nvme, qemu-devel, maintainers), On QEMU riscv64, which does not use MSI/MSI-X and thus relies on pin-based interrupts, I'm seeing occasional completion timeouts, i.e. nvme nvme0: I/O 333 QID 1 timeout, completion polled To rule out issues with shadow doorbells (which have been a source of frustration in the past), those are disabled. FWIW I'm also seeing the issue with shadow doorbells. diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index f25cc2c235e9..28d8e7f4b56c 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->mdts = n->params.mdts; id->ver = cpu_to_le32(NVME_SPEC_VER); id->oacs = -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF); +cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT); id->cntrltype = 0x1; /* I captured a trace from QEMU when this happens: pci_nvme_mmio_write addr 0x1008 data 0x4e size 4 pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78 pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324 pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 num_prps 5 pci_nvme_map_addr addr 0x80aca000 len 4096 pci_nvme_map_addr addr 0x80ac9000 len 4096 pci_nvme_map_addr addr 0x80ac8000 len 4096 pci_nvme_map_addr addr 0x80ac7000 len 4096 pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242 pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 num_prps 29 pci_nvme_map_addr addr 0x80ae6000 len 4096 pci_nvme_map_addr addr 0x80ae5000 len 4096 pci_nvme_map_addr addr 0x80ae4000 len 4096 pci_nvme_map_addr addr 0x80ae3000 len 4096 pci_nvme_map_addr addr 0x80ae2000 len 4096 pci_nvme_map_addr addr 0x80ae1000 len 4096 pci_nvme_map_addr addr 0x80ae len 4096 pci_nvme_map_addr addr 0x80adf000 len 4096 pci_nvme_map_addr addr 0x80ade000 len 4096 pci_nvme_map_addr addr 0x80add000 len 4096 pci_nvme_map_addr addr 0x80adc000 len 4096 pci_nvme_map_addr addr 0x80adb000 len 4096 pci_nvme_map_addr addr 0x80ada000 len 4096 pci_nvme_map_addr addr 0x80ad9000 len 4096 pci_nvme_map_addr addr 0x80ad8000 len 4096 pci_nvme_map_addr addr 0x80ad7000 len 4096 pci_nvme_map_addr addr 0x80ad6000 len 4096 pci_nvme_map_addr addr 0x80ad5000 len 4096 pci_nvme_map_addr addr 0x80ad4000 len 4096 pci_nvme_map_addr addr 0x80ad3000 len 4096 pci_nvme_map_addr addr 0x80ad2000 len 4096 pci_nvme_map_addr addr 0x80ad1000 len 4096 pci_nvme_map_addr addr 0x80ad len 4096 pci_nvme_map_addr addr 0x80acf000 len 4096 pci_nvme_map_addr addr 0x80ace000 len 4096 pci_nvme_map_addr addr 0x80acd000 len 4096 pci_nvme_map_addr addr 0x80acc000 len 4096 pci_nvme_map_addr addr 0x80acb000 len 4096 pci_nvme_rw_cb cid 4428 blk 'd0' pci_nvme_rw_complete_cb cid 4428 blk 'd0' pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0 [1]: pci_nvme_irq_pin pulsing IRQ pin pci_nvme_rw_cb cid 4429 blk 'd0' pci_nvme_rw_complete_cb cid 4429 blk 'd0' pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0 [2]: pci_nvme_irq_pin pulsing IRQ pin [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4 [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77 TIMEOUT HERE (30s) --- [5]: pci_nvme_mmio_read addr 0x1c size 4 [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4 [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78 --- Interrupt deasserted (cq->tail == cq->head) [ 31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled Following the timeout, everything returns to "normal" and device/driver happily continues. The pin-based interrupt logic in hw/nvme seems sound enough to me, so I am wondering if there is something going on with the kernel driver (but I certainly do not rule out that hw/nvme is at fault here, since pin-based interrupts has also been a source of several issues in the past). What I'm thinking is that following the interrupt in [1], the driver picks up completion for cid 4428 but does not find cid 4429 in the queue since it has not been posted yet. Before getting a cq head doorbell write (which would cause the pin to be deasserted), the device posts the completion for cid 4429 which just keeps the interrupt asserted in [2]. The trace then shows the cq head doorbell update in [3,4] for cid 4428 and then we hit the timeout since the driver is not aware that cid 4429 has been posted in between this (why is it not aware of this?) Timing out, the driver then polls the queue and notices cid 4429 and updates the cq head doorbell in [5-7], causing the device to deassert the interrupt and we are "back in shape". I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0). Tested on QEMU v7.0.0 (to rule out all the sha
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > am wondering if there is something going on with the kernel driver (but > I certainly do not rule out that hw/nvme is at fault here, since > pin-based interrupts has also been a source of several issues in the > past). Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? While probably not the "correct" thing to do, it has better results in my testing.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote: > On Jan 12 09:34, Keith Busch wrote: > > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > > > > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > > > am wondering if there is something going on with the kernel driver (but > > > I certainly do not rule out that hw/nvme is at fault here, since > > > pin-based interrupts has also been a source of several issues in the > > > past). > > > > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? > > While probably not the "correct" thing to do, it has better results in > > my testing. > > > > A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, > > diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c > index 03760ddeae8c..0fc46dcb9ec4 100644 > --- i/hw/nvme/ctrl.c > +++ w/hw/nvme/ctrl.c > @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) >return; >} >if (~intms & n->irq_status) { > +pci_irq_deassert(&n->parent_obj); >pci_irq_assert(&n->parent_obj); >} else { >pci_irq_deassert(&n->parent_obj); > > > seems to do the trick (pulse is the other way around, assert, then > deassert). > > Probably not the "correct" thing to do, but I'll take it since it seems > to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm > on ~20 runs now and have not encountered it. Yep, that looks good. It's sounding like something with the CPU irq handling is treating the level interrupt as edge triggered.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 12 09:34, Keith Busch wrote: > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > > am wondering if there is something going on with the kernel driver (but > > I certainly do not rule out that hw/nvme is at fault here, since > > pin-based interrupts has also been a source of several issues in the > > past). > > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? > While probably not the "correct" thing to do, it has better results in > my testing. > A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c index 03760ddeae8c..0fc46dcb9ec4 100644 --- i/hw/nvme/ctrl.c +++ w/hw/nvme/ctrl.c @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) return; } if (~intms & n->irq_status) { +pci_irq_deassert(&n->parent_obj); pci_irq_assert(&n->parent_obj); } else { pci_irq_deassert(&n->parent_obj); seems to do the trick (pulse is the other way around, assert, then deassert). Probably not the "correct" thing to do, but I'll take it since it seems to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm on ~20 runs now and have not encountered it. I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS machine/board(s) are you testing? signature.asc Description: PGP signature
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On 1/12/23 09:45, Klaus Jensen wrote: On Jan 12 09:34, Keith Busch wrote: On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: The pin-based interrupt logic in hw/nvme seems sound enough to me, so I am wondering if there is something going on with the kernel driver (but I certainly do not rule out that hw/nvme is at fault here, since pin-based interrupts has also been a source of several issues in the past). Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? While probably not the "correct" thing to do, it has better results in my testing. A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c index 03760ddeae8c..0fc46dcb9ec4 100644 --- i/hw/nvme/ctrl.c +++ w/hw/nvme/ctrl.c @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) return; } if (~intms & n->irq_status) { +pci_irq_deassert(&n->parent_obj); pci_irq_assert(&n->parent_obj); } else { pci_irq_deassert(&n->parent_obj); seems to do the trick (pulse is the other way around, assert, then deassert). Probably not the "correct" thing to do, but I'll take it since it seems to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm on ~20 runs now and have not encountered it. I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS machine/board(s) are you testing? See https://github.com/groeck/linux-build-test/blob/master/rootfs/mips/run-qemu-mips.sh and https://github.com/groeck/linux-build-test/blob/master/rootfs/mipsel/run-qemu-mipsel.sh I'll apply the change suggested above to my version of qemu (7.2.0) and give it a try. Guenter
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote: > On Jan 12 09:34, Keith Busch wrote: > > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > > > > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > > > am wondering if there is something going on with the kernel driver (but > > > I certainly do not rule out that hw/nvme is at fault here, since > > > pin-based interrupts has also been a source of several issues in the > > > past). > > > > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? > > While probably not the "correct" thing to do, it has better results in > > my testing. > > > > A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, > > diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c > index 03760ddeae8c..0fc46dcb9ec4 100644 > --- i/hw/nvme/ctrl.c > +++ w/hw/nvme/ctrl.c > @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) >return; >} >if (~intms & n->irq_status) { > +pci_irq_deassert(&n->parent_obj); >pci_irq_assert(&n->parent_obj); >} else { >pci_irq_deassert(&n->parent_obj); > > > seems to do the trick (pulse is the other way around, assert, then > deassert). > > Probably not the "correct" thing to do, but I'll take it since it seems > to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm > on ~20 runs now and have not encountered it. > > I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS > machine/board(s) are you testing? Could you try the below? --- diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 2c85de4700..521c3c80c1 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq) } } +static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq) +{ +if (!cq->irq_enabled) { +return; +} + +if (msix_enabled(&(n->parent_obj))) { +msix_notify(&(n->parent_obj), cq->vector); +return; +} + +pci_irq_pulse(&n->parent_obj); +} + static void nvme_req_clear(NvmeRequest *req) { req->ns = NULL; @@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) } nvme_irq_deassert(n, cq); +} else { +/* + * Retrigger the irq just to make sure the host has no excuse for + * not knowing there's more work to complete on this CQ. + */ +nvme_irq_pulse(n, cq); } } else { /* Submission queue doorbell write */ --
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On 1/12/23 09:45, Klaus Jensen wrote: On Jan 12 09:34, Keith Busch wrote: On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: The pin-based interrupt logic in hw/nvme seems sound enough to me, so I am wondering if there is something going on with the kernel driver (but I certainly do not rule out that hw/nvme is at fault here, since pin-based interrupts has also been a source of several issues in the past). Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? While probably not the "correct" thing to do, it has better results in my testing. A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c index 03760ddeae8c..0fc46dcb9ec4 100644 --- i/hw/nvme/ctrl.c +++ w/hw/nvme/ctrl.c @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) return; } if (~intms & n->irq_status) { +pci_irq_deassert(&n->parent_obj); pci_irq_assert(&n->parent_obj); } else { pci_irq_deassert(&n->parent_obj); seems to do the trick (pulse is the other way around, assert, then deassert). Probably not the "correct" thing to do, but I'll take it since it seems to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm on ~20 runs now and have not encountered it. I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS machine/board(s) are you testing? So, for mipsel, two sets of results for the above: First, qemu v7.2 is already much better than qemu v7.1. With qemu v7.1, the boot test fails roughly every other test. Failure rate with qemu v7.2 is much less. Second, my nvme boot test with qemu 7.2 fails after ~5-10 iterations. After the above change, I did not see a single failure in 50 boot tests. I'll test the other suggested change next. Guenter
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On 1/12/23 11:27, Keith Busch wrote: On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote: On Jan 12 09:34, Keith Busch wrote: On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: The pin-based interrupt logic in hw/nvme seems sound enough to me, so I am wondering if there is something going on with the kernel driver (but I certainly do not rule out that hw/nvme is at fault here, since pin-based interrupts has also been a source of several issues in the past). Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? While probably not the "correct" thing to do, it has better results in my testing. A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c index 03760ddeae8c..0fc46dcb9ec4 100644 --- i/hw/nvme/ctrl.c +++ w/hw/nvme/ctrl.c @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) return; } if (~intms & n->irq_status) { +pci_irq_deassert(&n->parent_obj); pci_irq_assert(&n->parent_obj); } else { pci_irq_deassert(&n->parent_obj); seems to do the trick (pulse is the other way around, assert, then deassert). Probably not the "correct" thing to do, but I'll take it since it seems to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm on ~20 runs now and have not encountered it. I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS machine/board(s) are you testing? Could you try the below? This works as well: 50 iterations with no failures after applying the patch below on top of qemu v7.2.0. Tested with qemu-system-mipsel. Guenter --- diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 2c85de4700..521c3c80c1 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq) } } +static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq) +{ +if (!cq->irq_enabled) { +return; +} + +if (msix_enabled(&(n->parent_obj))) { +msix_notify(&(n->parent_obj), cq->vector); +return; +} + +pci_irq_pulse(&n->parent_obj); +} + static void nvme_req_clear(NvmeRequest *req) { req->ns = NULL; @@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) } nvme_irq_deassert(n, cq); +} else { +/* + * Retrigger the irq just to make sure the host has no excuse for + * not knowing there's more work to complete on this CQ. + */ +nvme_irq_pulse(n, cq); } } else { /* Submission queue doorbell write */ --
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
+CC qemu pci maintainers Michael, Marcel, Do you have any comments on this thread? As you can see one solution is to simply deassert prior to asserting, the other is to reintroduce a pci_irq_pulse(). Both seem to solve the issue. On Jan 12 14:10, Klaus Jensen wrote: > Hi all (linux-nvme, qemu-devel, maintainers), > > On QEMU riscv64, which does not use MSI/MSI-X and thus relies on > pin-based interrupts, I'm seeing occasional completion timeouts, i.e. > > nvme nvme0: I/O 333 QID 1 timeout, completion polled > > To rule out issues with shadow doorbells (which have been a source of > frustration in the past), those are disabled. FWIW I'm also seeing the > issue with shadow doorbells. > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index f25cc2c235e9..28d8e7f4b56c 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice > *pci_dev) >id->mdts = n->params.mdts; >id->ver = cpu_to_le32(NVME_SPEC_VER); >id->oacs = > -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | > NVME_OACS_DBBUF); > +cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT); >id->cntrltype = 0x1; > >/* > > > I captured a trace from QEMU when this happens: > > pci_nvme_mmio_write addr 0x1008 data 0x4e size 4 > pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78 > pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' > pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324 > pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 > num_prps 5 > pci_nvme_map_addr addr 0x80aca000 len 4096 > pci_nvme_map_addr addr 0x80ac9000 len 4096 > pci_nvme_map_addr addr 0x80ac8000 len 4096 > pci_nvme_map_addr addr 0x80ac7000 len 4096 > pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' > pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242 > pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 > num_prps 29 > pci_nvme_map_addr addr 0x80ae6000 len 4096 > pci_nvme_map_addr addr 0x80ae5000 len 4096 > pci_nvme_map_addr addr 0x80ae4000 len 4096 > pci_nvme_map_addr addr 0x80ae3000 len 4096 > pci_nvme_map_addr addr 0x80ae2000 len 4096 > pci_nvme_map_addr addr 0x80ae1000 len 4096 > pci_nvme_map_addr addr 0x80ae len 4096 > pci_nvme_map_addr addr 0x80adf000 len 4096 > pci_nvme_map_addr addr 0x80ade000 len 4096 > pci_nvme_map_addr addr 0x80add000 len 4096 > pci_nvme_map_addr addr 0x80adc000 len 4096 > pci_nvme_map_addr addr 0x80adb000 len 4096 > pci_nvme_map_addr addr 0x80ada000 len 4096 > pci_nvme_map_addr addr 0x80ad9000 len 4096 > pci_nvme_map_addr addr 0x80ad8000 len 4096 > pci_nvme_map_addr addr 0x80ad7000 len 4096 > pci_nvme_map_addr addr 0x80ad6000 len 4096 > pci_nvme_map_addr addr 0x80ad5000 len 4096 > pci_nvme_map_addr addr 0x80ad4000 len 4096 > pci_nvme_map_addr addr 0x80ad3000 len 4096 > pci_nvme_map_addr addr 0x80ad2000 len 4096 > pci_nvme_map_addr addr 0x80ad1000 len 4096 > pci_nvme_map_addr addr 0x80ad len 4096 > pci_nvme_map_addr addr 0x80acf000 len 4096 > pci_nvme_map_addr addr 0x80ace000 len 4096 > pci_nvme_map_addr addr 0x80acd000 len 4096 > pci_nvme_map_addr addr 0x80acc000 len 4096 > pci_nvme_map_addr addr 0x80acb000 len 4096 > pci_nvme_rw_cb cid 4428 blk 'd0' > pci_nvme_rw_complete_cb cid 4428 blk 'd0' > pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0 > [1]: pci_nvme_irq_pin pulsing IRQ pin > pci_nvme_rw_cb cid 4429 blk 'd0' > pci_nvme_rw_complete_cb cid 4429 blk 'd0' > pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0 > [2]: pci_nvme_irq_pin pulsing IRQ pin > [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4 > [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77 > TIMEOUT HERE (30s) --- > [5]: pci_nvme_mmio_read addr 0x1c size 4 > [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4 > [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78 > --- Interrupt deasserted (cq->tail == cq->head) > [ 31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled > > Following the timeout, everything returns to "normal" and device/driver > happily continues. > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > am wondering if there is something going on with the kernel driver (but > I certainly do not rule out that hw/nvme is at fault here, since > pin-based interrupts has also been a source of several issues in the > past). > > What I'm thinking is that following the interrupt in [1], the driver > picks up completion for cid 4428 but does not find cid 4429 in the queue > since it has not been posted yet. Before getting a cq head doorbell > write (which would cause the pin to be deasserted), the device posts the > completion for cid 4429 which just keeps the interrupt asserted in [2]. > The trace then shows the cq head doorbell update in [3,4] for cid 4428 > and then we hit the
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Fri, 13 Jan 2023 at 08:55, Klaus Jensen wrote: > > +CC qemu pci maintainers > > Michael, Marcel, > > Do you have any comments on this thread? As you can see one solution is > to simply deassert prior to asserting, the other is to reintroduce a > pci_irq_pulse(). Both seem to solve the issue. Both seem to be missing any analysis of "this is what is happening, this is where we differ from hardware, this is why this is the correct fix". We shouldn't put in random "this seems to happen to cause the guest to boot" fixes, please. thanks -- PMM
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 13 12:32, Peter Maydell wrote: > On Fri, 13 Jan 2023 at 08:55, Klaus Jensen wrote: > > > > +CC qemu pci maintainers > > > > Michael, Marcel, > > > > Do you have any comments on this thread? As you can see one solution is > > to simply deassert prior to asserting, the other is to reintroduce a > > pci_irq_pulse(). Both seem to solve the issue. > > Both seem to be missing any analysis of "this is what is > happening, this is where we differ from hardware, this > is why this is the correct fix". We shouldn't put in > random "this seems to happen to cause the guest to boot" > fixes, please. > No, I'd like to get to the bottom of this, which is why I'm reaching out to the pci maintainers to get an idea about if this is a general/known issue with pin based interrupts on pci. There are a fair amount of uses of pci_irq_pulse() still left in the tree. signature.asc Description: PGP signature
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Fri, 13 Jan 2023 at 12:37, Klaus Jensen wrote: > There are a fair amount of uses of pci_irq_pulse() still left in the > tree. Are there? I feel like I'm missing something here: $ git grep pci_irq_pulse include/hw/pci/pci.h:static inline void pci_irq_pulse(PCIDevice *pci_dev) $ ...looks at first sight like an unused function we could delete. thanks -- PMM
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 13 12:42, Peter Maydell wrote: > On Fri, 13 Jan 2023 at 12:37, Klaus Jensen wrote: > > There are a fair amount of uses of pci_irq_pulse() still left in the > > tree. > > Are there? I feel like I'm missing something here: > $ git grep pci_irq_pulse > include/hw/pci/pci.h:static inline void pci_irq_pulse(PCIDevice *pci_dev) > $ > > ...looks at first sight like an unused function we could delete. > Oh! My mistake! I grepped for irq_pulse which matched qemu_irq_pulse(), doh. signature.asc Description: PGP signature
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Fri, Jan 13, 2023 at 12:32:29PM +, Peter Maydell wrote: > On Fri, 13 Jan 2023 at 08:55, Klaus Jensen wrote: > > > > +CC qemu pci maintainers > > > > Michael, Marcel, > > > > Do you have any comments on this thread? As you can see one solution is > > to simply deassert prior to asserting, the other is to reintroduce a > > pci_irq_pulse(). Both seem to solve the issue. > > Both seem to be missing any analysis of "this is what is > happening, this is where we differ from hardware, this > is why this is the correct fix". We shouldn't put in > random "this seems to happen to cause the guest to boot" > fixes, please. It looks like these are being treated as edge instead of level interrupts so the work-around is to create more edges. I would definitely prefer a real fix, whether that's in the kernel or CPU emulation or somewhere else, but I'm not sure I'll have time to see it to completion. FWIW, x86 seems to handle legacy IRQs with NVMe as expected. It's actually easy enough for the DEASSERT to take so long that kernel reports the irq as "spurious" because it's spinning too often on the level.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 12 14:10, Klaus Jensen wrote: > Hi all (linux-nvme, qemu-devel, maintainers), > > On QEMU riscv64, which does not use MSI/MSI-X and thus relies on > pin-based interrupts, I'm seeing occasional completion timeouts, i.e. > > nvme nvme0: I/O 333 QID 1 timeout, completion polled > > To rule out issues with shadow doorbells (which have been a source of > frustration in the past), those are disabled. FWIW I'm also seeing the > issue with shadow doorbells. > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index f25cc2c235e9..28d8e7f4b56c 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice > *pci_dev) >id->mdts = n->params.mdts; >id->ver = cpu_to_le32(NVME_SPEC_VER); >id->oacs = > -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | > NVME_OACS_DBBUF); > +cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT); >id->cntrltype = 0x1; > >/* > > > I captured a trace from QEMU when this happens: > > pci_nvme_mmio_write addr 0x1008 data 0x4e size 4 > pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78 > pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' > pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324 > pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 > num_prps 5 > pci_nvme_map_addr addr 0x80aca000 len 4096 > pci_nvme_map_addr addr 0x80ac9000 len 4096 > pci_nvme_map_addr addr 0x80ac8000 len 4096 > pci_nvme_map_addr addr 0x80ac7000 len 4096 > pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ' > pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242 > pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 > num_prps 29 > pci_nvme_map_addr addr 0x80ae6000 len 4096 > pci_nvme_map_addr addr 0x80ae5000 len 4096 > pci_nvme_map_addr addr 0x80ae4000 len 4096 > pci_nvme_map_addr addr 0x80ae3000 len 4096 > pci_nvme_map_addr addr 0x80ae2000 len 4096 > pci_nvme_map_addr addr 0x80ae1000 len 4096 > pci_nvme_map_addr addr 0x80ae len 4096 > pci_nvme_map_addr addr 0x80adf000 len 4096 > pci_nvme_map_addr addr 0x80ade000 len 4096 > pci_nvme_map_addr addr 0x80add000 len 4096 > pci_nvme_map_addr addr 0x80adc000 len 4096 > pci_nvme_map_addr addr 0x80adb000 len 4096 > pci_nvme_map_addr addr 0x80ada000 len 4096 > pci_nvme_map_addr addr 0x80ad9000 len 4096 > pci_nvme_map_addr addr 0x80ad8000 len 4096 > pci_nvme_map_addr addr 0x80ad7000 len 4096 > pci_nvme_map_addr addr 0x80ad6000 len 4096 > pci_nvme_map_addr addr 0x80ad5000 len 4096 > pci_nvme_map_addr addr 0x80ad4000 len 4096 > pci_nvme_map_addr addr 0x80ad3000 len 4096 > pci_nvme_map_addr addr 0x80ad2000 len 4096 > pci_nvme_map_addr addr 0x80ad1000 len 4096 > pci_nvme_map_addr addr 0x80ad len 4096 > pci_nvme_map_addr addr 0x80acf000 len 4096 > pci_nvme_map_addr addr 0x80ace000 len 4096 > pci_nvme_map_addr addr 0x80acd000 len 4096 > pci_nvme_map_addr addr 0x80acc000 len 4096 > pci_nvme_map_addr addr 0x80acb000 len 4096 > pci_nvme_rw_cb cid 4428 blk 'd0' > pci_nvme_rw_complete_cb cid 4428 blk 'd0' > pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0 > [1]: pci_nvme_irq_pin pulsing IRQ pin > pci_nvme_rw_cb cid 4429 blk 'd0' > pci_nvme_rw_complete_cb cid 4429 blk 'd0' > pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0 > [2]: pci_nvme_irq_pin pulsing IRQ pin > [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4 > [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77 > TIMEOUT HERE (30s) --- > [5]: pci_nvme_mmio_read addr 0x1c size 4 > [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4 > [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78 > --- Interrupt deasserted (cq->tail == cq->head) > [ 31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled > > Following the timeout, everything returns to "normal" and device/driver > happily continues. > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > am wondering if there is something going on with the kernel driver (but > I certainly do not rule out that hw/nvme is at fault here, since > pin-based interrupts has also been a source of several issues in the > past). > > What I'm thinking is that following the interrupt in [1], the driver > picks up completion for cid 4428 but does not find cid 4429 in the queue > since it has not been posted yet. Before getting a cq head doorbell > write (which would cause the pin to be deasserted), the device posts the > completion for cid 4429 which just keeps the interrupt asserted in [2]. > The trace then shows the cq head doorbell update in [3,4] for cid 4428 > and then we hit the timeout since the driver is not aware that cid 4429 > has been posted in between this (why is it not aware of this?) Timing > out, the driver then polls the queue and notices cid 4429 and updates > the cq head doorbell in [5-7], cau
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > I noticed that the Linux driver does not use the INTMS/INTMC registers > to mask interrupts on the controller while processing CQEs. While not > required by the spec, it is *recommended* in setups not using MSI-X to > reduce the risk of spurious and/or missed interrupts. That's assuming completions are deferred to a bottom half. We don't do that by default in Linux nvme, though you can ask the driver to do that if you want. > With the patch below, running 100 boot iterations, no timeouts were > observed on QEMU emulated riscv64 or mips64. > > No changes are required in the QEMU hw/nvme interrupt logic. Yeah, I can see why: it forces the irq line to deassert then assert, just like we had forced to happen within the device side patches. Still, none of that is supposed to be necessary, but this idea of using these registers is probably fine. > static irqreturn_t nvme_irq(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + struct nvme_dev *dev = nvmeq->dev; > + u32 mask = 1 << nvmeq->cq_vector; > + irqreturn_t ret = IRQ_NONE; > + DEFINE_IO_COMP_BATCH(iob); > + > + writel(mask, dev->bar + NVME_REG_INTMS); > + > + if (nvme_poll_cq(nvmeq, &iob)) { > + if (!rq_list_empty(iob.req_list)) > + nvme_pci_complete_batch(&iob); > + ret = IRQ_HANDLED; > + } > + > + writel(mask, dev->bar + NVME_REG_INTMC); > + > + return ret; > +} If threaded interrupts are used, you'll want to do the masking in nvme_irq_check(), then clear it in the threaded handler instead of doing both in the same callback. > +static irqreturn_t nvme_irq_msix(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > DEFINE_IO_COMP_BATCH(iob); > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq) > { > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > int nr = nvmeq->dev->ctrl.instance; > + irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq; > > if (use_threaded_interrupts) { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, > - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > + handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } else { > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } > } > >
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: [ ... ] > > I noticed that the Linux driver does not use the INTMS/INTMC registers > to mask interrupts on the controller while processing CQEs. While not > required by the spec, it is *recommended* in setups not using MSI-X to > reduce the risk of spurious and/or missed interrupts. > > With the patch below, running 100 boot iterations, no timeouts were > observed on QEMU emulated riscv64 or mips64. > > No changes are required in the QEMU hw/nvme interrupt logic. > Yes, but isn't that technically similar to dropping the interrupt request and raising it again, or in other words pulsing the interrupt ? I still don't understand why the (level triggered) interrupt pin would require pulsing in the first place. Thanks, Guenter > > diff --git i/drivers/nvme/host/pci.c w/drivers/nvme/host/pci.c > index b13baccedb4a..75f6b87c4c3f 100644 > --- i/drivers/nvme/host/pci.c > +++ w/drivers/nvme/host/pci.c > @@ -1128,6 +1128,27 @@ static inline int nvme_poll_cq(struct nvme_queue > *nvmeq, > } > > static irqreturn_t nvme_irq(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + struct nvme_dev *dev = nvmeq->dev; > + u32 mask = 1 << nvmeq->cq_vector; > + irqreturn_t ret = IRQ_NONE; > + DEFINE_IO_COMP_BATCH(iob); > + > + writel(mask, dev->bar + NVME_REG_INTMS); > + > + if (nvme_poll_cq(nvmeq, &iob)) { > + if (!rq_list_empty(iob.req_list)) > + nvme_pci_complete_batch(&iob); > + ret = IRQ_HANDLED; > + } > + > + writel(mask, dev->bar + NVME_REG_INTMC); > + > + return ret; > +} > + > +static irqreturn_t nvme_irq_msix(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > DEFINE_IO_COMP_BATCH(iob); > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq) > { > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > int nr = nvmeq->dev->ctrl.instance; > + irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq; > > if (use_threaded_interrupts) { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, > - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > + handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } else { > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } > } > >
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote: > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > > I noticed that the Linux driver does not use the INTMS/INTMC registers > > to mask interrupts on the controller while processing CQEs. While not > > required by the spec, it is *recommended* in setups not using MSI-X to > > reduce the risk of spurious and/or missed interrupts. > > That's assuming completions are deferred to a bottom half. We don't do > that by default in Linux nvme, though you can ask the driver to do that > if you want. > > > With the patch below, running 100 boot iterations, no timeouts were > > observed on QEMU emulated riscv64 or mips64. > > > > No changes are required in the QEMU hw/nvme interrupt logic. > > Yeah, I can see why: it forces the irq line to deassert then assert, > just like we had forced to happen within the device side patches. Still, > none of that is supposed to be necessary, but this idea of using these > registers is probably fine. There is still no answer why this would be necessary in the first place, on either side. In my opinion, unless someone can confirm that the problem is seen with real hardware, we should assume that it happens on the qemu side and address it there. Guenter > > > static irqreturn_t nvme_irq(int irq, void *data) > > +{ > > + struct nvme_queue *nvmeq = data; > > + struct nvme_dev *dev = nvmeq->dev; > > + u32 mask = 1 << nvmeq->cq_vector; > > + irqreturn_t ret = IRQ_NONE; > > + DEFINE_IO_COMP_BATCH(iob); > > + > > + writel(mask, dev->bar + NVME_REG_INTMS); > > + > > + if (nvme_poll_cq(nvmeq, &iob)) { > > + if (!rq_list_empty(iob.req_list)) > > + nvme_pci_complete_batch(&iob); > > + ret = IRQ_HANDLED; > > + } > > + > > + writel(mask, dev->bar + NVME_REG_INTMC); > > + > > + return ret; > > +} > > If threaded interrupts are used, you'll want to do the masking in > nvme_irq_check(), then clear it in the threaded handler instead of doing > both in the same callback. > > > +static irqreturn_t nvme_irq_msix(int irq, void *data) > > { > > struct nvme_queue *nvmeq = data; > > DEFINE_IO_COMP_BATCH(iob); > > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue > > *nvmeq) > > { > > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > > int nr = nvmeq->dev->ctrl.instance; > > + irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : > > nvme_irq; > > > > if (use_threaded_interrupts) { > > return pci_request_irq(pdev, nvmeq->cq_vector, > > nvme_irq_check, > > - nvme_irq, nvmeq, "nvme%dq%d", nr, > > nvmeq->qid); > > + handler, nvmeq, "nvme%dq%d", nr, > > nvmeq->qid); > > } else { > > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > > } > > } > > > > > >
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Tue, 17 Jan 2023 at 16:10, Guenter Roeck wrote: > > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote: > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > > > I noticed that the Linux driver does not use the INTMS/INTMC registers > > > to mask interrupts on the controller while processing CQEs. While not > > > required by the spec, it is *recommended* in setups not using MSI-X to > > > reduce the risk of spurious and/or missed interrupts. > > > > That's assuming completions are deferred to a bottom half. We don't do > > that by default in Linux nvme, though you can ask the driver to do that > > if you want. > > > > > With the patch below, running 100 boot iterations, no timeouts were > > > observed on QEMU emulated riscv64 or mips64. > > > > > > No changes are required in the QEMU hw/nvme interrupt logic. > > > > Yeah, I can see why: it forces the irq line to deassert then assert, > > just like we had forced to happen within the device side patches. Still, > > none of that is supposed to be necessary, but this idea of using these > > registers is probably fine. > > There is still no answer why this would be necessary in the first place, > on either side. In my opinion, unless someone can confirm that the problem > is seen with real hardware, we should assume that it happens on the qemu > side and address it there. Sure, but that means identifying what the divergence between QEMU's implementation and the hardware is first. I don't want a fudged fix in QEMU's code any more than you want one in the kernel's driver code :-) thanks -- PMM
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Tue, Jan 17, 2023 at 04:18:14PM +, Peter Maydell wrote: > On Tue, 17 Jan 2023 at 16:10, Guenter Roeck wrote: > > > > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote: > > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > > > > I noticed that the Linux driver does not use the INTMS/INTMC registers > > > > to mask interrupts on the controller while processing CQEs. While not > > > > required by the spec, it is *recommended* in setups not using MSI-X to > > > > reduce the risk of spurious and/or missed interrupts. > > > > > > That's assuming completions are deferred to a bottom half. We don't do > > > that by default in Linux nvme, though you can ask the driver to do that > > > if you want. > > > > > > > With the patch below, running 100 boot iterations, no timeouts were > > > > observed on QEMU emulated riscv64 or mips64. > > > > > > > > No changes are required in the QEMU hw/nvme interrupt logic. > > > > > > Yeah, I can see why: it forces the irq line to deassert then assert, > > > just like we had forced to happen within the device side patches. Still, > > > none of that is supposed to be necessary, but this idea of using these > > > registers is probably fine. > > > > There is still no answer why this would be necessary in the first place, > > on either side. In my opinion, unless someone can confirm that the problem > > is seen with real hardware, we should assume that it happens on the qemu > > side and address it there. > > Sure, but that means identifying what the divergence > between QEMU's implementation and the hardware is first. I don't > want a fudged fix in QEMU's code any more than you want one in > the kernel's driver code :-) > I actually prefer it in qemu because that means I can test nvme support on all active LTS releases of the Linux kernel, but that is POV and secondary. This has been broken ever since I started testing nvme support with qemu, so it doesn't make much of a difference if fixing the problem for good takes a bit longer. Plus, I run my own version of qemu anyway, so carrying the fix (hack) in qemu doesn't make much of a difference for me. Anyway - any idea what to do to help figuring out what is happening ? Add tracing support to pci interrupt handling, maybe ? Guenter
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > Hi all (linux-nvme, qemu-devel, maintainers), > > On QEMU riscv64, which does not use MSI/MSI-X and thus relies on > pin-based interrupts, I'm seeing occasional completion timeouts, i.e. > > nvme nvme0: I/O 333 QID 1 timeout, completion polled I finally got a riscv setup recreating this, and I am not sure how you're getting a "completion polled" here. I find that the polling from here progresses the request state to IDLE, so the timeout handler moves on to aborting because it's expecting COMPLETED. The driver should not be doing that, so I'll fix that up while also investigating why the interrupt isn't retriggering as expected.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Tue, 17 Jan 2023 at 19:21, Guenter Roeck wrote: > Anyway - any idea what to do to help figuring out what is happening ? > Add tracing support to pci interrupt handling, maybe ? For intermittent bugs, I like recording the QEMU session under rr (using its chaos mode to provoke the failure if necessary) to get a recording that I can debug and re-debug at leisure. Usually you want to turn on/add tracing to help with this, and if the failure doesn't hit early in bootup then you might need to do a QEMU snapshot just before point-of-failure so you can run rr only on the short snapshot-to-failure segment. https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/ https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ This gives you a debugging session from the QEMU side's perspective, of course -- assuming you know what the hardware is supposed to do you hopefully wind up with either "the guest software did X,Y,Z and we incorrectly did A" or else "the guest software did X,Y,Z, the spec says A is the right/a permitted thing but the guest got confused". If it's the latter then you have to look at the guest as a separate code analysis/debug problem. thanks -- PMM
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote: > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck wrote: > > Anyway - any idea what to do to help figuring out what is happening ? > > Add tracing support to pci interrupt handling, maybe ? > > For intermittent bugs, I like recording the QEMU session under > rr (using its chaos mode to provoke the failure if necessary) to > get a recording that I can debug and re-debug at leisure. Usually > you want to turn on/add tracing to help with this, and if the > failure doesn't hit early in bootup then you might need to > do a QEMU snapshot just before point-of-failure so you can > run rr only on the short snapshot-to-failure segment. > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/ > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ > > This gives you a debugging session from the QEMU side's perspective, > of course -- assuming you know what the hardware is supposed to do > you hopefully wind up with either "the guest software did X,Y,Z > and we incorrectly did A" or else "the guest software did X,Y,Z, > the spec says A is the right/a permitted thing but the guest got confused". > If it's the latter then you have to look at the guest as a separate > code analysis/debug problem. Here's what I got, though I'm way out of my depth here. It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the interrupt after its first handling, which I think is expected. After claiming, QEMU masks the pending interrupt, lowering the level, though the device that raised it never deasserted.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
Klaus, This isn't going to help your issue, but there are at least two legacy irq bugs in the nvme qemu implementation. 1. The admin queue breaks if start with legacy and later initialize msix. 2. The legacy vector is shared among all queues, but it's being deasserted when the first queue's doorbell makes it empty. It should remain enabled if any cq is non-empty. I'll send you some patches for those later. Still working on the real problem.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote: > On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote: > > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck wrote: > > > Anyway - any idea what to do to help figuring out what is happening ? > > > Add tracing support to pci interrupt handling, maybe ? > > > > For intermittent bugs, I like recording the QEMU session under > > rr (using its chaos mode to provoke the failure if necessary) to > > get a recording that I can debug and re-debug at leisure. Usually > > you want to turn on/add tracing to help with this, and if the > > failure doesn't hit early in bootup then you might need to > > do a QEMU snapshot just before point-of-failure so you can > > run rr only on the short snapshot-to-failure segment. > > > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/ > > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ > > > > This gives you a debugging session from the QEMU side's perspective, > > of course -- assuming you know what the hardware is supposed to do > > you hopefully wind up with either "the guest software did X,Y,Z > > and we incorrectly did A" or else "the guest software did X,Y,Z, > > the spec says A is the right/a permitted thing but the guest got confused". > > If it's the latter then you have to look at the guest as a separate > > code analysis/debug problem. > > Here's what I got, though I'm way out of my depth here. > > It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the > interrupt after its first handling, which I think is expected. After > claiming, QEMU masks the pending interrupt, lowering the level, though > the device that raised it never deasserted. I'm not sure if this is correct, but this is what I'm coming up with and appears to fix the problem on my setup. The hardware that sets the pending interrupt is going clear it, so I don't see why the interrupt controller is automatically clearing it when the host claims it. --- diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c index c2dfacf028..f8f7af08dc 100644 --- a/hw/intc/sifive_plic.c +++ b/hw/intc/sifive_plic.c @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, unsigned size) uint32_t max_irq = sifive_plic_claimed(plic, addrid); if (max_irq) { -sifive_plic_set_pending(plic, max_irq, false); sifive_plic_set_claimed(plic, max_irq, true); } --
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 19, 2023 at 9:07 AM Keith Busch wrote: > > On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote: > > On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote: > > > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck wrote: > > > > Anyway - any idea what to do to help figuring out what is happening ? > > > > Add tracing support to pci interrupt handling, maybe ? > > > > > > For intermittent bugs, I like recording the QEMU session under > > > rr (using its chaos mode to provoke the failure if necessary) to > > > get a recording that I can debug and re-debug at leisure. Usually > > > you want to turn on/add tracing to help with this, and if the > > > failure doesn't hit early in bootup then you might need to > > > do a QEMU snapshot just before point-of-failure so you can > > > run rr only on the short snapshot-to-failure segment. > > > > > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/ > > > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ > > > > > > This gives you a debugging session from the QEMU side's perspective, > > > of course -- assuming you know what the hardware is supposed to do > > > you hopefully wind up with either "the guest software did X,Y,Z > > > and we incorrectly did A" or else "the guest software did X,Y,Z, > > > the spec says A is the right/a permitted thing but the guest got > > > confused". > > > If it's the latter then you have to look at the guest as a separate > > > code analysis/debug problem. > > > > Here's what I got, though I'm way out of my depth here. > > > > It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the > > interrupt after its first handling, which I think is expected. After > > claiming, QEMU masks the pending interrupt, lowering the level, though > > the device that raised it never deasserted. > > I'm not sure if this is correct, but this is what I'm coming up with and > appears to fix the problem on my setup. The hardware that sets the > pending interrupt is going clear it, so I don't see why the interrupt > controller is automatically clearing it when the host claims it. > > --- > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c > index c2dfacf028..f8f7af08dc 100644 > --- a/hw/intc/sifive_plic.c > +++ b/hw/intc/sifive_plic.c > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr > addr, unsigned size) > uint32_t max_irq = sifive_plic_claimed(plic, addrid); > > if (max_irq) { > -sifive_plic_set_pending(plic, max_irq, false); > sifive_plic_set_claimed(plic, max_irq, true); > } > This change isn't correct. The PLIC spec (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf) states: """ On receiving a claim message, the PLIC core will atomically determine the ID of the highest-priority pending interrupt for the target and then clear down the corresponding source’s IP bit """ which is what we are doing here. We are clearing the pending interrupt inside the PLIC Alistair
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote: > On Thu, Jan 19, 2023 at 9:07 AM Keith Busch wrote: > > --- > > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c > > index c2dfacf028..f8f7af08dc 100644 > > --- a/hw/intc/sifive_plic.c > > +++ b/hw/intc/sifive_plic.c > > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr > > addr, unsigned size) > > uint32_t max_irq = sifive_plic_claimed(plic, addrid); > > > > if (max_irq) { > > -sifive_plic_set_pending(plic, max_irq, false); > > sifive_plic_set_claimed(plic, max_irq, true); > > } > > > > This change isn't correct. The PLIC spec > (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf) > states: > > """ > On receiving a claim message, the PLIC core will atomically determine > the ID of the highest-priority pending interrupt for the target and > then clear down the corresponding source’s IP bit > """ > > which is what we are doing here. We are clearing the pending interrupt > inside the PLIC Thanks for the link. That's very helpful. If you're familiar with this area, could you possibly clear up this part of that spec? " On receiving an interrupt completion message, if the interrupt is level-triggered and the interrupt is still asserted, a new interrupt request will be forwarded to the PLIC core. " Further up, it says the "interrupt gateway" is responsible for forwarding new interrupt requests while the level remains asserted, but it doesn't look like anything is handling that, which essentially turns this into an edge interrupt. Am I missing something, or is this really not being handled?
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: > > On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote: > > On Thu, Jan 19, 2023 at 9:07 AM Keith Busch wrote: > > > --- > > > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c > > > index c2dfacf028..f8f7af08dc 100644 > > > --- a/hw/intc/sifive_plic.c > > > +++ b/hw/intc/sifive_plic.c > > > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr > > > addr, unsigned size) > > > uint32_t max_irq = sifive_plic_claimed(plic, addrid); > > > > > > if (max_irq) { > > > -sifive_plic_set_pending(plic, max_irq, false); > > > sifive_plic_set_claimed(plic, max_irq, true); > > > } > > > > > > > This change isn't correct. The PLIC spec > > (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf) > > states: > > > > """ > > On receiving a claim message, the PLIC core will atomically determine > > the ID of the highest-priority pending interrupt for the target and > > then clear down the corresponding source’s IP bit > > """ > > > > which is what we are doing here. We are clearing the pending interrupt > > inside the PLIC > > Thanks for the link. That's very helpful. > > If you're familiar with this area, could you possibly clear up this part > of that spec? > > " > On receiving an interrupt completion message, if the interrupt is > level-triggered and the interrupt is still asserted, a new interrupt > request will be forwarded to the PLIC core. > " > > Further up, it says the "interrupt gateway" is responsible for > forwarding new interrupt requests while the level remains asserted, but > it doesn't look like anything is handling that, which essentially turns > this into an edge interrupt. Am I missing something, or is this really > not being handled? Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs internal GPIO lines to trigger an interrupt. So with the current setup we only support edge triggered interrupts. It looks like we also drop the pending bit if the original interrupt de-asserts, which appears to be incorrect as well. Alistair
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote: > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: > > > > Further up, it says the "interrupt gateway" is responsible for > > forwarding new interrupt requests while the level remains asserted, but > > it doesn't look like anything is handling that, which essentially turns > > this into an edge interrupt. Am I missing something, or is this really > > not being handled? > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs > internal GPIO lines to trigger an interrupt. So with the current setup > we only support edge triggered interrupts. Thanks for confirming! Klaus, I think we can justify introducing a work-around in the emulated device now. My previous proposal with pci_irq_pulse() is no good since it does assert+deassert, but it needs to be the other way around, so please don't considert that one. Also, we ought to revisit the intms/intmc usage in the linux driver for threaded interrupts.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 18 15:26, Keith Busch wrote: > Klaus, > > This isn't going to help your issue, but there are at least two legacy > irq bugs in the nvme qemu implementation. > > 1. The admin queue breaks if start with legacy and later initialize > msix. > Hmm. Interesting that we have not encountered this before - is this because the kernel will enable MSI-X early and use it for the admin queue immediately? > 2. The legacy vector is shared among all queues, but it's being > deasserted when the first queue's doorbell makes it empty. It should > remain enabled if any cq is non-empty. I was certain that we fixed this already in commit 83d7ed5c570 ("hw/nvme: fix pin-based interrupt behavior (again)")... signature.asc Description: PGP signature
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 18 21:03, Keith Busch wrote: > On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote: > > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: > > > > > > Further up, it says the "interrupt gateway" is responsible for > > > forwarding new interrupt requests while the level remains asserted, but > > > it doesn't look like anything is handling that, which essentially turns > > > this into an edge interrupt. Am I missing something, or is this really > > > not being handled? > > > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs > > internal GPIO lines to trigger an interrupt. So with the current setup > > we only support edge triggered interrupts. > > Thanks for confirming! > > Klaus, > I think we can justify introducing a work-around in the emulated device > now. My previous proposal with pci_irq_pulse() is no good since it does > assert+deassert, but it needs to be the other way around, so please > don't considert that one. > > Also, we ought to revisit the intms/intmc usage in the linux driver for > threaded interrupts. +CC: qemu-riscv Keith, Thanks for digging into this! Yeah, you are probably right that we should only use the intms/intmc changes in the use_threaded_interrupts case, not in general. While my RFC patch does seem to "fix" this, it is just a workaround as your analysis indicate. signature.asc Description: PGP signature
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Jan 19 08:28, Klaus Jensen wrote: > On Jan 18 21:03, Keith Busch wrote: > > On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote: > > > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: > > > > > > > > Further up, it says the "interrupt gateway" is responsible for > > > > forwarding new interrupt requests while the level remains asserted, but > > > > it doesn't look like anything is handling that, which essentially turns > > > > this into an edge interrupt. Am I missing something, or is this really > > > > not being handled? > > > > > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs > > > internal GPIO lines to trigger an interrupt. So with the current setup > > > we only support edge triggered interrupts. > > > > Thanks for confirming! > > > > Klaus, > > I think we can justify introducing a work-around in the emulated device > > now. My previous proposal with pci_irq_pulse() is no good since it does > > assert+deassert, but it needs to be the other way around, so please > > don't considert that one. > > > > Also, we ought to revisit the intms/intmc usage in the linux driver for > > threaded interrupts. > > +CC: qemu-riscv > > Keith, > > Thanks for digging into this! > > Yeah, you are probably right that we should only use the intms/intmc > changes in the use_threaded_interrupts case, not in general. While my > RFC patch does seem to "fix" this, it is just a workaround as your > analysis indicate. +CC: Philippe, I am observing these timeouts/aborts on mips as well, so I guess that emulation could suffer from the same issue? signature.asc Description: PGP signature
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, 19 Jan 2023 at 04:03, Keith Busch wrote: > > On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote: > > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: > > > > > > Further up, it says the "interrupt gateway" is responsible for > > > forwarding new interrupt requests while the level remains asserted, but > > > it doesn't look like anything is handling that, which essentially turns > > > this into an edge interrupt. Am I missing something, or is this really > > > not being handled? > > > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs > > internal GPIO lines to trigger an interrupt. So with the current setup > > we only support edge triggered interrupts. > > Thanks for confirming! > > Klaus, > I think we can justify introducing a work-around in the emulated device > now. My previous proposal with pci_irq_pulse() is no good since it does > assert+deassert, but it needs to be the other way around, so please > don't considert that one. No, please don't. The bug is in the risc-v interrupt controller, so fix the risc-v interrupt controller. It shouldn't be too difficult (you probably have to do something like what the Arm GIC implementation does, where when the guest dismisses the interrupt you look at the level to see if it needs to be re-pended.) Once "workarounds" go into QEMU device emulation that make it deviate from hardware behaviour, it's hard to get rid of them again, because nobody knows whether deployed guests now accidentally rely on the wrong behaviour. So the correct thing is to never put in the workaround in the first place. thanks -- PMM
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On 1/19/23 00:02, Klaus Jensen wrote: On Jan 19 08:28, Klaus Jensen wrote: On Jan 18 21:03, Keith Busch wrote: On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote: On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: Further up, it says the "interrupt gateway" is responsible for forwarding new interrupt requests while the level remains asserted, but it doesn't look like anything is handling that, which essentially turns this into an edge interrupt. Am I missing something, or is this really not being handled? Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs internal GPIO lines to trigger an interrupt. So with the current setup we only support edge triggered interrupts. Thanks for confirming! Klaus, I think we can justify introducing a work-around in the emulated device now. My previous proposal with pci_irq_pulse() is no good since it does assert+deassert, but it needs to be the other way around, so please don't considert that one. Also, we ought to revisit the intms/intmc usage in the linux driver for threaded interrupts. +CC: qemu-riscv Keith, Thanks for digging into this! Yeah, you are probably right that we should only use the intms/intmc changes in the use_threaded_interrupts case, not in general. While my RFC patch does seem to "fix" this, it is just a workaround as your analysis indicate. > +CC: Philippe, I am observing these timeouts/aborts on mips as well, so I guess that emulation could suffer from the same issue? I suspect it does. In my case, I have an Ethernet controller on the same interrupt line as the NVME controller, and it looks like one of the interrupts gets lost if both controllers raise an interrupt at the same time. The suggested workarounds "fix" the problem, but that doesn't seem to be the correct solution. Guenter