subject:"completion timeouts with pin\-based interrupts in QEMU hw\/nvme"

completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Klaus Jensen

Hi all (linux-nvme, qemu-devel, maintainers),

On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
pin-based interrupts, I'm seeing occasional completion timeouts, i.e.

  nvme nvme0: I/O 333 QID 1 timeout, completion polled

To rule out issues with shadow doorbells (which have been a source of
frustration in the past), those are disabled. FWIW I'm also seeing the
issue with shadow doorbells.

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f25cc2c235e9..28d8e7f4b56c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
 id->oacs =
-cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | 
NVME_OACS_DBBUF);
+cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
 id->cntrltype = 0x1;

 /*


I captured a trace from QEMU when this happens:

pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 
num_prps 5
pci_nvme_map_addr addr 0x80aca000 len 4096
pci_nvme_map_addr addr 0x80ac9000 len 4096
pci_nvme_map_addr addr 0x80ac8000 len 4096
pci_nvme_map_addr addr 0x80ac7000 len 4096
pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 
num_prps 29
pci_nvme_map_addr addr 0x80ae6000 len 4096
pci_nvme_map_addr addr 0x80ae5000 len 4096
pci_nvme_map_addr addr 0x80ae4000 len 4096
pci_nvme_map_addr addr 0x80ae3000 len 4096
pci_nvme_map_addr addr 0x80ae2000 len 4096
pci_nvme_map_addr addr 0x80ae1000 len 4096
pci_nvme_map_addr addr 0x80ae len 4096
pci_nvme_map_addr addr 0x80adf000 len 4096
pci_nvme_map_addr addr 0x80ade000 len 4096
pci_nvme_map_addr addr 0x80add000 len 4096
pci_nvme_map_addr addr 0x80adc000 len 4096
pci_nvme_map_addr addr 0x80adb000 len 4096
pci_nvme_map_addr addr 0x80ada000 len 4096
pci_nvme_map_addr addr 0x80ad9000 len 4096
pci_nvme_map_addr addr 0x80ad8000 len 4096
pci_nvme_map_addr addr 0x80ad7000 len 4096
pci_nvme_map_addr addr 0x80ad6000 len 4096
pci_nvme_map_addr addr 0x80ad5000 len 4096
pci_nvme_map_addr addr 0x80ad4000 len 4096
pci_nvme_map_addr addr 0x80ad3000 len 4096
pci_nvme_map_addr addr 0x80ad2000 len 4096
pci_nvme_map_addr addr 0x80ad1000 len 4096
pci_nvme_map_addr addr 0x80ad len 4096
pci_nvme_map_addr addr 0x80acf000 len 4096
pci_nvme_map_addr addr 0x80ace000 len 4096
pci_nvme_map_addr addr 0x80acd000 len 4096
pci_nvme_map_addr addr 0x80acc000 len 4096
pci_nvme_map_addr addr 0x80acb000 len 4096
pci_nvme_rw_cb cid 4428 blk 'd0'
pci_nvme_rw_complete_cb cid 4428 blk 'd0'
pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[1]: pci_nvme_irq_pin pulsing IRQ pin
pci_nvme_rw_cb cid 4429 blk 'd0'
pci_nvme_rw_complete_cb cid 4429 blk 'd0'
pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[2]: pci_nvme_irq_pin pulsing IRQ pin
[3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
[4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
 TIMEOUT HERE (30s) ---
[5]: pci_nvme_mmio_read addr 0x1c size 4
[6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
[7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
--- Interrupt deasserted (cq->tail == cq->head)
[   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled

Following the timeout, everything returns to "normal" and device/driver
happily continues.

The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).

What I'm thinking is that following the interrupt in [1], the driver
picks up completion for cid 4428 but does not find cid 4429 in the queue
since it has not been posted yet. Before getting a cq head doorbell
write (which would cause the pin to be deasserted), the device posts the
completion for cid 4429 which just keeps the interrupt asserted in [2].
The trace then shows the cq head doorbell update in [3,4] for cid 4428
and then we hit the timeout since the driver is not aware that cid 4429
has been posted in between this (why is it not aware of this?) Timing
out, the driver then polls the queue and notices cid 4429 and updates
the cq head doorbell in [5-7], causing the device to deassert the
interrupt and we are "back in shape".

I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
Tested on QEMU v7.0.0 (to rule out all the shadow doorbell
optimizations) as well as QE

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck


On 1/12/23 05:10, Klaus Jensen wrote:

Hi all (linux-nvme, qemu-devel, maintainers),

On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
pin-based interrupts, I'm seeing occasional completion timeouts, i.e.

   nvme nvme0: I/O 333 QID 1 timeout, completion polled

To rule out issues with shadow doorbells (which have been a source of
frustration in the past), those are disabled. FWIW I'm also seeing the
issue with shadow doorbells.

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f25cc2c235e9..28d8e7f4b56c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
 id->oacs =
-cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | 
NVME_OACS_DBBUF);
+cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
 id->cntrltype = 0x1;

 /*


I captured a trace from QEMU when this happens:

pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 
num_prps 5
pci_nvme_map_addr addr 0x80aca000 len 4096
pci_nvme_map_addr addr 0x80ac9000 len 4096
pci_nvme_map_addr addr 0x80ac8000 len 4096
pci_nvme_map_addr addr 0x80ac7000 len 4096
pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 
num_prps 29
pci_nvme_map_addr addr 0x80ae6000 len 4096
pci_nvme_map_addr addr 0x80ae5000 len 4096
pci_nvme_map_addr addr 0x80ae4000 len 4096
pci_nvme_map_addr addr 0x80ae3000 len 4096
pci_nvme_map_addr addr 0x80ae2000 len 4096
pci_nvme_map_addr addr 0x80ae1000 len 4096
pci_nvme_map_addr addr 0x80ae len 4096
pci_nvme_map_addr addr 0x80adf000 len 4096
pci_nvme_map_addr addr 0x80ade000 len 4096
pci_nvme_map_addr addr 0x80add000 len 4096
pci_nvme_map_addr addr 0x80adc000 len 4096
pci_nvme_map_addr addr 0x80adb000 len 4096
pci_nvme_map_addr addr 0x80ada000 len 4096
pci_nvme_map_addr addr 0x80ad9000 len 4096
pci_nvme_map_addr addr 0x80ad8000 len 4096
pci_nvme_map_addr addr 0x80ad7000 len 4096
pci_nvme_map_addr addr 0x80ad6000 len 4096
pci_nvme_map_addr addr 0x80ad5000 len 4096
pci_nvme_map_addr addr 0x80ad4000 len 4096
pci_nvme_map_addr addr 0x80ad3000 len 4096
pci_nvme_map_addr addr 0x80ad2000 len 4096
pci_nvme_map_addr addr 0x80ad1000 len 4096
pci_nvme_map_addr addr 0x80ad len 4096
pci_nvme_map_addr addr 0x80acf000 len 4096
pci_nvme_map_addr addr 0x80ace000 len 4096
pci_nvme_map_addr addr 0x80acd000 len 4096
pci_nvme_map_addr addr 0x80acc000 len 4096
pci_nvme_map_addr addr 0x80acb000 len 4096
pci_nvme_rw_cb cid 4428 blk 'd0'
pci_nvme_rw_complete_cb cid 4428 blk 'd0'
pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[1]: pci_nvme_irq_pin pulsing IRQ pin
pci_nvme_rw_cb cid 4429 blk 'd0'
pci_nvme_rw_complete_cb cid 4429 blk 'd0'
pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[2]: pci_nvme_irq_pin pulsing IRQ pin
[3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
[4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
 TIMEOUT HERE (30s) ---
[5]: pci_nvme_mmio_read addr 0x1c size 4
[6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
[7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
--- Interrupt deasserted (cq->tail == cq->head)
[   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled

Following the timeout, everything returns to "normal" and device/driver
happily continues.

The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).

What I'm thinking is that following the interrupt in [1], the driver
picks up completion for cid 4428 but does not find cid 4429 in the queue
since it has not been posted yet. Before getting a cq head doorbell
write (which would cause the pin to be deasserted), the device posts the
completion for cid 4429 which just keeps the interrupt asserted in [2].
The trace then shows the cq head doorbell update in [3,4] for cid 4428
and then we hit the timeout since the driver is not aware that cid 4429
has been posted in between this (why is it not aware of this?) Timing
out, the driver then polls the queue and notices cid 4429 and updates
the cq head doorbell in [5-7], causing the device to deassert the
interrupt and we are "back in shape".

I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
Tested on QEMU v7.0.0 (to rule out all the sha

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).

Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch

On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
>   diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>   index 03760ddeae8c..0fc46dcb9ec4 100644
>   --- i/hw/nvme/ctrl.c
>   +++ w/hw/nvme/ctrl.c
>   @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>return;
>}
>if (~intms & n->irq_status) {
>   +pci_irq_deassert(&n->parent_obj);
>pci_irq_assert(&n->parent_obj);
>} else {
>pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.

Yep, that looks good. It's sounding like something with the CPU irq
handling is treating the level interrupt as edge triggered.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Klaus Jensen

On Jan 12 09:34, Keith Busch wrote:
> On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > 
> > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > am wondering if there is something going on with the kernel driver (but
> > I certainly do not rule out that hw/nvme is at fault here, since
> > pin-based interrupts has also been a source of several issues in the
> > past).
> 
> Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> While probably not the "correct" thing to do, it has better results in
> my testing.
> 

A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);

seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?

signature.asc
Description: PGP signature

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck


On 1/12/23 09:45, Klaus Jensen wrote:

On Jan 12 09:34, Keith Busch wrote:

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:


The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).


Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


See
https://github.com/groeck/linux-build-test/blob/master/rootfs/mips/run-qemu-mips.sh
and
https://github.com/groeck/linux-build-test/blob/master/rootfs/mipsel/run-qemu-mipsel.sh

I'll apply the change suggested above to my version of qemu (7.2.0)
and give it a try.

Guenter

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch

On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
>   diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>   index 03760ddeae8c..0fc46dcb9ec4 100644
>   --- i/hw/nvme/ctrl.c
>   +++ w/hw/nvme/ctrl.c
>   @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>return;
>}
>if (~intms & n->irq_status) {
>   +pci_irq_deassert(&n->parent_obj);
>pci_irq_assert(&n->parent_obj);
>} else {
>pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.
> 
> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
> machine/board(s) are you testing?

Could you try the below?

---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2c85de4700..521c3c80c1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
+static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)
+{
+if (!cq->irq_enabled) {
+return;
+}
+
+if (msix_enabled(&(n->parent_obj))) {
+msix_notify(&(n->parent_obj), cq->vector);
+return;
+}
+
+pci_irq_pulse(&n->parent_obj);
+}
+
 static void nvme_req_clear(NvmeRequest *req)
 {
 req->ns = NULL;
@@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
 }
 
 nvme_irq_deassert(n, cq);
+} else {
+/*
+ * Retrigger the irq just to make sure the host has no excuse for
+ * not knowing there's more work to complete on this CQ.
+ */
+nvme_irq_pulse(n, cq);
 }
 } else {
 /* Submission queue doorbell write */
--

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck


On 1/12/23 09:45, Klaus Jensen wrote:

On Jan 12 09:34, Keith Busch wrote:

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:


The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).


Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


So, for mipsel, two sets of results for the above:

First, qemu v7.2 is already much better than qemu v7.1. With qemu v7.1,
the boot test fails roughly every other test. Failure rate with qemu v7.2
is much less.

Second, my nvme boot test with qemu 7.2 fails after ~5-10 iterations.
After the above change, I did not see a single failure in 50 boot tests.

I'll test the other suggested change next.

Guenter

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck


On 1/12/23 11:27, Keith Busch wrote:

On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:

On Jan 12 09:34, Keith Busch wrote:

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:


The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).


Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


Could you try the below?



This works as well: 50 iterations with no failures after applying the patch
below on top of qemu v7.2.0. Tested with qemu-system-mipsel.

Guenter


---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2c85de4700..521c3c80c1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
  }
  }
  
+static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)

+{
+if (!cq->irq_enabled) {
+return;
+}
+
+if (msix_enabled(&(n->parent_obj))) {
+msix_notify(&(n->parent_obj), cq->vector);
+return;
+}
+
+pci_irq_pulse(&n->parent_obj);
+}
+
  static void nvme_req_clear(NvmeRequest *req)
  {
  req->ns = NULL;
@@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
  }
  
  nvme_irq_deassert(n, cq);

+} else {
+/*
+ * Retrigger the irq just to make sure the host has no excuse for
+ * not knowing there's more work to complete on this CQ.
+ */
+nvme_irq_pulse(n, cq);
  }
  } else {
  /* Submission queue doorbell write */
--

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Klaus Jensen

+CC qemu pci maintainers

Michael, Marcel,

Do you have any comments on this thread? As you can see one solution is
to simply deassert prior to asserting, the other is to reintroduce a
pci_irq_pulse(). Both seem to solve the issue.

On Jan 12 14:10, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> To rule out issues with shadow doorbells (which have been a source of
> frustration in the past), those are disabled. FWIW I'm also seeing the
> issue with shadow doorbells.
> 
>   diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>   index f25cc2c235e9..28d8e7f4b56c 100644
>   --- a/hw/nvme/ctrl.c
>   +++ b/hw/nvme/ctrl.c
>   @@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> *pci_dev)
>id->mdts = n->params.mdts;
>id->ver = cpu_to_le32(NVME_SPEC_VER);
>id->oacs =
>   -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | 
> NVME_OACS_DBBUF);
>   +cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
>id->cntrltype = 0x1;
> 
>/*
> 
> 
> I captured a trace from QEMU when this happens:
> 
> pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
> pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
> pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
> pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 
> num_prps 5
> pci_nvme_map_addr addr 0x80aca000 len 4096
> pci_nvme_map_addr addr 0x80ac9000 len 4096
> pci_nvme_map_addr addr 0x80ac8000 len 4096
> pci_nvme_map_addr addr 0x80ac7000 len 4096
> pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
> pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 
> num_prps 29
> pci_nvme_map_addr addr 0x80ae6000 len 4096
> pci_nvme_map_addr addr 0x80ae5000 len 4096
> pci_nvme_map_addr addr 0x80ae4000 len 4096
> pci_nvme_map_addr addr 0x80ae3000 len 4096
> pci_nvme_map_addr addr 0x80ae2000 len 4096
> pci_nvme_map_addr addr 0x80ae1000 len 4096
> pci_nvme_map_addr addr 0x80ae len 4096
> pci_nvme_map_addr addr 0x80adf000 len 4096
> pci_nvme_map_addr addr 0x80ade000 len 4096
> pci_nvme_map_addr addr 0x80add000 len 4096
> pci_nvme_map_addr addr 0x80adc000 len 4096
> pci_nvme_map_addr addr 0x80adb000 len 4096
> pci_nvme_map_addr addr 0x80ada000 len 4096
> pci_nvme_map_addr addr 0x80ad9000 len 4096
> pci_nvme_map_addr addr 0x80ad8000 len 4096
> pci_nvme_map_addr addr 0x80ad7000 len 4096
> pci_nvme_map_addr addr 0x80ad6000 len 4096
> pci_nvme_map_addr addr 0x80ad5000 len 4096
> pci_nvme_map_addr addr 0x80ad4000 len 4096
> pci_nvme_map_addr addr 0x80ad3000 len 4096
> pci_nvme_map_addr addr 0x80ad2000 len 4096
> pci_nvme_map_addr addr 0x80ad1000 len 4096
> pci_nvme_map_addr addr 0x80ad len 4096
> pci_nvme_map_addr addr 0x80acf000 len 4096
> pci_nvme_map_addr addr 0x80ace000 len 4096
> pci_nvme_map_addr addr 0x80acd000 len 4096
> pci_nvme_map_addr addr 0x80acc000 len 4096
> pci_nvme_map_addr addr 0x80acb000 len 4096
> pci_nvme_rw_cb cid 4428 blk 'd0'
> pci_nvme_rw_complete_cb cid 4428 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [1]: pci_nvme_irq_pin pulsing IRQ pin
> pci_nvme_rw_cb cid 4429 blk 'd0'
> pci_nvme_rw_complete_cb cid 4429 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [2]: pci_nvme_irq_pin pulsing IRQ pin
> [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
> [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
>  TIMEOUT HERE (30s) ---
> [5]: pci_nvme_mmio_read addr 0x1c size 4
> [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
> [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
> --- Interrupt deasserted (cq->tail == cq->head)
> [   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> Following the timeout, everything returns to "normal" and device/driver
> happily continues.
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).
> 
> What I'm thinking is that following the interrupt in [1], the driver
> picks up completion for cid 4428 but does not find cid 4429 in the queue
> since it has not been posted yet. Before getting a cq head doorbell
> write (which would cause the pin to be deasserted), the device posts the
> completion for cid 4429 which just keeps the interrupt asserted in [2].
> The trace then shows the cq head doorbell update in [3,4] for cid 4428
> and then we hit the

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Peter Maydell

On Fri, 13 Jan 2023 at 08:55, Klaus Jensen  wrote:
>
> +CC qemu pci maintainers
>
> Michael, Marcel,
>
> Do you have any comments on this thread? As you can see one solution is
> to simply deassert prior to asserting, the other is to reintroduce a
> pci_irq_pulse(). Both seem to solve the issue.

Both seem to be missing any analysis of "this is what is
happening, this is where we differ from hardware, this
is why this is the correct fix". We shouldn't put in
random "this seems to happen to cause the guest to boot"
fixes, please.

thanks
-- PMM

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Klaus Jensen

On Jan 13 12:32, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 08:55, Klaus Jensen  wrote:
> >
> > +CC qemu pci maintainers
> >
> > Michael, Marcel,
> >
> > Do you have any comments on this thread? As you can see one solution is
> > to simply deassert prior to asserting, the other is to reintroduce a
> > pci_irq_pulse(). Both seem to solve the issue.
> 
> Both seem to be missing any analysis of "this is what is
> happening, this is where we differ from hardware, this
> is why this is the correct fix". We shouldn't put in
> random "this seems to happen to cause the guest to boot"
> fixes, please.
> 

No, I'd like to get to the bottom of this, which is why I'm reaching out
to the pci maintainers to get an idea about if this is a general/known
issue with pin based interrupts on pci.

There are a fair amount of uses of pci_irq_pulse() still left in the
tree.


signature.asc
Description: PGP signature

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Peter Maydell

On Fri, 13 Jan 2023 at 12:37, Klaus Jensen  wrote:
> There are a fair amount of uses of pci_irq_pulse() still left in the
> tree.

Are there? I feel like I'm missing something here:
$ git grep pci_irq_pulse
include/hw/pci/pci.h:static inline void pci_irq_pulse(PCIDevice *pci_dev)
$

...looks at first sight like an unused function we could delete.

thanks
-- PMM

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Klaus Jensen

On Jan 13 12:42, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 12:37, Klaus Jensen  wrote:
> > There are a fair amount of uses of pci_irq_pulse() still left in the
> > tree.
> 
> Are there? I feel like I'm missing something here:
> $ git grep pci_irq_pulse
> include/hw/pci/pci.h:static inline void pci_irq_pulse(PCIDevice *pci_dev)
> $
> 
> ...looks at first sight like an unused function we could delete.
> 

Oh! My mistake! I grepped for irq_pulse which matched qemu_irq_pulse(),
doh.


signature.asc
Description: PGP signature

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Keith Busch

On Fri, Jan 13, 2023 at 12:32:29PM +, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 08:55, Klaus Jensen  wrote:
> >
> > +CC qemu pci maintainers
> >
> > Michael, Marcel,
> >
> > Do you have any comments on this thread? As you can see one solution is
> > to simply deassert prior to asserting, the other is to reintroduce a
> > pci_irq_pulse(). Both seem to solve the issue.
> 
> Both seem to be missing any analysis of "this is what is
> happening, this is where we differ from hardware, this
> is why this is the correct fix". We shouldn't put in
> random "this seems to happen to cause the guest to boot"
> fixes, please.

It looks like these are being treated as edge instead of level
interrupts so the work-around is to create more edges. I would
definitely prefer a real fix, whether that's in the kernel or CPU
emulation or somewhere else, but I'm not sure I'll have time to see it
to completion.

FWIW, x86 seems to handle legacy IRQs with NVMe as expected. It's
actually easy enough for the DEASSERT to take so long that kernel
reports the irq as "spurious" because it's spinning too often on the
level.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-16 Thread Klaus Jensen

On Jan 12 14:10, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> To rule out issues with shadow doorbells (which have been a source of
> frustration in the past), those are disabled. FWIW I'm also seeing the
> issue with shadow doorbells.
> 
>   diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>   index f25cc2c235e9..28d8e7f4b56c 100644
>   --- a/hw/nvme/ctrl.c
>   +++ b/hw/nvme/ctrl.c
>   @@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> *pci_dev)
>id->mdts = n->params.mdts;
>id->ver = cpu_to_le32(NVME_SPEC_VER);
>id->oacs =
>   -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | 
> NVME_OACS_DBBUF);
>   +cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
>id->cntrltype = 0x1;
> 
>/*
> 
> 
> I captured a trace from QEMU when this happens:
> 
> pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
> pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
> pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
> pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 
> num_prps 5
> pci_nvme_map_addr addr 0x80aca000 len 4096
> pci_nvme_map_addr addr 0x80ac9000 len 4096
> pci_nvme_map_addr addr 0x80ac8000 len 4096
> pci_nvme_map_addr addr 0x80ac7000 len 4096
> pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
> pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 
> num_prps 29
> pci_nvme_map_addr addr 0x80ae6000 len 4096
> pci_nvme_map_addr addr 0x80ae5000 len 4096
> pci_nvme_map_addr addr 0x80ae4000 len 4096
> pci_nvme_map_addr addr 0x80ae3000 len 4096
> pci_nvme_map_addr addr 0x80ae2000 len 4096
> pci_nvme_map_addr addr 0x80ae1000 len 4096
> pci_nvme_map_addr addr 0x80ae len 4096
> pci_nvme_map_addr addr 0x80adf000 len 4096
> pci_nvme_map_addr addr 0x80ade000 len 4096
> pci_nvme_map_addr addr 0x80add000 len 4096
> pci_nvme_map_addr addr 0x80adc000 len 4096
> pci_nvme_map_addr addr 0x80adb000 len 4096
> pci_nvme_map_addr addr 0x80ada000 len 4096
> pci_nvme_map_addr addr 0x80ad9000 len 4096
> pci_nvme_map_addr addr 0x80ad8000 len 4096
> pci_nvme_map_addr addr 0x80ad7000 len 4096
> pci_nvme_map_addr addr 0x80ad6000 len 4096
> pci_nvme_map_addr addr 0x80ad5000 len 4096
> pci_nvme_map_addr addr 0x80ad4000 len 4096
> pci_nvme_map_addr addr 0x80ad3000 len 4096
> pci_nvme_map_addr addr 0x80ad2000 len 4096
> pci_nvme_map_addr addr 0x80ad1000 len 4096
> pci_nvme_map_addr addr 0x80ad len 4096
> pci_nvme_map_addr addr 0x80acf000 len 4096
> pci_nvme_map_addr addr 0x80ace000 len 4096
> pci_nvme_map_addr addr 0x80acd000 len 4096
> pci_nvme_map_addr addr 0x80acc000 len 4096
> pci_nvme_map_addr addr 0x80acb000 len 4096
> pci_nvme_rw_cb cid 4428 blk 'd0'
> pci_nvme_rw_complete_cb cid 4428 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [1]: pci_nvme_irq_pin pulsing IRQ pin
> pci_nvme_rw_cb cid 4429 blk 'd0'
> pci_nvme_rw_complete_cb cid 4429 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [2]: pci_nvme_irq_pin pulsing IRQ pin
> [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
> [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
>  TIMEOUT HERE (30s) ---
> [5]: pci_nvme_mmio_read addr 0x1c size 4
> [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
> [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
> --- Interrupt deasserted (cq->tail == cq->head)
> [   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> Following the timeout, everything returns to "normal" and device/driver
> happily continues.
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).
> 
> What I'm thinking is that following the interrupt in [1], the driver
> picks up completion for cid 4428 but does not find cid 4429 in the queue
> since it has not been posted yet. Before getting a cq head doorbell
> write (which would cause the pin to be deasserted), the device posts the
> completion for cid 4429 which just keeps the interrupt asserted in [2].
> The trace then shows the cq head doorbell update in [3,4] for cid 4428
> and then we hit the timeout since the driver is not aware that cid 4429
> has been posted in between this (why is it not aware of this?) Timing
> out, the driver then polls the queue and notices cid 4429 and updates
> the cq head doorbell in [5-7], cau

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-16 Thread Keith Busch

On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> I noticed that the Linux driver does not use the INTMS/INTMC registers
> to mask interrupts on the controller while processing CQEs. While not
> required by the spec, it is *recommended* in setups not using MSI-X to
> reduce the risk of spurious and/or missed interrupts.

That's assuming completions are deferred to a bottom half. We don't do
that by default in Linux nvme, though you can ask the driver to do that
if you want.
 
> With the patch below, running 100 boot iterations, no timeouts were
> observed on QEMU emulated riscv64 or mips64.
>
> No changes are required in the QEMU hw/nvme interrupt logic.

Yeah, I can see why: it forces the irq line to deassert then assert,
just like we had forced to happen within the device side patches. Still,
none of that is supposed to be necessary, but this idea of using these
registers is probably fine.

>  static irqreturn_t nvme_irq(int irq, void *data)
> +{
> +   struct nvme_queue *nvmeq = data;
> +   struct nvme_dev *dev = nvmeq->dev;
> +   u32 mask = 1 << nvmeq->cq_vector;
> +   irqreturn_t ret = IRQ_NONE;
> +   DEFINE_IO_COMP_BATCH(iob);
> +
> +   writel(mask, dev->bar + NVME_REG_INTMS);
> +
> +   if (nvme_poll_cq(nvmeq, &iob)) {
> +   if (!rq_list_empty(iob.req_list))
> +   nvme_pci_complete_batch(&iob);
> +   ret = IRQ_HANDLED;
> +   }
> +
> +   writel(mask, dev->bar + NVME_REG_INTMC);
> +
> +   return ret;
> +}

If threaded interrupts are used, you'll want to do the masking in
nvme_irq_check(), then clear it in the threaded handler instead of doing
both in the same callback.

> +static irqreturn_t nvme_irq_msix(int irq, void *data)
>  {
> struct nvme_queue *nvmeq = data;
> DEFINE_IO_COMP_BATCH(iob);
> @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  {
> struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> int nr = nvmeq->dev->ctrl.instance;
> +   irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> 
> if (use_threaded_interrupts) {
> return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -   nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +   handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> } else {
> -   return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +   return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> }
>  }
> 
>

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Guenter Roeck

On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
[ ... ]
> 
> I noticed that the Linux driver does not use the INTMS/INTMC registers
> to mask interrupts on the controller while processing CQEs. While not
> required by the spec, it is *recommended* in setups not using MSI-X to
> reduce the risk of spurious and/or missed interrupts.
> 
> With the patch below, running 100 boot iterations, no timeouts were
> observed on QEMU emulated riscv64 or mips64.
> 
> No changes are required in the QEMU hw/nvme interrupt logic.
> 

Yes, but isn't that technically similar to dropping the
interrupt request and raising it again, or in other words
pulsing the interrupt ?

I still don't understand why the (level triggered) interrupt
pin would require pulsing in the first place.

Thanks,
Guenter

> 
> diff --git i/drivers/nvme/host/pci.c w/drivers/nvme/host/pci.c
> index b13baccedb4a..75f6b87c4c3f 100644
> --- i/drivers/nvme/host/pci.c
> +++ w/drivers/nvme/host/pci.c
> @@ -1128,6 +1128,27 @@ static inline int nvme_poll_cq(struct nvme_queue 
> *nvmeq,
>  }
> 
>  static irqreturn_t nvme_irq(int irq, void *data)
> +{
> +   struct nvme_queue *nvmeq = data;
> +   struct nvme_dev *dev = nvmeq->dev;
> +   u32 mask = 1 << nvmeq->cq_vector;
> +   irqreturn_t ret = IRQ_NONE;
> +   DEFINE_IO_COMP_BATCH(iob);
> +
> +   writel(mask, dev->bar + NVME_REG_INTMS);
> +
> +   if (nvme_poll_cq(nvmeq, &iob)) {
> +   if (!rq_list_empty(iob.req_list))
> +   nvme_pci_complete_batch(&iob);
> +   ret = IRQ_HANDLED;
> +   }
> +
> +   writel(mask, dev->bar + NVME_REG_INTMC);
> +
> +   return ret;
> +}
> +
> +static irqreturn_t nvme_irq_msix(int irq, void *data)
>  {
> struct nvme_queue *nvmeq = data;
> DEFINE_IO_COMP_BATCH(iob);
> @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  {
> struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> int nr = nvmeq->dev->ctrl.instance;
> +   irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> 
> if (use_threaded_interrupts) {
> return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -   nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +   handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> } else {
> -   return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +   return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> }
>  }
> 
>

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Guenter Roeck

On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > to mask interrupts on the controller while processing CQEs. While not
> > required by the spec, it is *recommended* in setups not using MSI-X to
> > reduce the risk of spurious and/or missed interrupts.
> 
> That's assuming completions are deferred to a bottom half. We don't do
> that by default in Linux nvme, though you can ask the driver to do that
> if you want.
>  
> > With the patch below, running 100 boot iterations, no timeouts were
> > observed on QEMU emulated riscv64 or mips64.
> >
> > No changes are required in the QEMU hw/nvme interrupt logic.
> 
> Yeah, I can see why: it forces the irq line to deassert then assert,
> just like we had forced to happen within the device side patches. Still,
> none of that is supposed to be necessary, but this idea of using these
> registers is probably fine.

There is still no answer why this would be necessary in the first place,
on either side. In my opinion, unless someone can confirm that the problem
is seen with real hardware, we should assume that it happens on the qemu
side and address it there.

Guenter

> 
> >  static irqreturn_t nvme_irq(int irq, void *data)
> > +{
> > +   struct nvme_queue *nvmeq = data;
> > +   struct nvme_dev *dev = nvmeq->dev;
> > +   u32 mask = 1 << nvmeq->cq_vector;
> > +   irqreturn_t ret = IRQ_NONE;
> > +   DEFINE_IO_COMP_BATCH(iob);
> > +
> > +   writel(mask, dev->bar + NVME_REG_INTMS);
> > +
> > +   if (nvme_poll_cq(nvmeq, &iob)) {
> > +   if (!rq_list_empty(iob.req_list))
> > +   nvme_pci_complete_batch(&iob);
> > +   ret = IRQ_HANDLED;
> > +   }
> > +
> > +   writel(mask, dev->bar + NVME_REG_INTMC);
> > +
> > +   return ret;
> > +}
> 
> If threaded interrupts are used, you'll want to do the masking in
> nvme_irq_check(), then clear it in the threaded handler instead of doing
> both in the same callback.
> 
> > +static irqreturn_t nvme_irq_msix(int irq, void *data)
> >  {
> > struct nvme_queue *nvmeq = data;
> > DEFINE_IO_COMP_BATCH(iob);
> > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue 
> > *nvmeq)
> >  {
> > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> > int nr = nvmeq->dev->ctrl.instance;
> > +   irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : 
> > nvme_irq;
> > 
> > if (use_threaded_interrupts) {
> > return pci_request_irq(pdev, nvmeq->cq_vector, 
> > nvme_irq_check,
> > -   nvme_irq, nvmeq, "nvme%dq%d", nr, 
> > nvmeq->qid);
> > +   handler, nvmeq, "nvme%dq%d", nr, 
> > nvmeq->qid);
> > } else {
> > -   return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> > +   return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> > }
> >  }
> > 
> > 
> 
>

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Peter Maydell

On Tue, 17 Jan 2023 at 16:10, Guenter Roeck  wrote:
>
> On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > > to mask interrupts on the controller while processing CQEs. While not
> > > required by the spec, it is *recommended* in setups not using MSI-X to
> > > reduce the risk of spurious and/or missed interrupts.
> >
> > That's assuming completions are deferred to a bottom half. We don't do
> > that by default in Linux nvme, though you can ask the driver to do that
> > if you want.
> >
> > > With the patch below, running 100 boot iterations, no timeouts were
> > > observed on QEMU emulated riscv64 or mips64.
> > >
> > > No changes are required in the QEMU hw/nvme interrupt logic.
> >
> > Yeah, I can see why: it forces the irq line to deassert then assert,
> > just like we had forced to happen within the device side patches. Still,
> > none of that is supposed to be necessary, but this idea of using these
> > registers is probably fine.
>
> There is still no answer why this would be necessary in the first place,
> on either side. In my opinion, unless someone can confirm that the problem
> is seen with real hardware, we should assume that it happens on the qemu
> side and address it there.

Sure, but that means identifying what the divergence
between QEMU's implementation and the hardware is first. I don't
want a fudged fix in QEMU's code any more than you want one in
the kernel's driver code :-)

thanks
-- PMM

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Guenter Roeck

On Tue, Jan 17, 2023 at 04:18:14PM +, Peter Maydell wrote:
> On Tue, 17 Jan 2023 at 16:10, Guenter Roeck  wrote:
> >
> > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > > > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > > > to mask interrupts on the controller while processing CQEs. While not
> > > > required by the spec, it is *recommended* in setups not using MSI-X to
> > > > reduce the risk of spurious and/or missed interrupts.
> > >
> > > That's assuming completions are deferred to a bottom half. We don't do
> > > that by default in Linux nvme, though you can ask the driver to do that
> > > if you want.
> > >
> > > > With the patch below, running 100 boot iterations, no timeouts were
> > > > observed on QEMU emulated riscv64 or mips64.
> > > >
> > > > No changes are required in the QEMU hw/nvme interrupt logic.
> > >
> > > Yeah, I can see why: it forces the irq line to deassert then assert,
> > > just like we had forced to happen within the device side patches. Still,
> > > none of that is supposed to be necessary, but this idea of using these
> > > registers is probably fine.
> >
> > There is still no answer why this would be necessary in the first place,
> > on either side. In my opinion, unless someone can confirm that the problem
> > is seen with real hardware, we should assume that it happens on the qemu
> > side and address it there.
> 
> Sure, but that means identifying what the divergence
> between QEMU's implementation and the hardware is first. I don't
> want a fudged fix in QEMU's code any more than you want one in
> the kernel's driver code :-)
> 

I actually prefer it in qemu because that means I can test nvme support
on all active LTS releases of the Linux kernel, but that is POV and
secondary. This has been broken ever since I started testing nvme
support with qemu, so it doesn't make much of a difference if fixing
the problem for good takes a bit longer. Plus, I run my own version
of qemu anyway, so carrying the fix (hack) in qemu doesn't make much
of a difference for me.

Anyway - any idea what to do to help figuring out what is happening ?
Add tracing support to pci interrupt handling, maybe ?

Guenter

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Keith Busch

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled

I finally got a riscv setup recreating this, and I am not sure how
you're getting a "completion polled" here. I find that the polling from
here progresses the request state to IDLE, so the timeout handler moves
on to aborting because it's expecting COMPLETED. The driver should not
be doing that, so I'll fix that up while also investigating why the
interrupt isn't retriggering as expected.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Peter Maydell

On Tue, 17 Jan 2023 at 19:21, Guenter Roeck  wrote:
> Anyway - any idea what to do to help figuring out what is happening ?
> Add tracing support to pci interrupt handling, maybe ?

For intermittent bugs, I like recording the QEMU session under
rr (using its chaos mode to provoke the failure if necessary) to
get a recording that I can debug and re-debug at leisure. Usually
you want to turn on/add tracing to help with this, and if the
failure doesn't hit early in bootup then you might need to
do a QEMU snapshot just before point-of-failure so you can
run rr only on the short snapshot-to-failure segment.

https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/

This gives you a debugging session from the QEMU side's perspective,
of course -- assuming you know what the hardware is supposed to do
you hopefully wind up with either "the guest software did X,Y,Z
and we incorrectly did A" or else "the guest software did X,Y,Z,
the spec says A is the right/a permitted thing but the guest got confused".
If it's the latter then you have to look at the guest as a separate
code analysis/debug problem.

thanks
-- PMM

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch

On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote:
> On Tue, 17 Jan 2023 at 19:21, Guenter Roeck  wrote:
> > Anyway - any idea what to do to help figuring out what is happening ?
> > Add tracing support to pci interrupt handling, maybe ?
> 
> For intermittent bugs, I like recording the QEMU session under
> rr (using its chaos mode to provoke the failure if necessary) to
> get a recording that I can debug and re-debug at leisure. Usually
> you want to turn on/add tracing to help with this, and if the
> failure doesn't hit early in bootup then you might need to
> do a QEMU snapshot just before point-of-failure so you can
> run rr only on the short snapshot-to-failure segment.
> 
> https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> 
> This gives you a debugging session from the QEMU side's perspective,
> of course -- assuming you know what the hardware is supposed to do
> you hopefully wind up with either "the guest software did X,Y,Z
> and we incorrectly did A" or else "the guest software did X,Y,Z,
> the spec says A is the right/a permitted thing but the guest got confused".
> If it's the latter then you have to look at the guest as a separate
> code analysis/debug problem.

Here's what I got, though I'm way out of my depth here.

It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
interrupt after its first handling, which I think is expected. After
claiming, QEMU masks the pending interrupt, lowering the level, though
the device that raised it never deasserted.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch

Klaus,

This isn't going to help your issue, but there are at least two legacy
irq bugs in the nvme qemu implementation.

1. The admin queue breaks if start with legacy and later initialize
msix.

2. The legacy vector is shared among all queues, but it's being
deasserted when the first queue's doorbell makes it empty. It should
remain enabled if any cq is non-empty.

I'll send you some patches for those later. Still working on the real
problem.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch

On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote:
> On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote:
> > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck  wrote:
> > > Anyway - any idea what to do to help figuring out what is happening ?
> > > Add tracing support to pci interrupt handling, maybe ?
> > 
> > For intermittent bugs, I like recording the QEMU session under
> > rr (using its chaos mode to provoke the failure if necessary) to
> > get a recording that I can debug and re-debug at leisure. Usually
> > you want to turn on/add tracing to help with this, and if the
> > failure doesn't hit early in bootup then you might need to
> > do a QEMU snapshot just before point-of-failure so you can
> > run rr only on the short snapshot-to-failure segment.
> > 
> > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> > 
> > This gives you a debugging session from the QEMU side's perspective,
> > of course -- assuming you know what the hardware is supposed to do
> > you hopefully wind up with either "the guest software did X,Y,Z
> > and we incorrectly did A" or else "the guest software did X,Y,Z,
> > the spec says A is the right/a permitted thing but the guest got confused".
> > If it's the latter then you have to look at the guest as a separate
> > code analysis/debug problem.
> 
> Here's what I got, though I'm way out of my depth here.
> 
> It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
> interrupt after its first handling, which I think is expected. After
> claiming, QEMU masks the pending interrupt, lowering the level, though
> the device that raised it never deasserted.

I'm not sure if this is correct, but this is what I'm coming up with and
appears to fix the problem on my setup. The hardware that sets the
pending interrupt is going clear it, so I don't see why the interrupt
controller is automatically clearing it when the host claims it.

---
diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c2dfacf028..f8f7af08dc 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 uint32_t max_irq = sifive_plic_claimed(plic, addrid);
 
 if (max_irq) {
-sifive_plic_set_pending(plic, max_irq, false);
 sifive_plic_set_claimed(plic, max_irq, true);
 }
 
--

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Alistair Francis

On Thu, Jan 19, 2023 at 9:07 AM Keith Busch  wrote:
>
> On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote:
> > On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote:
> > > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck  wrote:
> > > > Anyway - any idea what to do to help figuring out what is happening ?
> > > > Add tracing support to pci interrupt handling, maybe ?
> > >
> > > For intermittent bugs, I like recording the QEMU session under
> > > rr (using its chaos mode to provoke the failure if necessary) to
> > > get a recording that I can debug and re-debug at leisure. Usually
> > > you want to turn on/add tracing to help with this, and if the
> > > failure doesn't hit early in bootup then you might need to
> > > do a QEMU snapshot just before point-of-failure so you can
> > > run rr only on the short snapshot-to-failure segment.
> > >
> > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> > > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> > >
> > > This gives you a debugging session from the QEMU side's perspective,
> > > of course -- assuming you know what the hardware is supposed to do
> > > you hopefully wind up with either "the guest software did X,Y,Z
> > > and we incorrectly did A" or else "the guest software did X,Y,Z,
> > > the spec says A is the right/a permitted thing but the guest got 
> > > confused".
> > > If it's the latter then you have to look at the guest as a separate
> > > code analysis/debug problem.
> >
> > Here's what I got, though I'm way out of my depth here.
> >
> > It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
> > interrupt after its first handling, which I think is expected. After
> > claiming, QEMU masks the pending interrupt, lowering the level, though
> > the device that raised it never deasserted.
>
> I'm not sure if this is correct, but this is what I'm coming up with and
> appears to fix the problem on my setup. The hardware that sets the
> pending interrupt is going clear it, so I don't see why the interrupt
> controller is automatically clearing it when the host claims it.
>
> ---
> diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> index c2dfacf028..f8f7af08dc 100644
> --- a/hw/intc/sifive_plic.c
> +++ b/hw/intc/sifive_plic.c
> @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
> addr, unsigned size)
>  uint32_t max_irq = sifive_plic_claimed(plic, addrid);
>
>  if (max_irq) {
> -sifive_plic_set_pending(plic, max_irq, false);
>  sifive_plic_set_claimed(plic, max_irq, true);
>  }
>

This change isn't correct. The PLIC spec
(https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
states:

"""
On receiving a claim message, the PLIC core will atomically determine
the ID of the highest-priority pending interrupt for the target and
then clear down the corresponding source’s IP bit
"""

which is what we are doing here. We are clearing the pending interrupt
inside the PLIC

Alistair

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch

On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote:
> On Thu, Jan 19, 2023 at 9:07 AM Keith Busch  wrote:
> > ---
> > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> > index c2dfacf028..f8f7af08dc 100644
> > --- a/hw/intc/sifive_plic.c
> > +++ b/hw/intc/sifive_plic.c
> > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
> > addr, unsigned size)
> >  uint32_t max_irq = sifive_plic_claimed(plic, addrid);
> >
> >  if (max_irq) {
> > -sifive_plic_set_pending(plic, max_irq, false);
> >  sifive_plic_set_claimed(plic, max_irq, true);
> >  }
> >
> 
> This change isn't correct. The PLIC spec
> (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
> states:
> 
> """
> On receiving a claim message, the PLIC core will atomically determine
> the ID of the highest-priority pending interrupt for the target and
> then clear down the corresponding source’s IP bit
> """
> 
> which is what we are doing here. We are clearing the pending interrupt
> inside the PLIC

Thanks for the link. That's very helpful.

If you're familiar with this area, could you possibly clear up this part
of that spec?

"
On receiving an interrupt completion message, if the interrupt is
level-triggered and the interrupt is still asserted, a new interrupt
request will be forwarded to the PLIC core.
"

Further up, it says the "interrupt gateway" is responsible for
forwarding new interrupt requests while the level remains asserted, but
it doesn't look like anything is handling that, which essentially turns
this into an edge interrupt. Am I missing something, or is this really
not being handled?

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Alistair Francis

On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:
>
> On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote:
> > On Thu, Jan 19, 2023 at 9:07 AM Keith Busch  wrote:
> > > ---
> > > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> > > index c2dfacf028..f8f7af08dc 100644
> > > --- a/hw/intc/sifive_plic.c
> > > +++ b/hw/intc/sifive_plic.c
> > > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
> > > addr, unsigned size)
> > >  uint32_t max_irq = sifive_plic_claimed(plic, addrid);
> > >
> > >  if (max_irq) {
> > > -sifive_plic_set_pending(plic, max_irq, false);
> > >  sifive_plic_set_claimed(plic, max_irq, true);
> > >  }
> > >
> >
> > This change isn't correct. The PLIC spec
> > (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
> > states:
> >
> > """
> > On receiving a claim message, the PLIC core will atomically determine
> > the ID of the highest-priority pending interrupt for the target and
> > then clear down the corresponding source’s IP bit
> > """
> >
> > which is what we are doing here. We are clearing the pending interrupt
> > inside the PLIC
>
> Thanks for the link. That's very helpful.
>
> If you're familiar with this area, could you possibly clear up this part
> of that spec?
>
> "
> On receiving an interrupt completion message, if the interrupt is
> level-triggered and the interrupt is still asserted, a new interrupt
> request will be forwarded to the PLIC core.
> "
>
> Further up, it says the "interrupt gateway" is responsible for
> forwarding new interrupt requests while the level remains asserted, but
> it doesn't look like anything is handling that, which essentially turns
> this into an edge interrupt. Am I missing something, or is this really
> not being handled?

Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
internal GPIO lines to trigger an interrupt. So with the current setup
we only support edge triggered interrupts.

It looks like we also drop the pending bit if the original interrupt
de-asserts, which appears to be incorrect as well.

Alistair

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch

On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:
> >
> > Further up, it says the "interrupt gateway" is responsible for
> > forwarding new interrupt requests while the level remains asserted, but
> > it doesn't look like anything is handling that, which essentially turns
> > this into an edge interrupt. Am I missing something, or is this really
> > not being handled?
> 
> Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> internal GPIO lines to trigger an interrupt. So with the current setup
> we only support edge triggered interrupts.

Thanks for confirming!

Klaus,
I think we can justify introducing a work-around in the emulated device
now. My previous proposal with pci_irq_pulse() is no good since it does
assert+deassert, but it needs to be the other way around, so please
don't considert that one.

Also, we ought to revisit the intms/intmc usage in the linux driver for
threaded interrupts.

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Klaus Jensen

On Jan 18 15:26, Keith Busch wrote:
> Klaus,
> 
> This isn't going to help your issue, but there are at least two legacy
> irq bugs in the nvme qemu implementation.
> 
> 1. The admin queue breaks if start with legacy and later initialize
> msix.
> 

Hmm. Interesting that we have not encountered this before - is this
because the kernel will enable MSI-X early and use it for the admin
queue immediately?

> 2. The legacy vector is shared among all queues, but it's being
> deasserted when the first queue's doorbell makes it empty. It should
> remain enabled if any cq is non-empty.

I was certain that we fixed this already in commit 83d7ed5c570
("hw/nvme: fix pin-based interrupt behavior (again)")...

signature.asc
Description: PGP signature

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Klaus Jensen

On Jan 18 21:03, Keith Busch wrote:
> On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:
> > >
> > > Further up, it says the "interrupt gateway" is responsible for
> > > forwarding new interrupt requests while the level remains asserted, but
> > > it doesn't look like anything is handling that, which essentially turns
> > > this into an edge interrupt. Am I missing something, or is this really
> > > not being handled?
> > 
> > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> > internal GPIO lines to trigger an interrupt. So with the current setup
> > we only support edge triggered interrupts.
> 
> Thanks for confirming!
> 
> Klaus,
> I think we can justify introducing a work-around in the emulated device
> now. My previous proposal with pci_irq_pulse() is no good since it does
> assert+deassert, but it needs to be the other way around, so please
> don't considert that one.
> 
> Also, we ought to revisit the intms/intmc usage in the linux driver for
> threaded interrupts.

+CC: qemu-riscv

Keith,

Thanks for digging into this!

Yeah, you are probably right that we should only use the intms/intmc
changes in the use_threaded_interrupts case, not in general. While my
RFC patch does seem to "fix" this, it is just a workaround as your
analysis indicate.


signature.asc
Description: PGP signature

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-19 Thread Klaus Jensen

On Jan 19 08:28, Klaus Jensen wrote:
> On Jan 18 21:03, Keith Busch wrote:
> > On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> > > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:
> > > >
> > > > Further up, it says the "interrupt gateway" is responsible for
> > > > forwarding new interrupt requests while the level remains asserted, but
> > > > it doesn't look like anything is handling that, which essentially turns
> > > > this into an edge interrupt. Am I missing something, or is this really
> > > > not being handled?
> > > 
> > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> > > internal GPIO lines to trigger an interrupt. So with the current setup
> > > we only support edge triggered interrupts.
> > 
> > Thanks for confirming!
> > 
> > Klaus,
> > I think we can justify introducing a work-around in the emulated device
> > now. My previous proposal with pci_irq_pulse() is no good since it does
> > assert+deassert, but it needs to be the other way around, so please
> > don't considert that one.
> > 
> > Also, we ought to revisit the intms/intmc usage in the linux driver for
> > threaded interrupts.
> 
> +CC: qemu-riscv
> 
> Keith,
> 
> Thanks for digging into this!
> 
> Yeah, you are probably right that we should only use the intms/intmc
> changes in the use_threaded_interrupts case, not in general. While my
> RFC patch does seem to "fix" this, it is just a workaround as your
> analysis indicate.

+CC: Philippe,

I am observing these timeouts/aborts on mips as well, so I guess that
emulation could suffer from the same issue?


signature.asc
Description: PGP signature

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-19 Thread Peter Maydell

On Thu, 19 Jan 2023 at 04:03, Keith Busch  wrote:
>
> On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:
> > >
> > > Further up, it says the "interrupt gateway" is responsible for
> > > forwarding new interrupt requests while the level remains asserted, but
> > > it doesn't look like anything is handling that, which essentially turns
> > > this into an edge interrupt. Am I missing something, or is this really
> > > not being handled?
> >
> > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> > internal GPIO lines to trigger an interrupt. So with the current setup
> > we only support edge triggered interrupts.
>
> Thanks for confirming!
>
> Klaus,
> I think we can justify introducing a work-around in the emulated device
> now. My previous proposal with pci_irq_pulse() is no good since it does
> assert+deassert, but it needs to be the other way around, so please
> don't considert that one.

No, please don't. The bug is in the risc-v interrupt controller,
so fix the risc-v interrupt controller. It shouldn't be too difficult
(you probably have to do something like what the Arm GIC implementation
does, where when the guest dismisses the interrupt you look at the level
to see if it needs to be re-pended.)

Once "workarounds" go into QEMU device emulation that make it
deviate from hardware behaviour, it's hard to get rid of them
again, because nobody knows whether deployed guests now accidentally
rely on the wrong behaviour. So the correct thing is to never
put in the workaround in the first place.

thanks
-- PMM

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-19 Thread Guenter Roeck


On 1/19/23 00:02, Klaus Jensen wrote:

On Jan 19 08:28, Klaus Jensen wrote:

On Jan 18 21:03, Keith Busch wrote:

On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:

On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:


Further up, it says the "interrupt gateway" is responsible for
forwarding new interrupt requests while the level remains asserted, but
it doesn't look like anything is handling that, which essentially turns
this into an edge interrupt. Am I missing something, or is this really
not being handled?


Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
internal GPIO lines to trigger an interrupt. So with the current setup
we only support edge triggered interrupts.


Thanks for confirming!

Klaus,
I think we can justify introducing a work-around in the emulated device
now. My previous proposal with pci_irq_pulse() is no good since it does
assert+deassert, but it needs to be the other way around, so please
don't considert that one.

Also, we ought to revisit the intms/intmc usage in the linux driver for
threaded interrupts.


+CC: qemu-riscv

Keith,

Thanks for digging into this!

Yeah, you are probably right that we should only use the intms/intmc
changes in the use_threaded_interrupts case, not in general. While my
RFC patch does seem to "fix" this, it is just a workaround as your
analysis indicate. >

+CC: Philippe,

I am observing these timeouts/aborts on mips as well, so I guess that
emulation could suffer from the same issue?


I suspect it does. In my case, I have an Ethernet controller on the same
interrupt line as the NVME controller, and it looks like one of the
interrupts gets lost if both controllers raise an interrupt at the same
time. The suggested workarounds "fix" the problem, but that doesn't seem
to be the correct solution.

Guenter

completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

35 matches

Site Navigation

Mail list logo

Footer information