Re: [PATCH] hw/nvme: reenable cqe batching

2022-10-20 Thread Jinhao Fan
at 1:37 PM, Klaus Jensen  wrote:

> On Okt 21 10:37, Jinhao Fan wrote:
>> at 7:35 PM, Klaus Jensen  wrote:
>> 
>>> Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell
>>> updates") had the unintended effect of disabling batching of CQEs.
>>> 
>>> This patch changes the sq/cq timers to bottom halfs and instead of
>>> calling nvme_post_cqes() immediately (causing an interrupt per cqe), we
>>> defer the call.
>> 
>> This change is definitely desired.
>> 
>>>  | iops
>>> -+--
>>>   baseline   | 138k
>>>   +cqe batching  | 233k
>> 
>> What is the queue depth config for this test?
> 
> That's 64. My box isnt nearly as beefy as yours ;)

:). The improvement looks great.

Reviewed-by: Jinhao Fan 




Re: [PATCH] hw/nvme: reenable cqe batching

2022-10-20 Thread Klaus Jensen
On Okt 21 10:37, Jinhao Fan wrote:
> at 7:35 PM, Klaus Jensen  wrote:
> 
> > Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell
> > updates") had the unintended effect of disabling batching of CQEs.
> > 
> > This patch changes the sq/cq timers to bottom halfs and instead of
> > calling nvme_post_cqes() immediately (causing an interrupt per cqe), we
> > defer the call.
> 
> This change is definitely desired.
> 
> > 
> >   | iops
> >  -+--
> >baseline   | 138k
> >+cqe batching  | 233k
> 
> What is the queue depth config for this test?
> 

That's 64. My box isnt nearly as beefy as yours ;)


signature.asc
Description: PGP signature


Re: [PATCH] hw/nvme: reenable cqe batching

2022-10-20 Thread Jinhao Fan
at 7:35 PM, Klaus Jensen  wrote:

> Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell
> updates") had the unintended effect of disabling batching of CQEs.
> 
> This patch changes the sq/cq timers to bottom halfs and instead of
> calling nvme_post_cqes() immediately (causing an interrupt per cqe), we
> defer the call.

This change is definitely desired.

> 
>   | iops
>  -+--
>baseline   | 138k
>+cqe batching  | 233k

What is the queue depth config for this test?




Re: [PATCH] hw/nvme: reenable cqe batching

2022-10-20 Thread Keith Busch
On Thu, Oct 20, 2022 at 01:35:38PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell
> updates") had the unintended effect of disabling batching of CQEs.
> 
> This patch changes the sq/cq timers to bottom halfs and instead of
> calling nvme_post_cqes() immediately (causing an interrupt per cqe), we
> defer the call.

Nice change, looks good! Timers never did seem to be the best fit for
this.

Reviewed-by: Keith Busch 



Re: [PATCH 2/2] thread-pool: use ThreadPool from the running thread

2022-10-20 Thread Dr. David Alan Gilbert
* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Mon, Oct 03, 2022 at 10:52:33AM +0200, Emanuele Giuseppe Esposito wrote:
> > 
> > 
> > Am 30/09/2022 um 17:45 schrieb Kevin Wolf:
> > > Am 30.09.2022 um 14:17 hat Emanuele Giuseppe Esposito geschrieben:
> > >> Am 29/09/2022 um 17:30 schrieb Kevin Wolf:
> > >>> Am 09.06.2022 um 15:44 hat Emanuele Giuseppe Esposito geschrieben:
> >  Remove usage of aio_context_acquire by always submitting work items
> >  to the current thread's ThreadPool.
> > 
> >  Signed-off-by: Paolo Bonzini 
> >  Signed-off-by: Emanuele Giuseppe Esposito 
> > >>>
> > >>> The thread pool is used by things outside of the file-* block drivers,
> > >>> too. Even outside the block layer. Not all of these seem to submit work
> > >>> in the same thread.
> > >>>
> > >>>
> > >>> For example:
> > >>>
> > >>> postcopy_ram_listen_thread() -> qemu_loadvm_state_main() ->
> > >>> qemu_loadvm_section_start_full() -> vmstate_load() ->
> > >>> vmstate_load_state() -> spapr_nvdimm_flush_post_load(), which has:
> > >>>
> > >>> ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
>  ^^^
> 
> aio_get_thread_pool() isn't thread safe either:
> 
>   ThreadPool *aio_get_thread_pool(AioContext *ctx)
>   {
>   if (!ctx->thread_pool) {
>   ctx->thread_pool = thread_pool_new(ctx);
> 
> 
> Two threads could race in aio_get_thread_pool().
> 
> I think post-copy is broken here: it's calling code that was only
> designed to be called from the main loop thread.
> 
> I have CCed Juan and David.

In theory the path that you describe there shouldn't happen - although
there is perhaps not enough protection on the load side to stop it
happening if presented with a bad stream.
This is documented in docs/devel/migration.rst under 'Destination
behaviour'; but to recap, during postcopy load we have a problem that we
need to be able to load incoming iterative (ie. RAM) pages during the
loading of normal devices, because the loading of a device may access
RAM that's not yet been transferred.

To do that, the device state of all the non-iterative devices (which I
think includes your spapr_nvdimm) is serialised into a separate
migration stream and sent as a 'package'.

We read the package off the stream on the main thread, but don't process
it until we fire off the 'listen' thread - which you spotted the
creation of above; the listen thread now takes over reading the
migration stream to process RAM pages, and since it's in the same
format, it calls qemu_loadvm_state_main() - but it doesn't expect
any devices in that other than the RAM devices; it's just expecting RAM.

In parallel with that, the main thread carries on loading the contents
of the 'package' - and that contains your spapr_nvdimm device (and any
other 'normal' devices); but that's OK because that's the main thread.

Now if something was very broken and sent a header for the spapr-nvdimm
down the main thread rather than into the package then, yes, we'd
trigger your case, but that shouldn't happen.

Dave

> > >>> ...
> > >>> thread_pool_submit_aio(pool, flush_worker_cb, state,
> > >>>spapr_nvdimm_flush_completion_cb, state);
> > >>>
> > >>> So it seems to me that we may be submitting work for the main thread
> > >>> from a postcopy migration thread.
> > >>>
> > >>> I believe the other direct callers of thread_pool_submit_aio() all
> > >>> submit work for the main thread and also run in the main thread.
> > >>>
> > >>>
> > >>> For thread_pool_submit_co(), pr_manager_execute() calls it with the pool
> > >>> it gets passed as a parameter. This is still bdrv_get_aio_context(bs) in
> > >>> hdev_co_ioctl() and should probably be changed the same way as for the
> > >>> AIO call in file-posix, i.e. use qemu_get_current_aio_context().
> > >>>
> > >>>
> > >>> We could consider either asserting in thread_pool_submit_aio() that we
> > >>> are really in the expected thread, or like I suggested for LinuxAio drop
> > >>> the pool parameter and always get it from the current thread (obviously
> > >>> this is only possible if migration could in fact schedule the work on
> > >>> its current thread - if it schedules it on the main thread and then
> > >>> exits the migration thread (which destroys the thread pool), that
> > >>> wouldn't be good).
> > >>
> > >> Dumb question: why not extend the already-existing poll->lock to cover
> > >> also the necessary fields like pool->head that are accessed by other
> > >> threads (only case I could find with thread_pool_submit_aio is the one
> > >> you pointed above)?
> > > 
> > > Other people are more familiar with this code, but I believe this could
> > > have performance implications. I seem to remember that this code is
> > > careful to avoid locking to synchronise between worker threads and the
> > > main thread.
> > > 
> > > But looking at the patch again, I have actually a dumb question, too:
>

RE: [RFC 0/7] migration patches for VFIO

2022-10-20 Thread Yishai Hadas
> From: Qemu-devel  bounces+yishaih=nvidia@nongnu.org> On Behalf Of Juan Quintela
> Sent: Monday, 3 October 2022 6:16
> To: qemu-de...@nongnu.org
> Cc: Alex Williamson ; Eric Blake
> ; Stefan Hajnoczi ; Fam
> Zheng ; qemu-s3...@nongnu.org; Cornelia Huck
> ; Thomas Huth ; Vladimir
> Sementsov-Ogievskiy ; Laurent Vivier
> ; John Snow ; Dr. David Alan
> Gilbert ; Christian Borntraeger
> ; Halil Pasic ; Juan
> Quintela ; Paolo Bonzini ;
> qemu-block@nongnu.org; Eric Farman ; Richard
> Henderson ; David Hildenbrand
> 
> Subject: [RFC 0/7] migration patches for VFIO
> 
> Hi
> 
> VFIO migration has several requirements:
> - the size of the state is only known when the guest is stopped

As was discussed in the conference call, I just sent a patch to the kernel 
mailing list to be able to get the state size in each state.

See:
https://patchwork.kernel.org/project/kvm/patch/20221020132109.112708-1-yish...@nvidia.com/

This can drop the need to stop the guest and ask for that data.

So, I assume that you can drop some complexity and hacks from your RFC once 
you'll send the next series.

Specifically,
No need to stop the VM and re-start it in case the SLA can't meet, just read 
upon RUNNING the estimated data length that will be required to complete 
STOP_COPY and use it.

Yishai

> - they need to send possible lots of data.
> 
> this series only address the 1st set of problems.
> 
> What they do:
> - res_compatible parameter was not used anywhere, just add that
> information to res_postcopy.
> - Remove QEMUFILE parameter from save_live_pending
> - Split save_live_pending into
>   * save_pending_estimate(): the pending state size without trying too hard
>   * save_pending_exact(): the real pending state size, it is called with the
> guest stopped.
> - Now save_pending_* don't need the threshold parameter
> - HACK a way to stop the guest before moving there.
> 
> ToDo:
> - autoconverge test is broken, no real clue why, but it is possible that the 
> test
> is wrong.
> 
> - Make an artifact to be able to send massive amount of data in the save
> state stage (probably more multifd channels).
> 
> - Be able to not having to start the guest between cheking the state pending
> size and migration_completion().
> 
> Please review.
> 
> Thanks, Juan.
> 
> Juan Quintela (7):
>   migration: Remove res_compatible parameter
>   migration: No save_live_pending() method uses the QEMUFile parameter
>   migration: Block migration comment or code is wrong
>   migration: Split save_live_pending() into state_pending_*
>   migration: Remove unused threshold_size parameter
>   migration: simplify migration_iteration_run()
>   migration: call qemu_savevm_state_pending_exact() with the guest
> stopped
> 
>  docs/devel/migration.rst   | 18 ++--
>  docs/devel/vfio-migration.rst  |  4 +--
>  include/migration/register.h   | 29 ++-
>  migration/savevm.h |  8 +++---
>  hw/s390x/s390-stattrib.c   | 11 ---
>  hw/vfio/migration.c| 17 +--
>  migration/block-dirty-bitmap.c | 14 -
>  migration/block.c  | 17 ++-
>  migration/migration.c  | 52 ++
>  migration/ram.c| 35 ---
>  migration/savevm.c | 37 +---
>  tests/qtest/migration-test.c   |  3 +-
>  hw/vfio/trace-events   |  2 +-
>  migration/trace-events |  7 +++--
>  14 files changed, 148 insertions(+), 106 deletions(-)
> 
> --
> 2.37.2
> 



[PATCH v3 12/26] block/vvfat: Unify the mkdir() call

2022-10-20 Thread Alex Bennée
From: Bin Meng 

There is a difference in the mkdir() call for win32 and non-win32
platforms, and currently is handled in the codes with #ifdefs.

glib provides a portable g_mkdir() API and we can use it to unify
the codes without #ifdefs.

Signed-off-by: Bin Meng 
Reviewed-by: Marc-André Lureau 
Signed-off-by: Alex Bennée 
Message-Id: <20221006151927.2079583-6-bmeng...@gmail.com>
---
 block/vvfat.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index d6dd919683..723beef025 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -25,6 +25,7 @@
 
 #include "qemu/osdep.h"
 #include 
+#include 
 #include "qapi/error.h"
 #include "block/block_int.h"
 #include "block/qdict.h"
@@ -2726,13 +2727,9 @@ static int handle_renames_and_mkdirs(BDRVVVFATState* s)
 mapping_t* mapping;
 int j, parent_path_len;
 
-#ifdef __MINGW32__
-if (mkdir(commit->path))
+if (g_mkdir(commit->path, 0755)) {
 return -5;
-#else
-if (mkdir(commit->path, 0755))
-return -5;
-#endif
+}
 
 mapping = insert_mapping(s, commit->param.mkdir.cluster,
 commit->param.mkdir.cluster + 1);
-- 
2.34.1




Re: [PATCH v3 3/4] hw/nvme: add iothread support

2022-10-20 Thread Klaus Jensen
On Aug 27 17:12, Jinhao Fan wrote:
> Add an option "iothread=x" to do emulation in a seperate iothread.
> This improves the performance because QEMU's main loop is responsible
> for a lot of other work while iothread is dedicated to NVMe emulation.
> Moreover, emulating in iothread brings the potential of polling on
> SQ/CQ doorbells, which I will bring up in a following patch.
> 
> Iothread can be enabled by:
> -object iothread,id=nvme0 \
> -device nvme,iothread=nvme0 \
> 
> Performance comparisons (KIOPS):
> 
> QD 1   4  16  64
> QEMU  41 136 242 338
> iothread  53 155 245 309
> 
> Signed-off-by: Jinhao Fan 
> ---
>  hw/nvme/ctrl.c | 67 --
>  hw/nvme/ns.c   | 21 +---
>  hw/nvme/nvme.h |  6 -
>  3 files changed, 82 insertions(+), 12 deletions(-)
> 

In hw/nvme/ns.c you need to guard the blk_flush and blk_drain calls with
an aio_context_acquire and aio_context_release pair.

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index eb9141a67b5c..dcf889f6d5ce 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -520,12 +520,21 @@ int nvme_ns_setup(NvmeNamespace *ns, AioContext *ctx, 
Error **errp)

 void nvme_ns_drain(NvmeNamespace *ns)
 {
+AioContext *ctx = blk_get_aio_context(ns->blkconf.blk);
+
+aio_context_acquire(ctx);
 blk_drain(ns->blkconf.blk);
+aio_context_release(ctx);
 }

 void nvme_ns_shutdown(NvmeNamespace *ns)
 {
+AioContext *ctx = blk_get_aio_context(ns->blkconf.blk);
+
+aio_context_acquire(ctx);
 blk_flush(ns->blkconf.blk);
+aio_context_release(ctx);
+
 if (ns->params.zoned) {
 nvme_zoned_ns_shutdown(ns);
 }

Otherwise, it all looks fine. I'm still seeing the weird slowdown when
an iothread is enabled. I have yet to figure out why that is... But it
scales! :)


signature.asc
Description: PGP signature


[PATCH] hw/nvme: reenable cqe batching

2022-10-20 Thread Klaus Jensen
From: Klaus Jensen 

Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell
updates") had the unintended effect of disabling batching of CQEs.

This patch changes the sq/cq timers to bottom halfs and instead of
calling nvme_post_cqes() immediately (causing an interrupt per cqe), we
defer the call.

   | iops
  -+--
baseline   | 138k
+cqe batching  | 233k

Fixes: 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell updates")
Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c | 26 +++---
 hw/nvme/nvme.h |  4 ++--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba056499..73c870a42996 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1401,13 +1401,7 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
 
-if (req->sq->ioeventfd_enabled) {
-/* Post CQE directly since we are in main loop thread */
-nvme_post_cqes(cq);
-} else {
-/* Schedule the timer to post CQE later since we are in vcpu thread */
-timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
-}
+qemu_bh_schedule(cq->bh);
 }
 
 static void nvme_process_aers(void *opaque)
@@ -4252,7 +4246,7 @@ static void nvme_cq_notifier(EventNotifier *e)
 nvme_irq_deassert(n, cq);
 }
 
-nvme_post_cqes(cq);
+qemu_bh_schedule(cq->bh);
 }
 
 static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
@@ -4307,7 +4301,7 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 uint16_t offset = sq->sqid << 3;
 
 n->sq[sq->sqid] = NULL;
-timer_free(sq->timer);
+qemu_bh_delete(sq->bh);
 if (sq->ioeventfd_enabled) {
 memory_region_del_eventfd(&n->iomem,
   0x1000 + offset, 4, false, 0, &sq->notifier);
@@ -4381,7 +4375,8 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 sq->io_req[i].sq = sq;
 QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
 }
-sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
+
+sq->bh = qemu_bh_new(nvme_process_sq, sq);
 
 if (n->dbbuf_enabled) {
 sq->db_addr = n->dbbuf_dbs + (sqid << 3);
@@ -4698,7 +4693,7 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 uint16_t offset = (cq->cqid << 3) + (1 << 2);
 
 n->cq[cq->cqid] = NULL;
-timer_free(cq->timer);
+qemu_bh_delete(cq->bh);
 if (cq->ioeventfd_enabled) {
 memory_region_del_eventfd(&n->iomem,
   0x1000 + offset, 4, false, 0, &cq->notifier);
@@ -4771,7 +4766,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 }
 }
 n->cq[cqid] = cq;
-cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+cq->bh = qemu_bh_new(nvme_post_cqes, cq);
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
@@ -6913,9 +6908,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int 
val)
 if (start_sqs) {
 NvmeSQueue *sq;
 QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
-timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 
500);
+qemu_bh_schedule(sq->bh);
 }
-timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+qemu_bh_schedule(cq->bh);
 }
 
 if (cq->tail == cq->head) {
@@ -6984,7 +6979,8 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int 
val)
 pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
   sizeof(sq->tail));
 }
-timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+
+qemu_bh_schedule(sq->bh);
 }
 }
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 79f5c281c223..7adf042ec3e4 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -375,7 +375,7 @@ typedef struct NvmeSQueue {
 uint64_tdma_addr;
 uint64_tdb_addr;
 uint64_tei_addr;
-QEMUTimer   *timer;
+QEMUBH  *bh;
 EventNotifier notifier;
 boolioeventfd_enabled;
 NvmeRequest *io_req;
@@ -396,7 +396,7 @@ typedef struct NvmeCQueue {
 uint64_tdma_addr;
 uint64_tdb_addr;
 uint64_tei_addr;
-QEMUTimer   *timer;
+QEMUBH  *bh;
 EventNotifier notifier;
 boolioeventfd_enabled;
 QTAILQ_HEAD(, NvmeSQueue) sq_list;
-- 
2.37.0 (Apple Git-136)




Re: [PATCH v3 4/4] hw/nvme: add polling support

2022-10-20 Thread Klaus Jensen
On Aug 27 17:12, Jinhao Fan wrote:
> Add AioContext polling handlers for NVMe SQ and CQ. By employing polling,
> the latency of NVMe IO emulation is greatly reduced. The SQ polling
> handler checks for updates on the SQ tail shadow doorbell buffer. The CQ
> polling handler is an empty function because we procatively polls the CQ
> head shadow doorbell buffer when we want to post a cqe. Updates on the
> SQ eventidx buffer is stopped during polling to avoid the host doing
> unnecessary doorbell buffer writes.
> 
> Comparison (KIOPS):
> 
> QD1   4  16  64
> QEMU 53 155 245 309
> polling 123 165 189 191
> 
> Signed-off-by: Jinhao Fan 
> ---
>  hw/nvme/ctrl.c | 74 ++
>  hw/nvme/nvme.h |  1 +
>  2 files changed, 70 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 869565d77b..a7f8a4220e 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -298,6 +298,8 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
>  
>  static void nvme_process_sq(void *opaque);
>  static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq);
> +static void nvme_update_sq_tail(NvmeSQueue *sq);
>  
>  static uint16_t nvme_sqid(NvmeRequest *req)
>  {
> @@ -4447,6 +4449,21 @@ static void nvme_cq_notifier(EventNotifier *e)
>  nvme_post_cqes(cq);
>  }
>  
> +static bool nvme_cq_notifier_aio_poll(void *opaque)
> +{
> +/*
> + * We already "poll" the CQ tail shadow doorbell value in 
> nvme_post_cqes(),
> + * so we do not need to check the value here. However, QEMU's AioContext
> + * polling requires us to provide io_poll and io_poll_ready handlers, so
> + * use dummy functions for CQ.
> + */
> +return false;
> +}
> +
> +static void nvme_cq_notifier_aio_poll_ready(EventNotifier *n)
> +{
> +}
> +
>  static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
>  {
>  NvmeCtrl *n = cq->ctrl;
> @@ -4459,8 +4476,10 @@ static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
>  }
>  
>  if (cq->cqid) {
> -aio_set_event_notifier(n->ctx, &cq->notifier, true, nvme_cq_notifier,
> -   NULL, NULL);
> +aio_set_event_notifier(n->ctx, &cq->notifier, true,
> +   nvme_cq_notifier,
> +   nvme_cq_notifier_aio_poll,
> +   nvme_cq_notifier_aio_poll_ready);

There is no reason to set up these polling handlers (since they don't do
anything).

>  } else {
>  event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
>  }
> @@ -4482,6 +4501,44 @@ static void nvme_sq_notifier(EventNotifier *e)
>  nvme_process_sq(sq);
>  }
>  
> +static void nvme_sq_notifier_aio_poll_begin(EventNotifier *n)
> +{
> +NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +
> +nvme_update_sq_eventidx(sq);
> +
> +/* Stop host doorbell writes by stop updating eventidx */
> +sq->suppress_db = true;

This doesn't do what you expect it to. By not updaring the eventidx it
will fall behind the actual head, causing the host to think that the
device is not processing events (but it is!), resulting in doorbell
ringing.

> +}
> +
> +static bool nvme_sq_notifier_aio_poll(void *opaque)
> +{
> +EventNotifier *n = opaque;
> +NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +uint32_t old_tail = sq->tail;
> +
> +nvme_update_sq_tail(sq);
> +
> +return sq->tail != old_tail;
> +}
> +
> +static void nvme_sq_notifier_aio_poll_ready(EventNotifier *n)
> +{
> +NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +
> +nvme_process_sq(sq);
> +}
> +
> +static void nvme_sq_notifier_aio_poll_end(EventNotifier *n)
> +{
> +NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +
> +nvme_update_sq_eventidx(sq);
> +
> +/* Resume host doorbell writes */
> +sq->suppress_db = false;
> +}
> +
>  static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
>  {
>  NvmeCtrl *n = sq->ctrl;
> @@ -4494,8 +4551,13 @@ static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
>  }
>  
>  if (sq->sqid) {
> -aio_set_event_notifier(n->ctx, &sq->notifier, true, nvme_sq_notifier,
> -   NULL, NULL);
> +aio_set_event_notifier(n->ctx, &sq->notifier, true,
> +   nvme_sq_notifier,
> +   nvme_sq_notifier_aio_poll,
> +   nvme_sq_notifier_aio_poll_ready);
> +aio_set_event_notifier_poll(n->ctx, &sq->notifier,
> +nvme_sq_notifier_aio_poll_begin,
> +nvme_sq_notifier_aio_poll_end);

You can remove the call to aio_set_event_notifier_poll() since the
supress_db "trick" shouldnt be used anyway.

>  } else {
>  event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
>  }
> @@ -6530,7 +6592,9 @@ static void nvme_process_sq(voi