Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: >> Hi Stefan, >> >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: >>>> Dear maintainers, >>>> >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event >>>> loop of iothreads and this causes an IO pressure stall value of nearly >>>> 100 when idle. >>>> >>>> The issue was also reported on the kernel mailing list [0]. The >>>> suggestion from Jens Axboe was to just turn off the iowait accounting >>>> completely. But since (for block/file-posix.c), there is actual IO >>>> submitted via the same ring, I wasn't sure if that is the right approach. >>>> >>>> So the idea was to keep track of whether the event loop is otherwise >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. >>>> >>>> However, doing so would only help for block/file-posix.c, which submits >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When >>>> submitting that poll SQE in the iothread, we would need to be able to >>>> know if IO for RBD is currently in-flight or not to be able to decide >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good >>>> way to do this (in a general way)? >>>> >>>> Or should the flag really always be used (if supported by the kernel)? >>>> Is there a way to tell io_uring/kernel that we are an event loop and our >>>> waiting should only be accounted for when there is actual IO in-flight? >>>> >>>> Happy to hear your opinions and suggestions! >>>> >>>> [0]: >>>> https://lore.kernel.org/io-uring/[email protected]/T/ >>> >>> Hi Fiona, >>> Jens replied yesterday confirmed your suspicion that the number of >>> inflight requests is not being tracked correctly. >>> >>> Is there still a problem after fixing the kernel's inflight counting? If >>> not, then no QEMU change is necessary and that seems like the cleanest >>> solution anyway. The kernel should know whether there is I/O in flight >>> and so it doesn't seem right that userspace needs to hint this. >> >> >> unfortunately, yes. Even with the kernel fix [2], the real problem with >> poll SQEs described above remains. I'm still seeing high IO pressure >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll >> SQEs for the AioHandler node fd, and that does count as pending IO. A >> small reproducer modeling this [3]. > > Does the kernel account POLL_ADD SQEs as blocking I/O activity?
Apparently yes. See the C program below [3]. > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) > syscalls do not count as blocking I/O activity. The kernel io_uring code > should account them correctly and not rely on a userspace hint. @Jens Axboe: should there be a separate internal counter for poll/timeout SQEs and have them not count towards IO wait by default? > > Stefan > >> >> So the question from above, how to deal with this for block drivers not >> going through file-posix.c remains. >> >> Best Regards, >> Fiona >> >> [2]: >> https://lore.kernel.org/io-uring/[email protected]/T/ >> >> [3]: >> >> #include <assert.h> >> #include <errno.h> >> #include <stdio.h> >> #include <unistd.h> >> #include <liburing.h> >> #include <sys/eventfd.h> >> >> int main(void) { >> int fd; >> int ret; >> struct io_uring ring; >> struct io_uring_sqe *sqe; >> >> fd = eventfd(0, 0); >> assert(fd >= 0); >> >> ret = io_uring_queue_init(128, &ring, 0); >> assert(ret == 0); >> >> sqe = io_uring_get_sqe(&ring); >> assert(sqe); >> >> io_uring_prep_poll_add(sqe, fd, 1); >> >> ret = io_uring_submit_and_wait(&ring, 1); >> printf("got ret %d\n", ret); >> >> io_uring_queue_exit(&ring); >> >> return 0; >> } >> >>
