On Thu, Oct 22, 2020 at 05:29:16PM +0100, Fam Zheng wrote: > On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote: > > On 2020/10/19 21:25, Paolo Bonzini wrote: > > > On 19/10/20 14:40, Zhenyu Ye wrote: > > > > The kernel backtrace for io_submit in GUEST is: > > > > > > > > guest# ./offcputime -K -p `pgrep -nx fio` > > > > b'finish_task_switch' > > > > b'__schedule' > > > > b'schedule' > > > > b'io_schedule' > > > > b'blk_mq_get_tag' > > > > b'blk_mq_get_request' > > > > b'blk_mq_make_request' > > > > b'generic_make_request' > > > > b'submit_bio' > > > > b'blkdev_direct_IO' > > > > b'generic_file_read_iter' > > > > b'aio_read' > > > > b'io_submit_one' > > > > b'__x64_sys_io_submit' > > > > b'do_syscall_64' > > > > b'entry_SYSCALL_64_after_hwframe' > > > > - fio (1464) > > > > 40031912 > > > > > > > > And Linux io_uring can avoid the latency problem. > > Thanks for the info. What this tells us is basically the inflight > requests are high. It's sad that the linux-aio is in practice > implemented as a blocking API. > > Host side backtrace will be of more help. Can you get that too?
I guess Linux AIO didn't set the BLK_MQ_REQ_NOWAIT flag so the task went to sleep when it ran out of blk-mq tags. The easiest solution is to move to io_uring. Linux AIO is broken - it's not AIO :). If we know that no other process is writing to the host block device then maybe we can determine the blk-mq tags limit (the queue depth) and avoid sending more requests. That way QEMU doesn't block, but I don't think this approach works when other processes are submitting I/O to the same host block device :(. Fam's original suggestion of invoking io_submit(2) from a worker thread is an option, but I'm afraid it will slow down the uncontended case. I'm CCing Glauber in case he battled this in the past in ScyllaDB. Stefan
signature.asc
Description: PGP signature