On Tue, Aug 11, 2020 at 09:54:08PM +0800, Zhenyu Ye wrote: > Hi Kevin, > > On 2020/8/10 23:38, Kevin Wolf wrote: > > Am 10.08.2020 um 16:52 hat Zhenyu Ye geschrieben: > >> Before doing qmp actions, we need to lock the qemu_global_mutex, > >> so the qmp actions should not take too long time. > >> > >> Unfortunately, some qmp actions need to acquire aio context and > >> this may take a long time. The vm will soft lockup if this time > >> is too long. > > > > Do you have a specific situation in mind where getting the lock of an > > AioContext can take a long time? I know that the main thread can > > block for considerable time, but QMP commands run in the main thread, so > > this patch doesn't change anything for this case. It would be effective > > if an iothread blocks, but shouldn't everything running in an iothread > > be asynchronous and therefore keep the AioContext lock only for a short > > time? > > > > Theoretically, everything running in an iothread is asynchronous. However, > some 'asynchronous' actions are not non-blocking entirely, such as > io_submit(). This will block while the iodepth is too big and I/O pressure > is too high. If we do some qmp actions, such as 'info block', at this time, > may cause vm soft lockup. This series can make these qmp actions safer. > > I constructed the scene as follow: > 1. create a vm with 4 disks, using iothread. > 2. add press to the CPU on the host. In my scene, the CPU usage exceeds 95%. > 3. add press to the 4 disks in the vm at the same time. I used the fio and > some parameters are: > > fio -rw=randrw -bs=1M -size=1G -iodepth=512 -ioengine=libaio -numjobs=4 > > 4. do block query actions, for example, by virsh: > > virsh qemu-monitor-command [vm name] --hmp info block > > Then the vm will soft lockup, the calltrace is: > > [ 192.311393] watchdog: BUG: soft lockup - CPU#1 stuck for 42s! > [kworker/1:1:33]
Hi, Sorry I haven't had time to investigate this myself yet. Do you also have a QEMU backtrace when the hang occurs? Let's find out if QEMU is stuck in the io_submit(2) syscall or whether there's an issue in QEMU itself that causes the softlockup (for example, aio_poll() with the global mutex held). Stefan
signature.asc
Description: PGP signature