Re: Best practice for issuing blocking calls in response to an event

Miles Glenn Tue, 25 Mar 2025 08:26:44 -0700

On Mon, 2025-03-24 at 14:35 -0400, Stefan Hajnoczi wrote:
> On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <[email protected]> wrote:
> > On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote:
> > > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <[email protected]> wrote:
> > > > Hello,
> > > > 
> > > > I am attempting to simulate a system with multiple CPU
> > > > architectures.  To do this I am starting a unique QEMU process for each
> > > > CPU architecture that is needed. I'm also developing some QEMU code
> > > > that aids in transporting MMIO transactions across the process
> > > > boundaries using sockets.
> > > 
> > > I have CCed Phil. He has been working on heterogenous target emulation
> > > and might be interested.
> > > 
> > > > The design takes MMIO request messages off of a socket, services the
> > > > request by calling address_space_ldq_be(), then sends a response
> > > > message (containing the requested data) over the same
> > > > socket.  Currently, this is all done inside the socket IOReadHandler
> > > > callback function.
> > > 
> > > At a high level this is similar to the vfio-user feature where a PCI
> > > device is emulated in a separate process. This also involves sending
> > > messages describing QEMU's MemoryRegion accesses. See the "remote"
> > > machine type in QEMU to look at the code.
> > > 
> > > > This works as long as the targeted register exists in the same QEMU
> > > > process that received the request.  However, If the register exists in
> > > > another QEMU process, then the call to address_space_ldq_be() results
> > > > in another socket message being sent to that QEMU process, requesting
> > > > the data, and then waiting (blocking) for the response message
> > > > containing the data.  In other words, it ends up blocking inside the
> > > > event handler and even though the QEMU process containing the target
> > > > register was able to receive the request and send the response, the
> > > > originator of the request is unable to receive the response until it
> > > > eventually times out and stops blocking.  Once it times out and stops
> > > > blocking, it does receive the response, but now it is too late.
> > > > 
> > > > Here's a summary of the stack up to where the code blocks:
> > > > 
> > > > IOReadHandler callback
> > > >   calls address_space_ldq_be()
> > > >     resolves to mmio read op of a remote device
> > > >       sends request over socket and waits (blocks) for response
> > > > 
> > > > So, I'm looking for a way to handle the work of calling
> > > > address_space_ldq_be(), which might block when attempting to read a
> > > > register of a remote device, without blocking inside the IOReadHandler
> > > > callback context.
> > > > 
> > > > I've done a lot of searches and reading about how to do this on the web
> > > > and in the QEMU code but it's still not really clear to me how this
> > > > should be done in QEMU.  I've seen a lot about using coroutines to
> > > > handle cases like this. Is that what I should be using here?
> > > 
> > > The fundamental problem is that address_space_ldq_be() is synchronous,
> > > so there is no way to return back to the caller until the response has
> > > been received.
> > > 
> > > vfio-user didn't solve this problem. It simply blocks until the
> > > response is received, but it does drop the Big QEMU Lock during this
> > > time so that other vCPU threads can run. For example, see
> > > hw/remote/proxy.c:send_bar_access_msg() and
> > > mpqemu_msg_send_and_await_reply().
> > > 
> > > QEMU supports nested event loops, but they come with their own set of
> > > gotchas. The way a nested event loop might help here is to send the
> > > request and then call aio_poll() to receive the response in another
> > > IOReadHandler. This way other event loop processing can take place
> > > while waiting in address_space_ldq_be().
> > > 
> > > The second problem is that this approach where QEMU processes send
> > > requests to each other needs to be implemented carefully to avoid
> > > deadlocks. For example, devices that do DMA could load/store memory
> > > belonging to another device handled by another QEMU. Once there is an
> > > A -> B -> A situation it could deadlock.
> > > 
> > > Both vfio-user and vhost-user have similar issues with their
> > > bi-directional communication where a device emulation process can send
> > > a message to QEMU while processing a message from QEMU. Deadlock can
> > > be avoided if the code is structured so that QEMU is able to receive
> > > new requests during the time when it is waiting for a response.
> > > 
> > > Stefan
> > 
> > Stefan, Thank you for the quick response and great information!
> > 
> > I'm not sure if this is the best way, but I was able to get things
> > working today using the coroutine approach.
> > 
> > Now, the aforementioned stack looks like this:
> > 
> > IOReadHandler callback receives request
> >   enters coroutine
> >     calls address_space_ldq_be()
> >       resolves to mmio read op of a remote device
> >         sends request
> > over socket
> >         detects coroutine context and
> >         calls qemu_coroutine_yield() instead of blocking
> >   returns to callback
> > 
> > <time passes>
> > 
> > IOReadHandler callback receives response
> >   re-enters coroutine
> >         mmio read op returns data received in response message
> >     address_space_ldq_be() returns
> >   coroutine completes and returns to callback
> > 
> > While this works, I couldn't help but notice that the coroutine concept
> > seems to be like a form of multithreading.  Is there some advantage to
> > using coroutines over doing the work in another thread?  Does QEMU
> > offer an interface that allows for a callback to queue up work that can
> > be handled by another thread or a pool of threads?
> 
> Coroutines make it easier to write concurrent code in an event loop.
> The alternative is to write asynchronous callback functions, which is
> tedious for sequences with multiple steps that need to wait for I/O.
> 
> Coroutines do not offer parallelism, so they are not replacement for
> multi-threading. QEMU is mostly event-driven rather than
> multi-threaded. Usually only computation in QEMU that really needs its
> own CPU runs in its own thread (vCPUs, compression, blocking syscalls
> when there is no alternative, etc).
> 
> There are advantages to using coroutines: less synchronization is
> necessary than with threads (you can be sure no other coroutine will
> run in the same thread while your code is running) and this eliminates
> most thread-safety issues. Also, event loops are seen as more scalable
> than threads (lots of historical resources, for example
> http://www.kegel.com/c10k.html). One QEMU-specific advantage of
> coroutines: coroutine code has access to all of QEMU's APIs that
> require the event loop whereas threads need to take extra steps to
> interact with the rest of QEMU.
> 
> Stefan


Thanks for the explanation, Stefan!

Glenn

Re: Best practice for issuing blocking calls in response to an event

Reply via email to