On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote:
> On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <[email protected]> wrote:
> > Hello,
> > 
> > I am attempting to simulate a system with multiple CPU
> > architectures.  To do this I am starting a unique QEMU process for each
> > CPU architecture that is needed. I'm also developing some QEMU code
> > that aids in transporting MMIO transactions across the process
> > boundaries using sockets.
> 
> I have CCed Phil. He has been working on heterogenous target emulation
> and might be interested.
> 
> > The design takes MMIO request messages off of a socket, services the
> > request by calling address_space_ldq_be(), then sends a response
> > message (containing the requested data) over the same
> > socket.  Currently, this is all done inside the socket IOReadHandler
> > callback function.
> 
> At a high level this is similar to the vfio-user feature where a PCI
> device is emulated in a separate process. This also involves sending
> messages describing QEMU's MemoryRegion accesses. See the "remote"
> machine type in QEMU to look at the code.
> 
> > This works as long as the targeted register exists in the same QEMU
> > process that received the request.  However, If the register exists in
> > another QEMU process, then the call to address_space_ldq_be() results
> > in another socket message being sent to that QEMU process, requesting
> > the data, and then waiting (blocking) for the response message
> > containing the data.  In other words, it ends up blocking inside the
> > event handler and even though the QEMU process containing the target
> > register was able to receive the request and send the response, the
> > originator of the request is unable to receive the response until it
> > eventually times out and stops blocking.  Once it times out and stops
> > blocking, it does receive the response, but now it is too late.
> > 
> > Here's a summary of the stack up to where the code blocks:
> > 
> > IOReadHandler callback
> >   calls address_space_ldq_be()
> >     resolves to mmio read op of a remote device
> >       sends request over socket and waits (blocks) for response
> > 
> > So, I'm looking for a way to handle the work of calling
> > address_space_ldq_be(), which might block when attempting to read a
> > register of a remote device, without blocking inside the IOReadHandler
> > callback context.
> > 
> > I've done a lot of searches and reading about how to do this on the web
> > and in the QEMU code but it's still not really clear to me how this
> > should be done in QEMU.  I've seen a lot about using coroutines to
> > handle cases like this. Is that what I should be using here?
> 
> The fundamental problem is that address_space_ldq_be() is synchronous,
> so there is no way to return back to the caller until the response has
> been received.
> 
> vfio-user didn't solve this problem. It simply blocks until the
> response is received, but it does drop the Big QEMU Lock during this
> time so that other vCPU threads can run. For example, see
> hw/remote/proxy.c:send_bar_access_msg() and
> mpqemu_msg_send_and_await_reply().
> 
> QEMU supports nested event loops, but they come with their own set of
> gotchas. The way a nested event loop might help here is to send the
> request and then call aio_poll() to receive the response in another
> IOReadHandler. This way other event loop processing can take place
> while waiting in address_space_ldq_be().
> 
> The second problem is that this approach where QEMU processes send
> requests to each other needs to be implemented carefully to avoid
> deadlocks. For example, devices that do DMA could load/store memory
> belonging to another device handled by another QEMU. Once there is an
> A -> B -> A situation it could deadlock.
> 
> Both vfio-user and vhost-user have similar issues with their
> bi-directional communication where a device emulation process can send
> a message to QEMU while processing a message from QEMU. Deadlock can
> be avoided if the code is structured so that QEMU is able to receive
> new requests during the time when it is waiting for a response.
> 
> Stefan

Stefan, Thank you for the quick response and great information!

I'm not sure if this is the best way, but I was able to get things
working today using the coroutine approach.

Now, the aforementioned stack looks like this:

IOReadHandler callback receives request
  enters coroutine
    calls address_space_ldq_be()
      resolves to mmio read op of a remote device
        sends request
over socket
        detects coroutine context and
        calls qemu_coroutine_yield() instead of blocking
  returns to callback 

<time passes>

IOReadHandler callback receives response
  re-enters coroutine
        mmio read op returns data received in response message
    address_space_ldq_be() returns
  coroutine completes and returns to callback

While this works, I couldn't help but notice that the coroutine concept
seems to be like a form of multithreading.  Is there some advantage to
using coroutines over doing the work in another thread?  Does QEMU
offer an interface that allows for a callback to queue up work that can
be handled by another thread or a pool of threads?

Thanks,

Glenn Miles



Reply via email to