On Mon, 2025-03-24 at 14:35 -0400, Stefan Hajnoczi wrote: > On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <[email protected]> wrote: > > On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote: > > > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <[email protected]> wrote: > > > > Hello, > > > > > > > > I am attempting to simulate a system with multiple CPU > > > > architectures. To do this I am starting a unique QEMU process for each > > > > CPU architecture that is needed. I'm also developing some QEMU code > > > > that aids in transporting MMIO transactions across the process > > > > boundaries using sockets. > > > > > > I have CCed Phil. He has been working on heterogenous target emulation > > > and might be interested. > > > > > > > The design takes MMIO request messages off of a socket, services the > > > > request by calling address_space_ldq_be(), then sends a response > > > > message (containing the requested data) over the same > > > > socket. Currently, this is all done inside the socket IOReadHandler > > > > callback function. > > > > > > At a high level this is similar to the vfio-user feature where a PCI > > > device is emulated in a separate process. This also involves sending > > > messages describing QEMU's MemoryRegion accesses. See the "remote" > > > machine type in QEMU to look at the code. > > > > > > > This works as long as the targeted register exists in the same QEMU > > > > process that received the request. However, If the register exists in > > > > another QEMU process, then the call to address_space_ldq_be() results > > > > in another socket message being sent to that QEMU process, requesting > > > > the data, and then waiting (blocking) for the response message > > > > containing the data. In other words, it ends up blocking inside the > > > > event handler and even though the QEMU process containing the target > > > > register was able to receive the request and send the response, the > > > > originator of the request is unable to receive the response until it > > > > eventually times out and stops blocking. Once it times out and stops > > > > blocking, it does receive the response, but now it is too late. > > > > > > > > Here's a summary of the stack up to where the code blocks: > > > > > > > > IOReadHandler callback > > > > calls address_space_ldq_be() > > > > resolves to mmio read op of a remote device > > > > sends request over socket and waits (blocks) for response > > > > > > > > So, I'm looking for a way to handle the work of calling > > > > address_space_ldq_be(), which might block when attempting to read a > > > > register of a remote device, without blocking inside the IOReadHandler > > > > callback context. > > > > > > > > I've done a lot of searches and reading about how to do this on the web > > > > and in the QEMU code but it's still not really clear to me how this > > > > should be done in QEMU. I've seen a lot about using coroutines to > > > > handle cases like this. Is that what I should be using here? > > > > > > The fundamental problem is that address_space_ldq_be() is synchronous, > > > so there is no way to return back to the caller until the response has > > > been received. > > > > > > vfio-user didn't solve this problem. It simply blocks until the > > > response is received, but it does drop the Big QEMU Lock during this > > > time so that other vCPU threads can run. For example, see > > > hw/remote/proxy.c:send_bar_access_msg() and > > > mpqemu_msg_send_and_await_reply(). > > > > > > QEMU supports nested event loops, but they come with their own set of > > > gotchas. The way a nested event loop might help here is to send the > > > request and then call aio_poll() to receive the response in another > > > IOReadHandler. This way other event loop processing can take place > > > while waiting in address_space_ldq_be(). > > > > > > The second problem is that this approach where QEMU processes send > > > requests to each other needs to be implemented carefully to avoid > > > deadlocks. For example, devices that do DMA could load/store memory > > > belonging to another device handled by another QEMU. Once there is an > > > A -> B -> A situation it could deadlock. > > > > > > Both vfio-user and vhost-user have similar issues with their > > > bi-directional communication where a device emulation process can send > > > a message to QEMU while processing a message from QEMU. Deadlock can > > > be avoided if the code is structured so that QEMU is able to receive > > > new requests during the time when it is waiting for a response. > > > > > > Stefan > > > > Stefan, Thank you for the quick response and great information! > > > > I'm not sure if this is the best way, but I was able to get things > > working today using the coroutine approach. > > > > Now, the aforementioned stack looks like this: > > > > IOReadHandler callback receives request > > enters coroutine > > calls address_space_ldq_be() > > resolves to mmio read op of a remote device > > sends request > > over socket > > detects coroutine context and > > calls qemu_coroutine_yield() instead of blocking > > returns to callback > > > > <time passes> > > > > IOReadHandler callback receives response > > re-enters coroutine > > mmio read op returns data received in response message > > address_space_ldq_be() returns > > coroutine completes and returns to callback > > > > While this works, I couldn't help but notice that the coroutine concept > > seems to be like a form of multithreading. Is there some advantage to > > using coroutines over doing the work in another thread? Does QEMU > > offer an interface that allows for a callback to queue up work that can > > be handled by another thread or a pool of threads? > > Coroutines make it easier to write concurrent code in an event loop. > The alternative is to write asynchronous callback functions, which is > tedious for sequences with multiple steps that need to wait for I/O. > > Coroutines do not offer parallelism, so they are not replacement for > multi-threading. QEMU is mostly event-driven rather than > multi-threaded. Usually only computation in QEMU that really needs its > own CPU runs in its own thread (vCPUs, compression, blocking syscalls > when there is no alternative, etc). > > There are advantages to using coroutines: less synchronization is > necessary than with threads (you can be sure no other coroutine will > run in the same thread while your code is running) and this eliminates > most thread-safety issues. Also, event loops are seen as more scalable > than threads (lots of historical resources, for example > http://www.kegel.com/c10k.html). One QEMU-specific advantage of > coroutines: coroutine code has access to all of QEMU's APIs that > require the event loop whereas threads need to take extra steps to > interact with the rest of QEMU. > > Stefan
Thanks for the explanation, Stefan! Glenn
