Peter Maydell <peter.mayd...@linaro.org> writes:
> I've been investigating a race condition where sometimes when my > guest writes to a device register which triggers a > qemu_system_reset_request(), it doesn't actually cause a clean reset, > but instead the guest CPU continues to execute instructions. > I managed to repro it under 'rr', which let me walk through enough > of what was going on to determine the following: > > When a guest CPU thread calls qemu_system_reset_request(), this > results in a call to qemu_cpu_stop(current_cpu, true), to > make the CPU come back out to the main loop. We also set the > reset_requested flag, to get the IO thread to actually do the > reset. > > The main loop thread runs main_loop_should_exit(). If there is a > pending reset, it calls pause_all_vcpus(), with the intention > that this quiesces all the guest CPUs before it starts messing > with reset actions. > > pause_all_vcpus() just waits for every cpu to have cpu->stopped set. > However, if the running cpu has just called qemu_cpu_stop() on > itself then it will have set cpu->stopped true but not actually > made it out to the main loop yet. (In the case I'm looking at, > what happens is that as soon as the CPU thread unlocks the > iothread mutex in io_writex() after the device write, the > main thread runs and does all the reset operations.) > > The reset code in the iothread then proceeds to start calling > various reset functions while the CPU thread is still inside > the exec loop, running generated code and so on. This doesn't > seem like what ought to happen. In particular it includes > calling cpu_common_reset(), which clears all kinds of flags > relevant to the still-executing CPU... I would have thought the reset code should be scheduled via safe async work to run in the vCPU context. Why should the main loop get involved at all here? > > Any suggestions for how we should fix this? > > thanks > -- PMM -- Alex Bennée