Hi Peter,
On Tue, Oct 14, 2025 at 7:36 PM Salil Mehta <[email protected]> wrote:
>
> Hi Peter,
>
> > From: [email protected] <qemu-
> > [email protected]> On Behalf Of Peter
> > Maydell
> > Sent: Tuesday, October 14, 2025 4:44 PM
> > To: Salil Mehta <[email protected]>
> >
> > On Tue, 14 Oct 2025 at 16:33, Salil Mehta <[email protected]> wrote:
> > >
> > > > From: Peter Maydell <[email protected]>
> > > > Sent: Tuesday, October 14, 2025 4:24 PM
> > > > To: Salil Mehta <[email protected]>
> > > >
> > > > On Tue, 14 Oct 2025 at 16:13, Salil Mehta <[email protected]>
> > wrote:
> > > > >
> > > > > > From: Peter Maydell <[email protected]> In what situation
> > > > > > do we ever start running a VCPU before the *GIC* has been
> > > > > > realized? The GIC should get realized as part of creating the
> > > > > > virt board, which must complete before we do anything like running a
> > vcpu.
> > > > >
> > > > >
> > > > > Just after realization of vCPU in the machvirt_init() you can see
> > > > > the default power_state is PSCI CPU_ON, which means
> > > > KVM_MP_STATE_RUNNABLE.
> > > > > Since, the thread is up and not doing IO wait in userspace it gets
> > > > > into
> > > > > cpu_exec() loop and actually run KVM_RUN IOCTL. Inside the KVM it
> > > > > momentarily takes the vCPU mutex but later exit and releases. This
> > > > > keeps going on for all of the vCPU threads realized early.
> > > >
> > > > Yikes. We definitely should fix that : letting the vcpu run before
> > > > we get to
> > > > qemu_machine_creation_done() seems like it would be a massive source
> > > > of race conditions.
> > >
> > > I've already proposed fix for this by parking such threads in
> > > userspace. Please check functions virt_(un)park_cpu_in_userspace().
> > > But need to check if we can use this trick can be used at the very early
> > stages of the VM initialization.
> >
> > I had a look at this on x86, and we correctly don't try to KVM_RUN the vcpus
> > early. What happens there is:
> > * the vcpu thread calls qemu_process_cpu_events()
>
>
> I cannot find this function in the Qemu mainline of 28th September ?
>
>
> > * this causes it to go to sleep on the cpu->halt_cond: so
> > it will not end up doing KVM_RUN yet
> > * later, the main thread completes initialization of the board
> > * qdev_machine_creation_done() calls cpu_synchronize_all_post_init()
> > * for kvm, this causes us to call kvm_cpu_synchronize_post_init()
> > for each vcpu
> > * that will call run_on_cpu() which ends up calling qemu_cpu_kick()
> > * qemu_cpu_kick() does a broadcast on cpu->halt_cond, which
> > wakes up the vcpu thread and lets it go into kvm_cpu_exec()
> > for the first time
> >
> > Why doesn't this mechanism work on Arm ?
>
> It is a combination of things:
>
> void qemu_wait_io_event(CPUState *cpu)
> {
> [...]
> while (cpu_thread_is_idle(cpu)) {
> [...]
> qemu_cond_wait(cpu->halt_cond, &bql);
> }
> [...]
> }
>
> 1. To block we should wait on 'halt_cond' as you rightly pointed.
> 2. but condition to wait is to check of the CPU is IDLE or not.
> 3. Various conditions in which CPU can be termed IDLE are:
> 3.1 STOPPED
> 3.2 HALTED
> 3.3 It does not have any queued work to process.
>
>
I was partly wrong in my earlier analysis—sorry about that. The real flow is:
qdev_realize() -> qemu_vcpu_init() sets CPUState::stopped = true.
This ensures the threads created during machvirt_init() block in I/O
wait. That part works as expected.
qemu_system_reset() runs via qemu_machine_creation_done(). This performs
the CPU reset, and cpu_common_reset_hold() initializes
CPUState::halted from the start-powered-off property.
resume_all_vcpus() is called from vm_start(). It sets
CPUState::stopped = false and kicks each vCPU. However,
CPUState::halted remains whatever start-powered-off configured
(typically 1 for secondaries).
Thus, for each secondary vCPU we have: halted = 1, stopped = 0.
These conditions should be sufficient for the vCPU thread to be treated as
idle and to sleep (I/O wait) on halt_cond.
However, inside cpu_thread_is_idle(), the halted = 1 condition is
effectively bypassed because kvm_halt_in_kernel_allowed defaults to
true. As a result, the thread does not take the userspace sleep path
even though the vCPU is halted.
I traced the code and the earlier patch as follows:
bool cpu_thread_is_idle(CPUState *cpu)
{
if (cpu->stop || !cpu_work_list_empty(cpu)) {
return false;
}
if (cpu_is_stopped(cpu)) {
return true;
}
if (!cpu->halted || cpu_has_work(cpu)) {
return false;
}
// at this point we expect CPUState::halted = 1
if (cpus_accel->cpu_thread_is_idle) {
/* flows goes into this for ARM and returns false */
return cpus_accel->cpu_thread_is_idle(cpu);
}
return true;
}
where cpu_thread_is_idle = kvm_vcpu_thread_is_idle()
static bool kvm_vcpu_thread_is_idle(CPUState *cpu)
{
return !kvm_halt_in_kernel();
}
File: kvm.h
/**
* kvm_halt_in_kernel
*
* Returns: true if halted cpus should still get a KVM_RUN ioctl to run
* inside of kernel space. This only works if MP state is implemented.
*/
#define kvm_halt_in_kernel() (kvm_halt_in_kernel_allowed)
File: target/arm/kvm.c
int kvm_arch_init(MachineState *ms, KVMState *s)
{
[...]
/*
* PSCI wakes up secondary cores, so we always need to
* have vCPUs waiting in kernel space
*/
kvm_halt_in_kernel_allowed = true;
[...]
}
Patch: ARM: KVM: Enable in-kernel timers with user space gic
https://lore.kernel.org/qemu-devel/[email protected]/
Net effect: even when halted = 1 and there is no work, KVM/ARM forces
cpu_thread_is_idle() to return false so the thread enters KVM_RUN. This
causes vCPU lock contention during the VM Initialization.
It looks to be a 2017 patch, maybe you can help throw more light in this?
Many thanks!
Best regards
Salil.