Hi Peter,

On Tue, Oct 14, 2025 at 7:36 PM Salil Mehta <[email protected]> wrote:
>
> Hi Peter,
>
> > From: [email protected] <qemu-
> > [email protected]> On Behalf Of Peter
> > Maydell
> > Sent: Tuesday, October 14, 2025 4:44 PM
> > To: Salil Mehta <[email protected]>
> >
> > On Tue, 14 Oct 2025 at 16:33, Salil Mehta <[email protected]> wrote:
> > >
> > > > From: Peter Maydell <[email protected]>
> > > > Sent: Tuesday, October 14, 2025 4:24 PM
> > > > To: Salil Mehta <[email protected]>
> > > >
> > > > On Tue, 14 Oct 2025 at 16:13, Salil Mehta <[email protected]>
> > wrote:
> > > > >
> > > > > > From: Peter Maydell <[email protected]> In what situation
> > > > > > do we ever start running a VCPU before the *GIC* has been
> > > > > > realized? The GIC should get realized as part of creating the
> > > > > > virt board, which must complete before we do anything like running a
> > vcpu.
> > > > >
> > > > >
> > > > > Just after realization of vCPU in the machvirt_init() you can see
> > > > > the default power_state is PSCI CPU_ON, which means
> > > > KVM_MP_STATE_RUNNABLE.
> > > > > Since, the thread is up and not doing IO wait in userspace it gets
> > > > > into
> > > > > cpu_exec() loop and actually run KVM_RUN IOCTL. Inside the KVM it
> > > > > momentarily takes the vCPU mutex but later exit and releases. This
> > > > > keeps going on for all of the vCPU threads realized early.
> > > >
> > > > Yikes. We definitely should fix that : letting the vcpu run before
> > > > we get to
> > > > qemu_machine_creation_done() seems like it would be a massive source
> > > > of race conditions.
> > >
> > > I've already proposed fix for this by parking such threads in
> > > userspace. Please check functions virt_(un)park_cpu_in_userspace().
> > > But need to check if we can use this trick can be used at the very early
> > stages of the VM initialization.
> >
> > I had a look at this on x86, and we correctly don't try to KVM_RUN the vcpus
> > early. What happens there is:
> >  * the vcpu thread calls qemu_process_cpu_events()
>
>
> I cannot find this function in the Qemu mainline of 28th September ?
>
>
> >  * this causes it to go to sleep on the cpu->halt_cond: so
> >    it will not end up doing KVM_RUN yet
> >  * later, the main thread completes initialization of the board
> >  * qdev_machine_creation_done() calls cpu_synchronize_all_post_init()
> >  * for kvm, this causes us to call kvm_cpu_synchronize_post_init()
> >    for each vcpu
> >  * that will call run_on_cpu() which ends up calling qemu_cpu_kick()
> >  * qemu_cpu_kick() does a broadcast on cpu->halt_cond, which
> >    wakes up the vcpu thread and lets it go into kvm_cpu_exec()
> >    for the first time
> >
> > Why doesn't this mechanism work on Arm ?
>
> It is a combination of things:
>
> void qemu_wait_io_event(CPUState *cpu)
> {
> [...]
>     while (cpu_thread_is_idle(cpu)) {
>          [...]
>         qemu_cond_wait(cpu->halt_cond, &bql);
>     }
> [...]
> }
>
> 1. To block we should wait on 'halt_cond' as you rightly pointed.
> 2. but condition to wait is to check of the CPU is IDLE or not.
> 3. Various conditions in which CPU can be termed IDLE are:
>      3.1  STOPPED
>      3.2  HALTED
>      3.3 It does not have any queued  work to process.
>
>

I was partly wrong in my earlier analysis—sorry about that. The real flow is:

qdev_realize() -> qemu_vcpu_init() sets CPUState::stopped = true.
This ensures the threads created during machvirt_init() block in I/O
wait. That part works as expected.

qemu_system_reset() runs via qemu_machine_creation_done(). This performs
the CPU reset, and cpu_common_reset_hold() initializes
CPUState::halted from the start-powered-off property.

resume_all_vcpus() is called from vm_start(). It sets
CPUState::stopped = false and kicks each vCPU. However,
CPUState::halted remains whatever start-powered-off configured
(typically 1 for secondaries).

Thus, for each secondary vCPU we have: halted = 1, stopped = 0.

These conditions should be sufficient for the vCPU thread to be treated as
idle and to sleep (I/O wait) on halt_cond.

However, inside cpu_thread_is_idle(), the halted = 1 condition is
effectively bypassed because kvm_halt_in_kernel_allowed defaults to
true. As a result, the thread does not take the userspace sleep path
even though the vCPU is halted.

I traced the code and the earlier patch as follows:

bool cpu_thread_is_idle(CPUState *cpu)
{
    if (cpu->stop || !cpu_work_list_empty(cpu)) {
        return false;
    }
    if (cpu_is_stopped(cpu)) {
        return true;
    }
    if (!cpu->halted || cpu_has_work(cpu)) {
        return false;
    }

     // at this point we expect CPUState::halted = 1

    if (cpus_accel->cpu_thread_is_idle) {
        /* flows goes into this for ARM and returns false */
        return cpus_accel->cpu_thread_is_idle(cpu);
    }
    return true;
}

where cpu_thread_is_idle = kvm_vcpu_thread_is_idle()

static bool kvm_vcpu_thread_is_idle(CPUState *cpu)
{
    return !kvm_halt_in_kernel();
}


File: kvm.h
/**
 * kvm_halt_in_kernel
 *
 * Returns: true if halted cpus should still get a KVM_RUN ioctl to run
 * inside of kernel space. This only works if MP state is implemented.
 */
#define kvm_halt_in_kernel() (kvm_halt_in_kernel_allowed)


File: target/arm/kvm.c
int kvm_arch_init(MachineState *ms, KVMState *s)
{
[...]

    /*
     * PSCI wakes up secondary cores, so we always need to
     * have vCPUs waiting in kernel space
     */
    kvm_halt_in_kernel_allowed = true;
[...]
}

Patch: ARM: KVM: Enable in-kernel timers with user space gic
https://lore.kernel.org/qemu-devel/[email protected]/

Net effect: even when halted = 1 and there is no work, KVM/ARM forces
cpu_thread_is_idle() to return false so the thread enters KVM_RUN. This
causes vCPU lock contention during the VM Initialization.

It looks to be a 2017 patch, maybe you can help throw more light in this?


Many thanks!

Best regards
Salil.

Reply via email to