Hi, I happened to hit a kdump hang issue in a Linux VM running on some Hyper-V host. Please see the attached log: the kdump kernel always hangs, even if I configure only 1 virtual CPU to the VM.
I firstly hit the issue in RHEL 8.3's 4.18.x kernel, but later I found that the latest upstream v5.10-rc5 also has the same issue (at least the symptom is exactly the same), so I dug into v5.10-rc5 and found that the kdump kernel always hangs in kernel_init() -> mark_readonly() -> rcu_barrier() -> wait_for_completion(&rcu_state.barrier_completion). Let's take the 1-vCPU case for example (refer to the attached log): in the below code, somehow rcu_segcblist_n_cbs() returns true, so the call smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1) increases the counter by 1, and hence later the counter is still 1 after the atomic_sub_and_test(), and the complete() is not called. static void rcu_barrier_func(void *cpu_in) { ... if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) { atomic_inc(&rcu_state.barrier_cpu_count); } else { ... } void rcu_barrier(void) { ... atomic_set(&rcu_state.barrier_cpu_count, 2); ... for_each_possible_cpu(cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); ... if (rcu_segcblist_n_cbs(&rdp->cblist) && cpu_online(cpu)) { ... smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1); ... } } ... if (atomic_sub_and_test(2, &rcu_state.barrier_cpu_count)) complete(&rcu_state.barrier_completion); ... wait_for_completion(&rcu_state.barrier_completion); Sorry for my ignorance of RCU -- I'm not sure why the rcu_segcblist_n_cbs() returns 1 here. In the normal kernel, it returns 0, so the normal kernel does not hang. Note: in the case of kdump kernel, if I remove the kernel parameter console=ttyS0 OR if I build the kernel with CONFIG_HZ=250, the issue can no longer reproduce. Currently my kernel uses CONFIG_HZ=1000 and I use console=ttyS0, so I'm able to reproduce the isue every time. Note: the same kernel binary can not reproduce the issue when the VM runs on another Hyper-V host. It looks there is some kind of race condition? Looking forward to your insights! I'm happy to test any patch or enable more tracing, if necessary. Thanks! Thanks, -- Dexuan
bad-hz-1000.log
Description: bad-hz-1000.log