Hello folks, We observed that Arm VM instance (Centos 8) runs into timeout at boot time when it is launched by Qemu/KVM on an Arm server host that enabled hugepages (2MB/1GB). We also noticed that Will has a workaround that applies to the guest kernel (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0e1645557d19fc6d88d3c40431f63a3c3a4c417b) But, it simply increased the timeout time from 1s to 5s and didn't accepted by the upstream so far for some reasons. Moreover, even it is accepted by the upstream, it should take a long time for the distributions like CentOS to backport it.
We are thinking that if we have a chance to fix this problem at host side. Because most cloud providers need less time than distributions to upgrade their software, then eventually, no need to upgrade guest OS. We have investigated the root cause for a few weeks, and barely understand this problem may related to page locking contention when guest kernel allocates per-cpu memory for each CPU at boot time. However, we are not so sure about the root cause and hopping you could share your expertise on this. Really appreciate if you could share your insights! Reproduce step: ============= - Enable 1GB hugepage on the host kernel side. - Start vm using following command (qemu must support passing cpu topology through acpi&dt): # qemu-system-aarch64 -name guest=12345,debug-threads=on \ -machine virt,accel=kvm,usb=off,dump-guest-core=off,gic-version=3 \ -cpu host \ -m 122880 \ -object iothread,id=iothread1 \ -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages,share=yes,size=64424509440,host-nodes=0,policy=preferred \ -numa node,nodeid=0,cpus=0-29,memdev=ram-node0 \ -object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages,share=yes,size=64424509440,host-nodes=1,policy=preferred \ -numa node,nodeid=1,cpus=30-59,memdev=ram-node1 \ -smp 60,sockets=2,cores=30,threads=1 \ -bios /usr/share/AAVMF/AAVMF_CODE.fd \ -drive file=centos8.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none \ -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -net none -serial telnet::9001,server,nowait -monitor stdio Guest kernel boot timeout log sample: =============================== [ 0.017636] smp: Bringing up secondary CPUs ... [ 1.039073] CPU1: failed to come online [ 1.039486] CPU1: failed in unknown state : 0x0 [ 1.052547] Detected VIPT I-cache on CPU2 [ 1.052563] GICv3: CPU2: found redistributor 2 region 0:0x00000000080e0000 [ 1.052708] GICv3: CPU2: using allocated LPI pending table @0x0000000f006a0000 [ 1.052729] CPU2: Booted secondary processor 0x0000000001 [0x481fd010] [ 2.079846] CPU3: failed to come online [ 2.082698] CPU3: failed in unknown state : 0x0 [ 2.086925] Detected VIPT I-cache on CPU4 [ 2.086947] GICv3: CPU4: found redistributor 4 region 0:0x0000000008120000 [ 2.087232] GICv3: CPU4: using allocated LPI pending table @0x0000000f006c0000 [ 2.087259] CPU4: Booted secondary processor 0x0000000002 [0x481fd010] [ 3.120634] CPU5: failed to come online [ 3.121691] Detected VIPT I-cache on CPU5 [ 3.121754] CPU5: failed in unknown state : 0x0 [ 3.121774] GICv3: CPU5: found redistributor 5 region 0:0x0000000008140000 Linux perf output (host side) while guest kernel booting: ============================================== - 19.75% worker [kernel.kallsyms] [k] queued_spin_lock_slowpath - queued_spin_lock_slowpath - 12.98% kvm_mmu_notifier_invalidate_range_end __mmu_notifier_invalidate_range_end zap_page_range __arm64_sys_madvise el0_svc_common.constprop.3 el0_svc_handler el0_svc __madvise tcmalloc::PageHeap::DecommitSpan tcmalloc::PageHeap::MergeIntoFreeList tcmalloc::PageHeap::Delete tcmalloc::CentralFreeList::ReleaseToSpans tcmalloc::CentralFreeList::ReleaseListToSpans tcmalloc::CentralFreeList::InsertRange tcmalloc::ThreadCache::ReleaseToCentralCache tcmalloc::ThreadCache::Cleanup tcmalloc::ThreadCache::DeleteCache __nptl_deallocate_tsd start_thread thread_start - 6.77% kvm_mmu_notifier_invalidate_range_start __mmu_notifier_invalidate_range_start zap_page_range __arm64_sys_madvise el0_svc_common.constprop.3 el0_svc_handler el0_svc __madvise tcmalloc::PageHeap::DecommitSpan tcmalloc::PageHeap::MergeIntoFreeList tcmalloc::PageHeap::Delete tcmalloc::CentralFreeList::ReleaseToSpans tcmalloc::CentralFreeList::ReleaseListToSpans tcmalloc::CentralFreeList::InsertRange tcmalloc::ThreadCache::ReleaseToCentralCache tcmalloc::ThreadCache::Cleanup tcmalloc::ThreadCache::DeleteCache __nptl_deallocate_tsd start_thread thread_start Thanks, Chengdong