Am 17.09.2018 um 11:40 schrieb Jack Wang: > Stefan Priebe - Profihost AG <s.pri...@profihost.ag> 于2018年9月17日周一 上午9:00写道: >> >> Hi, >> >> Am 17.09.2018 um 08:38 schrieb Jack Wang: >>> Stefan Priebe - Profihost AG <s.pri...@profihost.ag> 于2018年9月16日周日 下午3:31写道: >>>> >>>> Hello, >>>> >>>> while overcommiting cpu I had several situations where all vms gone >>>> offline while two vms saturated all cores. >>>> >>>> I believed all vms would stay online but would just not be able to use all >>>> their cores? >>>> >>>> My original idea was to automate live migration on high host load to move >>>> vms to another node but that makes only sense if all vms stay online. >>>> >>>> Is this expected? Anything special needed to archive this? >>>> >>>> Greets, >>>> Stefan >>>> >>> Hi, Stefan, >>> >>> Do you have any logs when all VMs go offline? >>> Maybe OOMkiller play a role there? >> >> After reviewing i think this is memory related but OOM did not play a role. >> All kvm processes where spinning trying to get > 100% CPU and i was not >> able to even login to ssh. After 5-10 minutes i was able to login. > So the VMs are not really offline, what the result if you run > query-status via qmp?
I can't as i can't connect to the host in that stage. >> There were about 150GB free mem. >> >> Relevant settings (no local storage involved): >> vm.dirty_background_ratio: >> 3 >> vm.dirty_ratio: >> 10 >> vm.min_free_kbytes: >> 10567004 >> >> # cat /sys/kernel/mm/transparent_hugepage/defrag >> always defer [defer+madvise] madvise never >> >> # cat /sys/kernel/mm/transparent_hugepage/enabled >> [always] madvise never >> >> After that i had the following traces on the host node: >> https://pastebin.com/raw/0VhyQmAv > > The call trace looks ceph related deadlock or hung. Yes but i can also show you traces where nothing from ceph is involved the only thing they have in common is the beginning in page_fault. >> Thanks! >> >> Greets, >> Stefan