@alex I was thinking along the same lines. I actually initially had a very highly tuned setup, but for debugging I’m back to the most generic configuration I can, with as many defaults as possible, and no tuning or anything. I even recently redid the whole host from scratch to make sure I didn’t have any weird modprobe configs laying around or anything.
As far as overlapping cores, I have the skylake i7, so 8 logical cores total for the host. One VM runs 4 virtual cores, and the other only 3. So Im actually under-committed by 1 core. The nohz thing is really interesting… I have tried both the lowlatency and generic ubuntu kernels, I cant remember entirely, but I think the lowlatency kernel didnt crash as much, I’ll have to confirm. On Wed, May 18, 2016 at 9:35 AM Colin Godsey <[email protected]> wrote: > I’ve been running as much monitoring as possible these last few crashes, > thankfully the SSH sessions lock up too, so I can see the last stats. > > top: looks totally normal when it crashes, maybe 60% CPU util, > swap/cache/sys all look normal. > context switches: seem mostly normal- total of maybe ~4k voluntary, ~300 > non-voluntary. > disk usage: crazy up and down constantly… I use ZFS for the VMs which I’m > not entirely ruling out yet… but I think if anything it may contribute to > power fluctuations via the disks (4 magnetic total). The entire VM host is > on its own regular ext4 drive tho, so hoping that helps rule out ZFS > kernel/software issues. > interrupts: normal > > > On Wed, May 18, 2016 at 9:24 AM Brett Peckinpaugh <[email protected]> > wrote: > >> Are you monitoring processor utilization? 2 systems like you describe >> could tax a host. Maybe it is cpu starvation? >> >> On May 18, 2016 7:47:11 AM PDT, Colin Godsey <[email protected]> wrote: >> >>> I’ve been running a dual gaming VM rig (2x dedicated GPU) for a little >>> bit now, and everything works perfectly except when both VMs are under >>> load, after an hour or so I get a hard crash and/or reboot. It will either >>> reboot itself, or will hang so bad the physical ‘reset’ button on the box >>> doesnt work. >>> >>> There is 0 evidence in the linux logs about the crash, I literally just >>> see one of a few standard cron jobs as the syslog, then the next line is >>> the kernel boot/start-up. Only real evidence I get is that- rarely I can >>> hear windows crash first. Or windows will crash and Ill get maybe another >>> second or 2 of ’top’ before the whole system goes down. I find it extremely >>> odd that there’s some sort of (albeit fast) degradation, but absolutely >>> nothing interesting in the logs. >>> >>> So, I’m pretty sure it’s something hardware related- either PSU or my >>> mobo is crap and is underpowered somewhere. During load, there are about 5 >>> drives, 2 GTX GPUs, and GBe (~200mbps) all under constant load, so it seems >>> likely it could be something chipset related. >>> >>> *So my question is really: is there ANY kind of kernel/vfio software >>> level issue that could cause this crash? Or does this just sound like >>> hardware?* I’ve tried several different power configurations at this >>> point, I just want to be as sure as possible it’s hardware before i start >>> replacing more things =\ >>> >>> This is an up to date Ubuntu Xenial, not really running anything >>> special. I’ve gotten away with running my VMs almost as pure as possible, >>> no funny workarounds or anything. OVMF, Windows 10, hyper-v flags. Skylake >>> i7 @ z170M. >>> >>> ------------------------------ >>> >>> vfio-users mailing list >>> [email protected] >>> https://www.redhat.com/mailman/listinfo/vfio-users >>> >>>
_______________________________________________ vfio-users mailing list [email protected] https://www.redhat.com/mailman/listinfo/vfio-users
