Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
[Adding George, as scheduler maintainer, and Juergen as he commented, later in this thread] [Adding xen-users back, as the thread originated from there... sorry for cross-posting] On Mon, 2020-02-17 at 11:58 -0800, Sarah Newman wrote: > If there are no merged or proposed fixes soon, it may be worth > considering making the credit scheduler the default again until > problems with the > credit2 scheduler are resolved. > Just as an heads up, I finally --thanks to Jim Fehlig-- gfound a machine where I could reproduce (something like) this. I've been able to do some analysis of the situation. Basically, on the server I'm using, I do not see stalls severe enough to cause NMI/watchdogs to fire, but I do see, during boot, some preliminary signs of that. I checked what was happening in Xen at that point in time ('r' debug- key, which dumps the scheduler's data scructures), and I found out that there is a vCPU kind of stuck in a runqueue. In fact, the vCPU is in there, i.e., it is ready to run *but* not running, despite being plenty of idle pCPUs that could possibly run it. Reason why it's not being picked up, is that its credit are less than the ones of the idle vCPU. I have a theory about how it got in such a situation and, if I'm right, a draft of an idea of how to fix this. We're using this bug, that Glen kindly created, to track this issue: https://bugzilla.opensuse.org/show_bug.cgi?id=1165206#c3 But of course I'll keep upstream MLs updated as well. Stay tuned. :-) -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ --- <> (Raistlin Majere) signature.asc Description: This is a digitally signed message part ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
On Mon, 2020-02-17 at 11:58 -0800, Sarah Newman wrote: > On 1/7/20 6:25 AM, Alastair Browne wrote: > > > > After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 > > Xen > > for production use running credit1 as the default scheduler. > > One person CC'ed appears to be having the same experience, where the > credit2 scheduler leads to lockups (in this case in the domU, not the > dom0) under > relatively heavy load. It seems possible they may have the same root > cause. > Yeah, well, if booting with `sched=credit` makes the problem disappear, whatever the real root cause really is, it seems related to Credit2. > I don't think there are, but have there been any patches since the > 4.13.0 release which might have fixed problems with credit 2 > scheduler? If not, > what would the next step be to isolating the problem - a debug build > of Xen or something else? > Yes, having a debug build of Xen running and providing, for instance, the info that Juergen is asking for later in this thread, i.e.: xl vcpu-list /usr/lib/xen/bin/xenctx -C -S -s And I'd add myself: xl debug-keys r ; xl dmesg And, in general, hypervisor logs when the problem occurs (I've gone through the threads, and I don't think I have seen any, but maybe I missed something?). xentop is also another way to have a look, from Dom0, at whether (and if yes, which ones and how much) the vCPUs are busy. > If there are no merged or proposed fixes soon, it may be worth > considering making the credit scheduler the default again until > problems with the > credit2 scheduler are resolved. > Nothing similar to what is being described has happened in our testing (or we wouldn't have switched to Credit2, of course! :-D). I will see about trying to reproduce this myself, but this may take a little bit. In the meantime, if you help us by sending more logs, we're happy to try diagnosing and fixing things. Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ --- <> (Raistlin Majere) signature.asc Description: This is a digitally signed message part ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
Juergen - On Mon, Feb 17, 2020 at 10:51 PM Jürgen Groß wrote: > > Any thoughts, insights or guidance would be greatly appreciated! > Can you check whether all vcpus of a hanging guest are consuming time > (via xl vcpu-list) ? > It would be interesting to see where the vcpus are running around. Can > you please copy the domU's /boot/System.map- to dom0 > and then issue: > /usr/lib/xen/bin/xenctx -C -S -s > This should give a backtrace for all vcpus of . To recognize a > loop you should issue that multiple times. > Juergen I've applied the sched=credit boot option to all my production servers at this point, in preparation for a client cutover this weekend. Once that's done, I'm happy next week to reboot the old crashing server to credit2, and test. I'll save these directions and advise. Thank you, Glen ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
On 18.02.20 01:39, Glen wrote: Hello Sander - If I might chime in, I'm also experiencing what we believe is the same problem, and hope I'm not breaking any protocol by sharing a few quick details... On Mon, Feb 17, 2020 at 3:46 PM Sander Eikelenboom wrote: On 17/02/2020 20:58, Sarah Newman wrote: On 1/7/20 6:25 AM, Alastair Browne wrote: So in conclusion, the tests indicate that credit2 might be unstable. For the time being, we are using credit as the chosen scheduler. We I don't think there are, but have there been any patches since the 4.13.0 release which might have fixed problems with credit 2 scheduler? If not, what would the next step be to isolating the problem - a debug build of Xen or something else? If there are no merged or proposed fixes soon, it may be worth considering making the credit scheduler the default again until problems with the credit2 scheduler are resolved. I did take a look at Alastair Browne's report your replied to (https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html) and I do see some differences: - Alastair's machine has multiple sockets, my machines don't. - It seems Alastair's config is using ballooning ? (dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the past, so my configs don't. My configuration has ballooning disabled, we do not use it, and we still have the problem. - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested (5.4 is available in Debian backport for buster). - Alastair, are you using pv, hvm or pvh guests? The report seems to miss the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for dom0) ? The problem appears to occur for both HVM and PV guests. A report by Tomas https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html provides his config for his HVM setup. My initial report https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html contains my PV guest config. Any how, could be worthwhile to test without ballooning, and test a recent kernel to rule out an issue with (missing) kernel backports. Thanks to guidance from Sarah, we've had lots of discussion on the users lists about this, especially this past week (pasting in https://lists.xenproject.org/archives/html/xen-users/2020-02/ just for your clicking convenience since I'm there as I type this) and it seems like we've been able to narrow things down a bit: * Alastair's config is on very large machines. Tomas can duplicate this on a much smaller scale, and I can duplicate it on a single DomU running as the only guest on a Dom0 host. So overall host size/capacity doesn't seem to be very important, nor does number of guests on the host. * I'm using the Linux 4.12.14 kernel on both host and guest with Xen 4.12.1. - for me, the act of just going to a previous version of Xen (in my case to Xen 4.10) eliminates the problem. Tomas is on 4.14.159, and he reports that even moving back just to Xen 4.11 resolves his issue, whereas the issue seems to still exist in Xen 4.13. So changing Xen versions without changing kernel versions seems to resolve this. * We've had another user mention that "When I switched to openSUSE Xen 4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests of all 'flavors' became *much* better behaved.", so we think maybe something in very recent Xen 4.13 might have helped (or possibly that latest kernel, although from our limited point of view the changing of Xen versions back to pre-4.12 solcing this without any kernel changes seems compelling.) * Tomas has already tested, and I am still testing, Xen 4.12 with just the sched=credit change. For him that has eliminated the problem as well, I am still stress-testing my guest under Xen 4.12 sched=credit, so I cannot report, but I am hopeful. I believe this is why Sarah asked about patches to 4.13... it is looking to us just on the user level like this is possibly kernel-independent, but at least Xen-version-dependent, and likely credit-scheduler-dependent. I apologize if I should be doing something different here, but it is looking like a few more of us are having what we believe to be the same problem and, based only on what I've seen, I've already changed over all of my production hosts (I run about 20) to sched=credit as a precautionary measure. Any thoughts, insights or guidance would be greatly appreciated! Can you check whether all vcpus of a hanging guest are consuming time (via xl vcpu-list) ? It would be interesting to see where the vcpus are running around. Can you please copy the domU's /boot/System.map- to dom0 and then issue: /usr/lib/xen/bin/xenctx -C -S -s This should give a backtrace for all vcpus of . To recognize a loop you should issue that multiple times. Juergen ___ Xen-devel mailing list
Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
Hello Sander - If I might chime in, I'm also experiencing what we believe is the same problem, and hope I'm not breaking any protocol by sharing a few quick details... On Mon, Feb 17, 2020 at 3:46 PM Sander Eikelenboom wrote: > On 17/02/2020 20:58, Sarah Newman wrote: > > On 1/7/20 6:25 AM, Alastair Browne wrote: > >> So in conclusion, the tests indicate that credit2 might be unstable. > >> For the time being, we are using credit as the chosen scheduler. We > > I don't think there are, but have there been any patches since the 4.13.0 > > release which might have fixed problems with credit 2 scheduler? If not, > > what would the next step be to isolating the problem - a debug build of Xen > > or something else? > > If there are no merged or proposed fixes soon, it may be worth considering > > making the credit scheduler the default again until problems with the > > credit2 scheduler are resolved. > I did take a look at Alastair Browne's report your replied to > (https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html) > and I do see some differences: > - Alastair's machine has multiple sockets, my machines don't. > - It seems Alastair's config is using ballooning ? > (dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the > past, so my configs don't. My configuration has ballooning disabled, we do not use it, and we still have the problem. > - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), > 4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested > (5.4 is available in Debian backport for buster). > - Alastair, are you using pv, hvm or pvh guests? The report seems to miss > the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for > dom0) ? The problem appears to occur for both HVM and PV guests. A report by Tomas https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html provides his config for his HVM setup. My initial report https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html contains my PV guest config. > Any how, could be worthwhile to test without ballooning, and test a recent > kernel to rule out an issue with (missing) kernel backports. Thanks to guidance from Sarah, we've had lots of discussion on the users lists about this, especially this past week (pasting in https://lists.xenproject.org/archives/html/xen-users/2020-02/ just for your clicking convenience since I'm there as I type this) and it seems like we've been able to narrow things down a bit: * Alastair's config is on very large machines. Tomas can duplicate this on a much smaller scale, and I can duplicate it on a single DomU running as the only guest on a Dom0 host. So overall host size/capacity doesn't seem to be very important, nor does number of guests on the host. * I'm using the Linux 4.12.14 kernel on both host and guest with Xen 4.12.1. - for me, the act of just going to a previous version of Xen (in my case to Xen 4.10) eliminates the problem. Tomas is on 4.14.159, and he reports that even moving back just to Xen 4.11 resolves his issue, whereas the issue seems to still exist in Xen 4.13. So changing Xen versions without changing kernel versions seems to resolve this. * We've had another user mention that "When I switched to openSUSE Xen 4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests of all 'flavors' became *much* better behaved.", so we think maybe something in very recent Xen 4.13 might have helped (or possibly that latest kernel, although from our limited point of view the changing of Xen versions back to pre-4.12 solcing this without any kernel changes seems compelling.) * Tomas has already tested, and I am still testing, Xen 4.12 with just the sched=credit change. For him that has eliminated the problem as well, I am still stress-testing my guest under Xen 4.12 sched=credit, so I cannot report, but I am hopeful. I believe this is why Sarah asked about patches to 4.13... it is looking to us just on the user level like this is possibly kernel-independent, but at least Xen-version-dependent, and likely credit-scheduler-dependent. I apologize if I should be doing something different here, but it is looking like a few more of us are having what we believe to be the same problem and, based only on what I've seen, I've already changed over all of my production hosts (I run about 20) to sched=credit as a precautionary measure. Any thoughts, insights or guidance would be greatly appreciated! Respectfully, Glen ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
On 17/02/2020 20:58, Sarah Newman wrote: > On 1/7/20 6:25 AM, Alastair Browne wrote: >> >> CONCLUSION >> >> So in conclusion, the tests indicate that credit2 might be unstable. >> >> For the time being, we are using credit as the chosen scheduler. We >> are booting the kernel with a parameter "sched=credit" to ensure that >> the correct scheduler is used. >> >> After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 Xen >> for production use running credit1 as the default scheduler. > > One person CC'ed appears to be having the same experience, where the credit2 > scheduler leads to lockups (in this case in the domU, not the dom0) under > relatively heavy load. It seems possible they may have the same root cause. > > I don't think there are, but have there been any patches since the 4.13.0 > release which might have fixed problems with credit 2 scheduler? If not, > what would the next step be to isolating the problem - a debug build of Xen > or something else? > > If there are no merged or proposed fixes soon, it may be worth considering > making the credit scheduler the default again until problems with the > credit2 scheduler are resolved. > > Thanks, Sarah > > Hi Sarah / Alastair, I can only provide my n=1 (OK, I'm running a bunch of boxes, some of which pretty over-committed CPU wise), but I haven't seen any issues (lately) with credit2. I did take a look at Alastair Browne's report your replied to (https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html) and I do see some differences: - Alastair's machine has multiple sockets, my machines don't. - It seems Alastair's config is using ballooning ? (dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the past, so my configs don't. - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested (5.4 is available in Debian backport for buster). - Alastair, are you using pv, hvm or pvh guests? The report seems to miss the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for dom0) ? Any how, could be worthwhile to test without ballooning, and test a recent kernel to rule out an issue with (missing) kernel backports. -- Sander ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
On 1/7/20 6:25 AM, Alastair Browne wrote: CONCLUSION So in conclusion, the tests indicate that credit2 might be unstable. For the time being, we are using credit as the chosen scheduler. We are booting the kernel with a parameter "sched=credit" to ensure that the correct scheduler is used. After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 Xen for production use running credit1 as the default scheduler. One person CC'ed appears to be having the same experience, where the credit2 scheduler leads to lockups (in this case in the domU, not the dom0) under relatively heavy load. It seems possible they may have the same root cause. I don't think there are, but have there been any patches since the 4.13.0 release which might have fixed problems with credit 2 scheduler? If not, what would the next step be to isolating the problem - a debug build of Xen or something else? If there are no merged or proposed fixes soon, it may be worth considering making the credit scheduler the default again until problems with the credit2 scheduler are resolved. Thanks, Sarah ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel