Re: [Xen-devel] [PATCH] xen: add steal_clock support on x86
On Wed, May 18, 2016 at 11:59 AM, Tony S <suokuns...@gmail.com> wrote: > On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovsky > <boris.ostrov...@oracle.com> wrote: >> On 05/18/2016 12:10 PM, Dario Faggioli wrote: >>> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote: >>>> On 18/05/16 16:46, Boris Ostrovsky wrote: >>>>> >>>>> Won't we be accounting for stolen cycles twice now --- once from >>>>> steal_account_process_tick()->steal_clock() and second time from >>>>> do_stolen_accounting()? >>>> Uuh, yes. >>>> >>>> I guess I should rip do_stolen_accounting() out, too? It is a >>>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a >>>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then? >>>> >>> So, config options aside, if I understand this correctly, it looks like >>> we were actually already doing steal time accounting, although in a >>> non-standard way. >>> >>> And yet, people seem to have issues relating to lack of (proper?) steal >>> time accounting (Cc-ing Tony). >>> >>> I guess this means that, either: >>> - the issue being reported is actually not caused by the lack of >>>steal time accounting, >>> - our current (Xen specific) steal time accounting solution is flawed, >>> - the issue is caused by the lack of the bit of steal time accounting >>>that we do not support yet, >> >> I believe it's this one. >> >> Tony narrowed the problem down to update_curr() where vruntime is >> calculated, based on runqueue's clock_task value. That value is computed >> in update_rq_clock_task(), which needs paravirt_steal_rq_enabled. >> > > Hi Boris, > > You are right. > > The real problem is steal_clock in pv_time_ops is implemented in KVM > but not in Xen. > > arch/x86/include/asm/paravirt_types.h > struct pv_time_ops { > unsigned long long (*sched_clock)(void); > unsigned long long (*steal_clock)(int cpu); > unsigned long (*get_tsc_khz)(void); > }; > > > (1) KVM implemented both sched_clock and steal_clock. > > arch/x86/kernel/kvmclock.c > pv_time_ops.sched_clock = kvm_clock_read; > > arch/x86/kernel/kvm.c > pv_time_ops.steal_clock = kvm_steal_clock; > > > (2) However, Xen just implemented sched_clock while the steal_clock is > still native_steal_clock(). The function native_steal_clock() just > simply return 0. > > arch/x86/xen/time.c > .sched_clock = xen_clocksource_read; > > arch/x86/kernel/paravirt.c > static u64 native_steal_clock(int cpu) > { > return 0; > } > > > Therefore, even though update_rq_clock_task() calculates the value and > paravirt_steal_rq_enabled option is enabled, the steal value just > returns 0. This will cause the problem which I mentioned. > > update_rq_clock_task > --> paravirt_steal_clock > --> pv_time_ops.steal_clock > --> native_steal_clock (if in Xen) > --> 0 > > The fundamental solution is to implement a steal_clock in Xen(learn > from KVM implementation) instead of using the native one. > > Tony > Also, I tried the latest long term version of Linux 4.4, this issue still exists there. Hoping the next version can add this patch. Tony >> -boris >> >>> - other ideas? Tony? >>> >>> Dario >> >> ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen: add steal_clock support on x86
On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovskywrote: > On 05/18/2016 12:10 PM, Dario Faggioli wrote: >> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote: >>> On 18/05/16 16:46, Boris Ostrovsky wrote: Won't we be accounting for stolen cycles twice now --- once from steal_account_process_tick()->steal_clock() and second time from do_stolen_accounting()? >>> Uuh, yes. >>> >>> I guess I should rip do_stolen_accounting() out, too? It is a >>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a >>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then? >>> >> So, config options aside, if I understand this correctly, it looks like >> we were actually already doing steal time accounting, although in a >> non-standard way. >> >> And yet, people seem to have issues relating to lack of (proper?) steal >> time accounting (Cc-ing Tony). >> >> I guess this means that, either: >> - the issue being reported is actually not caused by the lack of >>steal time accounting, >> - our current (Xen specific) steal time accounting solution is flawed, >> - the issue is caused by the lack of the bit of steal time accounting >>that we do not support yet, > > I believe it's this one. > > Tony narrowed the problem down to update_curr() where vruntime is > calculated, based on runqueue's clock_task value. That value is computed > in update_rq_clock_task(), which needs paravirt_steal_rq_enabled. > Hi Boris, You are right. The real problem is steal_clock in pv_time_ops is implemented in KVM but not in Xen. arch/x86/include/asm/paravirt_types.h struct pv_time_ops { unsigned long long (*sched_clock)(void); unsigned long long (*steal_clock)(int cpu); unsigned long (*get_tsc_khz)(void); }; (1) KVM implemented both sched_clock and steal_clock. arch/x86/kernel/kvmclock.c pv_time_ops.sched_clock = kvm_clock_read; arch/x86/kernel/kvm.c pv_time_ops.steal_clock = kvm_steal_clock; (2) However, Xen just implemented sched_clock while the steal_clock is still native_steal_clock(). The function native_steal_clock() just simply return 0. arch/x86/xen/time.c .sched_clock = xen_clocksource_read; arch/x86/kernel/paravirt.c static u64 native_steal_clock(int cpu) { return 0; } Therefore, even though update_rq_clock_task() calculates the value and paravirt_steal_rq_enabled option is enabled, the steal value just returns 0. This will cause the problem which I mentioned. update_rq_clock_task --> paravirt_steal_clock --> pv_time_ops.steal_clock --> native_steal_clock (if in Xen) --> 0 The fundamental solution is to implement a steal_clock in Xen(learn from KVM implementation) instead of using the native one. Tony > -boris > >> - other ideas? Tony? >> >> Dario > > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen
On Wed, May 18, 2016 at 8:57 AM, Dario Faggioli <dario.faggi...@citrix.com> wrote: > On Wed, 2016-05-18 at 14:24 +0200, Juergen Gross wrote: >> On 17/05/16 11:33, George Dunlap wrote: >> > > Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting >> > > process >> > > times. KVM uses it but Xen doesn't. >> > Is someone on the Linux side going to put this on their to-do list >> > then? :-) >> >> Patch sent. >> > Yep, seen it, thanks. > >> Support was already existing for arm. >> > Yes!! I remember Stefano talking about introducing it, and that was > also why I thought we had it already since long time on x86. > > Well, anyway... :-) > >> What is missing is support for >> paravirt_steal_rq_enabled which requires to be able to read the >> stolen >> time of another cpu. This can't work today as accessing another cpu's >> vcpu_runstate_info isn't possible without risking inconsistent data. >> I plan to add support for this, too, but this will require adding >> another hypercall to map a modified vcpu_runstate_info containing an >> indicator for an ongoing update of the data. >> > Understood. > > So, Tony, up for trying again your workload with this patch applied to > Linux? > > Most likely, it _won't_ fix all the problems you're seeing, but I'm > curious to see if it helps. Hi Dario, I did not see the patch. Can you please send me the patch and I will try to test it later. Best Tony > > Thanks again and Regards, > Dario > -- > <> (Raistlin Majere) > - > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R Ltd., Cambridge (UK) > -- Tony. S Ph. D student of University of Colorado, Colorado Springs ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues
On Tue, May 17, 2016 at 3:27 AM, George Dunlap <dunl...@umich.edu> wrote: > On Sun, May 15, 2016 at 5:11 AM, Tony S <suokuns...@gmail.com> wrote: >> Hi all, >> >> When I was running latency-sensitive applications in VMs on Xen, I >> found some bugs in the credit scheduler which will cause long tail >> latency in I/O-intensive VMs. >> >> >> (1) Problem description >> >> Description >> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux >> 3.18.21), Dom U(Linux 3.18.21). >> >> Environment setup: >> We created two 1-vCPU, 4GB-memory VMs and pinned them onto one >> physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server >> program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or >> simply a loop(denoted as CPU-VM). A client on another physical machine >> sent UDP requests to the I/O-VM. >> >> Here are my tail latency results (micro-second): >> Case Avg 90% 99%99.9% 99.99% >> #1 108 & 114& 128 & 129 & 130 >> #2 7811 & 13892 & 14874 & 15315 & 16383 >> #3 943 & 131& 21755 & 26453 & 26553 >> #4 116 & 96 & 105 & 8217& 13472 >> #5 116 & 117& 129 & 131 & 132 >> >> Bug 1, 2, and 3 will be discussed below. >> >> Case #1: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> idling (no processes running). >> >> Case #2: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 >> >> Case #3: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 fixed >> >> Case #4: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed >> >> Case #5: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed >> >> --- >> >> >> (2) Problem analysis > > Hey Tony, > > Thanks for looking at this. These issues in the credit1 algorithm are > essentially exactly the reason that I started work on the credit2 > scheduler several years ago. We meant credit2 to have replaced > credit1 by now, but we ran out of time to test it properly; we're in > the process of doing that right now, and are hoping it will be the > default scheduler for the 4.8 release. > > So if I could make two suggestions that would help your effort be more > helpful to us: > > 1. Use cpupools for testing rather than pinning. A lot of the > algorithms are designed with the assumption that they have all the > cpus to run on, and the credit allocation / priority algorithms fail > to work properly when they are only pinned. Cpupools was specifically > designed to allow the scheduler algorithms to work as designed with a > smaller number of cpus than the system had. > > 2. Test credit2. :-) > Hi George, Thank you for reply. I will try cpupools and credit2 later. :-) > One comment about your analysis here... > >> [Bug2]: In csched_acct() (by default every 30ms), a VCPU stops earning >> credits and is removed from the active CPU list(in >> __csched_vcpu_acct_stop_locked) if its credit is larger than the upper >> bound. Because the domain has only one VCPU and the VM will also be >> removed from the active domain list. >> >> Every 10ms, csched_tick() --> csched_vcpu_acct() --> >> __csched_vcpu_acct_start() will be executed and tries to put inactive >> VCPUs back to the active list. However, __csched_vcpu_acct_start() >> will only put the current VCPU back to the active list. If an >> I/O-bound VCPU is not the current VCPU at the csched_tick(), it will >> not be put back to the active VCPU list. If so, the I/O-bound VCPU >> will likely miss the next credit refill in csched_acct() and can >> easily enter the OVER state. As such, the I/O-bound VM will be unable >> to be boosted and have very long latency. It takes at least one time >> slice (e.g., 30ms) before the I/O VM is activated and starts to >> receive credits. >> >> [Possible Solution] Try to activate any inactive VCPUs back to active >> before next credit refil
Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen
On Mon, May 16, 2016 at 3:38 PM, Tony S <suokuns...@gmail.com> wrote: > On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli > <dario.faggi...@citrix.com> wrote: >> [Adding George again, and a few Linux/Xen folks] >> >> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote: >>> In virtualized environments, sometimes we need to limit the CPU >>> resources to a virtual machine(VM). For example in Xen, we use >>> $ xl sched-credit -d 1 -c 50 >>> >>> to limit the CPU resource of dom 1 as half of >>> one physical CPU core. If the VM CPU resource is capped, the process >>> inside the VM will have a vruntime accounting problem. Here, I report >>> my findings about Linux process scheduler under the above scenario. >>> >> Thanks for this other report as well. :-) >> >> All you say makes sense to me, and I will think about it. I'm not sure >> about one thing, though... >> > > Hi Dario, > > Thank you for your reply. > > >>> Description >>> Linux CFS relies on delta_exec to charge the vruntime of processes. >>> The variable delta_exec is the difference of a process starts and >>> stops running on a CPU. This works well in physical machine. However, >>> in virtual machine under capped resources, some processes might be >>> accounted with inaccurate vruntime. >>> >>> For example, suppose we have a VM which has one vCPU and is capped to >>> have as much as 50% of a physical CPU. When process A inside the VM >>> starts running and the CPU resource of that VM runs out, the VM will >>> be paused. Next round when the VM is allocated new CPU resource and >>> starts running again, process A stops running and is put back to the >>> runqueue. The delta_exec of process A is accounted as its "real >>> execution time" plus the paused time of its VM. That will make the >>> vruntime of process A much larger than it should be and process A >>> would not be scheduled again for a long time until the vruntimes of >>> other >>> processes catch it. >>> --- >>> >>> >>> Analysis >>> When a process stops running and is going to put back to the >>> runqueue, >>> update_curr() will be executed. >>> [src/kernel/sched/fair.c] >>> >>> static void update_curr(struct cfs_rq *cfs_rq) >>> { >>> ... ... >>> delta_exec = now - curr->exec_start; >>> ... ... >>> curr->exec_start = now; >>> ... ... >>> curr->sum_exec_runtime += delta_exec; >>> schedstat_add(cfs_rq, exec_clock, delta_exec); >>> curr->vruntime += calc_delta_fair(delta_exec, curr); >>> update_min_vruntime(cfs_rq); >>> ... ... >>> } >>> >>> "now" --> the right now time >>> "exec_start" --> the time when the current process is put on the CPU >>> "delta_exec" --> the time difference of a process between it starts >>> and stops running on the CPU >>> >>> When a process starts running before its VM is paused and the process >>> stops running after its VM is unpaused, the delta_exec will include >>> the VM suspend time which is pretty large compared to the real >>> execution time of a process. >>> >> ... but would that also apply to a VM that is not scheduled --just >> because of pCPU contention, not because it was paused-- for a few time? >> > > Thanks for your suggestion. I have tried to see whether this issue > exists on pCPU sharing today. Unfortunately, I found this issue was > there, not only for capping case, but also for pCPU sharing case. > > In the above both cases, the process vruntime accounting in guest OS > has "vruntime jump", which might cause that victim process to have > poor and unpredictable performance. > > In the cloud, from my point of view, the VM exists in three scenarios: > 1, dedicated hardware(in this case, VM = Physical Machine); > 2, part of dedicated hardware(using capping, like Amazon EC2 T2.small > instance); > 3, sharing with other VMs on the same hardware; > > Both case#2 and case#3 will be influenced due to the issue I mentioned. > > >> Isn't there anything in place in Xen or Linux (the latter being better >> suitable for something like this, IMHO) to compensate for that? >> > > No. I do not think so. I think this is a bug in Linux kernel under > virtualization(vmm pl
Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen
On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli <dario.faggi...@citrix.com> wrote: > [Adding George again, and a few Linux/Xen folks] > > On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote: >> In virtualized environments, sometimes we need to limit the CPU >> resources to a virtual machine(VM). For example in Xen, we use >> $ xl sched-credit -d 1 -c 50 >> >> to limit the CPU resource of dom 1 as half of >> one physical CPU core. If the VM CPU resource is capped, the process >> inside the VM will have a vruntime accounting problem. Here, I report >> my findings about Linux process scheduler under the above scenario. >> > Thanks for this other report as well. :-) > > All you say makes sense to me, and I will think about it. I'm not sure > about one thing, though... > Hi Dario, Thank you for your reply. >> Description >> Linux CFS relies on delta_exec to charge the vruntime of processes. >> The variable delta_exec is the difference of a process starts and >> stops running on a CPU. This works well in physical machine. However, >> in virtual machine under capped resources, some processes might be >> accounted with inaccurate vruntime. >> >> For example, suppose we have a VM which has one vCPU and is capped to >> have as much as 50% of a physical CPU. When process A inside the VM >> starts running and the CPU resource of that VM runs out, the VM will >> be paused. Next round when the VM is allocated new CPU resource and >> starts running again, process A stops running and is put back to the >> runqueue. The delta_exec of process A is accounted as its "real >> execution time" plus the paused time of its VM. That will make the >> vruntime of process A much larger than it should be and process A >> would not be scheduled again for a long time until the vruntimes of >> other >> processes catch it. >> --- >> >> >> Analysis >> When a process stops running and is going to put back to the >> runqueue, >> update_curr() will be executed. >> [src/kernel/sched/fair.c] >> >> static void update_curr(struct cfs_rq *cfs_rq) >> { >> ... ... >> delta_exec = now - curr->exec_start; >> ... ... >> curr->exec_start = now; >> ... ... >> curr->sum_exec_runtime += delta_exec; >> schedstat_add(cfs_rq, exec_clock, delta_exec); >> curr->vruntime += calc_delta_fair(delta_exec, curr); >> update_min_vruntime(cfs_rq); >> ... ... >> } >> >> "now" --> the right now time >> "exec_start" --> the time when the current process is put on the CPU >> "delta_exec" --> the time difference of a process between it starts >> and stops running on the CPU >> >> When a process starts running before its VM is paused and the process >> stops running after its VM is unpaused, the delta_exec will include >> the VM suspend time which is pretty large compared to the real >> execution time of a process. >> > ... but would that also apply to a VM that is not scheduled --just > because of pCPU contention, not because it was paused-- for a few time? > Thanks for your suggestion. I have tried to see whether this issue exists on pCPU sharing today. Unfortunately, I found this issue was there, not only for capping case, but also for pCPU sharing case. In the above both cases, the process vruntime accounting in guest OS has "vruntime jump", which might cause that victim process to have poor and unpredictable performance. In the cloud, from my point of view, the VM exists in three scenarios: 1, dedicated hardware(in this case, VM = Physical Machine); 2, part of dedicated hardware(using capping, like Amazon EC2 T2.small instance); 3, sharing with other VMs on the same hardware; Both case#2 and case#3 will be influenced due to the issue I mentioned. > Isn't there anything in place in Xen or Linux (the latter being better > suitable for something like this, IMHO) to compensate for that? > No. I do not think so. I think this is a bug in Linux kernel under virtualization(vmm platform is Xen). > I have to admit I haven't really ever checked myself, maybe either > George or our Linux people do know more? The issue behind it is that the process execution calculation(e.g., delta_exec) in virtualized environment should not be calculated as it did in physical enviroment. Here are two solutions to fix it: 1) Based on the vcpu->runstate.time(running/runnable/block/offline) changes, to determine how much time the process on this VCPU is running, instead of just "de
Re: [Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues
On Mon, May 16, 2016 at 5:30 AM, Dario Faggioli <dario.faggi...@citrix.com> wrote: > [Adding George, and avoiding trimming, for his benefit] > > On Sat, 2016-05-14 at 22:11 -0600, Tony S wrote: >> Hi all, >> > Hi Tony, > >> When I was running latency-sensitive applications in VMs on Xen, I >> found some bugs in the credit scheduler which will cause long tail >> latency in I/O-intensive VMs. >> > Ok, first of all, thanks for looking into and reporting this. > > This is certainly something we need to think about... For now, just a > couple of questions. Hi Dario, Thank you for your reply. :-) > >> (1) Problem description >> >> Description >> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux >> 3.18.21), Dom U(Linux 3.18.21). >> >> Environment setup: >> We created two 1-vCPU, 4GB-memory VMs and pinned them onto one >> physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server >> program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or >> simply a loop(denoted as CPU-VM). A client on another physical >> machine >> sent UDP requests to the I/O-VM. >> > So, just to be sure I've understood, you have 2 VMs, each with 1 vCPU, > *both* pinned on the *same* pCPU, is this the case? > Yes. >> Here are my tail latency results (micro-second): >> Case Avg 90% 99%99.9% 99.99% >> #1 108 & 114& 128 & 129 & 130 >> #2 7811 & 13892 & 14874 & 15315 & 16383 >> #3 943 & 131& 21755 & 26453 & 26553 >> #4 116 & 96 & 105 & 8217& 13472 >> #5 116 & 117& 129 & 131 & 132 >> >> Bug 1, 2, and 3 will be discussed below. >> >> Case #1: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> idling (no processes running). >> >> Case #2: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 >> >> Case #3: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 fixed >> >> Case #4: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed >> >> Case #5: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed >> >> --- >> >> >> (2) Problem analysis >> >> Analysis >> >> [Bug1]: The VCPU that ran CPU-intensive workload could be mistakenly >> boosted due to CPU affinity. >> >> http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg02853. >> html >> >> We have already discussed this bug and a potential patch in the above >> link. Although the discussed patch improved the tail latency, i.e., >> reducing the 90th percentile latency, the long tail latency is till >> not bounded. Next, we discussed two new bugs that inflict latency >> hike >> at the very far end of the tail. >> > Right, and there is a fix upstream for this. It's not the patch you > proposed in the thread linked above, but it should have had the same > effect. > > Can you perhaps try something more recent thatn 4.5 (4.7-rc would be > great) and confirm that the number still look similar? I have tried the latest stable version Xen 4.6 today. Here is my results: Case Avg 90% 99%99.9% 99.99% #1 91 & 93 & 101 & 105 & 110 #2 22506 & 43011 & 231946 & 259501 & 265561 #3 917 & 95& 25257 & 30048 & 30756 #4 110 & 95 & 102 & 12448& 13255 #5 114 & 118 & 130 & 134 & 136 It seems that case#2 is much worse. The other cases are similar. My raw latency data is pasted below. For xen 4.7-rc, I have some installment issues on my machine, therefore I have not tried that. Raw data is as follows. Hope this could help you understand the issues better. :-) # case 1: sockperf: > avg-lat= 91.688 (std-dev=2.950) sockperf: ---> observation = 110.647 sockperf: ---> percentile 99.99 = 110.647 sockperf: ---> percentile 99.90 = 105.242 so
[Xen-devel] [BUG] Linux process vruntime accounting in Xen
In virtualized environments, sometimes we need to limit the CPU resources to a virtual machine(VM). For example in Xen, we use $ xl sched-credit -d 1 -c 50 to limit the CPU resource of dom 1 as half of one physical CPU core. If the VM CPU resource is capped, the process inside the VM will have a vruntime accounting problem. Here, I report my findings about Linux process scheduler under the above scenario. Description Linux CFS relies on delta_exec to charge the vruntime of processes. The variable delta_exec is the difference of a process starts and stops running on a CPU. This works well in physical machine. However, in virtual machine under capped resources, some processes might be accounted with inaccurate vruntime. For example, suppose we have a VM which has one vCPU and is capped to have as much as 50% of a physical CPU. When process A inside the VM starts running and the CPU resource of that VM runs out, the VM will be paused. Next round when the VM is allocated new CPU resource and starts running again, process A stops running and is put back to the runqueue. The delta_exec of process A is accounted as its "real execution time" plus the paused time of its VM. That will make the vruntime of process A much larger than it should be and process A would not be scheduled again for a long time until the vruntimes of other processes catch it. --- Analysis When a process stops running and is going to put back to the runqueue, update_curr() will be executed. [src/kernel/sched/fair.c] static void update_curr(struct cfs_rq *cfs_rq) { ... ... delta_exec = now - curr->exec_start; ... ... curr->exec_start = now; ... ... curr->sum_exec_runtime += delta_exec; schedstat_add(cfs_rq, exec_clock, delta_exec); curr->vruntime += calc_delta_fair(delta_exec, curr); update_min_vruntime(cfs_rq); ... ... } "now" --> the right now time "exec_start" --> the time when the current process is put on the CPU "delta_exec" --> the time difference of a process between it starts and stops running on the CPU When a process starts running before its VM is paused and the process stops running after its VM is unpaused, the delta_exec will include the VM suspend time which is pretty large compared to the real execution time of a process. This issue will make a great performance harm to the victim process. If the process is an I/O-bound workload, its throughput and latency will be influenced. If the process is a CPU-bound workload, this issue will make its vruntime "unfair" compared to other processes under CFS. Because the CPU resource of some type VMs in the cloud are limited as the above describes(like Amazon EC2 t2.small instance), I doubt that will also harm the performance of public cloud instances. --- My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels all have this issue. Please confirm this bug. Thanks. -- Tony. S Ph. D student of University of Colorado, Colorado Springs ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues
Hi all, When I was running latency-sensitive applications in VMs on Xen, I found some bugs in the credit scheduler which will cause long tail latency in I/O-intensive VMs. (1) Problem description Description My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux 3.18.21), Dom U(Linux 3.18.21). Environment setup: We created two 1-vCPU, 4GB-memory VMs and pinned them onto one physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or simply a loop(denoted as CPU-VM). A client on another physical machine sent UDP requests to the I/O-VM. Here are my tail latency results (micro-second): Case Avg 90% 99%99.9% 99.99% #1 108 & 114& 128 & 129 & 130 #2 7811 & 13892 & 14874 & 15315 & 16383 #3 943 & 131& 21755 & 26453 & 26553 #4 116 & 96 & 105 & 8217& 13472 #5 116 & 117& 129 & 131 & 132 Bug 1, 2, and 3 will be discussed below. Case #1: I/O-VM was processing Sockperf requests from clients; CPU-VM was idling (no processes running). Case #2: I/O-VM was processing Sockperf requests from clients; CPU-VM was running a compute-bound task. Hypervisor is the native Xen 4.5.0 Case #3: I/O-VM was processing Sockperf requests from clients; CPU-VM was running a compute-bound task. Hypervisor is the native Xen 4.5.0 with bug 1 fixed Case #4: I/O-VM was processing Sockperf requests from clients; CPU-VM was running a compute-bound task. Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed Case #5: I/O-VM was processing Sockperf requests from clients; CPU-VM was running a compute-bound task. Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed --- (2) Problem analysis Analysis [Bug1]: The VCPU that ran CPU-intensive workload could be mistakenly boosted due to CPU affinity. http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg02853.html We have already discussed this bug and a potential patch in the above link. Although the discussed patch improved the tail latency, i.e., reducing the 90th percentile latency, the long tail latency is till not bounded. Next, we discussed two new bugs that inflict latency hike at the very far end of the tail. [Bug2]: In csched_acct() (by default every 30ms), a VCPU stops earning credits and is removed from the active CPU list(in __csched_vcpu_acct_stop_locked) if its credit is larger than the upper bound. Because the domain has only one VCPU and the VM will also be removed from the active domain list. Every 10ms, csched_tick() --> csched_vcpu_acct() --> __csched_vcpu_acct_start() will be executed and tries to put inactive VCPUs back to the active list. However, __csched_vcpu_acct_start() will only put the current VCPU back to the active list. If an I/O-bound VCPU is not the current VCPU at the csched_tick(), it will not be put back to the active VCPU list. If so, the I/O-bound VCPU will likely miss the next credit refill in csched_acct() and can easily enter the OVER state. As such, the I/O-bound VM will be unable to be boosted and have very long latency. It takes at least one time slice (e.g., 30ms) before the I/O VM is activated and starts to receive credits. [Possible Solution] Try to activate any inactive VCPUs back to active before next credit refill, instead of just the current VCPU. [Bug 3]: The BOOST priority might be changed to UNDER before the boosted VCPU preempts the current running VCPU. If so, VCPU boosting can not take effect. If a VCPU is in UNDER state and wakes up from sleep, it will be boosted in csched_vcpu_wake(). However, the boosting is successful only when __runq_tickle() preempts the current VCPU. It is possible that csched_acct() can run between csched_vcpu_wake() and __runq_tickle(), which will sometimes change the BOOST state back to UNDER if credit >0. If so, __runq_tickle() can fail as VCPUs in UNDER cannot preempt another UNDER VCPU. This also contributes to the far end of the long tail latency. [Possible Solution] 1. add a lock to prevent csched_acct() from interleaving with csched_vcpu_wake(); 2. separate the BOOST state from UNDER and OVER states. --- Please confirm these bugs. Thanks. -- Tony. S Ph. D student of University of Colorado, Colorado Springs ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel