Re: [Xen-devel] [PATCH] xen: add steal_clock support on x86

2016-05-18 Thread Tony S
On Wed, May 18, 2016 at 11:59 AM, Tony S <suokuns...@gmail.com> wrote:
> On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovsky
> <boris.ostrov...@oracle.com> wrote:
>> On 05/18/2016 12:10 PM, Dario Faggioli wrote:
>>> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote:
>>>> On 18/05/16 16:46, Boris Ostrovsky wrote:
>>>>>
>>>>> Won't we be accounting for stolen cycles twice now --- once from
>>>>> steal_account_process_tick()->steal_clock() and second time from
>>>>> do_stolen_accounting()?
>>>> Uuh, yes.
>>>>
>>>> I guess I should rip do_stolen_accounting() out, too? It is a
>>>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
>>>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
>>>>
>>> So, config options aside, if I understand this correctly, it looks like
>>> we were actually already doing steal time accounting, although in a
>>> non-standard way.
>>>
>>> And yet, people seem to have issues relating to lack of (proper?) steal
>>> time accounting (Cc-ing Tony).
>>>
>>> I guess this means that, either:
>>>  - the issue being reported is actually not caused by the lack of
>>>steal time accounting,
>>>  - our current (Xen specific) steal time accounting solution is flawed,
>>>  - the issue is caused by the lack of the bit of steal time accounting
>>>that we do not support yet,
>>
>> I believe it's this one.
>>
>> Tony narrowed the problem down to update_curr() where vruntime is
>> calculated, based on runqueue's clock_task value. That value is computed
>> in update_rq_clock_task(), which needs paravirt_steal_rq_enabled.
>>
>
> Hi Boris,
>
> You are right.
>
> The real problem is steal_clock in pv_time_ops is implemented in KVM
> but not in Xen.
>
> arch/x86/include/asm/paravirt_types.h
> struct pv_time_ops {
> unsigned long long (*sched_clock)(void);
> unsigned long long (*steal_clock)(int cpu);
> unsigned long (*get_tsc_khz)(void);
> };
>
>
> (1) KVM implemented both sched_clock and steal_clock.
>
> arch/x86/kernel/kvmclock.c
> pv_time_ops.sched_clock = kvm_clock_read;
>
> arch/x86/kernel/kvm.c
> pv_time_ops.steal_clock = kvm_steal_clock;
>
>
> (2) However, Xen just implemented sched_clock while the steal_clock is
> still native_steal_clock(). The function native_steal_clock() just
> simply return 0.
>
> arch/x86/xen/time.c
> .sched_clock = xen_clocksource_read;
>
> arch/x86/kernel/paravirt.c
> static u64 native_steal_clock(int cpu)
> {
>  return 0;
> }
>
>
> Therefore, even though update_rq_clock_task() calculates the value and
> paravirt_steal_rq_enabled option is enabled, the steal value just
> returns 0. This will cause the problem which I mentioned.
>
> update_rq_clock_task
> --> paravirt_steal_clock
> --> pv_time_ops.steal_clock
> --> native_steal_clock (if in Xen)
> --> 0
>
> The fundamental solution is to implement a steal_clock in Xen(learn
> from KVM implementation) instead of using the native one.
>
> Tony
>

Also, I tried the latest long term version of Linux 4.4, this issue
still exists there. Hoping the next version can add this patch.

Tony


>> -boris
>>
>>>  - other ideas? Tony?
>>>
>>> Dario
>>
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen: add steal_clock support on x86

2016-05-18 Thread Tony S
On Wed, May 18, 2016 at 11:20 AM, Boris Ostrovsky
 wrote:
> On 05/18/2016 12:10 PM, Dario Faggioli wrote:
>> On Wed, 2016-05-18 at 16:53 +0200, Juergen Gross wrote:
>>> On 18/05/16 16:46, Boris Ostrovsky wrote:

 Won't we be accounting for stolen cycles twice now --- once from
 steal_account_process_tick()->steal_clock() and second time from
 do_stolen_accounting()?
>>> Uuh, yes.
>>>
>>> I guess I should rip do_stolen_accounting() out, too? It is a
>>> Xen-specific hack, so I guess nobody will cry. Maybe it would be a
>>> good idea to select CONFIG_PARAVIRT_TIME_ACCOUNTING for XEN then?
>>>
>> So, config options aside, if I understand this correctly, it looks like
>> we were actually already doing steal time accounting, although in a
>> non-standard way.
>>
>> And yet, people seem to have issues relating to lack of (proper?) steal
>> time accounting (Cc-ing Tony).
>>
>> I guess this means that, either:
>>  - the issue being reported is actually not caused by the lack of
>>steal time accounting,
>>  - our current (Xen specific) steal time accounting solution is flawed,
>>  - the issue is caused by the lack of the bit of steal time accounting
>>that we do not support yet,
>
> I believe it's this one.
>
> Tony narrowed the problem down to update_curr() where vruntime is
> calculated, based on runqueue's clock_task value. That value is computed
> in update_rq_clock_task(), which needs paravirt_steal_rq_enabled.
>

Hi Boris,

You are right.

The real problem is steal_clock in pv_time_ops is implemented in KVM
but not in Xen.

arch/x86/include/asm/paravirt_types.h
struct pv_time_ops {
unsigned long long (*sched_clock)(void);
unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
};


(1) KVM implemented both sched_clock and steal_clock.

arch/x86/kernel/kvmclock.c
pv_time_ops.sched_clock = kvm_clock_read;

arch/x86/kernel/kvm.c
pv_time_ops.steal_clock = kvm_steal_clock;


(2) However, Xen just implemented sched_clock while the steal_clock is
still native_steal_clock(). The function native_steal_clock() just
simply return 0.

arch/x86/xen/time.c
.sched_clock = xen_clocksource_read;

arch/x86/kernel/paravirt.c
static u64 native_steal_clock(int cpu)
{
 return 0;
}


Therefore, even though update_rq_clock_task() calculates the value and
paravirt_steal_rq_enabled option is enabled, the steal value just
returns 0. This will cause the problem which I mentioned.

update_rq_clock_task
--> paravirt_steal_clock
--> pv_time_ops.steal_clock
--> native_steal_clock (if in Xen)
--> 0

The fundamental solution is to implement a steal_clock in Xen(learn
from KVM implementation) instead of using the native one.

Tony

> -boris
>
>>  - other ideas? Tony?
>>
>> Dario
>
>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen

2016-05-18 Thread Tony S
On Wed, May 18, 2016 at 8:57 AM, Dario Faggioli
<dario.faggi...@citrix.com> wrote:
> On Wed, 2016-05-18 at 14:24 +0200, Juergen Gross wrote:
>> On 17/05/16 11:33, George Dunlap wrote:
>> > > Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting
>> > > process
>> > > times. KVM uses it but Xen doesn't.
>> > Is someone on the Linux side going to put this on their to-do list
>> > then? :-)
>>
>> Patch sent.
>>
> Yep, seen it, thanks.
>
>> Support was already existing for arm.
>>
> Yes!! I remember Stefano talking about introducing it, and that was
> also why I thought we had it already since long time on x86.
>
> Well, anyway... :-)
>
>> What is missing is support for
>> paravirt_steal_rq_enabled which requires to be able to read the
>> stolen
>> time of another cpu. This can't work today as accessing another cpu's
>> vcpu_runstate_info isn't possible without risking inconsistent data.
>> I plan to add support for this, too, but this will require adding
>> another hypercall to map a modified vcpu_runstate_info containing an
>> indicator for an ongoing update of the data.
>>
> Understood.
>
> So, Tony, up for trying again your workload with this patch applied to
> Linux?
>
> Most likely, it _won't_ fix all the problems you're seeing, but I'm
> curious to see if it helps.

Hi Dario,

I did not see the patch. Can you please send me the patch and I will
try to test it later.

Best
Tony

>
> Thanks again and Regards,
> Dario
> --
> <> (Raistlin Majere)
> -
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R Ltd., Cambridge (UK)
>

-- 
Tony. S
Ph. D student of University of Colorado, Colorado Springs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues

2016-05-17 Thread Tony S
On Tue, May 17, 2016 at 3:27 AM, George Dunlap <dunl...@umich.edu> wrote:
> On Sun, May 15, 2016 at 5:11 AM, Tony S <suokuns...@gmail.com> wrote:
>> Hi all,
>>
>> When I was running latency-sensitive applications in VMs on Xen, I
>> found some bugs in the credit scheduler which will cause long tail
>> latency in I/O-intensive VMs.
>>
>>
>> (1) Problem description
>>
>> Description
>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
>> 3.18.21), Dom U(Linux 3.18.21).
>>
>> Environment setup:
>> We created two 1-vCPU, 4GB-memory VMs and pinned them onto one
>> physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server
>> program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or
>> simply a loop(denoted as CPU-VM). A client on another physical machine
>> sent UDP requests to the I/O-VM.
>>
>> Here are my tail latency results (micro-second):
>> Case   Avg  90%   99%99.9%  99.99%
>> #1 108   &  114&  128 &  129 &  130
>> #2 7811  &  13892  &  14874   &  15315   &  16383
>> #3 943   &  131&  21755   &  26453   &  26553
>> #4 116   &  96 &  105 &  8217&  13472
>> #5 116   &  117&  129 &  131 &  132
>>
>> Bug 1, 2, and 3 will be discussed below.
>>
>> Case #1:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> idling (no processes running).
>>
>> Case #2:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0
>>
>> Case #3:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0 with bug 1 fixed
>>
>> Case #4:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed
>>
>> Case #5:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed
>>
>> ---
>>
>>
>> (2) Problem analysis
>
> Hey Tony,
>
> Thanks for looking at this.  These issues in the credit1 algorithm are
> essentially exactly the reason that I started work on the credit2
> scheduler several years ago.  We meant credit2 to have replaced
> credit1 by now, but we ran out of time to test it properly; we're in
> the process of doing that right now, and are hoping it will be the
> default scheduler for the 4.8 release.
>
> So if I could make two suggestions that would help your effort be more
> helpful to us:
>
> 1. Use cpupools for testing rather than pinning. A lot of the
> algorithms are designed with the assumption that they have all the
> cpus to run on, and the credit allocation / priority algorithms fail
> to work properly when they are only pinned.  Cpupools was specifically
> designed to allow the scheduler algorithms to work as designed with a
> smaller number of cpus than the system had.
>
> 2. Test credit2. :-)
>

Hi George,

Thank you for reply. I will try cpupools and credit2 later. :-)


> One comment about your analysis here...
>
>> [Bug2]: In csched_acct() (by default every 30ms), a VCPU stops earning
>> credits and is removed from the active CPU list(in
>> __csched_vcpu_acct_stop_locked) if its credit is larger than the upper
>> bound. Because the domain has only one VCPU and the VM will also be
>> removed from the active domain list.
>>
>> Every 10ms, csched_tick() --> csched_vcpu_acct() -->
>> __csched_vcpu_acct_start() will be executed and tries to put inactive
>> VCPUs back to the active list. However, __csched_vcpu_acct_start()
>> will only put the current VCPU back to the active list. If an
>> I/O-bound VCPU is not the current VCPU at the csched_tick(), it will
>> not be put back to the active VCPU list. If so, the I/O-bound VCPU
>> will likely miss the next credit refill in csched_acct() and can
>> easily enter the OVER state. As such, the I/O-bound VM will be unable
>> to be boosted and have very long latency. It takes at least one time
>> slice (e.g., 30ms) before the I/O VM is activated and starts to
>> receive credits.
>>
>> [Possible Solution] Try to activate any inactive VCPUs back to active
>> before next credit refil

Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen

2016-05-16 Thread Tony S
On Mon, May 16, 2016 at 3:38 PM, Tony S <suokuns...@gmail.com> wrote:
> On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli
> <dario.faggi...@citrix.com> wrote:
>> [Adding George again, and a few Linux/Xen folks]
>>
>> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
>>> In virtualized environments, sometimes we need to limit the CPU
>>> resources to a virtual machine(VM). For example in Xen, we use
>>> $ xl sched-credit -d 1 -c 50
>>>
>>> to limit the CPU resource of dom 1 as half of
>>> one physical CPU core. If the VM CPU resource is capped, the process
>>> inside the VM will have a vruntime accounting problem. Here, I report
>>> my findings about Linux process scheduler under the above scenario.
>>>
>> Thanks for this other report as well. :-)
>>
>> All you say makes sense to me, and I will think about it. I'm not sure
>> about one thing, though...
>>
>
> Hi Dario,
>
> Thank you for your reply.
>
>
>>> Description
>>> Linux CFS relies on delta_exec to charge the vruntime of processes.
>>> The variable delta_exec is the difference of a process starts and
>>> stops running on a CPU. This works well in physical machine. However,
>>> in virtual machine under capped resources, some processes might be
>>> accounted with inaccurate vruntime.
>>>
>>> For example, suppose we have a VM which has one vCPU and is capped to
>>> have as much as 50% of a physical CPU. When process A inside the VM
>>> starts running and the CPU resource of that VM runs out, the VM will
>>> be paused. Next round when the VM is allocated new CPU resource and
>>> starts running again, process A stops running and is put back to the
>>> runqueue. The delta_exec of process A is accounted as its "real
>>> execution time" plus the paused time of its VM. That will make the
>>> vruntime of process A much larger than it should be and process A
>>> would not be scheduled again for a long time until the vruntimes of
>>> other
>>> processes catch it.
>>> ---
>>>
>>>
>>> Analysis
>>> When a process stops running and is going to put back to the
>>> runqueue,
>>> update_curr() will be executed.
>>> [src/kernel/sched/fair.c]
>>>
>>> static void update_curr(struct cfs_rq *cfs_rq)
>>> {
>>> ... ...
>>> delta_exec = now - curr->exec_start;
>>> ... ...
>>> curr->exec_start = now;
>>> ... ...
>>> curr->sum_exec_runtime += delta_exec;
>>> schedstat_add(cfs_rq, exec_clock, delta_exec);
>>> curr->vruntime += calc_delta_fair(delta_exec, curr);
>>> update_min_vruntime(cfs_rq);
>>> ... ...
>>> }
>>>
>>> "now" --> the right now time
>>> "exec_start" --> the time when the current process is put on the CPU
>>> "delta_exec" --> the time difference of a process between it starts
>>> and stops running on the CPU
>>>
>>> When a process starts running before its VM is paused and the process
>>> stops running after its VM is unpaused, the delta_exec will include
>>> the VM suspend time which is pretty large compared to the real
>>> execution time of a process.
>>>
>> ... but would that also apply to a VM that is not scheduled --just
>> because of pCPU contention, not because it was paused-- for a few time?
>>
>
> Thanks for your suggestion. I have tried to see whether this issue
> exists on pCPU sharing today. Unfortunately, I found this issue was
> there, not only for capping case, but also for pCPU sharing case.
>
> In the above both cases, the process vruntime accounting in guest OS
> has "vruntime jump", which might cause that victim process to have
> poor and unpredictable performance.
>
> In the cloud, from my point of view, the VM exists in three scenarios:
> 1, dedicated hardware(in this case, VM = Physical Machine);
> 2, part of dedicated hardware(using capping, like Amazon EC2 T2.small 
> instance);
> 3, sharing with other VMs on the same hardware;
>
> Both case#2 and case#3 will be influenced due to the issue I mentioned.
>
>
>> Isn't there anything in place in Xen or Linux (the latter being better
>> suitable for something like this, IMHO) to compensate for that?
>>
>
> No. I do not think so. I think this is a bug in Linux kernel under
> virtualization(vmm pl

Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen

2016-05-16 Thread Tony S
On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli
<dario.faggi...@citrix.com> wrote:
> [Adding George again, and a few Linux/Xen folks]
>
> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
>> In virtualized environments, sometimes we need to limit the CPU
>> resources to a virtual machine(VM). For example in Xen, we use
>> $ xl sched-credit -d 1 -c 50
>>
>> to limit the CPU resource of dom 1 as half of
>> one physical CPU core. If the VM CPU resource is capped, the process
>> inside the VM will have a vruntime accounting problem. Here, I report
>> my findings about Linux process scheduler under the above scenario.
>>
> Thanks for this other report as well. :-)
>
> All you say makes sense to me, and I will think about it. I'm not sure
> about one thing, though...
>

Hi Dario,

Thank you for your reply.


>> Description
>> Linux CFS relies on delta_exec to charge the vruntime of processes.
>> The variable delta_exec is the difference of a process starts and
>> stops running on a CPU. This works well in physical machine. However,
>> in virtual machine under capped resources, some processes might be
>> accounted with inaccurate vruntime.
>>
>> For example, suppose we have a VM which has one vCPU and is capped to
>> have as much as 50% of a physical CPU. When process A inside the VM
>> starts running and the CPU resource of that VM runs out, the VM will
>> be paused. Next round when the VM is allocated new CPU resource and
>> starts running again, process A stops running and is put back to the
>> runqueue. The delta_exec of process A is accounted as its "real
>> execution time" plus the paused time of its VM. That will make the
>> vruntime of process A much larger than it should be and process A
>> would not be scheduled again for a long time until the vruntimes of
>> other
>> processes catch it.
>> ---
>>
>>
>> Analysis
>> When a process stops running and is going to put back to the
>> runqueue,
>> update_curr() will be executed.
>> [src/kernel/sched/fair.c]
>>
>> static void update_curr(struct cfs_rq *cfs_rq)
>> {
>> ... ...
>> delta_exec = now - curr->exec_start;
>> ... ...
>> curr->exec_start = now;
>> ... ...
>> curr->sum_exec_runtime += delta_exec;
>> schedstat_add(cfs_rq, exec_clock, delta_exec);
>> curr->vruntime += calc_delta_fair(delta_exec, curr);
>> update_min_vruntime(cfs_rq);
>> ... ...
>> }
>>
>> "now" --> the right now time
>> "exec_start" --> the time when the current process is put on the CPU
>> "delta_exec" --> the time difference of a process between it starts
>> and stops running on the CPU
>>
>> When a process starts running before its VM is paused and the process
>> stops running after its VM is unpaused, the delta_exec will include
>> the VM suspend time which is pretty large compared to the real
>> execution time of a process.
>>
> ... but would that also apply to a VM that is not scheduled --just
> because of pCPU contention, not because it was paused-- for a few time?
>

Thanks for your suggestion. I have tried to see whether this issue
exists on pCPU sharing today. Unfortunately, I found this issue was
there, not only for capping case, but also for pCPU sharing case.

In the above both cases, the process vruntime accounting in guest OS
has "vruntime jump", which might cause that victim process to have
poor and unpredictable performance.

In the cloud, from my point of view, the VM exists in three scenarios:
1, dedicated hardware(in this case, VM = Physical Machine);
2, part of dedicated hardware(using capping, like Amazon EC2 T2.small instance);
3, sharing with other VMs on the same hardware;

Both case#2 and case#3 will be influenced due to the issue I mentioned.


> Isn't there anything in place in Xen or Linux (the latter being better
> suitable for something like this, IMHO) to compensate for that?
>

No. I do not think so. I think this is a bug in Linux kernel under
virtualization(vmm platform is Xen).

> I have to admit I haven't really ever checked myself, maybe either
> George or our Linux people do know more?

The issue behind it is that the process execution calculation(e.g.,
delta_exec) in virtualized environment should not be calculated as it
did in physical enviroment.

Here are two solutions to fix it:

1) Based on the vcpu->runstate.time(running/runnable/block/offline)
changes, to determine how much time the process on this VCPU is
running, instead of just "de

Re: [Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues

2016-05-16 Thread Tony S
On Mon, May 16, 2016 at 5:30 AM, Dario Faggioli
<dario.faggi...@citrix.com> wrote:
> [Adding George, and avoiding trimming, for his benefit]
>
> On Sat, 2016-05-14 at 22:11 -0600, Tony S wrote:
>> Hi all,
>>
> Hi Tony,
>
>> When I was running latency-sensitive applications in VMs on Xen, I
>> found some bugs in the credit scheduler which will cause long tail
>> latency in I/O-intensive VMs.
>>
> Ok, first of all, thanks for looking into and reporting this.
>
> This is certainly something we need to think about... For now, just a
> couple of questions.

Hi Dario,

Thank you for your reply. :-)

>
>> (1) Problem description
>>
>> Description
>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
>> 3.18.21), Dom U(Linux 3.18.21).
>>
>> Environment setup:
>> We created two 1-vCPU, 4GB-memory VMs and pinned them onto one
>> physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server
>> program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or
>> simply a loop(denoted as CPU-VM). A client on another physical
>> machine
>> sent UDP requests to the I/O-VM.
>>
> So, just to be sure I've understood, you have 2 VMs, each with 1 vCPU,
> *both* pinned on the *same* pCPU, is this the case?
>

Yes.

>> Here are my tail latency results (micro-second):
>> Case   Avg  90%   99%99.9%  99.99%
>> #1 108   &  114&  128 &  129 &  130
>> #2 7811  &  13892  &  14874   &  15315   &  16383
>> #3 943   &  131&  21755   &  26453   &  26553
>> #4 116   &  96 &  105 &  8217&  13472
>> #5 116   &  117&  129 &  131 &  132
>>
>> Bug 1, 2, and 3 will be discussed below.
>>
>> Case #1:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> idling (no processes running).
>>
>> Case #2:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0
>>
>> Case #3:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0 with bug 1 fixed
>>
>> Case #4:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed
>>
>> Case #5:
>> I/O-VM was processing Sockperf requests from clients; CPU-VM was
>> running a compute-bound task.
>> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed
>>
>> ---
>>
>>
>> (2) Problem analysis
>>
>> Analysis
>>
>> [Bug1]: The VCPU that ran CPU-intensive workload could be mistakenly
>> boosted due to CPU affinity.
>>
>> http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg02853.
>> html
>>
>> We have already discussed this bug and a potential patch in the above
>> link. Although the discussed patch improved the tail latency, i.e.,
>> reducing the 90th percentile latency, the long tail latency is till
>> not bounded. Next, we discussed two new bugs that inflict latency
>> hike
>> at the very far end of the tail.
>>
> Right, and there is a fix upstream for this. It's not the patch you
> proposed in the thread linked above, but it should have had the same
> effect.
>
> Can you perhaps try something more recent thatn 4.5 (4.7-rc would be
> great) and confirm that the number still look similar?

I have tried the latest stable version Xen 4.6 today. Here is my results:

Case   Avg  90%   99%99.9%  99.99%
#1 91 &  93 &  101   &  105   &  110
#2 22506 & 43011  &  231946  &  259501   &  265561
#3 917   &  95&  25257   &  30048   &  30756
#4 110   &  95 &  102 &  12448&  13255
#5 114   &  118   &  130 &  134 &  136

It seems that case#2 is much worse. The other cases are similar. My
raw latency data is pasted below.

For xen 4.7-rc, I have some installment issues on my machine,
therefore I have not tried that.


Raw data is as follows. Hope this could help you understand the issues
better. :-)
# case 1:
sockperf: > avg-lat= 91.688 (std-dev=2.950)
sockperf: --->  observation =  110.647
sockperf: ---> percentile  99.99 =  110.647
sockperf: ---> percentile  99.90 =  105.242
so

[Xen-devel] [BUG] Linux process vruntime accounting in Xen

2016-05-14 Thread Tony S
In virtualized environments, sometimes we need to limit the CPU
resources to a virtual machine(VM). For example in Xen, we use
$ xl sched-credit -d 1 -c 50

to limit the CPU resource of dom 1 as half of
one physical CPU core. If the VM CPU resource is capped, the process
inside the VM will have a vruntime accounting problem. Here, I report
my findings about Linux process scheduler under the above scenario.


Description
Linux CFS relies on delta_exec to charge the vruntime of processes.
The variable delta_exec is the difference of a process starts and
stops running on a CPU. This works well in physical machine. However,
in virtual machine under capped resources, some processes might be
accounted with inaccurate vruntime.

For example, suppose we have a VM which has one vCPU and is capped to
have as much as 50% of a physical CPU. When process A inside the VM
starts running and the CPU resource of that VM runs out, the VM will
be paused. Next round when the VM is allocated new CPU resource and
starts running again, process A stops running and is put back to the
runqueue. The delta_exec of process A is accounted as its "real
execution time" plus the paused time of its VM. That will make the
vruntime of process A much larger than it should be and process A
would not be scheduled again for a long time until the vruntimes of other
processes catch it.
---


Analysis
When a process stops running and is going to put back to the runqueue,
update_curr() will be executed.
[src/kernel/sched/fair.c]

static void update_curr(struct cfs_rq *cfs_rq)
{
... ...
delta_exec = now - curr->exec_start;
... ...
curr->exec_start = now;
... ...
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq, exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
update_min_vruntime(cfs_rq);
... ...
}

"now" --> the right now time
"exec_start" --> the time when the current process is put on the CPU
"delta_exec" --> the time difference of a process between it starts
and stops running on the CPU

When a process starts running before its VM is paused and the process
stops running after its VM is unpaused, the delta_exec will include
the VM suspend time which is pretty large compared to the real
execution time of a process.

This issue will make a great performance harm to the victim process.
If the process is an I/O-bound workload, its throughput and latency
will be influenced. If the process is a CPU-bound workload, this issue
will make its vruntime "unfair" compared to other processes under CFS.

Because the CPU resource of some type VMs in the cloud are limited as
the above describes(like Amazon EC2 t2.small instance), I doubt that
will also harm the performance of public cloud instances.
---


My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
all have this issue.

Please confirm this bug. Thanks.


-- 
Tony. S
Ph. D student of University of Colorado, Colorado Springs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues

2016-05-14 Thread Tony S
Hi all,

When I was running latency-sensitive applications in VMs on Xen, I
found some bugs in the credit scheduler which will cause long tail
latency in I/O-intensive VMs.


(1) Problem description

Description
My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
3.18.21), Dom U(Linux 3.18.21).

Environment setup:
We created two 1-vCPU, 4GB-memory VMs and pinned them onto one
physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server
program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or
simply a loop(denoted as CPU-VM). A client on another physical machine
sent UDP requests to the I/O-VM.

Here are my tail latency results (micro-second):
Case   Avg  90%   99%99.9%  99.99%
#1 108   &  114&  128 &  129 &  130
#2 7811  &  13892  &  14874   &  15315   &  16383
#3 943   &  131&  21755   &  26453   &  26553
#4 116   &  96 &  105 &  8217&  13472
#5 116   &  117&  129 &  131 &  132

Bug 1, 2, and 3 will be discussed below.

Case #1:
I/O-VM was processing Sockperf requests from clients; CPU-VM was
idling (no processes running).

Case #2:
I/O-VM was processing Sockperf requests from clients; CPU-VM was
running a compute-bound task.
Hypervisor is the native Xen 4.5.0

Case #3:
I/O-VM was processing Sockperf requests from clients; CPU-VM was
running a compute-bound task.
Hypervisor is the native Xen 4.5.0 with bug 1 fixed

Case #4:
I/O-VM was processing Sockperf requests from clients; CPU-VM was
running a compute-bound task.
Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed

Case #5:
I/O-VM was processing Sockperf requests from clients; CPU-VM was
running a compute-bound task.
Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed

---


(2) Problem analysis

Analysis

[Bug1]: The VCPU that ran CPU-intensive workload could be mistakenly
boosted due to CPU affinity.

http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg02853.html

We have already discussed this bug and a potential patch in the above
link. Although the discussed patch improved the tail latency, i.e.,
reducing the 90th percentile latency, the long tail latency is till
not bounded. Next, we discussed two new bugs that inflict latency hike
at the very far end of the tail.



[Bug2]: In csched_acct() (by default every 30ms), a VCPU stops earning
credits and is removed from the active CPU list(in
__csched_vcpu_acct_stop_locked) if its credit is larger than the upper
bound. Because the domain has only one VCPU and the VM will also be
removed from the active domain list.

Every 10ms, csched_tick() --> csched_vcpu_acct() -->
__csched_vcpu_acct_start() will be executed and tries to put inactive
VCPUs back to the active list. However, __csched_vcpu_acct_start()
will only put the current VCPU back to the active list. If an
I/O-bound VCPU is not the current VCPU at the csched_tick(), it will
not be put back to the active VCPU list. If so, the I/O-bound VCPU
will likely miss the next credit refill in csched_acct() and can
easily enter the OVER state. As such, the I/O-bound VM will be unable
to be boosted and have very long latency. It takes at least one time
slice (e.g., 30ms) before the I/O VM is activated and starts to
receive credits.

[Possible Solution] Try to activate any inactive VCPUs back to active
before next credit refill, instead of just the current VCPU.



[Bug 3]: The BOOST priority might be changed to UNDER before the
boosted VCPU preempts the current running VCPU. If so, VCPU boosting
can not take effect.

If a VCPU is in UNDER state and wakes up from sleep, it will be
boosted in csched_vcpu_wake(). However, the boosting is successful
only when __runq_tickle() preempts the current VCPU. It is possible
that csched_acct() can run between csched_vcpu_wake() and
__runq_tickle(), which will sometimes change the BOOST state back to
UNDER if credit >0. If so, __runq_tickle() can fail as VCPUs in UNDER
cannot preempt another UNDER VCPU. This also contributes to the far
end of the long tail latency.

[Possible Solution]
1. add a lock to prevent csched_acct() from interleaving with
csched_vcpu_wake();
2. separate the BOOST state from UNDER and OVER states.
---


Please confirm these bugs.
Thanks.

--
Tony. S
Ph. D student of University of Colorado, Colorado Springs

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel