Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-12-07 Thread Michael Wolf

On 12/05/2012 06:46 AM, Glauber Costa wrote:

I am deeply sorry.

I was busy first time I read this, so I postponed answering and ended up
forgetting.

Sorry

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

So I looked at something like this in the past.  To make sure things
haven't changed
I set up a cgroup on my test server running a kernel built from the
latest tip tree.

[root]# cat cpu.cfs_quota_us
5
[root]# cat cpu.cfs_period_us
10
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks.  When I start a
script that will hog the cpu I see the
following in top on the guest
Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si,
49.8%st

So the steal time here is in line with the bandwidth control settings.

Ok. So I was wrong in my hunch that it would be outside the runqueue,
therefore work automatically. Still, the host kernel has all the
information in cgroups.


So then the steal time did not show on the guest.  You have no value
that needs to be passed
around.  What I did not like about this approach was
* only works for cfs bandwidth control.  If another type of hard limit
was added to the kernel
the code would potentially need to change.

This is true for almost everything we have in the kernel!
It is *very* unlikely for other bandwidth control mechanism to ever
appear. If it ever does, it's *their* burden to make sure it works for
steal time (provided it is merged). Code in tree gets precedence.


Ok,  I will work on a patch that uses the cgroup information for 
bandwidth control

to separate out the time.




* This approach doesn't help if the limits are set by overcommitting the
cpus.  It is my understanding
that this is a common approach.


I can't say anything about commonality, but common or not, it is a
*crazy* approach.

When you simply overcommit, you have no way to differentiate between
intended steal time and non-intended steal time. Moreover, when you
overcommit, your cpu usage will vary over time. If two guests use the
cpu to their full power, you will have 50 % each. But if one of them
slows down, the other gets more. What is your entitlement value? How do
you define this?

And then after you define it, you end up using more than this, what is
your cpu usage? 130 %?


yes exactly you would ideally show a boosted amount of cpu.  However to 
do that
you would need to either create a new tool or modify the current 
accounting tools

such as top.

My understanding is that you are not capping in this case as much as you 
are

guaranteeing a minimum level of performance.




The only sane way to do it, is to communicate this value to the kernel
somehow. The bandwidth controller is the interface we have for that. So
everybody that wants to *intentionally* overcommit needs to communicate
this to the controller. IOW: Any sane configuration should be explicit
about your capping.


 Add an ioctl to communicate the consign limit to the host.

This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.

I'm not understanding that comment.  If you are capping by simply
controlling the amount of
overcommit on the host then wouldn't you still need some value to
indicate the desired amount.

No, that is just crazy, and I don't like it a single bit.

So in the light of it: Whatever capping mechanism we have, we need to be
explicit about the expected entitlement. At this point, the kernel
already knows what it is, and needs no extra ioctls or anything like that.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-12-05 Thread Glauber Costa
I am deeply sorry.

I was busy first time I read this, so I postponed answering and ended up
forgetting.

Sorry

 include/linux/sched.h:
 unsigned long long run_delay; /* time spent waiting on a runqueue */

 So if you are out of the runqueue, you won't get steal time accounted,
 and then I truly fail to understand what you are doing.
 So I looked at something like this in the past.  To make sure things
 haven't changed
 I set up a cgroup on my test server running a kernel built from the
 latest tip tree.
 
 [root]# cat cpu.cfs_quota_us
 5
 [root]# cat cpu.cfs_period_us
 10
 [root]# cat cpuset.cpus
 1
 [root]# cat cpuset.mems
 0
 
 Next I put the PID from the cpu thread into tasks.  When I start a
 script that will hog the cpu I see the
 following in top on the guest
 Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si,
 49.8%st
 
 So the steal time here is in line with the bandwidth control settings.

Ok. So I was wrong in my hunch that it would be outside the runqueue,
therefore work automatically. Still, the host kernel has all the
information in cgroups.

 So then the steal time did not show on the guest.  You have no value
 that needs to be passed
 around.  What I did not like about this approach was
 * only works for cfs bandwidth control.  If another type of hard limit
 was added to the kernel
the code would potentially need to change.

This is true for almost everything we have in the kernel!
It is *very* unlikely for other bandwidth control mechanism to ever
appear. If it ever does, it's *their* burden to make sure it works for
steal time (provided it is merged). Code in tree gets precedence.

 * This approach doesn't help if the limits are set by overcommitting the
 cpus.  It is my understanding
that this is a common approach.
 

I can't say anything about commonality, but common or not, it is a
*crazy* approach.

When you simply overcommit, you have no way to differentiate between
intended steal time and non-intended steal time. Moreover, when you
overcommit, your cpu usage will vary over time. If two guests use the
cpu to their full power, you will have 50 % each. But if one of them
slows down, the other gets more. What is your entitlement value? How do
you define this?

And then after you define it, you end up using more than this, what is
your cpu usage? 130 %?


The only sane way to do it, is to communicate this value to the kernel
somehow. The bandwidth controller is the interface we have for that. So
everybody that wants to *intentionally* overcommit needs to communicate
this to the controller. IOW: Any sane configuration should be explicit
about your capping.

 Add an ioctl to communicate the consign limit to the host.
 This definitely should go away.

 More specifically, *whatever* way we use to cap the processor, the host
 system will have all the information at all times.
 I'm not understanding that comment.  If you are capping by simply
 controlling the amount of
 overcommit on the host then wouldn't you still need some value to
 indicate the desired amount.
No, that is just crazy, and I don't like it a single bit.

So in the light of it: Whatever capping mechanism we have, we need to be
explicit about the expected entitlement. At this point, the kernel
already knows what it is, and needs no extra ioctls or anything like that.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-29 Thread Michael Wolf

On 11/28/2012 02:55 PM, Glauber Costa wrote:

On 11/28/2012 10:43 PM, Michael Wolf wrote:

On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

In the capped case, the time that the guest spends waiting due to it
having used its full allottment of time shows up as steal time.  The way
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and
automatically adjust the setting in the
host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Yes, in the context above I'm referring to the cfs bandwidth control.


Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
vcpu-arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.
So I looked at something like this in the past.  To make sure things 
haven't changed
I set up a cgroup on my test server running a kernel built from the 
latest tip tree.


[root]# cat cpu.cfs_quota_us
5
[root]# cat cpu.cfs_period_us
10
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks.  When I start a 
script that will hog the cpu I see the

following in top on the guest
Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si, 
49.8%st


So the steal time here is in line with the bandwidth control settings.


In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:
About a year ago I was playing with this patch.  It is out of date now 
but will give you

an idea of what I was looking at.

 kernel/sched_fair.c  |4 ++--
 kernel/sched_stats.h |7 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..a837e4e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
struct sched_entity *se)

 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
 static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 int global_update)
@@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct
cfs_rq *cfs_rq)
 }

 /* check whether cfs_rq, or any parent, is throttled */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 return cfs_rq-throttle_count;
 }
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 87f9e36..e30ff26 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -213,14 +213,19 @@ static inline void sched_info_queued(struct
task_struct *t)
  * sched_info_queued() to mark that it has now again started waiting on
  * the runqueue.
  */
+extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static inline void sched_info_depart(struct task_struct *t)
 {
+struct task_group *tg = task_group(t);
+struct cfs_rq *cfs_rq;
 unsigned long long delta = task_rq(t)-clock -
 t-sched_info.last_arrival;

+cfs_rq = tg-cfs_rq[smp_processor_id()];
 rq_sched_info_depart(task_rq(t), delta);

-if (t-state == TASK_RUNNING)
+
+if (t-state == TASK_RUNNING  !throttled_hierarchy(cfs_rq))
 sched_info_queued(t);
 }


So then the steal time did not show on the guest.  You have no value 
that needs to be passed

around.  What I did not like about this approach was
* only works for cfs bandwidth control.  If another type of hard limit 
was added to the kernel

   the code would potentially need to change.
* This approach doesn't help if the limits are set by 

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Glauber Costa
On 11/27/2012 07:10 PM, Michael Wolf wrote:
 On 11/27/2012 02:48 AM, Glauber Costa wrote:
 Hi,

 On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will
 separate
 the consigned time from the steal time.  The consignment limit passed
 to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.
 If you submit this again, please include a version number in your series.
 Will do.  The patchset was sent twice yesterday by mistake.  Got an
 error the first time and didn't
 think the patches went out.  This has been corrected.

 It would also be helpful to include a small changelog about what changed
 between last version and this version, so we could focus on that.
 yes, will do that.  When I took the RFC off the patches I was looking at
 it as a new patchset which was
 a mistake.  I will make sure to add a changelog when I submit again.

 As for the rest, I answered your previous two submissions saying I don't
 agree with the concept. If you hadn't changed anything, resending it
 won't change my mind.

 I could of course, be mistaken or misguided. But I had also not seen any
 wave of support in favor of this previously, so basically I have no new
 data to make me believe I should see it any differently.

 Let's try this again:

 * Rik asked you in your last submission how does ppc handle this. You
 said, and I quote: In the case of lpar on POWER systems they simply
 report steal time and do not alter it in any way.
 They do however report how much processor is assigned to the partition
 and that information is in /proc/ppc64/lparcfg.
 Yes, but we still get questions from users asking what is steal time?
 why am I seeing this?

 Now, that is a *way* more sensible thing to do. Much more. Confusing
 users is something extremely subjective. This is specially true about
 concepts that are know for quite some time, like steal time. If you out
 of a sudden change the meaning of this, it is sure to confuse a lot more
 users than it would clarify.
 Something like this could certainly be done.  But when I was submitting
 the patch set as
 an RFC then qemu was passing a cpu percentage that would be used by the
 guest kernel
 to adjust the steal time. This percentage was being stored on the guest
 as a sysctl value.
 Avi stated he didn't like that kind of coupling, and that the value
 could get out of sync.  Anthony stated The guest shouldn't need to know
 it's entitlement. Or at least, it's up to a management tool to report
 that in a way that's meaningful for the guest.
 
 So perhaps I misunderstood what they were suggesting, but I took it to
 mean that they did not
 want the guest to know what the entitlement was.  That the host should
 take care of it and just
 report the already adjusted data to the guest.  So in this version of
 the code the host would use a set
 period for a timer and be passed essentially a number of ticks of
 expected steal time.  The host
 would then use the timer to break out the steal time into consigned and
 steal buckets which would be
 reported to the guest.
 
 Both the consigned and the steal would be reported via /proc/stat. So
 anyone needing to see total
 time away could add the two fields together.  The user, however, when
 using tools like top or vmstat
 would see the usage based on what the guest is entitled to.
 
 Do you have suggestions for how I can build consensus around one of the
 two approaches?
 

Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Michael Wolf

On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?
In the capped case, the time that the guest spends waiting due to it 
having used its full allottment of time shows up as steal time.  The way 
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from 
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and 
automatically adjust the setting in the

host used for steal time reporting.



  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.


  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Michael Wolf

On 11/28/2012 02:45 AM, Glauber Costa wrote:

On 11/27/2012 07:10 PM, Michael Wolf wrote:

On 11/27/2012 02:48 AM, Glauber Costa wrote:

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will
separate
the consigned time from the steal time.  The consignment limit passed
to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

If you submit this again, please include a version number in your series.

Will do.  The patchset was sent twice yesterday by mistake.  Got an
error the first time and didn't
think the patches went out.  This has been corrected.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

yes, will do that.  When I took the RFC off the patches I was looking at
it as a new patchset which was
a mistake.  I will make sure to add a changelog when I submit again.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.

Yes, but we still get questions from users asking what is steal time?
why am I seeing this?

Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.

Something like this could certainly be done.  But when I was submitting
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the
guest kernel
to adjust the steal time. This percentage was being stored on the guest
as a sysctl value.
Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated The guest shouldn't need to know
it's entitlement. Or at least, it's up to a management tool to report
that in a way that's meaningful for the guest.

So perhaps I misunderstood what they were suggesting, but I took it to
mean that they did not
want the guest to know what the entitlement was.  That the host should
take care of it and just
report the already adjusted data to the guest.  So in this version of
the code the host would use a set
period for a timer and be passed essentially a number of ticks of
expected steal time.  The host
would then use the timer to break out the steal time into consigned and
steal buckets which would be
reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So
anyone needing to see total
time away could add the two fields together.  The user, however, when
using tools like top or vmstat
would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the
two approaches?


Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?
It is setup using cpu overcommit.  But the request was for something 
that would work in both

the overcommit environment as well as when hard capping is being used.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Anthony Liguori
Glauber Costa glom...@parallels.com writes:

 Hi,

 On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.

 If you submit this again, please include a version number in your series.

 It would also be helpful to include a small changelog about what changed
 between last version and this version, so we could focus on that.

 As for the rest, I answered your previous two submissions saying I don't
 agree with the concept. If you hadn't changed anything, resending it
 won't change my mind.

 I could of course, be mistaken or misguided. But I had also not seen any
 wave of support in favor of this previously, so basically I have no new
 data to make me believe I should see it any differently.

 Let's try this again:

 * Rik asked you in your last submission how does ppc handle this. You
 said, and I quote: In the case of lpar on POWER systems they simply
 report steal time and do not alter it in any way.
 They do however report how much processor is assigned to the partition
 and that information is in /proc/ppc64/lparcfg.

This only is helpful for static entitlements.

But if we allow dynamic entitlements--which is a very useful feature,
think buying an online upgrade in a cloud environment--then you need
to account for entitlement loss at the same place where you do the rest
of the accounting: in /proc/stat.

 Now, that is a *way* more sensible thing to do. Much more. Confusing
 users is something extremely subjective. This is specially true about
 concepts that are know for quite some time, like steal time. If you out
 of a sudden change the meaning of this, it is sure to confuse a lot more
 users than it would clarify.

I'll bring you a nice bottle of scotch at the next KVM Forum if you can
find me one user that can accurately describe what steal time is.

The semantics are so incredibly subtle that I have a hard time believing
anyone actually understands what it means today.

Regards,

Anthony Liguori





 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Glauber Costa
On 11/28/2012 10:43 PM, Michael Wolf wrote:
 On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
 On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.
 The definition of stolen time is 'time during which the virtual CPU is
 runnable to not running'. Overcommit is the main scenario which steal
 time helps to detect.

 Can you describe the 'capped' case?
 In the capped case, the time that the guest spends waiting due to it
 having used its full allottment of time shows up as steal time.  The way
 my patchset currently stands is that you would set up the
 bandwidth control and you would have to pass it a  matching value from
 qemu.  In the future, it would
 be possible to have something parse the bandwidth setting and
 automatically adjust the setting in the
 host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
vcpu-arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:


Alter the amount of steal time reported by the guest.
Maybe this should go away.

Expand the steal time msr to also contain the consigned time.
Maybe this should go away

Add the code to send the consigned time from the host to the
 guest
This definitely should be heavily modified

Add a timer to allow the separation of consigned from steal time.
Maybe this should go away

Add an ioctl to communicate the consign limit to the host.
This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Glauber Costa
Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.

If you submit this again, please include a version number in your series.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.

Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.





 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Michael Wolf

On 11/27/2012 02:48 AM, Glauber Costa wrote:

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

If you submit this again, please include a version number in your series.
Will do.  The patchset was sent twice yesterday by mistake.  Got an 
error the first time and didn't

think the patches went out.  This has been corrected.


It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.
yes, will do that.  When I took the RFC off the patches I was looking at 
it as a new patchset which was

a mistake.  I will make sure to add a changelog when I submit again.


As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.
Yes, but we still get questions from users asking what is steal time? 
why am I seeing this?


Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.
Something like this could certainly be done.  But when I was submitting 
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the 
guest kernel
to adjust the steal time. This percentage was being stored on the guest 
as a sysctl value.

Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated The guest shouldn't need to know 
it's entitlement. Or at least, it's up to a management tool to report 
that in a way that's meaningful for the guest.


So perhaps I misunderstood what they were suggesting, but I took it to 
mean that they did not
want the guest to know what the entitlement was.  That the host should 
take care of it and just
report the already adjusted data to the guest.  So in this version of 
the code the host would use a set
period for a timer and be passed essentially a number of ticks of 
expected steal time.  The host
would then use the timer to break out the steal time into consigned and 
steal buckets which would be

reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So 
anyone needing to see total
time away could add the two fields together.  The user, however, when 
using tools like top or vmstat

would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the 
two approaches?









---

Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.


  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)


--
To 

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Marcelo Tosatti
On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.
 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 
 -- 
 Signature
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Marcelo Tosatti
On Tue, Nov 27, 2012 at 09:24:42PM -0200, Marcelo Tosatti wrote:
 On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
  In the case of where you have a system that is running in a
  capped or overcommitted environment the user may see steal time
  being reported in accounting tools such as top or vmstat.
 
 The definition of stolen time is 'time during which the virtual CPU is
 runnable to not running'. Overcommit is the main scenario which steal
 time helps to detect.

Meant 'runnable but not running'.


 Can you describe the 'capped' case?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html