from:"Michael Wolf"

Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-03-05 Thread Michael Wolf

Sorry for the delay in the response.  I did not see your question.

On Mon, 2013-02-18 at 20:57 -0300, Marcelo Tosatti wrote:
> On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote:
> > Add a helper routine to scheduler/core.c to allow the kvm module
> > to retrieve the cpu hardlimit settings.  The values will be used
> > to set up a timer that is used to separate the consigned from the
> > steal time.
> 
> 1) Can you please describe, in english, the mechanics of subtracting cpu
> hardlimit values from steal time reported via run_delay supposed to
> work?
> 
> "The period and the quota used to separate the consigned time 
> (expected steal) from the steal time are taken
> from the cfs bandwidth control settings. Any other steal time
> accruing during that period will show as the traditional steal time."
> 
> There is no "expected steal time" over a fixed period of real time.
There is expected steal time in the sense that the administrator of the
system sets up guests on the host so that there will be cpu
overcommitment.  The end user who is using the guest does not know this,
they only know they have been guaranteed a certain level of performance.
So if steal time shows up the end user typically thinks they are not
getting their guaranteed performance. So this patchset is meant to allow
top to show 100% utilization and ONLY show steal time if it is over the
level of steal time that the host administrator setup.  So take a simple
example of a host with 1 cpu and two guest on it.  If each guest is
fully utilized a user will see 50% utilization and 50% steal in either
of the guests.  In this case the amount of steal time that the host 
administrator would expect to see is 50%.  As long as the steal in the
guest does not exceed 50% the guest is running as expected.  If for some
reason the steal increases to 60%, now something is wrong and the steal
time needs to be reported and the end user will make inquiries?

> 
> 2) From the description of patch 1: "In the case of where you have
> a system that is running in a capped or overcommitted environment 
> the user may see steal time being reported in accounting tools 
> such as top or vmstat." 
> 
> This is outdated, right? Because overcommitted environment is exactly
> what steal time should report.

I hope I'm not missing your point here.  But again this comes down to
the point of view.  The end user is guaranteed a capability/level of
performance that may not be a whole cpu.  So only show steal time if the
amount of steal time exceeds what the host admin expected when the guest
was set up.
> 
> 
> Thanks

thanks
Mike Wolf

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-05 Thread Michael Wolf

Sorry for the delay in the response.  I did not see the email
right away.

On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > 2013/2/5 Michael Wolf :
> > > In the case of where you have a system that is running in a
> > > capped or overcommitted environment the user may see steal time
> > > being reported in accounting tools such as top or vmstat.  This can
> > > cause confusion for the end user.
> > 
> > Sorry, I'm no expert in this area. But I don't really understand what
> > is confusing for the end user here.
> 
> I suppose that what is wanted is to subtract stolen time due to 'known
> reasons' from steal time reporting. 'Known reasons' being, for example,
> hard caps. So a vcpu executing instructions with no halt, but limited to
> 80% of available bandwidth, would not have 20% of stolen time reported.

Yes exactly and the end user many times did not set up the guest and is
not aware of the capping.  The end user is only aware of the performance
level that they were told they would get with the guest.  

> 
> But yes, a description of the scenario that is being dealt with, with
> details, is important.

I will add more detail to the description next time I submit the
patches.  How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it.  The end user is only aware of a level of
performance that they should see."   or does that just muddy the picture
more??
> 
> > > To ease the confusion this patch set
> > > adds the idea of consigned (expected steal) time.  The host will separate
> > > the consigned time from the steal time.  Tthe steal time will only be 
> > > altered
> > > if hard limits (cfs bandwidth control) is used.  The period and the quota 
> > > used
> > > to separate the consigned time (expected steal) from the steal time are 
> > > taken
> > > from the cfs bandwidth control settings. Any other steal time accruing 
> > > during
> > > that period will show as the traditional steal time.
> > 
> > I'm also a bit confused here. steal time will then only account the
> > cpu time lost due to quotas from cfs bandwidth control? Also what do
> > you exactly mean by "expected steal time"? Is it steal time due to
> > overcommitting minus scheduler quotas?
> > 
> > Thanks.
> 

Thanks
Mike Wolf

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Michael Wolf

On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > Sorry for the delay in the response.  I did not see the email
> > right away.
> > 
> > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > 2013/2/5 Michael Wolf :
> > > > > In the case of where you have a system that is running in a
> > > > > capped or overcommitted environment the user may see steal time
> > > > > being reported in accounting tools such as top or vmstat.  This can
> > > > > cause confusion for the end user.
> > > > 
> > > > Sorry, I'm no expert in this area. But I don't really understand what
> > > > is confusing for the end user here.
> > > 
> > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > reasons' from steal time reporting. 'Known reasons' being, for example,
> > > hard caps. So a vcpu executing instructions with no halt, but limited to
> > > 80% of available bandwidth, would not have 20% of stolen time reported.
> > 
> > Yes exactly and the end user many times did not set up the guest and is
> > not aware of the capping.  The end user is only aware of the performance
> > level that they were told they would get with the guest.  
> > > But yes, a description of the scenario that is being dealt with, with
> > > details, is important.
> > 
> > I will add more detail to the description next time I submit the
> > patches.  How about something like,"In a cloud environment the user of a
> > kvm guest is not aware of the underlying hardware or how many other
> > guests are running on it.  The end user is only aware of a level of
> > performance that they should see."   or does that just muddy the picture
> > more??
> 
> So the feature aims for is to report stolen time relative to hard
> capping. That is: stolen time should be counted as time stolen from
> the guest _beyond_ hard capping. Yes?
Yes, that is the goal.
> 
> Probably don't need to report new data to the guest for that.
Not sure I understand what you are saying here. Do you mean that I don't
need to report the expected steal from the guest?  If I don't do that
then I'm not reporting all of the time and changing /proc/stat in a
bigger way than adding another catagory.  Also I thought I would need to
provide the consigned time and the steal time for debugging purposes.
Maybe I'm missing your point.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Michael Wolf

On Wed, 2013-03-06 at 12:13 +0400, Glauber Costa wrote:
> On 03/06/2013 05:41 AM, Marcelo Tosatti wrote:
> > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> >> Sorry for the delay in the response.  I did not see the email
> >> right away.
> >>
> >> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> >>> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> >>>> 2013/2/5 Michael Wolf :
> >>>>> In the case of where you have a system that is running in a
> >>>>> capped or overcommitted environment the user may see steal time
> >>>>> being reported in accounting tools such as top or vmstat.  This can
> >>>>> cause confusion for the end user.
> >>>>
> >>>> Sorry, I'm no expert in this area. But I don't really understand what
> >>>> is confusing for the end user here.
> >>>
> >>> I suppose that what is wanted is to subtract stolen time due to 'known
> >>> reasons' from steal time reporting. 'Known reasons' being, for example,
> >>> hard caps. So a vcpu executing instructions with no halt, but limited to
> >>> 80% of available bandwidth, would not have 20% of stolen time reported.
> >>
> >> Yes exactly and the end user many times did not set up the guest and is
> >> not aware of the capping.  The end user is only aware of the performance
> >> level that they were told they would get with the guest.  
> >>> But yes, a description of the scenario that is being dealt with, with
> >>> details, is important.
> >>
> >> I will add more detail to the description next time I submit the
> >> patches.  How about something like,"In a cloud environment the user of a
> >> kvm guest is not aware of the underlying hardware or how many other
> >> guests are running on it.  The end user is only aware of a level of
> >> performance that they should see."   or does that just muddy the picture
> >> more??
> > 
> > So the feature aims for is to report stolen time relative to hard
> > capping. That is: stolen time should be counted as time stolen from
> > the guest _beyond_ hard capping. Yes?
> > 
> > Probably don't need to report new data to the guest for that.
> > 
> If we take into account that 1 second always have one second, I believe
> that you can just subtract the consigned time from the steal time the
> host passes to the guest.
> 
> During each second, the numbers won't sum up to 100. The delta to 100 is
> the consigned time, if anyone cares.
> 
> Adopting this would simplify this a lot. All you need to do, really, is
> to get your calculation right from the bandwidth given by the cpu
> controller. Subtract it in the host, and voila.

I looked at doing that once but was told that I was changing the
interface in an unacceptable way, because now I was not reporting all of
the elapsed time.  I agree it would make things simpler.
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Michael Wolf

On Wed, 2013-03-06 at 14:34 +0100, Frederic Weisbecker wrote:
> 2013/3/5 Michael Wolf :
> > Sorry for the delay in the response.  I did not see the email
> > right away.
> >
> > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> >> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> >> > 2013/2/5 Michael Wolf :
> >> > > In the case of where you have a system that is running in a
> >> > > capped or overcommitted environment the user may see steal time
> >> > > being reported in accounting tools such as top or vmstat.  This can
> >> > > cause confusion for the end user.
> >> >
> >> > Sorry, I'm no expert in this area. But I don't really understand what
> >> > is confusing for the end user here.
> >>
> >> I suppose that what is wanted is to subtract stolen time due to 'known
> >> reasons' from steal time reporting. 'Known reasons' being, for example,
> >> hard caps. So a vcpu executing instructions with no halt, but limited to
> >> 80% of available bandwidth, would not have 20% of stolen time reported.
> >
> > Yes exactly and the end user many times did not set up the guest and is
> > not aware of the capping.  The end user is only aware of the performance
> > level that they were told they would get with the guest.
> >
> >>
> >> But yes, a description of the scenario that is being dealt with, with
> >> details, is important.
> >
> > I will add more detail to the description next time I submit the
> > patches.  How about something like,"In a cloud environment the user of a
> > kvm guest is not aware of the underlying hardware or how many other
> > guests are running on it.  The end user is only aware of a level of
> > performance that they should see."   or does that just muddy the picture
> > more??
> 
> That alone is probably not enough. But yeah, make sure you clearly
> state the difference between expected (caps, sched bandwidth...) and
> unexpected (overcommitting, competing load...) stolen time. Then add a
> practical example as you made above that explains why it matters to
> make that distinction and why you want to report it.
> 

Ok, I understand what you are requesting.  I will make sure to add it to
the description the next time I submit the patches.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-07 Thread Michael Wolf

On Thu, 2013-03-07 at 14:11 +1100, Paul Mackerras wrote:
> On Wed, Mar 06, 2013 at 09:52:16PM -0300, Marcelo Tosatti wrote:
> > On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote:
> > > I looked at doing that once but was told that I was changing the
> > > interface in an unacceptable way, because now I was not reporting all of
> > > the elapsed time.  I agree it would make things simpler.
> > 
> > Pointer to that claim, please?
> 
> Back in about 2004 or 2005 or so I was looking at changing how user
> and system times were calculated (in the context of trying to find a
> better way to report resources used by a thread in an SMT processor).
> I found that utilities such as top expected the deltas in the
> /proc/stat numbers to add up to elapsed time, and would report strange
> and inconsistent results if that wasn't the case.  Unfortunately at
> this distance I don't recall the exact details.  I don't know whether
> the expectation that the deltas in the /proc/stat numbers over a
> period of time add up to the elapsed real time is documented anywhere,
> but I wouldn't be at all surprised if some programs depend on it, so
> it's better to maintain that property.

I will have to look at this again.  When looking at the cpu data where
steal time is reported there isn't a problem today.  I will have to run
it and see if there is anything incorrect with the time being reported
for the individual processes.

My real concern here was that in changing the /proc/stat interface am I
going to mess private tools that look at that information.  When I've
looked at vmstat and top they report the cpu information fine, but I may
end up creating problems for home grown scripts and tools.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-07 Thread Michael Wolf

On Wed, 2013-03-06 at 23:30 -0300, Marcelo Tosatti wrote:
> On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote:
> > On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > > > Sorry for the delay in the response.  I did not see the email
> > > > right away.
> > > > 
> > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > > > 2013/2/5 Michael Wolf :
> > > > > > > In the case of where you have a system that is running in a
> > > > > > > capped or overcommitted environment the user may see steal time
> > > > > > > being reported in accounting tools such as top or vmstat.  This 
> > > > > > > can
> > > > > > > cause confusion for the end user.
> > > > > > 
> > > > > > Sorry, I'm no expert in this area. But I don't really understand 
> > > > > > what
> > > > > > is confusing for the end user here.
> > > > > 
> > > > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > > > reasons' from steal time reporting. 'Known reasons' being, for 
> > > > > example,
> > > > > hard caps. So a vcpu executing instructions with no halt, but limited 
> > > > > to
> > > > > 80% of available bandwidth, would not have 20% of stolen time 
> > > > > reported.
> > > > 
> > > > Yes exactly and the end user many times did not set up the guest and is
> > > > not aware of the capping.  The end user is only aware of the performance
> > > > level that they were told they would get with the guest.  
> > > > > But yes, a description of the scenario that is being dealt with, with
> > > > > details, is important.
> > > > 
> > > > I will add more detail to the description next time I submit the
> > > > patches.  How about something like,"In a cloud environment the user of a
> > > > kvm guest is not aware of the underlying hardware or how many other
> > > > guests are running on it.  The end user is only aware of a level of
> > > > performance that they should see."   or does that just muddy the picture
> > > > more??
> > > 
> > > So the feature aims for is to report stolen time relative to hard
> > > capping. That is: stolen time should be counted as time stolen from
> > > the guest _beyond_ hard capping. Yes?
> > Yes, that is the goal.
> > > 
> > > Probably don't need to report new data to the guest for that.
> > Not sure I understand what you are saying here. Do you mean that I don't
> > need to report the expected steal from the guest?  If I don't do that
> > then I'm not reporting all of the time and changing /proc/stat in a
> > bigger way than adding another catagory.  Also I thought I would need to
> > provide the consigned time and the steal time for debugging purposes.
> > Maybe I'm missing your point.
> 
> OK so the usefulness of steal time comes from the ability to measure 
> CPU cycles that the guest is being deprived of, relative to some unit
> (implicitly the CPU frequency presented to the VM). That way, it becomes
> easier to properly allocate resources.
> 
> From top man page:
> st : time stolen from this vm by the hypervisor
> 
> Not only its a problem for the lender, it is also confusing for the user
> (who has to subtract from the reported value himself), the hardcapping 
> from reported steal time.
> 
> 
> The problem with the algorithm in the patchset is the following
> (practical example):
> 
> - Hard capping set to 80% of available CPU.
> - vcpu does not exceed its threshold, say workload with 40%
> CPU utilization.
> - Under this scenario it is possible for vcpu to be deprived
> of cycles (because out of the 40% that workload uses, only 30% of
> actual CPU time are being provided).
> - The algorithm in this patchset will not report any stolen time
> because it assumes 20% of stolen time reported via 'run_delay'
> is fixed at all times (which is false), therefore any valid 
> stolen time below 20% will not be reported.
> 
> Makes sense?

I understand the scenerio.  I will have to go back and look at the 
CFS bandwidth code and run some tests. The question I have to look at is
how is everything reported in your

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-07 Thread Michael Wolf

On Wed, 2013-03-06 at 23:30 -0300, Marcelo Tosatti wrote:
> On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote:
> > On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > > > Sorry for the delay in the response.  I did not see the email
> > > > right away.
> > > > 
> > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > > > 2013/2/5 Michael Wolf :
> > > > > > > In the case of where you have a system that is running in a
> > > > > > > capped or overcommitted environment the user may see steal time
> > > > > > > being reported in accounting tools such as top or vmstat.  This 
> > > > > > > can
> > > > > > > cause confusion for the end user.
> > > > > > 
> > > > > > Sorry, I'm no expert in this area. But I don't really understand 
> > > > > > what
> > > > > > is confusing for the end user here.
> > > > > 
> > > > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > > > reasons' from steal time reporting. 'Known reasons' being, for 
> > > > > example,
> > > > > hard caps. So a vcpu executing instructions with no halt, but limited 
> > > > > to
> > > > > 80% of available bandwidth, would not have 20% of stolen time 
> > > > > reported.
> > > > 
> > > > Yes exactly and the end user many times did not set up the guest and is
> > > > not aware of the capping.  The end user is only aware of the performance
> > > > level that they were told they would get with the guest.  
> > > > > But yes, a description of the scenario that is being dealt with, with
> > > > > details, is important.
> > > > 
> > > > I will add more detail to the description next time I submit the
> > > > patches.  How about something like,"In a cloud environment the user of a
> > > > kvm guest is not aware of the underlying hardware or how many other
> > > > guests are running on it.  The end user is only aware of a level of
> > > > performance that they should see."   or does that just muddy the picture
> > > > more??
> > > 
> > > So the feature aims for is to report stolen time relative to hard
> > > capping. That is: stolen time should be counted as time stolen from
> > > the guest _beyond_ hard capping. Yes?
> > Yes, that is the goal.
> > > 
> > > Probably don't need to report new data to the guest for that.
> > Not sure I understand what you are saying here. Do you mean that I don't
> > need to report the expected steal from the guest?  If I don't do that
> > then I'm not reporting all of the time and changing /proc/stat in a
> > bigger way than adding another catagory.  Also I thought I would need to
> > provide the consigned time and the steal time for debugging purposes.
> > Maybe I'm missing your point.
> 
> OK so the usefulness of steal time comes from the ability to measure 
> CPU cycles that the guest is being deprived of, relative to some unit
> (implicitly the CPU frequency presented to the VM). That way, it becomes
> easier to properly allocate resources.
> 
> From top man page:
> st : time stolen from this vm by the hypervisor
> 
> Not only its a problem for the lender, it is also confusing for the user
> (who has to subtract from the reported value himself), the hardcapping 
> from reported steal time.
> 
> 
> The problem with the algorithm in the patchset is the following
> (practical example):
> 
> - Hard capping set to 80% of available CPU.
> - vcpu does not exceed its threshold, say workload with 40%
> CPU utilization.
> - Under this scenario it is possible for vcpu to be deprived
> of cycles (because out of the 40% that workload uses, only 30% of
> actual CPU time are being provided).
> - The algorithm in this patchset will not report any stolen time
> because it assumes 20% of stolen time reported via 'run_delay'
> is fixed at all times (which is false), therefore any valid 
> stolen time below 20% will not be reported.
> 
> Makes sense?
> 
> Not sure what the concrete way to report stolen time relative to hard
> capping is (probably easier inside the scheduler, where run_delay is
> calculated).
> 
> Reporting the hard capping to the guest is a good idea (which saves the
> user from having to measure it themselves), but better done separately
> via new field.

didnt respond to this in the previous response.  I'm not sure I'm
following you here.  I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output.  Are you
looking for something else or a better naming convention?

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-07 Thread Michael Wolf

On Thu, 2013-03-07 at 18:25 -0300, Marcelo Tosatti wrote:
> On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote:
> > > 
> > > Makes sense?
> > > 
> > > Not sure what the concrete way to report stolen time relative to hard
> > > capping is (probably easier inside the scheduler, where run_delay is
> > > calculated).
> > > 
> > > Reporting the hard capping to the guest is a good idea (which saves the
> > > user from having to measure it themselves), but better done separately
> > > via new field.
> > 
> > didnt respond to this in the previous response.  I'm not sure I'm
> > following you here.  I thought this is what I was doing by having a
> > consigned (expected steal) field add to the /proc/stat output.  Are you
> > looking for something else or a better naming convention?
> 
> Expected steal is not a good measure to use (because as mentioned in the
> previous email there is no expected steal over a fixed period of time).
> 
> It is fine to report 'maximum percentage of underlying physical CPU'
> (what percentage of the physical CPU time guest VM is allowed to make
> use of).
> 
> And then steal time is relative to maximum percentage of underlying
> physical CPU time allowed.
> 

So last August I had sent out an RFC set of patches to do this.  That
patchset was meant to handle the general overcommit case as well as the
capping case by having qemu pass a percentage to the host that would
then be passed onto the guest and used to adjust the steal time.
Here is the link to the discussion
http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html

As you will see there Avi didn't like the idea of a percentage down in
the guest, among other reasons he was concerned about migration.  Also
in the email thread you will see that Anthony Liguori was opposed to the
idea of just changing the steal time, he wanted it split out.

What Glauber has suggested and I am working on implementing is taking
out the timer and adding a last read field in the host.  So in the host
I can determine the total time that has passed and compute a percentage
and apply that percentage to the steal time while the info is still on
the host.  Then pass the steal and consigned time to the guest.

Does that address your concerns?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-12-07 Thread Michael Wolf


On 12/05/2012 06:46 AM, Glauber Costa wrote:

I am deeply sorry.

I was busy first time I read this, so I postponed answering and ended up
forgetting.

Sorry

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

So I looked at something like this in the past.  To make sure things
haven't changed
I set up a cgroup on my test server running a kernel built from the
latest tip tree.

[root]# cat cpu.cfs_quota_us
5
[root]# cat cpu.cfs_period_us
10
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks.  When I start a
script that will hog the cpu I see the
following in top on the guest
Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si,
49.8%st

So the steal time here is in line with the bandwidth control settings.

Ok. So I was wrong in my hunch that it would be outside the runqueue,
therefore work automatically. Still, the host kernel has all the
information in cgroups.


So then the steal time did not show on the guest.  You have no value
that needs to be passed
around.  What I did not like about this approach was
* only works for cfs bandwidth control.  If another type of hard limit
was added to the kernel
the code would potentially need to change.

This is true for almost everything we have in the kernel!
It is *very* unlikely for other bandwidth control mechanism to ever
appear. If it ever does, it's *their* burden to make sure it works for
steal time (provided it is merged). Code in tree gets precedence.


Ok,  I will work on a patch that uses the cgroup information for 
bandwidth control

to separate out the time.




* This approach doesn't help if the limits are set by overcommitting the
cpus.  It is my understanding
that this is a common approach.


I can't say anything about commonality, but common or not, it is a
*crazy* approach.

When you simply overcommit, you have no way to differentiate between
intended steal time and non-intended steal time. Moreover, when you
overcommit, your cpu usage will vary over time. If two guests use the
cpu to their full power, you will have 50 % each. But if one of them
slows down, the other gets more. What is your entitlement value? How do
you define this?

And then after you define it, you end up using more than this, what is
your cpu usage? 130 %?


yes exactly you would ideally show a boosted amount of cpu.  However to 
do that
you would need to either create a new tool or modify the current 
accounting tools

such as top.

My understanding is that you are not capping in this case as much as you 
are

guaranteeing a minimum level of performance.




The only sane way to do it, is to communicate this value to the kernel
somehow. The bandwidth controller is the interface we have for that. So
everybody that wants to *intentionally* overcommit needs to communicate
this to the controller. IOW: Any sane configuration should be explicit
about your capping.


 Add an ioctl to communicate the consign limit to the host.

This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.

I'm not understanding that comment.  If you are capping by simply
controlling the amount of
overcommit on the host then wouldn't you still need some value to
indicate the desired amount.

No, that is just crazy, and I don't like it a single bit.

So in the light of it: Whatever capping mechanism we have, we need to be
explicit about the expected entitlement. At this point, the kernel
already knows what it is, and needs no extra ioctls or anything like that.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] Expand the steal time msr to also contain the consigned time.

2013-02-05 Thread Michael Wolf

Expand the steal time msr to also contain the consigned time.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/paravirt.h   |4 ++--
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/kvm.c |7 ++-
 kernel/sched/core.c   |   10 +-
 kernel/sched/cputime.c|2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index fe75a28..89e5468 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -386,9 +386,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -396,11 +395,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src->version;
rmb();
-   steal = src->steal;
+   *steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..efc2652 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -757,6 +757,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+   u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -785,8 +786,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
u64 st;
+   u64 cs;
 
-   steal = paravirt_steal_clock(cpu_of(rq));
+   paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+   /*
+* since we are not assigning the steal time to cpustats
+* here, just combine the steal and consigned times to
+* do the rest of the calculations.
+*/
+   steal += consigned;
steal -= rq->prev_steal_time_rq;
 
if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 825a956..0b4f1ec 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(¶virt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), &steal);
steal -= this_rq()->prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] Add the code to send the consigned time from the host to the guest

2013-02-05 Thread Michael Wolf

Change the paravirt calls that retrieve the steal-time information
from the host.  Add to it getting the consigned value as well as
the steal time.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h  |1 +
 arch/x86/include/asm/paravirt.h  |4 ++--
 arch/x86/include/uapi/asm/kvm_para.h |3 ++-
 arch/x86/kernel/kvm.c|3 ++-
 arch/x86/kernel/paravirt.c   |4 ++--
 arch/x86/kvm/x86.c   |2 ++
 include/linux/kernel_stat.h  |1 +
 kernel/sched/cputime.c   |   21 +++--
 kernel/sched/sched.h |2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc87b65..fe5a37b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -428,6 +428,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9b753ea..77f05e7 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline void paravirt_steal_clock(int cpu, u64 *steal)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 89e5468..fb52f8a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -386,7 +386,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -396,6 +396,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src->version;
rmb();
*steal = src->steal;
+   *consigned = src->consigned;
rmb();
} while ((version & 1) || (version != src->version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   return 0;
+   *steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c243b81..51b63d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1867,8 +1867,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+   vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;
+   vcpu->arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e352052..f58ed0f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0b4f1ec..2a2d4be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_

[PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-05 Thread Michael Wolf

Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings.  The values will be used
to set up a timer that is used to separate the consigned from the
steal time.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |9 ++
 arch/x86/kvm/x86.c  |   62 ++-
 kernel/sched/core.c |   20 +
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fe5a37b..9518613 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -355,6 +355,15 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   s64 consigned_quota;
+   s64 consigned_period;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51b63d1..79d144d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1848,13 +1848,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
 
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
-   vcpu->arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu->arch.current_consigned < vcpu->arch.consigned_quota) {
+   vcpu->arch.current_consigned += delta;
+   if (vcpu->arch.current_consigned > vcpu->arch.consigned_quota) {
+   steal_delta = vcpu->arch.current_consigned
+   -  vcpu->arch.consigned_quota;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu->arch.st.accum_steal = steal_delta;
+   vcpu->arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -2629,8 +2648,35 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
!(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY);
 }
 
+extern int sched_use_hard_capping(int cpuid, int num_vcpus, s64 *quota,
+   s64 *period);
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   struct kvm *kvm;
+   int num_vcpus;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   kvm = vcpu->kvm;
+   num_vcpus = atomic_read(&kvm->online_vcpus);
+   sched_use_hard_capping(vcpu->cpu, num_vcpus,
+   &vcpu->arch.consigned_quota,
+   &vcpu->arch.consigned_period);
+   vcpu->arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(&vcpu->arch.steal_timer, now,
+   ktime_set(0, vcpu->arch.consigned_period));
+
+   return HRTIMER_RESTART;
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   struct kvm *kvm;
+   int num_vcpus;
+   ktime_t ktime;
+
/* Address WBINVD may be executed by guest */
if (need_emulate_wbinvd(vcpu)) {
if (kvm_x86_ops->has_wbinvd_exit())
@@ -2670,6 +2716,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
}
+   /* Initialize and start a timer to capture steal and consigned time */
+   kvm = vcpu->kvm;
+   num_vcpus = atomic_read(&kvm->online_vcpus);
+   num_vcpus = (num_vcpus == 0) ? 1 : num_vcpus;
+   sched_use_hard_capping(vcpu->cpu, num_vcpus,
+   &vcpu->arch.consigned_quota,
+   &vcpu->arch.consigned_period);
+   hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu->arch.steal_timer.function = &steal_timer_fn;
+   ktime = ktime_set(0, vcpu->arch.consigned_period);
+   hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
accumulate_steal_time(vcpu);
kvm_make_request(KV

[PATCH 0/4] Alter steal-time reporting in the guest

2013-02-05 Thread Michael Wolf

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  Tthe steal time will only be altered
if hard limits (cfs bandwidth control) is used.  The period and the quota used
to separate the consigned time (expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time accruing during
that period will show as the traditional steal time.

Changes from V2:
* Dropped the ioctl that allowed qemu to send the entitlement value to
  the guest.
* Added code to get the entitlement period and quota from cfs bandwidth.

Changes from V1:
* Removed the steal time allowed percentage from the guest
* Moved the separation of consigned (expected steal) and steal time to the
  host.
* No longer include a sysctl interface.

---

Michael Wolf (4):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.


 arch/x86/include/asm/kvm_host.h   |   10 +
 arch/x86/include/asm/paravirt.h   |4 +-
 arch/x86/include/asm/paravirt_types.h |2 +
 arch/x86/include/uapi/asm/kvm_para.h  |3 +-
 arch/x86/kernel/kvm.c |8 ++--
 arch/x86/kernel/paravirt.c|4 +-
 arch/x86/kvm/x86.c|   64 -
 fs/proc/stat.c|9 -
 include/linux/kernel_stat.h   |2 +
 kernel/sched/core.c   |   30 +++
 kernel/sched/cputime.c|   21 ++-
 kernel/sched/sched.h  |2 +
 12 files changed, 142 insertions(+), 17 deletions(-)

-- 
Signature

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] Alter the amount of steal time reported by the guest.

2013-02-05 Thread Michael Wolf

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu hard capping.

Signed-off-by: Michael Wolf 
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(&boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 66b7078..e352052 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-06 Thread Michael Wolf


On 02/06/2013 08:36 AM, Glauber Costa wrote:

On 02/06/2013 01:49 AM, Michael Wolf wrote:

Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings.  The values will be used
to set up a timer that is used to separate the consigned from the
steal time.

Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.
I may be missing something, but how do we know how much time has 
passed?  That is why
I had the timer in there.  I will go look again at the code but I 
thought the data was collected
as ticks and passed at random times.  The ticks are also accumulating so 
we are looking at the

difference in the count between reads.



Everything bigger the entitlement is steal time.
I agree provided I know the amount of total time that the steal time was 
accumulated.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] Expand the steal time msr to also contain the consigned time.

2013-02-07 Thread Michael Wolf


On 02/06/2013 03:14 PM, Rik van Riel wrote:

On 02/05/2013 04:49 PM, Michael Wolf wrote:

Expand the steal time msr to also contain the consigned time.

Signed-off-by: Michael Wolf 
---
  arch/x86/include/asm/paravirt.h   |4 ++--
  arch/x86/include/asm/paravirt_types.h |2 +-
  arch/x86/kernel/kvm.c |7 ++-
  kernel/sched/core.c   |   10 +-
  kernel/sched/cputime.c|2 +-
  5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h 
b/arch/x86/include/asm/paravirt.h

index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
  extern struct static_key paravirt_steal_enabled;
  extern struct static_key paravirt_steal_rq_enabled;

-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
  {
-return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
  }


This may be a stupid question, but what happens if a KVM
guest with this change, runs on a kernel that still has
the old steal time interface?

What happens if the host has the new steal time interface,
but the guest uses the old interface?

Will both cases continue to work as expected with your
patch series?

If so, could you document (in the source code) why things
continue to work?


I will test the scenarios you suggest and will report back the results.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] Add the code to send the consigned time from the host to the guest

2013-02-07 Thread Michael Wolf


On 02/06/2013 03:18 PM, Rik van Riel wrote:

On 02/05/2013 04:49 PM, Michael Wolf wrote:

Change the paravirt calls that retrieve the steal-time information
from the host.  Add to it getting the consigned value as well as
the steal time.

Signed-off-by: Michael Wolf 


diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h

index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@

  struct kvm_steal_time {
  __u64 steal;
+__u64 consigned;
  __u32 version;
  __u32 flags;
-__u32 pad[12];
+__u32 pad[10];
  };


The function kvm_register_steal_time passes the address of such
a structure to the host kernel, which then does something with
it.

Could running a guest with the above patch, on top of a host
with the old code, result in the values for "version" and
"flags" being written into "consigned"?

yes, good point.


Could that result in confusing the guest kernel to no end,
and generally breaking things?


Ok I will move the consigned field to be after the flags.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-07 Thread Michael Wolf


On 02/07/2013 02:46 AM, Glauber Costa wrote:

On 02/06/2013 10:07 PM, Michael Wolf wrote:

On 02/06/2013 08:36 AM, Glauber Costa wrote:

On 02/06/2013 01:49 AM, Michael Wolf wrote:

Add a helper routine to scheduler/core.c to allow the kvm module
to retrieve the cpu hardlimit settings.  The values will be used
to set up a timer that is used to separate the consigned from the
steal time.

Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.

I may be missing something, but how do we know how much time has
passed?  That is why
I had the timer in there.  I will go look again at the code but I
thought the data was collected
as ticks and passed at random times.  The ticks are also accumulating so
we are looking at the
difference in the count between reads.

They can be collected at random times, but you can of course record the
time in which it happened.


ok.  Let me add a previous_read field and take out the timer.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-23 Thread Michael Wolf

This is an RFC regarding the reporting of stealtime.  In the case of
where you have a system that is running with partial processors such as
KVM the user may see steal time being reported in accounting tools such
as top or vmstat.  This can cause confusion for the end user.  To
ease the confusion this patch set adds a sysctl interface to set the
cpu entitlement.  This is the percentage of cpu that the guest system is
 expected to receive.  As long as the steal time is within its expected
range it will show up as 0 in /proc/stat.  The user will then see in the
accounting tools that they are getting a full utilization of the cpu
resources assigned to them.

This patchset is changing the contents/output of /proc/stat and could affect 
user tools.  However the default setting is that the cpu is entitled to 100% 
so the code will act as before.  Also another field could be added to the 
/proc/stat output and show the unaltered steal time. Since this additional 
field could cause more confusion than it would clear up I have left it out 
for now.

Michael Wolf (3):
  Add a sysctl interface to control the cpu entitlement setting.
  Add a hypercall to retrieve the cpu entitlement value from the host.
  Modify the amount of stealtime that the kernel reports via the /proc

---
Documentation/sysctl/kernel.txt |   14 +++
 arch/x86/kvm/x86.c  |8 
 fs/proc/stat.c  |   80 ++-
 include/linux/kernel_stat.h |2 +
 include/linux/kvm.h |3 ++
 include/linux/kvm_host.h|3 ++
 include/linux/kvm_para.h|1 +
 kernel/sysctl.c |   10 +
 virt/kvm/kvm_main.c |7 
 9 files changed, 126 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 1/3] Add a sysctl interface to control and report the cpu entitlement setting.

2012-08-23 Thread Michael Wolf

This setting will later be used to compute an expected steal time.

Signed-off-by: Michael Wolf 
---
 Documentation/sysctl/kernel.txt |   14 ++
 fs/proc/stat.c  |1 +
 kernel/sysctl.c |   10 ++
 3 files changed, 25 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 6d78841..0f617dc 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -28,6 +28,7 @@ show up in /proc/sys/kernel:
 - core_pattern
 - core_pipe_limit
 - core_uses_pid
+- cpu_entitlement
 - ctrl-alt-del
 - dmesg_restrict
 - domainname
@@ -226,6 +227,19 @@ the filename.
 
 ==
 
+cpu_entitlement:
+
+The cpu_entitlement is the percentage of cpu utilization that
+the system expects to receive.  By default this is set to 100,
+in a guest system this could be set to a value between 0 and 100.
+This value is used to adjust the amount of steal time that
+process accounting code in the guest will display. The end effect
+will be is that steal time will only be reported if the
+percentage of steal time is greater than 100 - cpu_entitlement
+value.
+
+==
+
 ctrl-alt-del:
 
 When the value in this file is 0, ctrl-alt-del is trapped and
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 64c3b31..14e26c8 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 
+int cpu_entitlement = 100;
 #ifndef arch_irq_stat_cpu
 #define arch_irq_stat_cpu(cpu) 0
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87174ef..85efbc2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -109,6 +109,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int latencytop_enabled;
 extern int sysctl_nr_open_min, sysctl_nr_open_max;
+extern int cpu_entitlement;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
@@ -673,6 +674,15 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "cpu_entitlement",
+   .data   = &cpu_entitlement,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &zero,
+   .extra2 = &one_hundred,
+   },
 #if defined CONFIG_PRINTK
{
.procname   = "printk",

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 3/3] Modify the amount of stealtime that the kernel reports via the /proc interface.

2012-08-23 Thread Michael Wolf

Stealtime will be adjusted based on the cpu entitlement setting.  The user
will supply the cpu_entitlement which is the percentage of cpu the guest can
expect to receive.  The expected steal time is based on the expected steal
percentage which is 100 - cpu_entitlement.  If steal_time is less than the
expected steal time that is reported steal_time is changed to 0 no other fields
are changed.  If the steal_time is greater than the expected_steal then the
difference is reported.  By default the cpu_entitlement will be 100% and the
steal time will be reported without any modification.

Signed-off-by: Michael Wolf 
---
 fs/proc/stat.c  |   70 ++-
 include/linux/kernel_stat.h |2 +
 2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index cf5..efbaa03 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -73,6 +73,68 @@ static u64 get_iowait_time(int cpu)
 
 #endif
 
+/*
+ * This function will alter the steal time value that is written out
+ * to /proc/stat.  The cpu_entitlement is set by the user/admin and is
+ * meant to reflect the percentage of the processor that is expected to
+ * be used.  So as long as the amount of steal time is less than the
+ * expected steal time (based on cpu_entitlement) then report steal time
+ * as zero.
+ */
+static void kstat_adjust_steal_time(int currcpu)
+{
+   int j;
+   u64 cpustat_delta[NR_STATS];
+   u64 total_elapsed_time;
+   int expected_steal_pct;
+   u64 expected_steal;
+   u64 *currstat, *prevstat;
+
+   /*
+* if cpu_entitlement = 100% then the expected steal time is 0
+* so we don't need to do any adjustments to the fields.
+*/
+   if (cpu_entitlement == 100) {
+   kcpustat_cpu(currcpu).cpustat[CPUTIME_ADJ_STEAL] =
+   kcpustat_cpu(currcpu).cpustat[CPUTIME_STEAL];
+   return;
+   }
+   /*
+* For the user it is more intuitive to think in terms of
+* cpu entitlement.  To do the calculations it is easier to
+* think in terms of allowed steal time.  So convert the percentage
+* from cpu_entitlement to expected_steal_percent.
+*/
+   expected_steal_pct = 100 - cpu_entitlement;
+
+   total_elapsed_time = 0;
+   /* determine the total time elapsed between calls */
+   currstat = kcpustat_cpu(currcpu).cpustat;
+   prevstat = kcpustat_cpu(currcpu).prev_cpustat;
+   for (j = CPUTIME_USER; j < CPUTIME_GUEST; j++) {
+   cpustat_delta[j] = currstat[j] - prevstat[j];
+   prevstat[j] = currstat[j];
+   total_elapsed_time = total_elapsed_time + cpustat_delta[j];
+   }
+
+   /*
+* calculate the amount of expected steal time.  Add 5 as a
+* rounding factor.
+*/
+
+   expected_steal = (total_elapsed_time * expected_steal_pct + 5) / 100;
+   if (cpustat_delta[CPUTIME_STEAL] < expected_steal)
+   cpustat_delta[CPUTIME_STEAL] = 0;
+   else
+   cpustat_delta[CPUTIME_STEAL] -= expected_steal;
+
+   /* Adjust the steal time accordingly */
+   currstat[CPUTIME_ADJ_STEAL] = prevstat[CPUTIME_ADJ_STEAL]
+   + cpustat_delta[CPUTIME_STEAL];
+   prevstat[CPUTIME_ADJ_STEAL] = currstat[CPUTIME_ADJ_STEAL];
+}
+
+
 static int show_stat(struct seq_file *p, void *v)
 {
int i, j;
@@ -90,7 +152,11 @@ static int show_stat(struct seq_file *p, void *v)
getboottime(&boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
+   /* adjust the steal time based on the processor entitlement */
+   kstat_adjust_steal_time(i);
+
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
@@ -98,7 +164,7 @@ static int show_stat(struct seq_file *p, void *v)
iowait += get_iowait_time(i);
irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
-   steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
+   steal += kcpustat_cpu(i).cpustat[CPUTIME_ADJ_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
sum += kstat_cpu_irqs_sum(i);
@@ -135,7 +201,7 @@ static int show_stat(struct seq_file *p, void *v)
iowait = get_iowait_time(i);
irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
-   steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
+   steal = kcpustat_cpu(i).cpustat[CPUTIME_ADJ_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];

[PATCH RFC 2/3] Add a hypercall to retrieve the cpu entitlement value from the host.

2012-08-23 Thread Michael Wolf

If the hypercall is not implemented on the host a default value of
100 will be used. This value will be stored in 
/proc/sys/kernel/cpu_entitlements.

Signed-off-by: Michael Wolf 
---
 arch/x86/kvm/x86.c   |8 
 fs/proc/stat.c   |9 +
 include/linux/kvm.h  |3 +++
 include/linux/kvm_host.h |3 +++
 include/linux/kvm_para.h |1 +
 virt/kvm/kvm_main.c  |7 +++
 6 files changed, 31 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42bce48..734bc3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5052,6 +5052,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_VAPIC_POLL_IRQ:
ret = 0;
break;
+   case KVM_HC_CPU_ENTITLEMENT:
+   ret = vcpu->kvm->vcpu_entitlement;
+   break;
default:
ret = -KVM_ENOSYS;
break;
@@ -5897,6 +5900,11 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
 }
 
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm *kvm, long entitlement)
+{
+   kvm->vcpu_entitlement = (int) entitlement;
+   return 0;
+}
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
struct i387_fxsave_struct *fxsave =
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 14e26c8..cf5 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int cpu_entitlement = 100;
 #ifndef arch_irq_stat_cpu
@@ -213,6 +214,14 @@ static const struct file_operations proc_stat_operations = 
{
 
 static int __init proc_stat_init(void)
 {
+   long ret;
+
+   if (kvm_para_available()) {
+   ret = kvm_hypercall0(KVM_HC_CPU_ENTITLEMENT);
+   if (ret > 0)
+   cpu_entitlement = (int) ret;
+   }
+
proc_create("stat", 0, NULL, &proc_stat_operations);
return 0;
 }
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..fccd08e 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -904,6 +904,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL_IO(KVMIO,   0xad)
+/* Set the cpu entitlement this will be used to adjust stealtime reporting */
+#define KVM_SET_ENTITLEMENT  _IOW(KVMIO, 0xae, unsigned long)
+
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..71e3d73 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -276,6 +276,7 @@ struct kvm {
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
int last_boosted_vcpu;
+   int vcpu_entitlement;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
@@ -538,6 +539,8 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm *kvm,
+   long entitlement);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index ff476dd..95f3387 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -19,6 +19,7 @@
 #define KVM_HC_MMU_OP  2
 #define KVM_HC_FEATURES3
 #define KVM_HC_PPC_MAP_MAGIC_PAGE  4
+#define KVM_HC_CPU_ENTITLEMENT  5
 
 /*
  * hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2468523..a0a4939 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2093,6 +2093,13 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
 #endif
+   case KVM_SET_ENTITLEMENT: {
+   r = kvm_arch_vcpu_ioctl_set_entitlement(kvm, arg);
+   if (r)
+   goto out;
+   r = 0;
+   break;
+   }
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-24 Thread Michael Wolf

On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
> On 08/24/2012 03:14 AM, Michael Wolf wrote:
> > This is an RFC regarding the reporting of stealtime.  In the case of
> > where you have a system that is running with partial processors such as
> > KVM the user may see steal time being reported in accounting tools such
> > as top or vmstat.  This can cause confusion for the end user.  To
> > ease the confusion this patch set adds a sysctl interface to set the
> > cpu entitlement.  This is the percentage of cpu that the guest system is
> >  expected to receive.  As long as the steal time is within its expected
> > range it will show up as 0 in /proc/stat.  The user will then see in the
> > accounting tools that they are getting a full utilization of the cpu
> > resources assigned to them.
> > 
> 
> And how is such a knob not confusing?
> 
> Steal time is pretty well defined in meaning and is shown in top for
> ages. I really don't see the point for this.

Currently you can see the steal time but you have no way of knowing if
the cpu utilization you are seeing on the guest is the expected amount.
I decided on making it a knob because a guest could be migrated to
another system and it's entitlement could change because of hardware or 
load differences.  It could simply be a /proc file and report the
current entitlement if needed.   As things are currently implemented I 
don't see how someone knows if the guest is running as expected or
whether there is a problem.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Michael Wolf

On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
> On 08/24/2012 11:11 AM, Michael Wolf wrote:
> > On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
> >> On 08/24/2012 03:14 AM, Michael Wolf wrote:
> >>> This is an RFC regarding the reporting of stealtime.  In the case of
> >>> where you have a system that is running with partial processors such as
> >>> KVM the user may see steal time being reported in accounting tools such
> >>> as top or vmstat.  This can cause confusion for the end user.  To
> >>> ease the confusion this patch set adds a sysctl interface to set the
> >>> cpu entitlement.  This is the percentage of cpu that the guest system is
> >>>  expected to receive.  As long as the steal time is within its expected
> >>> range it will show up as 0 in /proc/stat.  The user will then see in the
> >>> accounting tools that they are getting a full utilization of the cpu
> >>> resources assigned to them.
> >>>
> >>
> >> And how is such a knob not confusing?
> >>
> >> Steal time is pretty well defined in meaning and is shown in top for
> >> ages. I really don't see the point for this.
> > 
> > Currently you can see the steal time but you have no way of knowing if
> > the cpu utilization you are seeing on the guest is the expected amount.
> > I decided on making it a knob because a guest could be migrated to
> > another system and it's entitlement could change because of hardware or 
> > load differences.  It could simply be a /proc file and report the
> > current entitlement if needed.   As things are currently implemented I 
> > don't see how someone knows if the guest is running as expected or
> > whether there is a problem.
> > 
> 
> Turning off steal time display won't get even close to displaying the
> information you want. What you probably want is a guest-visible way to
> say how many miliseconds you are expected to run each second. Right?

It is not clear to me how knowing how many milliseconds you are
expecting to run will help the user.  Currently the users will run top
to see how well the guest is running.  If they see _any_ steal time some
users think they are not getting the full use of their processor
entitlement.

Maybe I'm missing what you are proposing, but even if you knew the
milliseconds that you were expecting for your processor you would have
to adjust the top output in your head so to speak.  You would see the
utilization and then say 'ok that matches the number of milliseconds I
expected to run..."   If we take away the steal time (as long as it is
equal to or less than the expected amount of steal time) then the user
running top will see the 100% utilization.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Michael Wolf

On Mon, 2012-08-27 at 11:50 -0700, Glauber Costa wrote:
> On 08/27/2012 08:50 AM, Michael Wolf wrote:
> > On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
> >> On 08/24/2012 11:11 AM, Michael Wolf wrote:
> >>> On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
> >>>> On 08/24/2012 03:14 AM, Michael Wolf wrote:
> >>>>> This is an RFC regarding the reporting of stealtime.  In the case of
> >>>>> where you have a system that is running with partial processors such as
> >>>>> KVM the user may see steal time being reported in accounting tools such
> >>>>> as top or vmstat.  This can cause confusion for the end user.  To
> >>>>> ease the confusion this patch set adds a sysctl interface to set the
> >>>>> cpu entitlement.  This is the percentage of cpu that the guest system is
> >>>>>  expected to receive.  As long as the steal time is within its expected
> >>>>> range it will show up as 0 in /proc/stat.  The user will then see in the
> >>>>> accounting tools that they are getting a full utilization of the cpu
> >>>>> resources assigned to them.
> >>>>>
> >>>>
> >>>> And how is such a knob not confusing?
> >>>>
> >>>> Steal time is pretty well defined in meaning and is shown in top for
> >>>> ages. I really don't see the point for this.
> >>>
> >>> Currently you can see the steal time but you have no way of knowing if
> >>> the cpu utilization you are seeing on the guest is the expected amount.
> >>> I decided on making it a knob because a guest could be migrated to
> >>> another system and it's entitlement could change because of hardware or 
> >>> load differences.  It could simply be a /proc file and report the
> >>> current entitlement if needed.   As things are currently implemented I 
> >>> don't see how someone knows if the guest is running as expected or
> >>> whether there is a problem.
> >>>
> >>
> >> Turning off steal time display won't get even close to displaying the
> >> information you want. What you probably want is a guest-visible way to
> >> say how many miliseconds you are expected to run each second. Right?
> > 
> > It is not clear to me how knowing how many milliseconds you are
> > expecting to run will help the user.  Currently the users will run top
> > to see how well the guest is running.  If they see _any_ steal time some
> > users think they are not getting the full use of their processor
> > entitlement.
> >
> 
> And your plan is just to selectively lie about it, but disabling it with
> a knob?

It is about making it very obvious to the end user whether they are
receiving their cpu entitlement.  If there is more steal time than
expected that will still show up.  I have experimented, and it seems to
work, to put the raw stealtime at the end of each cpu line
in /proc/stat.  That way the raw data is there as well.   

Do you have another suggestion to communicate to the user whether they
are receiving their full entitlement?  At the very least shouldn't the
entitlement reside in a /proc file somewhere so that the user could look
up the value and "do the math"?

> 
> > Maybe I'm missing what you are proposing, but even if you knew the
> > milliseconds that you were expecting for your processor you would have
> > to adjust the top output in your head so to speak.  You would see the
> > utilization and then say 'ok that matches the number of milliseconds I
> > expected to run..."   If we take away the steal time (as long as it is
> > equal to or less than the expected amount of steal time) then the user
> > running top will see the 100% utilization.
> > 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Michael Wolf

On Mon, 2012-08-27 at 11:55 -0700, Avi Kivity wrote:
> On 08/23/2012 04:14 PM, Michael Wolf wrote:
> > This is an RFC regarding the reporting of stealtime.  In the case of
> > where you have a system that is running with partial processors such as
> > KVM the user may see steal time being reported in accounting tools such
> > as top or vmstat.  This can cause confusion for the end user.  To
> > ease the confusion this patch set adds a sysctl interface to set the
> > cpu entitlement.  This is the percentage of cpu that the guest system is
> >  expected to receive.  As long as the steal time is within its expected
> > range it will show up as 0 in /proc/stat.  The user will then see in the
> > accounting tools that they are getting a full utilization of the cpu
> > resources assigned to them.
> >
> > This patchset is changing the contents/output of /proc/stat and could 
> > affect 
> > user tools.  However the default setting is that the cpu is entitled to 
> > 100% 
> > so the code will act as before.  Also another field could be added to the 
> > /proc/stat output and show the unaltered steal time. Since this additional 
> > field could cause more confusion than it would clear up I have left it out 
> > for now.
> > 
> 
> How would a guest know what its entitlement is?
> 
> 

Currently the Admin/management tool setting up the guests will put it on
the qemu commandline.  From this it is passed via an ioctl to the host.
The guest will get the value from the host via a hypercall.

In the future the host could try and do some of it automatically in some
cases. 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Michael Wolf

On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
> On 08/27/2012 01:23 PM, Michael Wolf wrote:
> > > 
> > > How would a guest know what its entitlement is?
> > > 
> > > 
> >
> > Currently the Admin/management tool setting up the guests will put it on
> > the qemu commandline.  From this it is passed via an ioctl to the host.
> > The guest will get the value from the host via a hypercall.
> >
> > In the future the host could try and do some of it automatically in some
> > cases. 
> 
> Seems to me it's a meaningless value for the guest.  Suppose it is
> migrated to a host that is more powerful, and as a result its relative
> entitlement is reduced.  The value needs to be adjusted.

This is why I chose to manage the value from the sysctl interface rather
than just have it stored as a value in /proc.  Whatever tool was used to
migrate the vm could hopefully adjust the sysctl value on the guest.
> 
> This is best taken care of from the host side.

Not sure what you are getting at here.  If you are running in a cloud
environment, you purchase a VM with the understanding that you are
getting certain resources.  As this type of user I don't believe you
have any access to the host to see this type of information.  So the
user still wouldnt have a way to confirm that they are receiving what
they should be in the way of processor resources.

Would you please elaborate a little more on this?
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Michael Wolf

On Mon, 2012-08-27 at 14:41 -0700, Glauber Costa wrote:
> On 08/27/2012 02:27 PM, Michael Wolf wrote:
> > On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
> >> On 08/27/2012 01:23 PM, Michael Wolf wrote:
> >>>>
> >>>> How would a guest know what its entitlement is?
> >>>>
> >>>>
> >>>
> >>> Currently the Admin/management tool setting up the guests will put it on
> >>> the qemu commandline.  From this it is passed via an ioctl to the host.
> >>> The guest will get the value from the host via a hypercall.
> >>>
> >>> In the future the host could try and do some of it automatically in some
> >>> cases. 
> >>
> >> Seems to me it's a meaningless value for the guest.  Suppose it is
> >> migrated to a host that is more powerful, and as a result its relative
> >> entitlement is reduced.  The value needs to be adjusted.
> > 
> > This is why I chose to manage the value from the sysctl interface rather
> > than just have it stored as a value in /proc.  Whatever tool was used to
> > migrate the vm could hopefully adjust the sysctl value on the guest.
> >>
> >> This is best taken care of from the host side.
> > 
> > Not sure what you are getting at here.  If you are running in a cloud
> > environment, you purchase a VM with the understanding that you are
> > getting certain resources.  As this type of user I don't believe you
> > have any access to the host to see this type of information.  So the
> > user still wouldnt have a way to confirm that they are receiving what
> > they should be in the way of processor resources.
> > 
> > Would you please elaborate a little more on this?
> 
> What do you mean they have no access to the host?
> They have access to all sorts of tools that display information from the
> host. Speaking of a view-only resource, those are strictly equivalent.
> 
> 
> 

ok.  I will go look at those resources. 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V2 1/5] Alter the amount of steal time reported by the guest.

2012-10-16 Thread Michael Wolf

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed 
period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf 
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(&boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 36d12f0..c0b0095 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,6 +27,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V2 3/5] Add the code to send the consigned time from the host to the guest

2012-10-16 Thread Michael Wolf

Add the code to send the consigned time from the host to the guest

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/kvm_para.h |3 ++-
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |3 ++-
 arch/x86/kvm/x86.c  |2 ++
 include/linux/kernel_stat.h |1 +
 kernel/sched/cputime.c  |   21 +++--
 kernel/sched/sched.h|2 ++
 8 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1eaa6b0..bd4e412 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -409,6 +409,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2f7712e..debe72e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 91b3b2a..4e5582a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -368,7 +368,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu, u64 *steal)
+static u64 kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -378,6 +378,7 @@ static u64 kvm_steal_clock(int cpu, u64 *steal)
version = src->version;
rmb();
*steal = src->steal;
+   *consigned = src->consigned;
rmb();
} while ((version & 1) || (version != src->version));
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f09552..801cfa8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1554,8 +1554,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+   vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;
+   vcpu->arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index c0b0095..253fdce 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -125,6 +125,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index dd3fd46..bf2025a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time that the cpu was
+ * expected to be somewhere else.
+ */
+void account_consigned_time(cputime_t cputime)
+{
+   u64 *cpustat =  kcpustat_this_cpu->cpustat;
+
+   cpustat[CPUTIME_CONSIGN] += (__force u64) cputime;
+}
+
+/*
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
  */
@@ -274,15 +286,20 @@ static __always_inline bool 
steal_account_process_tick(void)
 #ifdef CONFIG_PARAVIRT
if (static_key_false(¶virt_steal_enabled)) {

[PATCH RFC V2 5/5] Add an ioctl to communicate the consign limit to the host.

2012-10-16 Thread Michael Wolf

Add an ioctl to communicate the consign limit to the host.

Signed-off-by: Michael Wolf 
---
 arch/x86/kvm/x86.c   |6 ++
 include/linux/kvm.h  |2 ++
 include/linux/kvm_host.h |2 ++
 virt/kvm/kvm_main.c  |7 +++
 4 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 469e748..5a4b8db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5928,6 +5928,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
 }
 
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long 
entitlement)
+{
+   vcpu->arch.consigned_limit = entitlement;
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
struct i387_fxsave_struct *fxsave =
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..6b211fb 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -904,6 +904,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL_IO(KVMIO,   0xad)
+/* Set the consignment limit which will be used to separete steal time */
+#define KVM_SET_ENTITLEMENT  _IOW(KVMIO, 0xae, unsigned long)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8a59e0a..9a0a791 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -538,6 +538,8 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu,
+   long entitlement);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..ccf0aec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1936,6 +1936,13 @@ out_free2:
r = 0;
break;
}
+   case KVM_SET_ENTITLEMENT: {
+   r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg);
+   if (r)
+   goto out;
+   r = 0;
+   break;
+   }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V2 4/5] Add a timer to allow the separation of consigned from steal time.

2012-10-16 Thread Michael Wolf

Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |   10 +
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/kvm/x86.c  |   42 ++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bd4e412..d700850 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 1UL
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -339,6 +341,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   u64 consigned_limit;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 801cfa8..469e748 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1535,13 +1535,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
 
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
-   vcpu->arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) {
+   vcpu->arch.current_consigned += delta;
+   if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) {
+   steal_delta = vcpu->arch.current_consigned
+   -  vcpu->arch.consigned_limit;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu->arch.st.accum_steal = steal_delta;
+   vcpu->arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6187,11 +6206,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
return irqchip_in_kernel(vcpu->kvm) == (vcpu->arch.apic != NULL);
 }
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   vcpu->arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(&vcpu->arch.steal_timer, now,
+   ktime_set(0, KVM_STEAL_TIMER_DELAY));
+   return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
int r;
+   ktime_t ktime;
 
BUG_ON(vcpu->kvm == NULL);
kvm = vcpu->kvm;
@@ -6234,6 +6267,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+   /* Initialize and start a timer to capture steal and consigned time */
+   hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu->arch.steal_timer.function = &steal_timer_fn;
+   ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+   hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
return 0;
 fail_free_mce_banks:
@@ -6252,6 +6291,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)

[PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-10-16 Thread Michael Wolf

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

TODO:
* Change native_clock to take params and not return a value
* Change update_rq_clock_task

Changes from V1:
* Removed the steal time allowed percentage from the guest
* Moved the separation of consigned (expected steal) and steal time to the
  host.
* No longer include a sysctl interface.

---

Michael Wolf (5):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.
  Add an ioctl to communicate the consign limit to the host.


 arch/x86/include/asm/kvm_host.h |   11 +
 arch/x86/include/asm/kvm_para.h |3 ++
 arch/x86/include/asm/paravirt.h |4 ++-
 arch/x86/kernel/kvm.c   |8 ++
 arch/x86/kvm/x86.c  |   50 ++-
 fs/proc/stat.c  |9 +--
 include/linux/kernel_stat.h |2 ++
 include/linux/kvm.h |2 ++
 include/linux/kvm_host.h|2 ++
 kernel/sched/cputime.c  |   21 +++-
 kernel/sched/sched.h|2 ++
 virt/kvm/kvm_main.c |7 +
 12 files changed, 108 insertions(+), 13 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC V2 2/5] Expand the steal time msr to also contain the consigned time.

2012-10-16 Thread Michael Wolf

Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |7 ++-
 kernel/sched/cputime.c  |2 +-
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..91b3b2a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -368,9 +368,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static u64 kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -378,11 +377,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src->version;
rmb();
-   steal = src->steal;
+   *steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 81b763b..dd3fd46 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(¶virt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), &steal);
steal -= this_rq()->prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-10-17 Thread Michael Wolf

On Wed, 2012-10-17 at 21:14 +0400, Glauber Costa wrote:
> On 10/17/2012 06:23 AM, Michael Wolf wrote:
> > In the case of where you have a system that is running in a
> > capped or overcommitted environment the user may see steal time
> > being reported in accounting tools such as top or vmstat.  This can
> > cause confusion for the end user.  To ease the confusion this patch set
> > adds the idea of consigned (expected steal) time.  The host will separate
> > the consigned time from the steal time.  The consignment limit passed to the
> > host will be the amount of steal time expected within a fixed period of
> > time.  Any other steal time accruing during that period will show as the
> > traditional steal time.
> > 
> > TODO:
> > * Change native_clock to take params and not return a value
> > * Change update_rq_clock_task
> > 
> > Changes from V1:
> > * Removed the steal time allowed percentage from the guest
> > * Moved the separation of consigned (expected steal) and steal time to the
> >   host.
> > * No longer include a sysctl interface.
> > 
> 
> You are showing this in the guest somewhere, but tools like top will
> still not show it. So for quite a while, it achieves nothing.
> 
> Of course this is a barrier that any new statistic has to go through. So
> while annoying, this is per-se ultimately not a blocker.
> 
> What I still fail to see, is how this is useful information to be shown
> in the guest. Honestly, if I'm in a guest VM or container, any time
> during which I am not running is time I lost. It doesn't matter if this
> was expected or not. This still seems to me as a host-side problem, to
> be solved entirely by tooling.
> 

What tools like top and vmstat will show is altered.  When I put time in
the consign bucket it does not show up in steal.  So now as long as the
system is performing as expected the user will see 100% and 0% steal.  I
added the consign field to /proc/stat so that all time accrued in the
period is accounted for and also for debugging purposes.  The user wont
care about consign and will not see it.  

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-11-26 Thread Michael Wolf


On 10/22/2012 10:33 AM, Rik van Riel wrote:

On 10/16/2012 10:23 PM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.


How do s390 and Power systems deal with reporting that kind
of information?

IMHO it would be good to see what those do, so we do not end
up re-inventing the wheel, and confusing admins with yet another
way of reporting the information...

Sorry for the delay in the response.  I'm assuming you are asking about 
s390 and Power lpars.
In the case of lpar on POWER systems they simply report steal time and 
do not alter it in any way.
They do however report how much processor is assigned to the partition 
and that information is

in /proc/ppc64/lparcfg.

Mike

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5] Alter steal time reporting in KVM

2012-11-26 Thread Michael Wolf

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.
  Add an ioctl to communicate the consign limit to the host.


 arch/x86/include/asm/kvm_host.h   |   11 +++
 arch/x86/include/asm/kvm_para.h   |3 +-
 arch/x86/include/asm/paravirt.h   |4 +--
 arch/x86/include/asm/paravirt_types.h |2 +
 arch/x86/kernel/kvm.c |8 ++---
 arch/x86/kernel/paravirt.c|4 +--
 arch/x86/kvm/x86.c|   50 -
 fs/proc/stat.c|9 +-
 include/linux/kernel_stat.h   |2 +
 include/linux/kvm_host.h  |2 +
 include/uapi/linux/kvm.h  |2 +
 kernel/sched/core.c   |   10 ++-
 kernel/sched/cputime.c|   21 +-
 kernel/sched/sched.h  |2 +
 virt/kvm/kvm_main.c   |7 +
 15 files changed, 120 insertions(+), 17 deletions(-)

-- 
Signature

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] Alter the amount of steal time reported by the guest.

2012-11-26 Thread Michael Wolf

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed 
period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf 
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(&boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..e5978b0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-26 Thread Michael Wolf

Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/paravirt.h   |4 ++--
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/kvm.c |7 ++-
 kernel/sched/core.c   |   10 +-
 kernel/sched/cputime.c|2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src->version;
rmb();
-   steal = src->steal;
+   *steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+   u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
u64 st;
+   u64 cs;
 
-   steal = paravirt_steal_clock(cpu_of(rq));
+   paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+   /*
+* since we are not assigning the steal time to cpustats
+* here, just combine the steal and consigned times to
+* do the rest of the calculations.
+*/
+   steal += consigned;
steal -= rq->prev_steal_time_rq;
 
if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(¶virt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), &steal);
steal -= this_rq()->prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5] Add the code to send the consigned time from the host to the guest

2012-11-26 Thread Michael Wolf

Add the code to send the consigned time from the host to the guest.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/kvm_para.h |3 ++-
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |3 ++-
 arch/x86/kernel/paravirt.c  |4 ++--
 arch/x86/kvm/x86.c  |2 ++
 include/linux/kernel_stat.h |1 +
 kernel/sched/cputime.c  |   21 +++--
 kernel/sched/sched.h|2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..434d378 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -426,6 +426,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index eb3e9d8..1763369 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ac357b3..4439a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src->version;
rmb();
*steal = src->steal;
+   *consigned = src->consigned;
rmb();
} while ((version & 1) || (version != src->version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   return 0;
+   *steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..683b531 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+   vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;
+   vcpu->arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e5978b0..91afaa3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593b647..53bd0be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ *

[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.

2012-11-26 Thread Michael Wolf

Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |   10 +
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/kvm/x86.c  |   42 ++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 434d378..4794c95 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 1UL
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -353,6 +355,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   u64 consigned_limit;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 683b531..c91f4c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
 
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
-   vcpu->arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) {
+   vcpu->arch.current_consigned += delta;
+   if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) {
+   steal_delta = vcpu->arch.current_consigned
+   -  vcpu->arch.consigned_limit;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu->arch.st.accum_steal = steal_delta;
+   vcpu->arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
 
 struct static_key kvm_no_apic_vcpu __read_mostly;
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   vcpu->arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(&vcpu->arch.steal_timer, now,
+   ktime_set(0, KVM_STEAL_TIMER_DELAY));
+   return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
int r;
+   ktime_t ktime;
 
BUG_ON(vcpu->kvm == NULL);
kvm = vcpu->kvm;
@@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+   /* Initialize and start a timer to capture steal and consigned time */
+   hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu->arch.steal_timer.function = &steal_timer_fn;
+   ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+   hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
return 0;
 fail_free_mce_banks:
@@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
int idx;
 
+

[PATCH 5/5] Add an ioctl to communicate the consign limit to the host.

2012-11-26 Thread Michael Wolf

Add an ioctl to communicate the consign limit to the host.

Signed-off-by: Michael Wolf 
---
 arch/x86/kvm/x86.c   |6 ++
 include/linux/kvm_host.h |2 ++
 include/uapi/linux/kvm.h |2 ++
 virt/kvm/kvm_main.c  |7 +++
 4 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c91f4c9..5d57469 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
 }
 
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long 
entitlement)
+{
+   vcpu->arch.consigned_limit = entitlement;
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
struct i387_fxsave_struct *fxsave =
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e2212f..de13648 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu,
+   long entitlement);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0a6d6ba..86f24bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL_IO(KVMIO,   0xad)
+/* Set the consignment limit which will be used to separete steal time */
+#define KVM_SET_ENTITLEMENT  _IOW(KVMIO, 0xae, unsigned long)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..c712fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2062,6 +2062,13 @@ out_free2:
r = 0;
break;
}
+   case KVM_SET_ENTITLEMENT: {
+   r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg);
+   if (r)
+   goto out;
+   r = 0;
+   break;
+   }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5] Add the code to send the consigned time from the host to the guest

2012-11-26 Thread Michael Wolf

Add the code to send the consigned time from the host to the guest.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/kvm_para.h |3 ++-
 arch/x86/include/asm/paravirt.h |4 ++--
 arch/x86/kernel/kvm.c   |3 ++-
 arch/x86/kernel/paravirt.c  |4 ++--
 arch/x86/kvm/x86.c  |2 ++
 include/linux/kernel_stat.h |1 +
 kernel/sched/cputime.c  |   21 +++--
 kernel/sched/sched.h|2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..434d378 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -426,6 +426,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+   u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index eb3e9d8..1763369 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+   PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ac357b3..4439a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
struct kvm_steal_time *src;
int version;
@@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src->version;
rmb();
*steal = src->steal;
+   *consigned = src->consigned;
rmb();
} while ((version & 1) || (version != src->version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-   return 0;
+   *steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..683b531 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;
 
vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+   vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;
+   vcpu->arch.st.accum_consigned = 0;
 
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e5978b0..91afaa3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct 
task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593b647..53bd0be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ *

[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.

2012-11-26 Thread Michael Wolf

Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/kvm_host.h |   10 +
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/kvm/x86.c  |   42 ++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 434d378..4794c95 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 1UL
+
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -353,6 +355,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
 
/*
+* timer used to determine if the time should be counted as
+* steal time or consigned time.
+*/
+   struct hrtimer steal_timer;
+   u64 current_consigned;
+   u64 consigned_limit;
+
+   /*
 * Paging state of the vcpu
 *
 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 683b531..c91f4c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
u64 delta;
+   u64 steal_delta;
+   u64 consigned_delta;
 
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
 
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
-   vcpu->arch.st.accum_steal = delta;
+
+   /* split the delta into steal and consigned */
+   if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) {
+   vcpu->arch.current_consigned += delta;
+   if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) {
+   steal_delta = vcpu->arch.current_consigned
+   -  vcpu->arch.consigned_limit;
+   consigned_delta = delta - steal_delta;
+   } else {
+   consigned_delta = delta;
+   steal_delta = 0;
+   }
+   } else {
+   consigned_delta = 0;
+   steal_delta = delta;
+   }
+   vcpu->arch.st.accum_steal = steal_delta;
+   vcpu->arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
 
 struct static_key kvm_no_apic_vcpu __read_mostly;
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+   struct kvm_vcpu *vcpu;
+   ktime_t now;
+
+   vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+   vcpu->arch.current_consigned = 0;
+   now = ktime_get();
+   hrtimer_forward(&vcpu->arch.steal_timer, now,
+   ktime_set(0, KVM_STEAL_TIMER_DELAY));
+   return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
int r;
+   ktime_t ktime;
 
BUG_ON(vcpu->kvm == NULL);
kvm = vcpu->kvm;
@@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+   /* Initialize and start a timer to capture steal and consigned time */
+   hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+   HRTIMER_MODE_REL);
+   vcpu->arch.steal_timer.function = &steal_timer_fn;
+   ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+   hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
return 0;
 fail_free_mce_banks:
@@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
int idx;
 
+

[PATCH 0/5] Alter stealtime reporting in KVM

2012-11-26 Thread Michael Wolf

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
  Alter the amount of steal time reported by the guest.
  Expand the steal time msr to also contain the consigned time.
  Add the code to send the consigned time from the host to the guest
  Add a timer to allow the separation of consigned from steal time.
  Add an ioctl to communicate the consign limit to the host.


 CREDITS|5 
 Documentation/arm64/memory.txt |   12 
 Documentation/cgroups/memory.txt   |4 
 .../devicetree/bindings/net/mdio-gpio.txt  |9 
 Documentation/filesystems/proc.txt |   16 
 Documentation/hwmon/fam15h_power   |2 
 Documentation/kernel-parameters.txt|   20 
 Documentation/networking/netdev-features.txt   |2 
 Documentation/scheduler/numa-problem.txt   |   20 
 MAINTAINERS|   87 +
 Makefile   |2 
 arch/alpha/kernel/osf_sys.c|6 
 arch/arm/boot/Makefile |   10 
 arch/arm/boot/dts/tegra30.dtsi |4 
 arch/arm/include/asm/io.h  |4 
 arch/arm/include/asm/sched_clock.h |2 
 arch/arm/include/asm/vfpmacros.h   |   12 
 arch/arm/include/uapi/asm/hwcap.h  |3 
 arch/arm/kernel/sched_clock.c  |   18 
 arch/arm/mach-at91/at91rm9200_devices.c|2 
 arch/arm/mach-at91/at91sam9260_devices.c   |2 
 arch/arm/mach-at91/at91sam9261_devices.c   |2 
 arch/arm/mach-at91/at91sam9263_devices.c   |2 
 arch/arm/mach-at91/at91sam9g45_devices.c   |   12 
 arch/arm/mach-davinci/dm644x.c |3 
 arch/arm/mach-highbank/system.c|3 
 arch/arm/mach-imx/clk-gate2.c  |2 
 arch/arm/mach-imx/ehci-imx25.c |2 
 arch/arm/mach-imx/ehci-imx35.c |2 
 arch/arm/mach-omap2/board-igep0020.c   |5 
 arch/arm/mach-omap2/clockdomains44xx_data.c|2 
 arch/arm/mach-omap2/devices.c  |   79 +
 arch/arm/mach-omap2/omap_hwmod.c   |   63 +
 arch/arm/mach-omap2/omap_hwmod_44xx_data.c |   36 
 arch/arm/mach-omap2/twl-common.c   |3 
 arch/arm/mach-omap2/vc.c   |2 
 arch/arm/mach-pxa/hx4700.c |8 
 arch/arm/mach-pxa/spitz_pm.c   |8 
 arch/arm/mm/alignment.c|2 
 arch/arm/plat-omap/include/plat/omap_hwmod.h   |6 
 arch/arm/tools/Makefile|2 
 arch/arm/vfp/vfpmodule.c   |9 
 arch/arm/xen/enlighten.c   |   11 
 arch/arm/xen/hypercall.S   |   14 
 arch/arm64/Kconfig |1 
 arch/arm64/include/asm/elf.h   |5 
 arch/arm64/include/asm/fpsimd.h|5 
 arch/arm64/include/asm/io.h|   10 
 arch/arm64/include/asm/pgtable-hwdef.h |6 
 arch/arm64/include/asm/pgtable.h   |   40 -
 arch/arm64/include/asm/processor.h |2 
 arch/arm64/include/asm/unistd.h|1 
 arch/arm64/kernel/perf_event.c |   10 
 arch/arm64/kernel/process.c|   18 
 arch/arm64/kernel/smp.c|3 
 arch/arm64/mm/init.c   |2 
 arch/frv/Kconfig   |1 
 arch/frv/boot/Makefile |   10 
 arch/frv/include/asm/unistd.h  |1 
 arch/frv/kernel/entry.S|   28 
 arch/frv/kernel/process.c  |5 
 arch/frv/mb93090-mb00/pci-dma-nommu.c  |1 
 arch/h8300/include/asm/cache.h |3 
 arch/ia64/mm/init.c|1 
 arch/m68k/include/asm/signal.h |6 
 arch/mips/cavium-octeon/executive/cvmx-l2c.c   |  900 
 arch/unicore32/include/asm/byteorder.h |   24 
 arch

[PATCH 1/5] Alter the amount of steal time reported by the guest.

2012-11-26 Thread Michael Wolf

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed 
period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf 
---
 fs/proc/stat.c  |9 +++--
 include/linux/kernel_stat.h |1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
-   u64 guest, guest_nice;
+   u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
user = nice = system = idle = iowait =
irq = softirq = steal = 0;
-   guest = guest_nice = 0;
+   guest = guest_nice = consign = 0;
getboottime(&boottime);
jif = boottime.tv_sec;
 
+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
 
for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+   consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+   seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..e5978b0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+   CPUTIME_CONSIGN,
NR_STATS,
 };
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-26 Thread Michael Wolf

Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf 
---
 arch/x86/include/asm/paravirt.h   |4 ++--
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/kvm.c |7 ++-
 kernel/sched/core.c   |   10 +-
 kernel/sched/cputime.c|2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-   u64 steal;
struct kvm_steal_time *src;
int version;
 
@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src->version;
rmb();
-   steal = src->steal;
+   *steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));
-
-   return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+   u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
u64 st;
+   u64 cs;
 
-   steal = paravirt_steal_clock(cpu_of(rq));
+   paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+   /*
+* since we are not assigning the steal time to cpustats
+* here, just combine the steal and consigned times to
+* do the rest of the calculations.
+*/
+   steal += consigned;
steal -= rq->prev_steal_time_rq;
 
if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(¶virt_steal_enabled)) {
u64 steal, st = 0;
 
-   steal = paravirt_steal_clock(smp_processor_id());
+   paravirt_steal_clock(smp_processor_id(), &steal);
steal -= this_rq()->prev_steal_time;
 
st = steal_ticks(steal);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Michael Wolf


On 11/27/2012 02:48 AM, Glauber Costa wrote:

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

If you submit this again, please include a version number in your series.
Will do.  The patchset was sent twice yesterday by mistake.  Got an 
error the first time and didn't

think the patches went out.  This has been corrected.


It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.
yes, will do that.  When I took the RFC off the patches I was looking at 
it as a new patchset which was

a mistake.  I will make sure to add a changelog when I submit again.


As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: "In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg."
Yes, but we still get questions from users asking what is steal time? 
why am I seeing this?


Now, that is a *way* more sensible thing to do. Much more. "Confusing
users" is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.
Something like this could certainly be done.  But when I was submitting 
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the 
guest kernel
to adjust the steal time. This percentage was being stored on the guest 
as a sysctl value.

Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated "The guest shouldn't need to know 
it's entitlement. Or at least, it's up to a management tool to report 
that in a way that's meaningful for the guest."


So perhaps I misunderstood what they were suggesting, but I took it to 
mean that they did not
want the guest to know what the entitlement was.  That the host should 
take care of it and just
report the already adjusted data to the guest.  So in this version of 
the code the host would use a set
period for a timer and be passed essentially a number of ticks of 
expected steal time.  The host
would then use the timer to break out the steal time into consigned and 
steal buckets which would be

reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So 
anyone needing to see total
time away could add the two fields together.  The user, however, when 
using tools like top or vmstat

would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the 
two approaches?









---

Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.


  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c

Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-28 Thread Michael Wolf


On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote:

On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:

Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf 
---
  arch/x86/include/asm/paravirt.h   |4 ++--
  arch/x86/include/asm/paravirt_types.h |2 +-
  arch/x86/kernel/kvm.c |7 ++-
  kernel/sched/core.c   |   10 +-
  kernel/sched/cputime.c|2 +-
  5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
  extern struct static_key paravirt_steal_enabled;
  extern struct static_key paravirt_steal_rq_enabled;
  
-static inline u64 paravirt_steal_clock(int cpu)

+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)

So its u64 here.

  {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
  }
  
  static inline unsigned long long paravirt_read_pmc(int counter)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
  
  struct pv_time_ops {

unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);

But not u64 here? Any particular reason?

It should be void everywhere, thanks for catching that I will change the 
code.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Michael Wolf


On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?
In the capped case, the time that the guest spends waiting due to it 
having used its full allottment of time shows up as steal time.  The way 
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from 
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and 
automatically adjust the setting in the

host used for steal time reporting.



  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.


  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Michael Wolf


On 11/28/2012 02:45 AM, Glauber Costa wrote:

On 11/27/2012 07:10 PM, Michael Wolf wrote:

On 11/27/2012 02:48 AM, Glauber Costa wrote:

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will
separate
the consigned time from the steal time.  The consignment limit passed
to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

If you submit this again, please include a version number in your series.

Will do.  The patchset was sent twice yesterday by mistake.  Got an
error the first time and didn't
think the patches went out.  This has been corrected.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

yes, will do that.  When I took the RFC off the patches I was looking at
it as a new patchset which was
a mistake.  I will make sure to add a changelog when I submit again.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: "In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg."

Yes, but we still get questions from users asking what is steal time?
why am I seeing this?

Now, that is a *way* more sensible thing to do. Much more. "Confusing
users" is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.

Something like this could certainly be done.  But when I was submitting
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the
guest kernel
to adjust the steal time. This percentage was being stored on the guest
as a sysctl value.
Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated "The guest shouldn't need to know
it's entitlement. Or at least, it's up to a management tool to report
that in a way that's meaningful for the guest."

So perhaps I misunderstood what they were suggesting, but I took it to
mean that they did not
want the guest to know what the entitlement was.  That the host should
take care of it and just
report the already adjusted data to the guest.  So in this version of
the code the host would use a set
period for a timer and be passed essentially a number of ticks of
expected steal time.  The host
would then use the timer to break out the steal time into consigned and
steal buckets which would be
reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So
anyone needing to see total
time away could add the two fields together.  The user, however, when
using tools like top or vmstat
would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the
two approaches?


Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?
It is setup using cpu overcommit.  But the request was for something 
that would work in both

the overcommit environment as well as when hard capping is being used.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-29 Thread Michael Wolf


On 11/28/2012 02:55 PM, Glauber Costa wrote:

On 11/28/2012 10:43 PM, Michael Wolf wrote:

On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

In the capped case, the time that the guest spends waiting due to it
having used its full allottment of time shows up as steal time.  The way
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and
automatically adjust the setting in the
host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Yes, in the context above I'm referring to the cfs bandwidth control.


Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
vcpu->arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.
So I looked at something like this in the past.  To make sure things 
haven't changed
I set up a cgroup on my test server running a kernel built from the 
latest tip tree.


[root]# cat cpu.cfs_quota_us
5
[root]# cat cpu.cfs_period_us
10
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks.  When I start a 
script that will hog the cpu I see the

following in top on the guest
Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si, 
49.8%st


So the steal time here is in line with the bandwidth control settings.


In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:
About a year ago I was playing with this patch.  It is out of date now 
but will give you

an idea of what I was looking at.

 kernel/sched_fair.c  |4 ++--
 kernel/sched_stats.h |7 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..a837e4e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
struct sched_entity *se)

 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
 static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
 int global_update)
@@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct
cfs_rq *cfs_rq)
 }

 /* check whether cfs_rq, or any parent, is throttled */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 return cfs_rq->throttle_count;
 }
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 87f9e36..e30ff26 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -213,14 +213,19 @@ static inline void sched_info_queued(struct
task_struct *t)
  * sched_info_queued() to mark that it has now again started waiting on
  * the runqueue.
  */
+extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static inline void sched_info_depart(struct task_struct *t)
 {
+struct task_group *tg = task_group(t);
+struct cfs_rq *cfs_rq;
 unsigned long long delta = task_rq(t)->clock -
 t->sched_info.last_arrival;

+cfs_rq = tg->cfs_rq[smp_processor_id()];
 rq_sched_info_depart(task_rq(t), delta);

-if (t->state == TASK_RUNNING)
+
+if (t->state == TASK_RUNNING && !throttled_hierarchy(cfs_rq))
 sched_info_queued(t);
 }


So then the steal time did not show on the guest.  You have no value 
that needs to be passed

around.  What I did not like about this approach was
* only works for cfs bandwidth control.  If another type of hard limit 
was added to the kernel

   the code wou

53 matches

Mail list logo