Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-05 Thread Michael S. Tsirkin
On Wed, Oct 04, 2017 at 11:31:43AM -0700, Jacob Pan wrote:
> On Wed, 4 Oct 2017 20:12:28 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote:
> > > On Wed, 4 Oct 2017 05:09:09 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:  
> > > > > On Sat, 30 Sep 2017 01:21:43 +0200
> > > > > "Rafael J. Wysocki"  wrote:
> > > > > 
> > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin
> > > > > >  wrote:
> > > > > > > intel idle driver does not DTRT when running within a VM:
> > > > > > > when going into a deep power state, the right thing to
> > > > > > > do is to exit to hypervisor rather than to keep polling
> > > > > > > within guest using mwait.
> > > > > > >
> > > > > > > Currently the solution is just to exit to hypervisor each
> > > > > > > time we go idle - this is why kvm does not expose the mwait
> > > > > > > leaf to guests even when it allows guests to do mwait.
> > > > > > >
> > > > > > > But that's not ideal - it seems better to use the idle
> > > > > > > driver to guess when will the next interrupt arrive.  
> > > > > > 
> > > > > > The idle driver alone is not sufficient for that, though.
> > > > > > 
> > > > > I second that. Why try to solve this problem at vendor specific
> > > > > driver level?
> > > > 
> > > > Well we still want to e.g. mwait if possible - saves power.
> > > >   
> > > > > perhaps just a pv idle driver that decide whether to vmexit
> > > > > based on something like local per vCPU timer expiration? I
> > > > > guess we can't predict other wake events such as interrupts.
> > > > > e.g.
> > > > > if (get_next_timer_interrupt() > kvm_halt_target_residency)
> > > > >   vmexit
> > > > > else
> > > > >   poll
> > > > > 
> > > > > Jacob
> > > > 
> > > > It's not always a poll, on x86 putting the CPU in a low power
> > > > state is possible within a VM.
> > > >   
> > > Are you talking about using mwait/monitor in the user space which
> > > are available on some Intel CPUs, such as Xeon Phi? I guess if the
> > > guest can identify host CPU id, it is doable.  
> > 
> > Not really.
> > 
> > Please take a look at the patch in question - it does mwait in guest
> > kernel and no need to identify host CPU id.
> > 
> I may be missing something, in your patch I only see HLT being used in
> the guest OS, that would cause VM exit right? If you do mwait in the
> guest kernel, it will also exit.


No mwait won't exit if running on kvm.
See 668fffa3f838edfcb1679f842f7ef1afa61c3e9a


> So I don't see how you can enter low
> power state within VM guest.
> 
> +static int intel_halt(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv, int index)
> +{
> + printk_once(KERN_ERR "safe_halt started\n");
> + safe_halt();
> + printk_once(KERN_ERR "safe_halt done\n");
> + return index;
> +}
> > 
> > > > Does not seem possible on other CPUs that's why it's vendor
> > > > specific. 
> > > 
> > > [Jacob Pan]  
> 
> [Jacob Pan]


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-05 Thread Paolo Bonzini
On 04/10/2017 20:31, Jacob Pan wrote:
> On Wed, 4 Oct 2017 20:12:28 +0300
> "Michael S. Tsirkin"  wrote:
> 
>> On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote:
>>> On Wed, 4 Oct 2017 05:09:09 +0300
>>> "Michael S. Tsirkin"  wrote:
>>>   
 On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:  
> On Sat, 30 Sep 2017 01:21:43 +0200
> "Rafael J. Wysocki"  wrote:
> 
>> On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin
>>  wrote:
>>> intel idle driver does not DTRT when running within a VM:
>>> when going into a deep power state, the right thing to
>>> do is to exit to hypervisor rather than to keep polling
>>> within guest using mwait.
>>>
>>> Currently the solution is just to exit to hypervisor each
>>> time we go idle - this is why kvm does not expose the mwait
>>> leaf to guests even when it allows guests to do mwait.
>>>
>>> But that's not ideal - it seems better to use the idle
>>> driver to guess when will the next interrupt arrive.  
>>
>> The idle driver alone is not sufficient for that, though.
>> 
> I second that. Why try to solve this problem at vendor specific
> driver level?

 Well we still want to e.g. mwait if possible - saves power.
   
> perhaps just a pv idle driver that decide whether to vmexit
> based on something like local per vCPU timer expiration? I
> guess we can't predict other wake events such as interrupts.
> e.g.
> if (get_next_timer_interrupt() > kvm_halt_target_residency)
>   vmexit
> else
>   poll
>
> Jacob

 It's not always a poll, on x86 putting the CPU in a low power
 state is possible within a VM.
   
>>> Are you talking about using mwait/monitor in the user space which
>>> are available on some Intel CPUs, such as Xeon Phi? I guess if the
>>> guest can identify host CPU id, it is doable.  
>>
>> Not really.
>>
>> Please take a look at the patch in question - it does mwait in guest
>> kernel and no need to identify host CPU id.
>>
> I may be missing something, in your patch I only see HLT being used in
> the guest OS, that would cause VM exit right? If you do mwait in the
> guest kernel, it will also exit. So I don't see how you can enter low
> power state within VM guest.

KVM does not exit on MWAIT (though it doesn't show it in CPUID by
default), see commit 668fffa3f838edfcb1679f842f7ef1afa61c3e9a.

Paolo

> 
> +static int intel_halt(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv, int index)
> +{
> + printk_once(KERN_ERR "safe_halt started\n");
> + safe_halt();
> + printk_once(KERN_ERR "safe_halt done\n");
> + return index;
> +}
>>
 Does not seem possible on other CPUs that's why it's vendor
 specific. 
>>>
>>> [Jacob Pan]  
> 
> [Jacob Pan]
> 



Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-04 Thread Rafael J. Wysocki
On Wed, Oct 4, 2017 at 9:56 AM, Thomas Gleixner  wrote:
> On Wed, 4 Oct 2017, Michael S. Tsirkin wrote:
>> On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote:
>> > There is the series from Audrey which makes use of the various idle
>> > prediction mechanisms, scheduler, irq timings, idle governor to get an idea
>> > about the estimated idle time. Exactly this information can be fed to the
>> > kvmidle driver which can act accordingly.
>> >
>> > Hacking a random hardware specific idle driver is definitely the wrong
>> > approach. It might be useful to chain the kvmidle driver and hardware
>> > specific drivers at some point, i.e. if the kvmdriver decides not to exit
>> > it delegates the mwait decision to the proper hardware driver in order not
>> > to reimplement all the required logic again.
>>
>> By making changes to idle core to allow that chaining?
>> Does this sound like something reasonable?
>
> At least for me it makes sense to avoid code duplication.

Well, I agree.

Thanks,
Rafael


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-04 Thread Jacob Pan
On Wed, 4 Oct 2017 20:12:28 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote:
> > On Wed, 4 Oct 2017 05:09:09 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:  
> > > > On Sat, 30 Sep 2017 01:21:43 +0200
> > > > "Rafael J. Wysocki"  wrote:
> > > > 
> > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin
> > > > >  wrote:
> > > > > > intel idle driver does not DTRT when running within a VM:
> > > > > > when going into a deep power state, the right thing to
> > > > > > do is to exit to hypervisor rather than to keep polling
> > > > > > within guest using mwait.
> > > > > >
> > > > > > Currently the solution is just to exit to hypervisor each
> > > > > > time we go idle - this is why kvm does not expose the mwait
> > > > > > leaf to guests even when it allows guests to do mwait.
> > > > > >
> > > > > > But that's not ideal - it seems better to use the idle
> > > > > > driver to guess when will the next interrupt arrive.  
> > > > > 
> > > > > The idle driver alone is not sufficient for that, though.
> > > > > 
> > > > I second that. Why try to solve this problem at vendor specific
> > > > driver level?
> > > 
> > > Well we still want to e.g. mwait if possible - saves power.
> > >   
> > > > perhaps just a pv idle driver that decide whether to vmexit
> > > > based on something like local per vCPU timer expiration? I
> > > > guess we can't predict other wake events such as interrupts.
> > > > e.g.
> > > > if (get_next_timer_interrupt() > kvm_halt_target_residency)
> > > > vmexit
> > > > else
> > > > poll
> > > > 
> > > > Jacob
> > > 
> > > It's not always a poll, on x86 putting the CPU in a low power
> > > state is possible within a VM.
> > >   
> > Are you talking about using mwait/monitor in the user space which
> > are available on some Intel CPUs, such as Xeon Phi? I guess if the
> > guest can identify host CPU id, it is doable.  
> 
> Not really.
> 
> Please take a look at the patch in question - it does mwait in guest
> kernel and no need to identify host CPU id.
> 
I may be missing something, in your patch I only see HLT being used in
the guest OS, that would cause VM exit right? If you do mwait in the
guest kernel, it will also exit. So I don't see how you can enter low
power state within VM guest.

+static int intel_halt(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int index)
+{
+   printk_once(KERN_ERR "safe_halt started\n");
+   safe_halt();
+   printk_once(KERN_ERR "safe_halt done\n");
+   return index;
+}
> 
> > > Does not seem possible on other CPUs that's why it's vendor
> > > specific. 
> > 
> > [Jacob Pan]  

[Jacob Pan]


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-04 Thread Michael S. Tsirkin
On Wed, Oct 04, 2017 at 10:09:39AM -0700, Jacob Pan wrote:
> On Wed, 4 Oct 2017 05:09:09 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:
> > > On Sat, 30 Sep 2017 01:21:43 +0200
> > > "Rafael J. Wysocki"  wrote:
> > >   
> > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin
> > > >  wrote:  
> > > > > intel idle driver does not DTRT when running within a VM:
> > > > > when going into a deep power state, the right thing to
> > > > > do is to exit to hypervisor rather than to keep polling
> > > > > within guest using mwait.
> > > > >
> > > > > Currently the solution is just to exit to hypervisor each time
> > > > > we go idle - this is why kvm does not expose the mwait leaf to
> > > > > guests even when it allows guests to do mwait.
> > > > >
> > > > > But that's not ideal - it seems better to use the idle driver to
> > > > > guess when will the next interrupt arrive.
> > > > 
> > > > The idle driver alone is not sufficient for that, though.
> > > >   
> > > I second that. Why try to solve this problem at vendor specific
> > > driver level?  
> > 
> > Well we still want to e.g. mwait if possible - saves power.
> > 
> > > perhaps just a pv idle driver that decide whether to vmexit
> > > based on something like local per vCPU timer expiration? I guess we
> > > can't predict other wake events such as interrupts.
> > > e.g.
> > > if (get_next_timer_interrupt() > kvm_halt_target_residency)
> > >   vmexit
> > > else
> > >   poll
> > > 
> > > Jacob  
> > 
> > It's not always a poll, on x86 putting the CPU in a low power state
> > is possible within a VM.
> > 
> Are you talking about using mwait/monitor in the user space which are
> available on some Intel CPUs, such as Xeon Phi? I guess if the guest
> can identify host CPU id, it is doable.

Not really.

Please take a look at the patch in question - it does mwait in guest
kernel and no need to identify host CPU id.


> > Does not seem possible on other CPUs that's why it's vendor specific.
> > 
> 
> [Jacob Pan]


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-04 Thread Jacob Pan
On Wed, 4 Oct 2017 05:09:09 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:
> > On Sat, 30 Sep 2017 01:21:43 +0200
> > "Rafael J. Wysocki"  wrote:
> >   
> > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin
> > >  wrote:  
> > > > intel idle driver does not DTRT when running within a VM:
> > > > when going into a deep power state, the right thing to
> > > > do is to exit to hypervisor rather than to keep polling
> > > > within guest using mwait.
> > > >
> > > > Currently the solution is just to exit to hypervisor each time
> > > > we go idle - this is why kvm does not expose the mwait leaf to
> > > > guests even when it allows guests to do mwait.
> > > >
> > > > But that's not ideal - it seems better to use the idle driver to
> > > > guess when will the next interrupt arrive.
> > > 
> > > The idle driver alone is not sufficient for that, though.
> > >   
> > I second that. Why try to solve this problem at vendor specific
> > driver level?  
> 
> Well we still want to e.g. mwait if possible - saves power.
> 
> > perhaps just a pv idle driver that decide whether to vmexit
> > based on something like local per vCPU timer expiration? I guess we
> > can't predict other wake events such as interrupts.
> > e.g.
> > if (get_next_timer_interrupt() > kvm_halt_target_residency)
> > vmexit
> > else
> > poll
> > 
> > Jacob  
> 
> It's not always a poll, on x86 putting the CPU in a low power state
> is possible within a VM.
> 
Are you talking about using mwait/monitor in the user space which are
available on some Intel CPUs, such as Xeon Phi? I guess if the guest
can identify host CPU id, it is doable.

> Does not seem possible on other CPUs that's why it's vendor specific.
> 

[Jacob Pan]


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-04 Thread Thomas Gleixner
On Wed, 4 Oct 2017, Michael S. Tsirkin wrote:
> On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote:
> > There is the series from Audrey which makes use of the various idle
> > prediction mechanisms, scheduler, irq timings, idle governor to get an idea
> > about the estimated idle time. Exactly this information can be fed to the
> > kvmidle driver which can act accordingly.
> > 
> > Hacking a random hardware specific idle driver is definitely the wrong
> > approach. It might be useful to chain the kvmidle driver and hardware
> > specific drivers at some point, i.e. if the kvmdriver decides not to exit
> > it delegates the mwait decision to the proper hardware driver in order not
> > to reimplement all the required logic again.
> 
> By making changes to idle core to allow that chaining?
> Does this sound like something reasonable?

At least for me it makes sense to avoid code duplication. But thats up to
the cpuidle maintainers to decide at the end.

Thanks,

tglx


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-03 Thread Michael S. Tsirkin
On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote:
> On Mon, 2 Oct 2017, Jacob Pan wrote:
> > On Sat, 30 Sep 2017 01:21:43 +0200
> > "Rafael J. Wysocki"  wrote:
> > 
> > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin 
> > > wrote:
> > > > intel idle driver does not DTRT when running within a VM:
> > > > when going into a deep power state, the right thing to
> > > > do is to exit to hypervisor rather than to keep polling
> > > > within guest using mwait.
> > > >
> > > > Currently the solution is just to exit to hypervisor each time we go
> > > > idle - this is why kvm does not expose the mwait leaf to guests even
> > > > when it allows guests to do mwait.
> > > >
> > > > But that's not ideal - it seems better to use the idle driver to
> > > > guess when will the next interrupt arrive.  
> > > 
> > > The idle driver alone is not sufficient for that, though.
> > > 
> > I second that. Why try to solve this problem at vendor specific driver
> > level? perhaps just a pv idle driver that decide whether to vmexit
> > based on something like local per vCPU timer expiration? I guess we
> > can't predict other wake events such as interrupts.
> > e.g.
> > if (get_next_timer_interrupt() > kvm_halt_target_residency)
> 
> Bah. no. get_next_timer_interrupt() is not available for abuse in random
> cpuidle driver code. It has state and its tied to the nohz code.
> 
> There is the series from Audrey which makes use of the various idle
> prediction mechanisms, scheduler, irq timings, idle governor to get an idea
> about the estimated idle time. Exactly this information can be fed to the
> kvmidle driver which can act accordingly.
> 
> Hacking a random hardware specific idle driver is definitely the wrong
> approach. It might be useful to chain the kvmidle driver and hardware
> specific drivers at some point, i.e. if the kvmdriver decides not to exit
> it delegates the mwait decision to the proper hardware driver in order not
> to reimplement all the required logic again.

By making changes to idle core to allow that chaining?
Does this sound like something reasonable?

> But that's a different story.
> 
> See 
> http://lkml.kernel.org/r/1506756034-6340-1-git-send-email-aubrey...@intel.com

Will read that, thanks a lot.

> Thanks,
> 
>   tglx
> 
> 
> 


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-03 Thread Michael S. Tsirkin
On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:
> On Sat, 30 Sep 2017 01:21:43 +0200
> "Rafael J. Wysocki"  wrote:
> 
> > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin 
> > wrote:
> > > intel idle driver does not DTRT when running within a VM:
> > > when going into a deep power state, the right thing to
> > > do is to exit to hypervisor rather than to keep polling
> > > within guest using mwait.
> > >
> > > Currently the solution is just to exit to hypervisor each time we go
> > > idle - this is why kvm does not expose the mwait leaf to guests even
> > > when it allows guests to do mwait.
> > >
> > > But that's not ideal - it seems better to use the idle driver to
> > > guess when will the next interrupt arrive.  
> > 
> > The idle driver alone is not sufficient for that, though.
> > 
> I second that. Why try to solve this problem at vendor specific driver
> level?

Well we still want to e.g. mwait if possible - saves power.

> perhaps just a pv idle driver that decide whether to vmexit
> based on something like local per vCPU timer expiration? I guess we
> can't predict other wake events such as interrupts.
> e.g.
> if (get_next_timer_interrupt() > kvm_halt_target_residency)
>   vmexit
> else
>   poll
> 
> Jacob

It's not always a poll, on x86 putting the CPU in a low power state
is possible within a VM.

Does not seem possible on other CPUs that's why it's vendor specific.

-- 
MST


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-03 Thread Thomas Gleixner
On Mon, 2 Oct 2017, Jacob Pan wrote:
> On Sat, 30 Sep 2017 01:21:43 +0200
> "Rafael J. Wysocki"  wrote:
> 
> > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin 
> > wrote:
> > > intel idle driver does not DTRT when running within a VM:
> > > when going into a deep power state, the right thing to
> > > do is to exit to hypervisor rather than to keep polling
> > > within guest using mwait.
> > >
> > > Currently the solution is just to exit to hypervisor each time we go
> > > idle - this is why kvm does not expose the mwait leaf to guests even
> > > when it allows guests to do mwait.
> > >
> > > But that's not ideal - it seems better to use the idle driver to
> > > guess when will the next interrupt arrive.  
> > 
> > The idle driver alone is not sufficient for that, though.
> > 
> I second that. Why try to solve this problem at vendor specific driver
> level? perhaps just a pv idle driver that decide whether to vmexit
> based on something like local per vCPU timer expiration? I guess we
> can't predict other wake events such as interrupts.
> e.g.
> if (get_next_timer_interrupt() > kvm_halt_target_residency)

Bah. no. get_next_timer_interrupt() is not available for abuse in random
cpuidle driver code. It has state and its tied to the nohz code.

There is the series from Audrey which makes use of the various idle
prediction mechanisms, scheduler, irq timings, idle governor to get an idea
about the estimated idle time. Exactly this information can be fed to the
kvmidle driver which can act accordingly.

Hacking a random hardware specific idle driver is definitely the wrong
approach. It might be useful to chain the kvmidle driver and hardware
specific drivers at some point, i.e. if the kvmdriver decides not to exit
it delegates the mwait decision to the proper hardware driver in order not
to reimplement all the required logic again. But that's a different story.

See 
http://lkml.kernel.org/r/1506756034-6340-1-git-send-email-aubrey...@intel.com

Thanks,

tglx






Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-02 Thread Jacob Pan
On Sat, 30 Sep 2017 01:21:43 +0200
"Rafael J. Wysocki"  wrote:

> On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin 
> wrote:
> > intel idle driver does not DTRT when running within a VM:
> > when going into a deep power state, the right thing to
> > do is to exit to hypervisor rather than to keep polling
> > within guest using mwait.
> >
> > Currently the solution is just to exit to hypervisor each time we go
> > idle - this is why kvm does not expose the mwait leaf to guests even
> > when it allows guests to do mwait.
> >
> > But that's not ideal - it seems better to use the idle driver to
> > guess when will the next interrupt arrive.  
> 
> The idle driver alone is not sufficient for that, though.
> 
I second that. Why try to solve this problem at vendor specific driver
level? perhaps just a pv idle driver that decide whether to vmexit
based on something like local per vCPU timer expiration? I guess we
can't predict other wake events such as interrupts.
e.g.
if (get_next_timer_interrupt() > kvm_halt_target_residency)
vmexit
else
poll

Jacob


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-09-29 Thread Rafael J. Wysocki
On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin  wrote:
> intel idle driver does not DTRT when running within a VM:
> when going into a deep power state, the right thing to
> do is to exit to hypervisor rather than to keep polling
> within guest using mwait.
>
> Currently the solution is just to exit to hypervisor each time we go
> idle - this is why kvm does not expose the mwait leaf to guests even
> when it allows guests to do mwait.
>
> But that's not ideal - it seems better to use the idle driver to guess
> when will the next interrupt arrive.

The idle driver alone is not sufficient for that, though.

Thanks,
Rafael


[PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-09-29 Thread Michael S. Tsirkin
intel idle driver does not DTRT when running within a VM:
when going into a deep power state, the right thing to
do is to exit to hypervisor rather than to keep polling
within guest using mwait.

Currently the solution is just to exit to hypervisor each time we go
idle - this is why kvm does not expose the mwait leaf to guests even
when it allows guests to do mwait.

But that's not ideal - it seems better to use the idle driver to guess
when will the next interrupt arrive. If that will happen soon, we are
better off doing mwait within guest.

How soon will have to be determined by e.g. how much does
an exit cost.  I plan to pass some flags host to guest to
address above issues, but for performance experiments
it's enough to just add some command line flags.

The patch below is very gross, I've thrown it together quickly so we
can have a discussion about this approach as compared to
changing the scheduler. If someone has the cycles to try some
tuning/performance comparisons, that would be very much appreciated.

Signed-off-by: Michael S. Tsirkin 

---

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c2ae819..6fa58ad 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,8 +65,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #define INTEL_IDLE_VERSION "0.4.1"
+#define PREFIX "intel_idle: "
 
 static struct cpuidle_driver intel_idle_driver = {
.name = "intel_idle",
@@ -94,6 +96,7 @@ struct idle_cpu {
 };
 
 static const struct idle_cpu *icpu;
+static struct idle_cpu icpus;
 static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
 static int intel_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index);
@@ -119,6 +122,49 @@ static struct cpuidle_state *cpuidle_state_table;
 #define flg2MWAIT(flags) (((flags) >> 24) & 0xFF)
 #define MWAIT2flg(eax) ((eax & 0xFF) << 24)
 
+static int intel_halt(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int index)
+{
+   printk_once(KERN_ERR "safe_halt started\n");
+   safe_halt();
+   printk_once(KERN_ERR "safe_halt done\n");
+   return index;
+}
+
+static int kvm_halt_target_residency = 400; /* Halt above this target 
residency */
+module_param(kvm_halt_target_residency, int, 0444);
+static int kvm_halt_native = 0; /* Use native mwait substates */
+module_param(kvm_halt_native, int, 0444);
+static int kvm_pv_mwait = 0; /* Whether to do mwait within KVM */
+module_param(kvm_pv_mwait, int, 0444);
+
+static struct cpuidle_state kvm_halt_cstate = {
+   .name = "HALT-KVM",
+   .desc = "HALT",
+   .flags = MWAIT2flg(0x10),
+   .exit_latency = 0,
+   .target_residency = 0,
+   .enter = &intel_halt,
+};
+
+static struct cpuidle_state kvm_cstates[] = {
+   {
+   .name = "C1-NHM",
+   .desc = "MWAIT 0x00",
+   .flags = MWAIT2flg(0x00),
+   .exit_latency = 3,
+   .target_residency = 6,
+   .enter = &intel_idle,
+   .enter_freeze = intel_idle_freeze, },
+   {
+   .name = "HALT-KVM",
+   .desc = "HALT",
+   .flags = MWAIT2flg(0x10),
+   .exit_latency = 30,
+   .target_residency = 399,
+   .enter = &intel_halt, }
+};
+
 /*
  * States are indexed by the cstate number,
  * which is also the index into the MWAIT hint array.
@@ -927,8 +973,11 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
if (!(lapic_timer_reliable_states & (1 << (cstate
tick_broadcast_enter();
 
+   printk_once(KERN_ERR "mwait_idle_with_hints started\n");
mwait_idle_with_hints(eax, ecx);
 
+   printk_once(KERN_ERR "mwait_idle_with_hints done\n");
+
if (!(lapic_timer_reliable_states & (1 << (cstate
tick_broadcast_exit();
 
@@ -989,6 +1038,11 @@ static const struct idle_cpu idle_cpu_tangier = {
.state_table = tangier_cstates,
 };
 
+static const struct idle_cpu idle_cpu_kvm = {
+   .state_table = kvm_cstates,
+};
+
+
 static const struct idle_cpu idle_cpu_lincroft = {
.state_table = atom_cstates,
.auto_demotion_disable_flags = ATM_LNC_C6_AUTO_DEMOTE,
@@ -1061,7 +1115,7 @@ static const struct idle_cpu idle_cpu_dnv = {
 };
 
 #define ICPU(model, cpu) \
-   { X86_VENDOR_INTEL, 6, model, X86_FEATURE_MWAIT, (unsigned long)&cpu }
+   { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu }
 
 static const struct x86_cpu_id intel_idle_ids[] __initconst = {
ICPU(INTEL_FAM6_NEHALEM_EP, idle_cpu_nehalem),
@@ -1115,6 +1169,7 @@ static int __init intel_idle_probe(void)
pr_debug("disabled\n");
return -EPERM;
}
+   pr_err(PREFIX "enabled\n");
 
id = x86_match_cpu(intel_idle_ids);
if (!id) {
@@ -1125,19 +1180,39 @@ static int __init intel_idle_probe(void)
return -ENODEV;