Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
No, it is a natural result of an implemention which treats setting the A bit as 
an abnormal flow (e.g. in microcode as opposed to hardware).

On September 29, 2015 7:11:59 PM PDT, ebied...@xmission.com wrote:
>"H. Peter Anvin"  writes:
>
>> On 09/29/2015 06:20 PM, Eric W. Biederman wrote:
>>> Linus Torvalds  writes:
>>> 
 On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski
> wrote:
>
> Does anyone know what happens if you stick a non-accessed segment
>in
> the GDT, map the GDT RO, and access it?

 You should get a #PF, as you guess, but go ahead and test it if you
 want to make sure.
>>> 
>>> I tested this by accident once when workinng on what has become
>known
>>> as coreboot.  Early in boot with your GDT in a EEPROM switching from
>>> real mode to 32bit protected mode causes a write and locks up the
>>> machine when the hardware declines the write to the GDT to set the
>>> accessed bit.  As I recall the write kept being retried and retried
>and
>>> retried...
>>> 
>>> Setting the access bit in the GDT cleared up the problem and I did
>not
>>> look back.
>>> 
>>> Way up in 64bit mode something might be different, but I don't know
>why
>>> cpu designeres would waste the silicon.
>>> 
>>
>> This is totally different from a TLB violation.  In your case, the
>write
>> goes through as far as the CPU is concerned, but when the data is
>> fetched back, it hasn't changed.  A write to a TLB-protected location
>> will #PF.
>
>The key point is that a write is generated when the cpu needs to set
>the
>access bit.  I agree the failure points are different.  A TLB fault vs
>a
>case where the hardware did not accept the write.
>
>The idea of a cpu reading back data (and not trusting it's cache
>coherency controls) to verify the access bit gets set seems mind
>boggling.  That is slow, stupid, racy and incorrect.  Incorrect as the
>cpu should not only set the access bit once per segment register load.
>
>In my case I am pretty certain it was something very weird with the
>hardware not acceppting the write and either not acknowledging the bus
>transaction or cancelling it.  In which case the cpu knew the write had
>not made it to the ``memory'' and was trying to cope.
>
>Eric

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Eric W. Biederman
"H. Peter Anvin"  writes:

> On 09/29/2015 06:20 PM, Eric W. Biederman wrote:
>> Linus Torvalds  writes:
>> 
>>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  
>>> wrote:

 Does anyone know what happens if you stick a non-accessed segment in
 the GDT, map the GDT RO, and access it?
>>>
>>> You should get a #PF, as you guess, but go ahead and test it if you
>>> want to make sure.
>> 
>> I tested this by accident once when workinng on what has become known
>> as coreboot.  Early in boot with your GDT in a EEPROM switching from
>> real mode to 32bit protected mode causes a write and locks up the
>> machine when the hardware declines the write to the GDT to set the
>> accessed bit.  As I recall the write kept being retried and retried and
>> retried...
>> 
>> Setting the access bit in the GDT cleared up the problem and I did not
>> look back.
>> 
>> Way up in 64bit mode something might be different, but I don't know why
>> cpu designeres would waste the silicon.
>> 
>
> This is totally different from a TLB violation.  In your case, the write
> goes through as far as the CPU is concerned, but when the data is
> fetched back, it hasn't changed.  A write to a TLB-protected location
> will #PF.

The key point is that a write is generated when the cpu needs to set the
access bit.  I agree the failure points are different.  A TLB fault vs a
case where the hardware did not accept the write.

The idea of a cpu reading back data (and not trusting it's cache
coherency controls) to verify the access bit gets set seems mind
boggling.  That is slow, stupid, racy and incorrect.  Incorrect as the
cpu should not only set the access bit once per segment register load.

In my case I am pretty certain it was something very weird with the
hardware not acceppting the write and either not acknowledging the bus
transaction or cancelling it.  In which case the cpu knew the write had
not made it to the ``memory'' and was trying to cope.

Eric



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
On 09/29/2015 06:20 PM, Eric W. Biederman wrote:
> Linus Torvalds  writes:
> 
>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>>
>>> Does anyone know what happens if you stick a non-accessed segment in
>>> the GDT, map the GDT RO, and access it?
>>
>> You should get a #PF, as you guess, but go ahead and test it if you
>> want to make sure.
> 
> I tested this by accident once when workinng on what has become known
> as coreboot.  Early in boot with your GDT in a EEPROM switching from
> real mode to 32bit protected mode causes a write and locks up the
> machine when the hardware declines the write to the GDT to set the
> accessed bit.  As I recall the write kept being retried and retried and
> retried...
> 
> Setting the access bit in the GDT cleared up the problem and I did not
> look back.
> 
> Way up in 64bit mode something might be different, but I don't know why
> cpu designeres would waste the silicon.
> 

This is totally different from a TLB violation.  In your case, the write
goes through as far as the CPU is concerned, but when the data is
fetched back, it hasn't changed.  A write to a TLB-protected location
will #PF.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>
>> Does anyone know what happens if you stick a non-accessed segment in
>> the GDT, map the GDT RO, and access it?
>
> You should get a #PF, as you guess, but go ahead and test it if you
> want to make sure.

I tested this by accident once when workinng on what has become known
as coreboot.  Early in boot with your GDT in a EEPROM switching from
real mode to 32bit protected mode causes a write and locks up the
machine when the hardware declines the write to the GDT to set the
accessed bit.  As I recall the write kept being retried and retried and
retried...

Setting the access bit in the GDT cleared up the problem and I did not
look back.

Way up in 64bit mode something might be different, but I don't know why
cpu designeres would waste the silicon.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/3] target-i386: add a subsection of vcpu's TSC rate in vmstate_x86_cpu

2015-09-29 Thread Haozhong Zhang
On Tue, Sep 29, 2015 at 08:00:13PM +0100, Dr. David Alan Gilbert wrote:
> * Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> > The newly added subsection 'vmstate_tsc_khz' in this patch results in
> > vcpu's TSC rate being saved on the source machine and loaded on the
> > target machine during the migration.
> > 
> > Signed-off-by: Haozhong Zhang 
> 
> Hi,
>   I'd appreciate it if you could tie this to only do it on newer
> machine types; that way it won't break back migration.
>

Does "back migration" mean migrating from QEMU w/ vmstate_tsc_khz
subsection to QEMU w/o that subsection?

- Haozhong

> Dave
> 
> > ---
> >  target-i386/machine.c | 20 
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/target-i386/machine.c b/target-i386/machine.c
> > index 9fa0563..80108a3 100644
> > --- a/target-i386/machine.c
> > +++ b/target-i386/machine.c
> > @@ -752,6 +752,25 @@ static const VMStateDescription vmstate_xss = {
> >  }
> >  };
> >  
> > +static bool tsc_khz_needed(void *opaque)
> > +{
> > +X86CPU *cpu = opaque;
> > +CPUX86State *env = &cpu->env;
> > +
> > +return env->tsc_khz != 0;
> > +}
> > +
> > +static const VMStateDescription vmstate_tsc_khz = {
> > +.name = "cpu/tsc_khz",
> > +.version_id = 1,
> > +.minimum_version_id = 1,
> > +.needed = tsc_khz_needed,
> > +.fields = (VMStateField[]) {
> > +VMSTATE_INT64(env.tsc_khz, X86CPU),
> > +VMSTATE_END_OF_LIST()
> > +}
> > +};
> > +
> >  VMStateDescription vmstate_x86_cpu = {
> >  .name = "cpu",
> >  .version_id = 12,
> > @@ -871,6 +890,7 @@ VMStateDescription vmstate_x86_cpu = {
> >  &vmstate_msr_hyperv_crash,
> >  &vmstate_avx512,
> >  &vmstate_xss,
> > +&vmstate_tsc_khz,
> >  NULL
> >  }
> >  };
> > -- 
> > 2.4.8
> > 
> > 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] kvm-all: notice KVM of vcpu's TSC rate after migration

2015-09-29 Thread Haozhong Zhang
On Tue, Sep 29, 2015 at 03:02:07PM -0300, Eduardo Habkost wrote:
> On Tue, Sep 29, 2015 at 11:43:34AM +0800, Haozhong Zhang wrote:
> > On Mon, Sep 28, 2015 at 01:37:34PM -0300, Eduardo Habkost wrote:
> > > On Mon, Sep 28, 2015 at 01:38:31PM +0800, Haozhong Zhang wrote:
> [...]
> > > >  static void do_kvm_cpu_synchronize_post_init(void *arg)
> > > >  {
> > > >  CPUState *cpu = arg;
> > > > +CPUX86State *env = &X86_CPU(cpu)->env;
> > > > +int r;
> > > > +
> > > > +/*
> > > > + * XXX: KVM_SET_TSC_KHZ must be done before 
> > > > kvm_arch_put_registers().
> > > 
> > > Could you explain where this requirement comes from?
> > >
> > 
> > kvm_arch_put_registers() below will setup vcpu's MSR_IA32_TSC through
> > KVM ioctl KVM_SET_MSRS. KVM needs to know vcpu's TSC rate so as to
> > correctly scale the TSC value given by QEMU, especially when vcpu's
> > TSC rate is different than the host TSC rate (e.g. it's migrated from
> > another machine w/ different host TSC rate than the current one).
> 
> Thanks. The comment above could contain a short version of this
> explanation, e.g.: "KVM needs KVM_SET_TSC_KHZ to be done before
> KVM_SET_MSRS sets MSR_IA32_TSC (done by kvm_arch_put_registers())".
>

will include this in the comment

> > 
> [...]
> > > Or maybe we shouldn't treat this as VM state, but as configuration, and
> > > let management configure the TSC frequency explicitly if the user really
> > > needs it to stay the same during migration.
> > >
> > > (CCing libvir-list to see if they have feedback)
> > >
> > 
> > Thanks for CC. I'll consider to add a command line option to control
> > the configuration of guest TSC fequency.
> 
> It already exists, -cpu has a "tsc-freq" option.
>

What I'm considering is to add a "-keep-tsc-freq" option to control
whether the TSC frequency should be migrated if "tsc-freq" is not
presented. Compared to "tsc-freq", "-keep-tsc-freq" can free users
from figuring out the guest TSC frequency manually in the migration.

- Haozhong

> -- 
> Eduardo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
On 09/29/2015 11:02 AM, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds
>  wrote:
>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>>
>>> Does anyone know what happens if you stick a non-accessed segment in
>>> the GDT, map the GDT RO, and access it?
>>
>> You should get a #PF, as you guess, but go ahead and test it if you
>> want to make sure.
> 
> Then I think that, if we do this, the patch series should first make
> it percpu and fixmapped but RW and then flip it RO as a separate patch
> in case we need to revert the actual RO bit.  I don't want to break
> Wine or The Witcher 2 because of this, and we might need various
> fixups.  I really hope that no one uses get_thread_area to check
> whether TLS has been accessed.
> 

Of course.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/3] target-i386: add a subsection of vcpu's TSC rate in vmstate_x86_cpu

2015-09-29 Thread Dr. David Alan Gilbert
* Haozhong Zhang (haozhong.zh...@intel.com) wrote:
> The newly added subsection 'vmstate_tsc_khz' in this patch results in
> vcpu's TSC rate being saved on the source machine and loaded on the
> target machine during the migration.
> 
> Signed-off-by: Haozhong Zhang 

Hi,
  I'd appreciate it if you could tie this to only do it on newer
machine types; that way it won't break back migration.

Dave

> ---
>  target-i386/machine.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/target-i386/machine.c b/target-i386/machine.c
> index 9fa0563..80108a3 100644
> --- a/target-i386/machine.c
> +++ b/target-i386/machine.c
> @@ -752,6 +752,25 @@ static const VMStateDescription vmstate_xss = {
>  }
>  };
>  
> +static bool tsc_khz_needed(void *opaque)
> +{
> +X86CPU *cpu = opaque;
> +CPUX86State *env = &cpu->env;
> +
> +return env->tsc_khz != 0;
> +}
> +
> +static const VMStateDescription vmstate_tsc_khz = {
> +.name = "cpu/tsc_khz",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.needed = tsc_khz_needed,
> +.fields = (VMStateField[]) {
> +VMSTATE_INT64(env.tsc_khz, X86CPU),
> +VMSTATE_END_OF_LIST()
> +}
> +};
> +
>  VMStateDescription vmstate_x86_cpu = {
>  .name = "cpu",
>  .version_id = 12,
> @@ -871,6 +890,7 @@ VMStateDescription vmstate_x86_cpu = {
>  &vmstate_msr_hyperv_crash,
>  &vmstate_avx512,
>  &vmstate_xss,
> +&vmstate_tsc_khz,
>  NULL
>  }
>  };
> -- 
> 2.4.8
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
Ugh.  Didn't realize that.

On September 29, 2015 11:22:04 AM PDT, Andy Lutomirski  
wrote:
>On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin  wrote:
>> SGDT would be easy to use, and it is logical that it is faster since
>it reads an internal register.  SIDT does too but unlike the GDT has a
>secondary limit (it can never be larger than 4096 bytes) and so all
>limits in the range 4095-65535 are exactly equivalent.
>>
>
>Using the IDT limit would have been a great ideal if Intel hadn't
>decided to clobber it on every VM exit.
>
>--Andy

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin  wrote:
> SGDT would be easy to use, and it is logical that it is faster since it reads 
> an internal register.  SIDT does too but unlike the GDT has a secondary limit 
> (it can never be larger than 4096 bytes) and so all limits in the range 
> 4095-65535 are exactly equivalent.
>

Using the IDT limit would have been a great ideal if Intel hadn't
decided to clobber it on every VM exit.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
SGDT would be easy to use, and it is logical that it is faster since it reads 
an internal register.  SIDT does too but unlike the GDT has a secondary limit 
(it can never be larger than 4096 bytes) and so all limits in the range 
4095-65535 are exactly equivalent.

Anything that causes a write to the GDT will #PF if read-only.  So yes, we need 
to force the accessed bit to set.  This shouldn't be a problem and in fact 
ought to be a performance improvement.

On September 29, 2015 10:35:38 AM PDT, Andy Lutomirski  
wrote:
>On Sep 29, 2015 2:01 AM, "Ingo Molnar"  wrote:
>>
>>
>> * Denys Vlasenko  wrote:
>>
>> > On 09/28/2015 09:58 AM, Ingo Molnar wrote:
>> > >
>> > > * Denys Vlasenko  wrote:
>> > >
>> > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
>> > >>> NAK.  We really should map the GDT read-only on all 64 bit
>systems,
>> > >>> since we can't hide the address from SLDT.  Same with the IDT.
>> > >>
>> > >> Sorry, I don't understand your point.
>> > >
>> > > So the problem is that right now the SGDT instruction (which is
>unprivileged)
>> > > leaks the real address of the kernel image:
>> > >
>> > >  fomalhaut:~> ./sgdt
>> > >  SGDT: 88303fd89000 / 007f
>> > >
>> > > that '88303fd89000' is a kernel address.
>> >
>> > Thank you.
>> > I do know that SGDT and friends are unprivileged on x86
>> > and thus they allow userspace (and guest kernels in paravirt)
>> > learn things they don't need to know.
>> >
>> > I don't see how making GDT page-aligned and page-sized
>> > changes anything in this regard. SGDT will still work,
>> > and still leak GDT address.
>>
>> Well, as I try to explain it in the other part of my mail, doing so
>enables us to
>> remap the GDT to a less security sensitive virtual address that does
>not leak the
>> kernel's randomized address:
>>
>> > > Your observation in the changelog and your patch:
>> > >
>> >  It is page-sized because of paravirt. [...]
>> > >
>> > > ... conflicts with the intention to mark (remap) the primary GDT
>address read-only
>> > > on native kernels as well.
>> > >
>> > > So what we should do instead is to use the page alignment
>properly and remap the
>> > > GDT to a read-only location, and load that one.
>> >
>> > If we'd have a small GDT (i.e. what my patch does), we still can
>remap the
>> > entire page which contains small GDT, and simply don't care that
>some other data
>> > is also visible through that RO page.
>>
>> That's generally considered fragile: suppose an attacker has a
>limited information
>> leak that can read absolute addresses with system privilege but he
>doesn't know
>> the kernel's randomized base offset. With a 'partial page' mapping
>there could be
>> function pointers near the GDT, part of the page the GDT happens to
>be on, that
>> leak this information.
>>
>> (Same goes for crypto keys or other critical information (like canary
>information,
>> salts, etc.) accidentally ending up nearby.)
>>
>> Arguably it's a bit tenuous, but when playing remapping games it's
>generally
>> considered good to be page aligned and page sized, with zero padding.
>>
>> > > This would have a couple of advantages:
>> > >
>> > >  - This would give kernel address randomization more teeth on
>x86.
>> > >
>> > >  - An additional advantage would be that rootkits overwriting the
>GDT would have
>> > >a bit more work to do.
>> > >
>> > >  - A third advantage would be that for NUMA systems we could
>'mirror' the GDT into
>> > >node-local memory and load those. This makes GDT load
>cache-misses a bit less
>> > >expensive.
>> >
>> > GDT is per-cpu. Isn't per-cpu memory already NUMA-local?
>>
>> Indeed it is:
>>
>> fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ;
>done
>> SGDT: 88103fa09000 / 007f
>> SGDT: 88103fa29000 / 007f
>> SGDT: 88103fa29000 / 007f
>> SGDT: 88103fa49000 / 007f
>> SGDT: 88103fa49000 / 007f
>> SGDT: 88103fa49000 / 007f
>> SGDT: 88103fa29000 / 007f
>> SGDT: 88103fa69000 / 007f
>>
>> I confused it with the IDT, which is still global.
>>
>> This also means that the GDT in itself does not leak kernel addresses
>at the
>> moment, except it leaks the layout of the percpu area.
>>
>> So my suggestion would be to:
>>
>>  - make the GDT unconditionally page aligned and sized, then remap it
>to a
>>read-only address unconditionally as well, like we do it for the
>IDT.
>
>Does anyone know what happens if you stick a non-accessed segment in
>the GDT, map the GDT RO, and access it?  The docs are extremely vague
>on the interplay between segmentation and paging on the segmentation
>structures themselves.  My guess is that it causes #PF.  This might
>break set_thread_area users unless we change set_thread_area to force
>the accessed bit on.
>
>There's a possible worse failure mode: if someone pokes an un-accessed
>segment into SS or CS using sigreturn, then it's within the realm of
>possibility that IRET would generate #PF (hey Intel and AMD, please
>document this!).  I don't think th

Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds
 wrote:
> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>
>> Does anyone know what happens if you stick a non-accessed segment in
>> the GDT, map the GDT RO, and access it?
>
> You should get a #PF, as you guess, but go ahead and test it if you
> want to make sure.
>

Then I think that, if we do this, the patch series should first make
it percpu and fixmapped but RW and then flip it RO as a separate patch
in case we need to revert the actual RO bit.  I don't want to break
Wine or The Witcher 2 because of this, and we might need various
fixups.  I really hope that no one uses get_thread_area to check
whether TLS has been accessed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] kvm-all: notice KVM of vcpu's TSC rate after migration

2015-09-29 Thread Eduardo Habkost
On Tue, Sep 29, 2015 at 11:43:34AM +0800, Haozhong Zhang wrote:
> On Mon, Sep 28, 2015 at 01:37:34PM -0300, Eduardo Habkost wrote:
> > On Mon, Sep 28, 2015 at 01:38:31PM +0800, Haozhong Zhang wrote:
[...]
> > >  static void do_kvm_cpu_synchronize_post_init(void *arg)
> > >  {
> > >  CPUState *cpu = arg;
> > > +CPUX86State *env = &X86_CPU(cpu)->env;
> > > +int r;
> > > +
> > > +/*
> > > + * XXX: KVM_SET_TSC_KHZ must be done before kvm_arch_put_registers().
> > 
> > Could you explain where this requirement comes from?
> >
> 
> kvm_arch_put_registers() below will setup vcpu's MSR_IA32_TSC through
> KVM ioctl KVM_SET_MSRS. KVM needs to know vcpu's TSC rate so as to
> correctly scale the TSC value given by QEMU, especially when vcpu's
> TSC rate is different than the host TSC rate (e.g. it's migrated from
> another machine w/ different host TSC rate than the current one).

Thanks. The comment above could contain a short version of this
explanation, e.g.: "KVM needs KVM_SET_TSC_KHZ to be done before
KVM_SET_MSRS sets MSR_IA32_TSC (done by kvm_arch_put_registers())".

> 
[...]
> > Or maybe we shouldn't treat this as VM state, but as configuration, and
> > let management configure the TSC frequency explicitly if the user really
> > needs it to stay the same during migration.
> >
> > (CCing libvir-list to see if they have feedback)
> >
> 
> Thanks for CC. I'll consider to add a command line option to control
> the configuration of guest TSC fequency.

It already exists, -cpu has a "tsc-freq" option.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Linus Torvalds
On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>
> Does anyone know what happens if you stick a non-accessed segment in
> the GDT, map the GDT RO, and access it?

You should get a #PF, as you guess, but go ahead and test it if you
want to make sure.

We do something very similar for the old Pentium F0 0F bug - we mark
the IDT read-only, which causes the (bogus) locked read of the IDT
entry that the F00F bug resulted in to be caught as a page fault
instead.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Sep 29, 2015 2:01 AM, "Ingo Molnar"  wrote:
>
>
> * Denys Vlasenko  wrote:
>
> > On 09/28/2015 09:58 AM, Ingo Molnar wrote:
> > >
> > > * Denys Vlasenko  wrote:
> > >
> > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> > >>> NAK.  We really should map the GDT read-only on all 64 bit systems,
> > >>> since we can't hide the address from SLDT.  Same with the IDT.
> > >>
> > >> Sorry, I don't understand your point.
> > >
> > > So the problem is that right now the SGDT instruction (which is 
> > > unprivileged)
> > > leaks the real address of the kernel image:
> > >
> > >  fomalhaut:~> ./sgdt
> > >  SGDT: 88303fd89000 / 007f
> > >
> > > that '88303fd89000' is a kernel address.
> >
> > Thank you.
> > I do know that SGDT and friends are unprivileged on x86
> > and thus they allow userspace (and guest kernels in paravirt)
> > learn things they don't need to know.
> >
> > I don't see how making GDT page-aligned and page-sized
> > changes anything in this regard. SGDT will still work,
> > and still leak GDT address.
>
> Well, as I try to explain it in the other part of my mail, doing so enables 
> us to
> remap the GDT to a less security sensitive virtual address that does not leak 
> the
> kernel's randomized address:
>
> > > Your observation in the changelog and your patch:
> > >
> >  It is page-sized because of paravirt. [...]
> > >
> > > ... conflicts with the intention to mark (remap) the primary GDT address 
> > > read-only
> > > on native kernels as well.
> > >
> > > So what we should do instead is to use the page alignment properly and 
> > > remap the
> > > GDT to a read-only location, and load that one.
> >
> > If we'd have a small GDT (i.e. what my patch does), we still can remap the
> > entire page which contains small GDT, and simply don't care that some other 
> > data
> > is also visible through that RO page.
>
> That's generally considered fragile: suppose an attacker has a limited 
> information
> leak that can read absolute addresses with system privilege but he doesn't 
> know
> the kernel's randomized base offset. With a 'partial page' mapping there 
> could be
> function pointers near the GDT, part of the page the GDT happens to be on, 
> that
> leak this information.
>
> (Same goes for crypto keys or other critical information (like canary 
> information,
> salts, etc.) accidentally ending up nearby.)
>
> Arguably it's a bit tenuous, but when playing remapping games it's generally
> considered good to be page aligned and page sized, with zero padding.
>
> > > This would have a couple of advantages:
> > >
> > >  - This would give kernel address randomization more teeth on x86.
> > >
> > >  - An additional advantage would be that rootkits overwriting the GDT 
> > > would have
> > >a bit more work to do.
> > >
> > >  - A third advantage would be that for NUMA systems we could 'mirror' the 
> > > GDT into
> > >node-local memory and load those. This makes GDT load cache-misses a 
> > > bit less
> > >expensive.
> >
> > GDT is per-cpu. Isn't per-cpu memory already NUMA-local?
>
> Indeed it is:
>
> fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done
> SGDT: 88103fa09000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa69000 / 007f
>
> I confused it with the IDT, which is still global.
>
> This also means that the GDT in itself does not leak kernel addresses at the
> moment, except it leaks the layout of the percpu area.
>
> So my suggestion would be to:
>
>  - make the GDT unconditionally page aligned and sized, then remap it to a
>read-only address unconditionally as well, like we do it for the IDT.

Does anyone know what happens if you stick a non-accessed segment in
the GDT, map the GDT RO, and access it?  The docs are extremely vague
on the interplay between segmentation and paging on the segmentation
structures themselves.  My guess is that it causes #PF.  This might
break set_thread_area users unless we change set_thread_area to force
the accessed bit on.

There's a possible worse failure mode: if someone pokes an un-accessed
segment into SS or CS using sigreturn, then it's within the realm of
possibility that IRET would generate #PF (hey Intel and AMD, please
document this!).  I don't think that would be rootable, but at the
very least we'd want to make sure it doesn't OOPS by either making it
impossible or adding an explicit test to sigreturn.c.

hpa pointed out in another thread that the GDT *must* be writable on
32-bit kernels because we use a task gate for NMI and jumping through
a task gate writes to the GDT.

On another note, SGDT is considerably faster than LSL, at least on
Sandy Bridge.  The vdso might be able to take advantage of that for
getcpu.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.k

[PATCH kvmtool] kvmtool: expose the TSC Deadline Timer feature to the guest

2015-09-29 Thread Dimitri John Ledkov
From: Arjan van de Ven 

with the TSC deadline timer feature, we don't need to calibrate the apic
timers anymore, which saves more than 100 milliseconds of boot time.

Signed-off-by: Arjan van de Ven 
Signed-off-by: Dimitri John Ledkov 
---
 x86/cpuid.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/x86/cpuid.c b/x86/cpuid.c
index c3b67d9..1d8bd23 100644
--- a/x86/cpuid.c
+++ b/x86/cpuid.c
@@ -31,6 +31,9 @@ static void filter_cpuid(struct kvm_cpuid2 *kvm_cpuid)
/* Set X86_FEATURE_HYPERVISOR */
if (entry->index == 0)
entry->ecx |= (1 << 31);
+/* Set CPUID_EXT_TSC_DEADLINE_TIMER*/
+   if (entry->index == 0)
+   entry->ecx |= (1 << 24);
break;
case 6:
/* Clear X86_FEATURE_EPB */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvmtool] Skip a few messages by default: command line args; flat binary; earlyprintk.

2015-09-29 Thread Dimitri John Ledkov
The partial command line args & earlyprintk=serial are still enabled
in the debug mode. Warning that a flat binary kernel image is attemped
to be loaded is completely dropped. These are not that informative,
once one is past intial debugging, and only polute the console.

Signed-off-by: Dimitri John Ledkov 
---
 builtin-run.c | 10 ++
 kvm.c |  1 -
 x86/kvm.c |  8 ++--
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin-run.c b/builtin-run.c
index e0c8732..8edbf88 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -613,10 +613,12 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
 
kvm->cfg.real_cmdline = real_cmdline;
 
-   printf("  # %s run -k %s -m %Lu -c %d --name %s\n", KVM_BINARY_NAME,
-   kvm->cfg.kernel_filename,
-   (unsigned long long)kvm->cfg.ram_size / 1024 / 1024,
-   kvm->cfg.nrcpus, kvm->cfg.guest_name);
+   if (do_debug_print) {
+   printf("  # %s run -k %s -m %Lu -c %d --name %s\n", 
KVM_BINARY_NAME,
+  kvm->cfg.kernel_filename,
+  (unsigned long long)kvm->cfg.ram_size / 1024 / 1024,
+  kvm->cfg.nrcpus, kvm->cfg.guest_name);
+   }
 
if (init_list__init(kvm) < 0)
die ("Initialisation failed");
diff --git a/kvm.c b/kvm.c
index 10ed230..1081072 100644
--- a/kvm.c
+++ b/kvm.c
@@ -378,7 +378,6 @@ bool kvm__load_kernel(struct kvm *kvm, const char 
*kernel_filename,
if (ret)
goto found_kernel;
 
-   pr_warning("%s is not a bzImage. Trying to load it as a flat 
binary...", kernel_filename);
 #endif
 
ret = load_elf_binary(kvm, fd_kernel, fd_initrd, kernel_cmdline);
diff --git a/x86/kvm.c b/x86/kvm.c
index 512ad67..4a5fa41 100644
--- a/x86/kvm.c
+++ b/x86/kvm.c
@@ -124,8 +124,12 @@ void kvm__arch_set_cmdline(char *cmdline, bool video)
"i8042.dumbkbd=1 i8042.nopnp=1");
if (video)
strcat(cmdline, " video=vesafb console=tty0");
-   else
-   strcat(cmdline, " console=ttyS0 earlyprintk=serial 
i8042.noaux=1");
+   else {
+   strcat(cmdline, " console=ttyS0 i8042.noaux=1");
+   if (do_debug_print) {
+   strcat(cmdline, " earlyprintk=serial");
+   }
+   }
 }
 
 /* Architecture-specific KVM init */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/8] arm/arm64: KVM: vgic: Factor out level irq processing on guest exit

2015-09-29 Thread Christoffer Dall
Currently vgic_process_maintenance() processes dealing with a completed
level-triggered interrupt directly, but we are soon going to reuse this
logic for level-triggered mapped interrupts with the HW bit set, so
move this logic into a separate static function.

Probably the most scary part of this commit is convincing yourself that
the current flow is safe compared to the old one.  In the following I
try to list the changes and why they are harmless:

  Move vgic_irq_clear_queued after kvm_notify_acked_irq:
Harmless because the only potential effect of clearing the queued
flag wrt.  kvm_set_irq is that vgic_update_irq_pending does not set
the pending bit on the emulated CPU interface or in the
pending_on_cpu bitmask if the function is called with level=1.
However, the point of kvm_notify_acked_irq is to call kvm_set_irq
with level=0, and we set the queued flag again in
__kvm_vgic_sync_hwstate later on if the level is stil high.

  Move vgic_set_lr before kvm_notify_acked_irq:
Also, harmless because the LR are cpu-local operations and
kvm_notify_acked only affects the dist

  Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
Also harmless because it's just a bit which is cleared and altering
the line state does not affect this bit.

Reviewed-by: Eric Auger 
Reviewed-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 88 ++---
 1 file changed, 50 insertions(+), 38 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 6bd1c9b..fe0e5db 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1322,12 +1322,56 @@ epilog:
}
 }
 
+static int process_level_irq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr vlr)
+{
+   int level_pending = 0;
+
+   vlr.state = 0;
+   vlr.hwirq = 0;
+   vgic_set_lr(vcpu, lr, vlr);
+
+   /*
+* If the IRQ was EOIed (called from vgic_process_maintenance) or it
+* went from active to non-active (called from vgic_sync_hwirq) it was
+* also ACKed and we we therefore assume we can clear the soft pending
+* state (should it had been set) for this interrupt.
+*
+* Note: if the IRQ soft pending state was set after the IRQ was
+* acked, it actually shouldn't be cleared, but we have no way of
+* knowing that unless we start trapping ACKs when the soft-pending
+* state is set.
+*/
+   vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
+
+   /*
+* Tell the gic to start sampling the line of this interrupt again.
+*/
+   vgic_irq_clear_queued(vcpu, vlr.irq);
+
+   /* Any additional pending interrupt? */
+   if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
+   vgic_cpu_irq_set(vcpu, vlr.irq);
+   level_pending = 1;
+   } else {
+   vgic_dist_irq_clear_pending(vcpu, vlr.irq);
+   vgic_cpu_irq_clear(vcpu, vlr.irq);
+   }
+
+   /*
+* Despite being EOIed, the LR may not have
+* been marked as empty.
+*/
+   vgic_sync_lr_elrsr(vcpu, lr, vlr);
+
+   return level_pending;
+}
+
 static bool vgic_process_maintenance(struct kvm_vcpu *vcpu)
 {
u32 status = vgic_get_interrupt_status(vcpu);
struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
-   bool level_pending = false;
struct kvm *kvm = vcpu->kvm;
+   int level_pending = 0;
 
kvm_debug("STATUS = %08x\n", status);
 
@@ -1342,54 +1386,22 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
*vcpu)
 
for_each_set_bit(lr, eisr_ptr, vgic->nr_lr) {
struct vgic_lr vlr = vgic_get_lr(vcpu, lr);
-   WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
 
-   spin_lock(&dist->lock);
-   vgic_irq_clear_queued(vcpu, vlr.irq);
+   WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
WARN_ON(vlr.state & LR_STATE_MASK);
-   vlr.state = 0;
-   vgic_set_lr(vcpu, lr, vlr);
 
-   /*
-* If the IRQ was EOIed it was also ACKed and we we
-* therefore assume we can clear the soft pending
-* state (should it had been set) for this interrupt.
-*
-* Note: if the IRQ soft pending state was set after
-* the IRQ was acked, it actually shouldn't be
-* cleared, but we have no way of knowing that unless
-* we start trapping ACKs when the soft-pending state
-* is set.
-*/
-   vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
 
/*
 * kvm_notify_acked_irq calls kvm_set_irq()
-* to 

[PATCH v3 4/8] arm/arm64: KVM: Implement GICD_ICFGR as RO for PPIs

2015-09-29 Thread Christoffer Dall
The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
We currently simulate this behavior by writing a hardcoded value to the
register for the SGIs and PPIs on every write of these bits to the
register (ignoring what the guest actually wrote), and by writing the
same value as the reset value to the register.

This is a bit counter-intuitive, as the register is RO for these bits,
and we can just implement it that way, allowing us to control the value
of the bits purely in the reset code.

Reviewed-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index fe0e5db..e606f78 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -655,7 +655,7 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
*mmio,
ACCESS_READ_VALUE | ACCESS_WRITE_VALUE);
if (mmio->is_write) {
if (offset < 8) {
-   *reg = ~0U; /* Force PPIs/SGIs to 1 */
+   /* Ignore writes to read-only SGI and PPI bits */
return false;
}
 
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 8/8] arm/arm64: KVM: Support edge-triggered forwarded interrupts

2015-09-29 Thread Christoffer Dall
We mark edge-triggered interrupts with the HW bit set as queued to
prevent the VGIC code from injecting LRs with both the Active and
Pending bits set at the same time while also setting the HW bit,
because the hardware does not support this.

However, this means that we must also clear the queued flag when we sync
back a LR where the state on the physical distributor went from active
to inactive because the guest deactivated the interrupt.  At this point
we must also check if the interrupt is pending on the distributor, and
tell the VGIC to queue it again if it is.

Since these actions on the sync path are extremely close to those for
level-triggered interrupts, rename process_level_irq to
process_queued_irq, allowing it to cater for both cases.

Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 53548f1..f3e76e5 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1322,13 +1322,10 @@ epilog:
}
 }
 
-static int process_level_irq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr vlr)
+static int process_queued_irq(struct kvm_vcpu *vcpu,
+  int lr, struct vgic_lr vlr)
 {
-   int level_pending = 0;
-
-   vlr.state = 0;
-   vlr.hwirq = 0;
-   vgic_set_lr(vcpu, lr, vlr);
+   int pending = 0;
 
/*
 * If the IRQ was EOIed (called from vgic_process_maintenance) or it
@@ -1344,26 +1341,35 @@ static int process_level_irq(struct kvm_vcpu *vcpu, int 
lr, struct vgic_lr vlr)
vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
 
/*
-* Tell the gic to start sampling the line of this interrupt again.
+* Tell the gic to start sampling this interrupt again.
 */
vgic_irq_clear_queued(vcpu, vlr.irq);
 
/* Any additional pending interrupt? */
-   if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
-   vgic_cpu_irq_set(vcpu, vlr.irq);
-   level_pending = 1;
+   if (vgic_irq_is_edge(vcpu, vlr.irq)) {
+   BUG_ON(!(vlr.state & LR_HW));
+   pending = vgic_dist_irq_is_pending(vcpu, vlr.irq);
} else {
-   vgic_dist_irq_clear_pending(vcpu, vlr.irq);
-   vgic_cpu_irq_clear(vcpu, vlr.irq);
+   if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
+   vgic_cpu_irq_set(vcpu, vlr.irq);
+   pending = 1;
+   } else {
+   vgic_dist_irq_clear_pending(vcpu, vlr.irq);
+   vgic_cpu_irq_clear(vcpu, vlr.irq);
+   }
}
 
/*
 * Despite being EOIed, the LR may not have
 * been marked as empty.
 */
+   vlr.state = 0;
+   vlr.hwirq = 0;
+   vgic_set_lr(vcpu, lr, vlr);
+
vgic_sync_lr_elrsr(vcpu, lr, vlr);
 
-   return level_pending;
+   return pending;
 }
 
 static bool vgic_process_maintenance(struct kvm_vcpu *vcpu)
@@ -1400,7 +1406,7 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
*vcpu)
 vlr.irq - VGIC_NR_PRIVATE_IRQS);
 
spin_lock(&dist->lock);
-   level_pending |= process_level_irq(vcpu, lr, vlr);
+   level_pending |= process_queued_irq(vcpu, lr, vlr);
spin_unlock(&dist->lock);
}
}
@@ -1422,7 +1428,7 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
*vcpu)
 /*
  * Save the physical active state, and reset it to inactive.
  *
- * Return true if there's a pending level triggered interrupt line to queue.
+ * Return true if there's a pending forwarded interrupt to queue.
  */
 static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr vlr)
 {
@@ -1458,10 +1464,8 @@ static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int 
lr, struct vgic_lr vlr)
return false;
}
 
-   /* Mapped edge-triggered interrupts not yet supported. */
-   WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
spin_lock(&dist->lock);
-   level_pending = process_level_irq(vcpu, lr, vlr);
+   level_pending = process_queued_irq(vcpu, lr, vlr);
spin_unlock(&dist->lock);
return level_pending;
 }
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/8] KVM: Add kvm_arch_vcpu_{un}blocking callbacks

2015-09-29 Thread Christoffer Dall
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).

Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.

Reviewed-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 arch/arm/include/asm/kvm_host.h | 3 +++
 arch/arm64/include/asm/kvm_host.h   | 3 +++
 arch/mips/include/asm/kvm_host.h| 2 ++
 arch/powerpc/include/asm/kvm_host.h | 2 ++
 arch/s390/include/asm/kvm_host.h| 2 ++
 arch/x86/include/asm/kvm_host.h | 3 +++
 include/linux/kvm_host.h| 2 ++
 virt/kvm/kvm_main.c | 3 +++
 8 files changed, 20 insertions(+)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 3df1e97..ca41d94 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -233,4 +233,7 @@ static inline void kvm_arm_setup_debug(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arm_clear_debug(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu) {}
 
+static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 4562459..5dae2d8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -254,4 +254,7 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
 
+static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 3a54dbc..b0fa534 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -846,5 +846,7 @@ static inline void kvm_arch_flush_shadow_memslot(struct kvm 
*kvm,
struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
 #endif /* __MIPS_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 195886a..c60e361 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -717,5 +717,7 @@ static inline void kvm_arch_memslots_updated(struct kvm 
*kvm, struct kvm_memslot
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_exit(void) {}
+static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 6ce4a0b..5219b4e 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -643,5 +643,7 @@ static inline void kvm_arch_memslots_updated(struct kvm 
*kvm, struct kvm_memslot
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot) {}
+static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
 #endif
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 349f80a..b926137 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1232,4 +1232,7 @@ int x86_set_memory_region(struct kvm *kvm,
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
 
+static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1bef9e2..4a86f5f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -625,6 +625,8 @@ int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, 
const void *data,
 void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 int kvm_vcpu_yield_to(struct kvm_vcpu *target);
 void kvm_vcpu_on_spin(struct kvm_vcpu

[PATCH v3 6/8] arm/arm64: KVM: Add forwarded physical interrupts documentation

2015-09-29 Thread Christoffer Dall
Forwarded physical interrupts on arm/arm64 is a tricky concept and the
way we deal with them is not apparently easy to understand by reading
various specs.

Therefore, add a proper documentation file explaining the flow and
rationale of the behavior of the vgic.

Some of this text was contributed by Marc Zyngier and edited by me.
Omissions and errors are all mine.

Signed-off-by: Christoffer Dall 
---
 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 186 +
 1 file changed, 186 insertions(+)
 create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt

diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt 
b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
new file mode 100644
index 000..213d650
--- /dev/null
+++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt
@@ -0,0 +1,186 @@
+KVM/ARM VGIC Forwarded Physical Interrupts
+==
+
+The KVM/ARM code implements software support for the ARM Generic
+Interrupt Controller's (GIC's) hardware support for virtualization by
+allowing software to inject virtual interrupts to a VM, which the guest
+OS sees as regular interrupts.  The code is famously known as the VGIC.
+
+Some of these virtual interrupts, however, correspond to physical
+interrupts from real physical devices.  One example could be the
+architected timer, which itself supports virtualization, and therefore
+lets a guest OS program the hardware device directly to raise an
+interrupt at some point in time.  When such an interrupt is raised, the
+host OS initially handles the interrupt and must somehow signal this
+event as a virtual interrupt to the guest.  Another example could be a
+passthrough device, where the physical interrupts are initially handled
+by the host, but the device driver for the device lives in the guest OS
+and KVM must therefore somehow inject a virtual interrupt on behalf of
+the physical one to the guest OS.
+
+These virtual interrupts corresponding to a physical interrupt on the
+host are called forwarded physical interrupts, but are also sometimes
+referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
+
+Forwarded physical interrupts are handled slightly differently compared
+to virtual interrupts generated purely by a software emulated device.
+
+
+The HW bit
+--
+Virtual interrupts are signalled to the guest by programming the List
+Registers (LRs) on the GIC before running a VCPU.  The LR is programmed
+with the virtual IRQ number and the state of the interrupt (Pending,
+Active, or Pending+Active).  When the guest ACKs and EOIs a virtual
+interrupt, the LR state moves from Pending to Active, and finally to
+inactive.
+
+The LRs include an extra bit, called the HW bit.  When this bit is set,
+KVM must also program an additional field in the LR, the physical IRQ
+number, to link the virtual with the physical IRQ.
+
+When the HW bit is set, KVM must EITHER set the Pending OR the Active
+bit, never both at the same time.
+
+Setting the HW bit causes the hardware to deactivate the physical
+interrupt on the physical distributor when the guest deactivates the
+corresponding virtual interrupt.
+
+
+Forwarded Physical Interrupts Life Cycle
+
+
+The state of forwarded physical interrupts is managed in the following way:
+
+  - The physical interrupt is acked by the host, and becomes active on
+the physical distributor (*).
+  - KVM sets the LR.Pending bit, because this is the only way the GICV
+interface is going to present it to the guest.
+  - LR.Pending will stay set as long as the guest has not acked the interrupt.
+  - LR.Pending transitions to LR.Active on the guest read of the IAR, as
+expected.
+  - On guest EOI, the *physical distributor* active bit gets cleared,
+but the LR.Active is left untouched (set).
+  - KVM clears the LR on VM exits when the physical distributor
+active state has been cleared.
+
+(*): The host handling is slightly more complicated.  For some forwarded
+interrupts (shared), KVM directly sets the active state on the physical
+distributor before entering the guest, because the interrupt is never actually
+handled on the host (see details on the timer as an example below).  For other
+forwarded interrupts (non-shared) the host does not deactivate the interrupt
+when the host ISR completes, but leaves the interrupt active until the guest
+deactivates it.  Leaving the interrupt active is allowed, because the GIC is
+configured with EOIMode=1, which causes EOI operations to perform a priority
+drop allowing the GIC to receive other interrupts of the default priority.
+
+
+Forwarded Edge and Level Triggered PPIs and SPIs
+
+Forwarded physical interrupts injected should always be active on the
+physical distributor when injected to a guest.
+
+Level-triggered interrupts will keep the interrupt line to the GIC
+asserted, typically until 

[PATCH v3 0/8] Rework architected timer and forwarded IRQs handling

2015-09-29 Thread Christoffer Dall
The architected timer integration with the VGIC had some shortcomings in
that certain guest operations weren't fully supported.

This series tries to address these problems in providing level-triggered
semantics for the arch timer and VGIC integration and seeks to clarify
the behavior when setting/clearing the active state on the physical
distributor.

It also fixes a few other bugs in the VGIC code and finally adds support
for edge-triggered forwarded interrupts.

The edge-triggered forwarded interrupts code is untested but probably
better to clearly do something wrong and raise a warning.

Changes since v2:
 - Various spelling and comment fixes
 - Clarified timer example in Documentation and fixed EOIMode=1 explanation
 - Added spin lock around process_queued_irq as pointed out by Andre.

Changes since v1:
 - Sent out bug fixes for active state and UEFI reset as separate
   patches.
 - Fixed various spelling nits
 - Rewrote proposed documentation file trying to address Eric's and
   Marc's comments
 - Rewrote kvm_timer_update_irq and kvm_timer_update_state according to
   Marc's suggestion (thanks!)
 - Added additional patch to support edge-triggered forwarded
   interrupts.

Christoffer Dall (8):
  KVM: Add kvm_arch_vcpu_{un}blocking callbacks
  arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block
  arm/arm64: KVM: vgic: Factor out level irq processing on guest exit
  arm/arm64: KVM: Implement GICD_ICFGR as RO for PPIs
  arm/arm64: KVM: Use appropriate define in VGIC reset code
  arm/arm64: KVM: Add forwarded physical interrupts documentation
  arm/arm64: KVM: Rework the arch timer to use level-triggered semantics
  arm/arm64: KVM: Support edge-triggered forwarded interrupts

 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 186 +
 arch/arm/kvm/arm.c |  21 ++-
 arch/mips/include/asm/kvm_host.h   |   2 +
 arch/powerpc/include/asm/kvm_host.h|   2 +
 arch/s390/include/asm/kvm_host.h   |   2 +
 arch/x86/include/asm/kvm_host.h|   3 +
 include/kvm/arm_arch_timer.h   |   4 +-
 include/kvm/arm_vgic.h |   3 -
 include/linux/kvm_host.h   |   2 +
 virt/kvm/arm/arch_timer.c  | 149 +++--
 virt/kvm/arm/vgic.c| 172 +--
 virt/kvm/kvm_main.c|   3 +
 12 files changed, 408 insertions(+), 141 deletions(-)
 create mode 100644 Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt

-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/8] arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block

2015-09-29 Thread Christoffer Dall
We currently schedule a soft timer every time we exit the guest if the
timer did not expire while running the guest.  This is really not
necessary, because the only work we do in the timer work function is to
kick the vcpu.

Kicking the vcpu does two things:
(1) If the vpcu thread is on a waitqueue, make it runnable and remove it
from the waitqueue.
(2) If the vcpu is running on a different physical CPU from the one
doing the kick, it sends a reschedule IPI.

The second case cannot happen, because the soft timer is only ever
scheduled when the vcpu is not running.  The first case is only relevant
when the vcpu thread is on a waitqueue, which is only the case when the
vcpu thread has called kvm_vcpu_block().

Therefore, we only need to make sure a timer is scheduled for
kvm_vcpu_block(), which we do by encapsulating all calls to
kvm_vcpu_block() with kvm_timer_{un}schedule calls.

Additionally, we only schedule a soft timer if the timer is enabled and
unmasked, since it is useless otherwise.

Note that theoretically userspace can use the SET_ONE_REG interface to
change registers that should cause the timer to fire, even if the vcpu
is blocked without a scheduled timer, but this case was not supported
before this patch and we leave it for future work for now.

Signed-off-by: Christoffer Dall 
---
 arch/arm/include/asm/kvm_host.h   |  3 --
 arch/arm/kvm/arm.c| 10 +
 arch/arm64/include/asm/kvm_host.h |  3 --
 include/kvm/arm_arch_timer.h  |  2 +
 virt/kvm/arm/arch_timer.c | 92 ---
 5 files changed, 70 insertions(+), 40 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index ca41d94..3df1e97 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -233,7 +233,4 @@ static inline void kvm_arm_setup_debug(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arm_clear_debug(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu) {}
 
-static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
-
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index dc017ad..134863f 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -271,6 +271,16 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
return kvm_timer_should_fire(vcpu);
 }
 
+void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
+{
+   kvm_timer_schedule(vcpu);
+}
+
+void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
+{
+   kvm_timer_unschedule(vcpu);
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
/* Force users to call KVM_ARM_VCPU_INIT */
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 5dae2d8..4562459 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -254,7 +254,4 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
 
-static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
-
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index e1e4d7c..ef14cc1 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -71,5 +71,7 @@ u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 
 bool kvm_timer_should_fire(struct kvm_vcpu *vcpu);
+void kvm_timer_schedule(struct kvm_vcpu *vcpu);
+void kvm_timer_unschedule(struct kvm_vcpu *vcpu);
 
 #endif
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 48c6e1a..756c62d 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -111,14 +111,21 @@ static enum hrtimer_restart kvm_timer_expire(struct 
hrtimer *hrt)
return HRTIMER_NORESTART;
 }
 
+static bool kvm_timer_irq_can_fire(struct kvm_vcpu *vcpu)
+{
+   struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
+
+   return !(timer->cntv_ctl & ARCH_TIMER_CTRL_IT_MASK) &&
+   (timer->cntv_ctl & ARCH_TIMER_CTRL_ENABLE) &&
+   !kvm_vgic_get_phys_irq_active(timer->map);
+}
+
 bool kvm_timer_should_fire(struct kvm_vcpu *vcpu)
 {
struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
cycle_t cval, now;
 
-   if ((timer->cntv_ctl & ARCH_TIMER_CTRL_IT_MASK) ||
-   !(timer->cntv_ctl & ARCH_TIMER_CTRL_ENABLE) ||
-   kvm_vgic_get_phys_irq_active(timer->map))
+   if (!kvm_timer_irq_can_fire(vcpu))
return false;
 
cval = timer->cntv_cval;
@@ -127,28 +134,60 @@ bool kvm_timer_should_fire(struct kvm_vcpu *vcpu)
return cval <= now;
 }
 
-/**
- * kvm_timer_flush_hwstate - prepare to move the virt timer to the cpu
- * 

[PATCH v3 5/8] arm/arm64: KVM: Use appropriate define in VGIC reset code

2015-09-29 Thread Christoffer Dall
We currently initialize the SGIs to be enabled in the VGIC code, but we
use the VGIC_NR_PPIS define for this purpose, instead of the the more
natural VGIC_NR_SGIS.  Change this slightly confusing use of the
defines.

Note: This should have no functional change, as both names are defined
to the number 16.

Acked-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index e606f78..9ed8d53 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -2109,7 +2109,7 @@ int vgic_init(struct kvm *kvm)
}
 
for (i = 0; i < dist->nr_irqs; i++) {
-   if (i < VGIC_NR_PPIS)
+   if (i < VGIC_NR_SGIS)
vgic_bitmap_set_irq_val(&dist->irq_enabled,
vcpu->vcpu_id, i, 1);
if (i < VGIC_NR_PRIVATE_IRQS)
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/8] arm/arm64: KVM: Rework the arch timer to use level-triggered semantics

2015-09-29 Thread Christoffer Dall
The arch timer currently uses edge-triggered semantics in the sense that
the line is never sampled by the vgic and lowering the line from the
timer to the vgic doesn't have any effect on the pending state of
virtual interrupts in the vgic.  This means that we do not support a
guest with the otherwise valid behavior of (1) disable interrupts (2)
enable the timer (3) disable the timer (4) enable interrupts.  Such a
guest would validly not expect to see any interrupts on real hardware,
but will see interrupts on KVM.

This patches fixes this shortcoming through the following series of
changes.

First, we change the flow of the timer/vgic sync/flush operations.  Now
the timer is always flushed/synced before the vgic, because the vgic
samples the state of the timer output.  This has the implication that we
move the timer operations in to non-preempible sections, but that is
fine after the previous commit getting rid of hrtimer schedules on every
entry/exit.

Second, we change the internal behavior of the timer, letting the timer
keep track of its previous output state, and only lower/raise the line
to the vgic when the state changes.  Note that in theory this could have
been accomplished more simply by signalling the vgic every time the
state *potentially* changed, but we don't want to be hitting the vgic
more often than necessary.

Third, we get rid of the use of the map->active field in the vgic and
instead simply set the interrupt as active on the physical distributor
whenever we signal a mapped interrupt to the guest, and we reset the
active state when we sync back the HW state from the vgic.

Fourth, and finally, we now initialize the timer PPIs (and all the other
unused PPIs for now), to be level-triggered, and modify the sync code to
sample the line state on HW sync and re-inject a new interrupt if it is
still pending at that time.

Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/arm.c   | 11 +--
 include/kvm/arm_arch_timer.h |  2 +-
 include/kvm/arm_vgic.h   |  3 --
 virt/kvm/arm/arch_timer.c| 65 +---
 virt/kvm/arm/vgic.c  | 78 ++--
 5 files changed, 87 insertions(+), 72 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 134863f..a9491ef 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -561,9 +561,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
local_irq_enable();
+   kvm_timer_sync_hwstate(vcpu);
kvm_vgic_sync_hwstate(vcpu);
preempt_enable();
-   kvm_timer_sync_hwstate(vcpu);
continue;
}
 
@@ -608,12 +608,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
kvm_guest_exit();
trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
 
+   /*
+* We must sync the timer state before the vgic state so that
+* the vgic can properly sample the updated state of the
+* interrupt line.
+*/
+   kvm_timer_sync_hwstate(vcpu);
+
kvm_vgic_sync_hwstate(vcpu);
 
preempt_enable();
 
-   kvm_timer_sync_hwstate(vcpu);
-
ret = handle_exit(vcpu, run, ret);
}
 
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index ef14cc1..1800227 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -51,7 +51,7 @@ struct arch_timer_cpu {
boolarmed;
 
/* Timer IRQ */
-   const struct kvm_irq_level  *irq;
+   struct kvm_irq_levelirq;
 
/* VGIC mapping */
struct irq_phys_map *map;
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4e14dac..7bc5d02 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -159,7 +159,6 @@ struct irq_phys_map {
u32 virt_irq;
u32 phys_irq;
u32 irq;
-   boolactive;
 };
 
 struct irq_phys_map_entry {
@@ -354,8 +353,6 @@ int kvm_vgic_vcpu_active_irq(struct kvm_vcpu *vcpu);
 struct irq_phys_map *kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu,
   int virt_irq, int irq);
 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
-bool kvm_vgic_get_phys_irq_active(struct irq_phys_map *map);
-void kvm_vgic_set_phys_irq_active(struct irq_phys_map *map, bool active);
 
 #define irqchip_in_kernel(k)   (!!((k)->arch.vgic.in_kernel))
 #define vgic_initialized(k)(!!((k)->arch.vgic.nr_cpus))
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 756c62d..f653ae6 100644
--- a/virt/kvm

Re: [PATCH v2 7/8] arm/arm64: KVM: Rework the arch timer to use level-triggered semantics

2015-09-29 Thread Christoffer Dall
On Wed, Sep 23, 2015 at 06:44:21PM +0100, Andre Przywara wrote:
> Hi Christoffer,
> 
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index 9ed8d53..f4ea950 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -1422,34 +1422,43 @@ static bool vgic_process_maintenance(struct 
> > kvm_vcpu *vcpu)
> >  /*
> >   * Save the physical active state, and reset it to inactive.
> >   *
> > - * Return 1 if HW interrupt went from active to inactive, and 0 otherwise.
> > + * Return true if there's a pending level triggered interrupt line to 
> > queue.
> >   */
> > -static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, struct vgic_lr vlr)
> > +static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> > vlr)
> >  {
> > struct irq_phys_map *map;
> > +   bool phys_active;
> > int ret;
> >  
> > if (!(vlr.state & LR_HW))
> > return 0;
> >  
> > map = vgic_irq_map_search(vcpu, vlr.irq);
> > -   BUG_ON(!map || !map->active);
> > +   BUG_ON(!map);
> >  
> > ret = irq_get_irqchip_state(map->irq,
> > IRQCHIP_STATE_ACTIVE,
> > -   &map->active);
> > +   &phys_active);
> >  
> > WARN_ON(ret);
> >  
> > -   if (map->active) {
> > +   if (phys_active) {
> > +   /*
> > +* Interrupt still marked as active on the physical
> > +* distributor, so guest did not EOI it yet.  Reset to
> > +* non-active so that other VMs can see interrupts from this
> > +* device.
> > +*/
> > ret = irq_set_irqchip_state(map->irq,
> > IRQCHIP_STATE_ACTIVE,
> > false);
> > WARN_ON(ret);
> > -   return 0;
> > +   return false;
> > }
> >  
> > -   return 1;
> > +   /* Mapped edge-triggered interrupts not yet supported. */
> > +   WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
> > +   return process_level_irq(vcpu, lr, vlr);
> 
> Don't you miss the dist->lock here? The other call to
> process_level_irq() certainly does it, and Eric recently removed the
> coarse grained lock around the whole __kvm_vgic_sync_hwstate() function.
> So we don't hold the lock here, but we change quite some common VGIC
> state in there.
> 

Indeed I think we should.

I'll fix that for the next revision.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 07/11] KVM: arm/arm64: vgic: Allow HW interrupts to be queued to a guest

2015-09-29 Thread Christoffer Dall
On Wed, Sep 23, 2015 at 06:55:04PM +0100, Andre Przywara wrote:
> Salut Marc,
> 
> I know that this patch is already merged, but 
> 
> On 07/08/15 16:45, Marc Zyngier wrote:
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index 51c9900..9d009d2 100644
> ...
> > @@ -1364,6 +1397,39 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
> > *vcpu)
> > return level_pending;
> >  }
> >  
> > +/*
> > + * Save the physical active state, and reset it to inactive.
> > + *
> > + * Return 1 if HW interrupt went from active to inactive, and 0 otherwise.
> > + */
> > +static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, struct vgic_lr vlr)
> > +{
> > +   struct irq_phys_map *map;
> > +   int ret;
> > +
> > +   if (!(vlr.state & LR_HW))
> > +   return 0;
> > +
> > +   map = vgic_irq_map_search(vcpu, vlr.irq);
> > +   BUG_ON(!map || !map->active);
> > +
> > +   ret = irq_get_irqchip_state(map->irq,
> > +   IRQCHIP_STATE_ACTIVE,
> > +   &map->active);
> > +
> > +   WARN_ON(ret);
> > +
> > +   if (map->active) {
> > +   ret = irq_set_irqchip_state(map->irq,
> > +   IRQCHIP_STATE_ACTIVE,
> > +   false);
> > +   WARN_ON(ret);
> > +   return 0;
> > +   }
> > +
> > +   return 1;
> > +}
> > +
> >  /* Sync back the VGIC state after a guest run */
> >  static void __kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu)
> >  {
> > @@ -1378,14 +1444,31 @@ static void __kvm_vgic_sync_hwstate(struct kvm_vcpu 
> > *vcpu)
> > elrsr = vgic_get_elrsr(vcpu);
> > elrsr_ptr = u64_to_bitmask(&elrsr);
> >  
> > -   /* Clear mappings for empty LRs */
> > -   for_each_set_bit(lr, elrsr_ptr, vgic->nr_lr) {
> > +   /* Deal with HW interrupts, and clear mappings for empty LRs */
> > +   for (lr = 0; lr < vgic->nr_lr; lr++) {
> > struct vgic_lr vlr;
> >  
> > -   if (!test_and_clear_bit(lr, vgic_cpu->lr_used))
> > +   if (!test_bit(lr, vgic_cpu->lr_used))
> > continue;
> >  
> > vlr = vgic_get_lr(vcpu, lr);
> > +   if (vgic_sync_hwirq(vcpu, vlr)) {
> > +   /*
> > +* So this is a HW interrupt that the guest
> > +* EOI-ed. Clean the LR state and allow the
> > +* interrupt to be sampled again.
> > +*/
> > +   vlr.state = 0;
> > +   vlr.hwirq = 0;
> > +   vgic_set_lr(vcpu, lr, vlr);
> > +   vgic_irq_clear_queued(vcpu, vlr.irq);
> 
> Isn't this line altering common VGIC state without holding the lock?
> Eric removed the coarse dist->lock around the whole
> __kvm_vgic_sync_hwstate() function, we take it now in
> vgic_process_maintenance(), but don't hold it here AFAICT.
> As long as we are only dealing with private timer IRQs this is probably
> not a problem, but the IRQ number could be a SPI as well, right?
> 
I don't see a problematic race with this though, as all we're doing is
to clear a bit in a bitmap, which is always checked atomically, so
adding a lock around this really doesn't change anything as far as I can
tell.

Nevertheless, my rework series also addresses this.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] iommu/s390: add iommu api for s390 pci devices

2015-09-29 Thread Joerg Roedel
Hi Gerald,

thanks for your patch. It looks pretty good and addresses my previous
review comments. I have a few questions, first one is how this
operates with DMA-API on s390. Is there a seperate DMA-API
implementation besides the IOMMU-API one for PCI devices?

My other question is inline:

On Thu, Aug 27, 2015 at 03:33:03PM +0200, Gerald Schaefer wrote:
> +struct s390_domain_device {
> + struct list_headlist;
> + struct zpci_dev *zdev;
> +};

Instead of using your own struct here, have you considered using the
struct iommu_group instead? The struct devices contains a pointer to an
iommu_group and the struct itself contains pointers to the domain it is
currently bound to.


Regards,

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][v2] KVM: PPC: e6500: allow odd powers of 2K TLB1 sizes

2015-09-29 Thread Laurentiu Tudor
Book-E MMUv2 present in e6500 cores supports
powers of 2K page sizes while older MMUv1 cores
support only powers of 4K page sizes, or in other
words the LSB of TSIZE on MMUv1 is always 0.
On MMUv2 allow power of 2K pages by not stripping
the LSB of the TSIZE field. This gives better
HW TLB1 usage for odd powers of 2K pages by not
using two powers of 4K HW TLB1 entries to back
them up.

Signed-off-by: Mihai Caraman 
[laurentiu.tu...@freescale.com: addressed review
 feedback, split in distinct patch]
Signed-off-by: Laurentiu Tudor 
---
Was "KVM: PPC: e6500: support powers of 2K TLB1 sizes"
v2:
 - reworded commit msg & comments
 - got rid of an if() in favor of bitwise op

 arch/powerpc/kvm/e500_mmu_host.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 4d33e19..008ab84 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -371,6 +371,7 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
unsigned long start, end;
unsigned long slot_start, slot_end;
+   int tsize_inc;
 
pfnmap = 1;
 
@@ -392,10 +393,20 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
MAS1_TSIZE_SHIFT;
 
/*
-* e500 doesn't implement the lowest tsize bit,
-* or 1K pages.
+* Calculate TSIZE increment. MMUv2 supports
+* power of 2K translations while MMUv1 is limited
+* to power of 4K sizes.
 */
-   tsize = max(BOOK3E_PAGESZ_4K, tsize & ~1);
+   tsize_inc = has_feature(&vcpu_e500->vcpu,
+   VCPU_FTR_MMU_V2) ? 1 : 2;
+
+   /*
+* MMUv1 doesn't implement the lowest tsize bit,
+* meaning that only power of 4K translation sizes
+* are supported so strip the LSB on MMUv1.
+*/
+   tsize &= ~(tsize_inc - 1);
+   tsize = max(BOOK3E_PAGESZ_4K, tsize);
 
/*
 * Now find the largest tsize (up to what the guest
@@ -404,7 +415,8 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 * aligned.
 */
 
-   for (; tsize > BOOK3E_PAGESZ_4K; tsize -= 2) {
+   for (; tsize > BOOK3E_PAGESZ_4K;
+tsize -= tsize_inc) {
unsigned long gfn_start, gfn_end;
tsize_pages = 1 << (tsize - 2);
 
@@ -437,10 +449,13 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
tsize = min(__ilog2(psize) - 10, tsize);
 
/*
-* e500 doesn't implement the lowest tsize bit,
-* or 1K pages.
+* MMUv1 doesn't implement the lowest tsize bit,
+* meaning that only power of 4K translation sizes
+* are supported.
 */
-   tsize = max(BOOK3E_PAGESZ_4K, tsize & ~1);
+   if (!has_feature(&vcpu_e500->vcpu, VCPU_FTR_MMU_V2))
+   tsize &= ~1;
+   tsize = max(BOOK3E_PAGESZ_4K, tsize);
}
 
up_read(¤t->mm->mmap_sem);
-- 
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu 2.4 machine + kernel 4.2 + kvm : freeze & cpu 100% at start on some hardware (maybe KVM_CAP_X86_SMM related)

2015-09-29 Thread Alexandre DERUMIER
Hi,

I'm currently implementing qemu 2.4 for proxmox hypervisors,
and a lot of users have reported qemu freeze with cpu at 100% when starting.
Connecting with vnc display : "qemu guest has not initialized the display yet"

Similar bug report here : 
https://lacyc3.eu/qemu-guest-has-not-initialized-the-display


This does not occur on all hardware, 
for example it freeze on dell powerege r710  (xeon E5540),  but not on dell 
r630 (CPU E5-2687W v3 @ 3.10GHz)
or very old dell poweredge 2950 (xeon 5110  @ 1.60GHz).

This is only with qemu 2.4 + kernel 4.2 (kernel 4.1 works fine) + kvm




not working command line
-
/usr/bin/kvm chardev 
socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait -mon 
chardev=qmp,mode=control -vnc unix:/var/run/qemu-server/100.vnc,x509,password 
-pidfile /var/run/qemu-server/100.pid -name test -cpu kvm64 -m 4096 -machine 
pc-i440fx-2.4


working command line
-
qemu 2.4 + kvm + compat 2.3 profil:

/usr/bin/kvm chardev 
socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait -mon 
chardev=qmp,mode=control -vnc unix:/var/run/qemu-server/100.vnc,x509,password 
-pidfile /var/run/qemu-server/100.pid -name test -cpu kvm64 -m 4096 -machine 
pc-i440fx-2.3

qemu 2.4 without kvm:

/usr/bin/kvm chardev 
socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait -mon 
chardev=qmp,mode=control -vnc unix:/var/run/qemu-server/100.vnc,x509,password 
-pidfile /var/run/qemu-server/100.pid -name test -cpu kvm64 -m 4096 -machine 
accel=tcg,type=pc-i440fx-2.4



So it's working with qemu 2.4 + machine 2.3 compat profil.



Looking at the code:

static void pc_compat_2_3(MachineState *machine)
{
PCMachineState *pcms = PC_MACHINE(machine);
savevm_skip_section_footers();
if (kvm_enabled()) {
pcms->smm = ON_OFF_AUTO_OFF;
}
global_state_set_optional();
savevm_skip_configuration();
}


If I comment 
//pcms->smm = ON_OFF_AUTO_OFF;

I have the same freeze too.



So,it's seem to come from somewhere in


bool pc_machine_is_smm_enabled(PCMachineState *pcms)
{
bool smm_available = false;

if (pcms->smm == ON_OFF_AUTO_OFF) {
return false;
}

if (tcg_enabled() || qtest_enabled()) {
smm_available = true;
} else if (kvm_enabled()) {
smm_available = kvm_has_smm();>> maybe here ?
}

if (smm_available) {
return true;
}

if (pcms->smm == ON_OFF_AUTO_ON) {
error_report("System Management Mode not supported by this 
hypervisor.");
exit(1);
}
return false;
}


bool kvm_has_smm(void)
{
return kvm_check_extension(kvm_state, KVM_CAP_X86_SMM);
}



I'm not sure if it's a qemu bug or kernel/kvm bug.

Help is welcome.


Regards,

Alexandre Derumier

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: expose VPID capability to L1

2015-09-29 Thread Wanpeng Li

On 9/29/15 6:39 PM, Paolo Bonzini wrote:

On 29/09/2015 04:55, Wanpeng Li wrote:

Expose VPID capability to L1.

Signed-off-by: Wanpeng Li 
---
v1 -> v2:
  * set only VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT

Thanks.  I've checked more thoroughly your implementation against the
SDM now, and there are a few missing things between this patch and the
one that emulates INVVPID:

- you're not setting bit 32 of the VMX_EPT_VPID_CAP MSR

- you were not checking against the supported types in the
implementation of the INVVPID instruction

- the memory operand must always be read even if it isn't needed (e.g.,
for type==global), similar to INVEPT

- for single-context invalidation you're not checking that VPID != 0,
though in practice that doesn't matter because we don't want to support
single-context invalidation

- you're always setting the MSR's bits to 1 even if !enable_vpid

At this point it's better if you resend the whole nested VPID
implementation, i.e. the following five patches:

KVM: VMX: adjust interface to allocate/free_vpid
KVM: VMX: introduce __vmx_flush_tlb to handle specific vpid
KVM: nVMX: emulate the INVVPID instruction
KVM: nVMX: nested VPID emulation
KVM: nVMX: expose VPID capability to L1

with the above issues fixed.  Please also send kvm-unit-tests patches
that tests for the error cases.


Ok, I will do this after the vacation(until 10.7) in my country. :-)

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: expose VPID capability to L1

2015-09-29 Thread Paolo Bonzini
On 29/09/2015 04:55, Wanpeng Li wrote:
> Expose VPID capability to L1.
> 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * set only VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT

Thanks.  I've checked more thoroughly your implementation against the
SDM now, and there are a few missing things between this patch and the
one that emulates INVVPID:

- you're not setting bit 32 of the VMX_EPT_VPID_CAP MSR

- you were not checking against the supported types in the
implementation of the INVVPID instruction

- the memory operand must always be read even if it isn't needed (e.g.,
for type==global), similar to INVEPT

- for single-context invalidation you're not checking that VPID != 0,
though in practice that doesn't matter because we don't want to support
single-context invalidation

- you're always setting the MSR's bits to 1 even if !enable_vpid

At this point it's better if you resend the whole nested VPID
implementation, i.e. the following five patches:

KVM: VMX: adjust interface to allocate/free_vpid
KVM: VMX: introduce __vmx_flush_tlb to handle specific vpid
KVM: nVMX: emulate the INVVPID instruction
KVM: nVMX: nested VPID emulation
KVM: nVMX: expose VPID capability to L1

with the above issues fixed.  Please also send kvm-unit-tests patches
that tests for the error cases.

Thanks,

Paolo

>  arch/x86/kvm/vmx.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 75f3ee0..c4ea890 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -442,7 +442,7 @@ struct nested_vmx {
>   u32 nested_vmx_true_entry_ctls_low;
>   u32 nested_vmx_misc_low;
>   u32 nested_vmx_misc_high;
> - u32 nested_vmx_ept_caps;
> + u64 nested_vmx_ept_vpid_caps;
>  };
>  
>  #define POSTED_INTR_ON  0
> @@ -2485,22 +2485,23 @@ static void nested_vmx_setup_ctls_msrs(struct 
> vcpu_vmx *vmx)
>   SECONDARY_EXEC_WBINVD_EXITING |
>   SECONDARY_EXEC_XSAVES;
>  
> - if (enable_ept) {
> + if (enable_ept | enable_vpid) {
>   /* nested EPT: emulate EPT also to L1 */
>   vmx->nested.nested_vmx_secondary_ctls_high |=
>   SECONDARY_EXEC_ENABLE_EPT;
> - vmx->nested.nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT |
> + vmx->nested.nested_vmx_ept_vpid_caps = VMX_EPT_PAGE_WALK_4_BIT |
>VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT |
>VMX_EPT_INVEPT_BIT;
> - vmx->nested.nested_vmx_ept_caps &= vmx_capability.ept;
> + vmx->nested.nested_vmx_ept_vpid_caps &= vmx_capability.ept;
>   /*
>* For nested guests, we don't do anything specific
>* for single context invalidation. Hence, only advertise
>* support for global context invalidation.
>*/
> - vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT;
> + vmx->nested.nested_vmx_ept_vpid_caps |= 
> VMX_EPT_EXTENT_GLOBAL_BIT;
> + vmx->nested.nested_vmx_ept_vpid_caps |= 
> VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT << 32;
>   } else
> - vmx->nested.nested_vmx_ept_caps = 0;
> + vmx->nested.nested_vmx_ept_vpid_caps = 0;
>  
>   if (enable_unrestricted_guest)
>   vmx->nested.nested_vmx_secondary_ctls_high |=
> @@ -2616,8 +2617,7 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
> msr_index, u64 *pdata)
>   vmx->nested.nested_vmx_secondary_ctls_high);
>   break;
>   case MSR_IA32_VMX_EPT_VPID_CAP:
> - /* Currently, no nested vpid support */
> - *pdata = vmx->nested.nested_vmx_ept_caps;
> + *pdata = vmx->nested.nested_vmx_ept_vpid_caps;
>   break;
>   default:
>   return 1;
> @@ -7142,7 +7142,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
>  
>   if (!(vmx->nested.nested_vmx_secondary_ctls_high &
> SECONDARY_EXEC_ENABLE_EPT) ||
> - !(vmx->nested.nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
> + !(vmx->nested.nested_vmx_ept_vpid_caps & VMX_EPT_INVEPT_BIT)) {
>   kvm_queue_exception(vcpu, UD_VECTOR);
>   return 1;
>   }
> @@ -7158,7 +7158,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
>   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>   type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf);
>  
> - types = (vmx->nested.nested_vmx_ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6;
> + types = (vmx->nested.nested_vmx_ept_vpid_caps >> VMX_EPT_EXTENT_SHIFT) 
> & 6;
>  
>   if (!(types & (1UL << type))) {
>   nested_vmx_failValid(vcpu,
> @@ -8763,7 +8763,7 @@ static void nested_ept_init_mmu_context(struct kvm_vcpu 
> *vcpu)
>  {
>   WARN_ON(mmu_is_nested(vcpu));
>   kvm_init_shadow_ept_mmu(vcpu,
> - to_vmx(vcpu)->nested.nested_vmx_ept_caps &
> + 

[PATCH v2] KVM: nVMX: expose VPID capability to L1

2015-09-29 Thread Wanpeng Li
Expose VPID capability to L1.

Signed-off-by: Wanpeng Li 
---
v1 -> v2:
 * set only VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT

 arch/x86/kvm/vmx.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 75f3ee0..c4ea890 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -442,7 +442,7 @@ struct nested_vmx {
u32 nested_vmx_true_entry_ctls_low;
u32 nested_vmx_misc_low;
u32 nested_vmx_misc_high;
-   u32 nested_vmx_ept_caps;
+   u64 nested_vmx_ept_vpid_caps;
 };
 
 #define POSTED_INTR_ON  0
@@ -2485,22 +2485,23 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_WBINVD_EXITING |
SECONDARY_EXEC_XSAVES;
 
-   if (enable_ept) {
+   if (enable_ept | enable_vpid) {
/* nested EPT: emulate EPT also to L1 */
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_EPT;
-   vmx->nested.nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT |
+   vmx->nested.nested_vmx_ept_vpid_caps = VMX_EPT_PAGE_WALK_4_BIT |
 VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT |
 VMX_EPT_INVEPT_BIT;
-   vmx->nested.nested_vmx_ept_caps &= vmx_capability.ept;
+   vmx->nested.nested_vmx_ept_vpid_caps &= vmx_capability.ept;
/*
 * For nested guests, we don't do anything specific
 * for single context invalidation. Hence, only advertise
 * support for global context invalidation.
 */
-   vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT;
+   vmx->nested.nested_vmx_ept_vpid_caps |= 
VMX_EPT_EXTENT_GLOBAL_BIT;
+   vmx->nested.nested_vmx_ept_vpid_caps |= 
VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT << 32;
} else
-   vmx->nested.nested_vmx_ept_caps = 0;
+   vmx->nested.nested_vmx_ept_vpid_caps = 0;
 
if (enable_unrestricted_guest)
vmx->nested.nested_vmx_secondary_ctls_high |=
@@ -2616,8 +2617,7 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
vmx->nested.nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested vpid support */
-   *pdata = vmx->nested.nested_vmx_ept_caps;
+   *pdata = vmx->nested.nested_vmx_ept_vpid_caps;
break;
default:
return 1;
@@ -7142,7 +7142,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
 
if (!(vmx->nested.nested_vmx_secondary_ctls_high &
  SECONDARY_EXEC_ENABLE_EPT) ||
-   !(vmx->nested.nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+   !(vmx->nested.nested_vmx_ept_vpid_caps & VMX_EPT_INVEPT_BIT)) {
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
}
@@ -7158,7 +7158,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf);
 
-   types = (vmx->nested.nested_vmx_ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6;
+   types = (vmx->nested.nested_vmx_ept_vpid_caps >> VMX_EPT_EXTENT_SHIFT) 
& 6;
 
if (!(types & (1UL << type))) {
nested_vmx_failValid(vcpu,
@@ -8763,7 +8763,7 @@ static void nested_ept_init_mmu_context(struct kvm_vcpu 
*vcpu)
 {
WARN_ON(mmu_is_nested(vcpu));
kvm_init_shadow_ept_mmu(vcpu,
-   to_vmx(vcpu)->nested.nested_vmx_ept_caps &
+   to_vmx(vcpu)->nested.nested_vmx_ept_vpid_caps &
VMX_EPT_EXECUTE_ONLY_BIT);
vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 104271] Boot crashes on Lenovo W520 with KVM and Discrete Graphic enabled in BIOS

2015-09-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=104271

Hans Streibel  changed:

   What|Removed |Added

 CC||streibel.2...@gmx.net

--- Comment #6 from Hans Streibel  ---
As Alex suggested I moved this bug to Drivers/PCI to get more attention.
However now I have the impression that this bug does not get any attention at
all any more.

Did I miss any necessary action beside moving this to Drivers/PCI?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> On 09/28/2015 09:58 AM, Ingo Molnar wrote:
> > 
> > * Denys Vlasenko  wrote:
> > 
> >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> >>> NAK.  We really should map the GDT read-only on all 64 bit systems,
> >>> since we can't hide the address from SLDT.  Same with the IDT.
> >>
> >> Sorry, I don't understand your point.
> > 
> > So the problem is that right now the SGDT instruction (which is 
> > unprivileged) 
> > leaks the real address of the kernel image:
> > 
> >  fomalhaut:~> ./sgdt 
> >  SGDT: 88303fd89000 / 007f
> > 
> > that '88303fd89000' is a kernel address.
> 
> Thank you.
> I do know that SGDT and friends are unprivileged on x86
> and thus they allow userspace (and guest kernels in paravirt)
> learn things they don't need to know.
> 
> I don't see how making GDT page-aligned and page-sized
> changes anything in this regard. SGDT will still work,
> and still leak GDT address.

Well, as I try to explain it in the other part of my mail, doing so enables us 
to 
remap the GDT to a less security sensitive virtual address that does not leak 
the 
kernel's randomized address:

> > Your observation in the changelog and your patch:
> > 
>  It is page-sized because of paravirt. [...]
> > 
> > ... conflicts with the intention to mark (remap) the primary GDT address 
> > read-only 
> > on native kernels as well.
> > 
> > So what we should do instead is to use the page alignment properly and 
> > remap the 
> > GDT to a read-only location, and load that one.
> 
> If we'd have a small GDT (i.e. what my patch does), we still can remap the 
> entire page which contains small GDT, and simply don't care that some other 
> data 
> is also visible through that RO page.

That's generally considered fragile: suppose an attacker has a limited 
information 
leak that can read absolute addresses with system privilege but he doesn't know 
the kernel's randomized base offset. With a 'partial page' mapping there could 
be 
function pointers near the GDT, part of the page the GDT happens to be on, that 
leak this information.

(Same goes for crypto keys or other critical information (like canary 
information, 
salts, etc.) accidentally ending up nearby.)

Arguably it's a bit tenuous, but when playing remapping games it's generally 
considered good to be page aligned and page sized, with zero padding.

> > This would have a couple of advantages:
> > 
> >  - This would give kernel address randomization more teeth on x86.
> > 
> >  - An additional advantage would be that rootkits overwriting the GDT would 
> > have 
> >a bit more work to do.
> > 
> >  - A third advantage would be that for NUMA systems we could 'mirror' the 
> > GDT into
> >node-local memory and load those. This makes GDT load cache-misses a bit 
> > less
> >expensive.
> 
> GDT is per-cpu. Isn't per-cpu memory already NUMA-local?

Indeed it is:

fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done
SGDT: 88103fa09000 / 007f
SGDT: 88103fa29000 / 007f
SGDT: 88103fa29000 / 007f
SGDT: 88103fa49000 / 007f
SGDT: 88103fa49000 / 007f
SGDT: 88103fa49000 / 007f
SGDT: 88103fa29000 / 007f
SGDT: 88103fa69000 / 007f

I confused it with the IDT, which is still global.

This also means that the GDT in itself does not leak kernel addresses at the 
moment, except it leaks the layout of the percpu area.

So my suggestion would be to:

 - make the GDT unconditionally page aligned and sized, then remap it to a
   read-only address unconditionally as well, like we do it for the IDT.

 - make the IDT per CPU as well, for performance reasons.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: expose VPID capability to L1

2015-09-29 Thread Paolo Bonzini


On 29/09/2015 03:15, Wanpeng Li wrote:
>>>
>> Hi Wanpeng, the comment above is about invept, but the same applies
>> applies to invvpid.  We can set only VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT.
> 
> Agreed. I see the patch has already in kvm/queue, if I need to send out
> another patch or you can adjust it for me? :-)

Please resend the patch.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 15/15] KVM: arm: enable trapping of all debug registers

2015-09-29 Thread Christoffer Dall
On Tue, Sep 29, 2015 at 01:41:45PM +0800, Zhichao Huang wrote:
> 
> 
> On 2015/9/3 0:08, Christoffer Dall wrote:
> > On Mon, Aug 10, 2015 at 09:26:07PM +0800, Zhichao Huang wrote:
> >> Enable trapping of the debug registers unconditionally, allowing guests to
> >> use the debug infrastructure.
> >>
> >> Signed-off-by: Zhichao Huang 
> >> Reviewed-by: Christoffer Dall 
> >> ---
> >>  arch/arm/kvm/interrupts_head.S | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/arm/kvm/interrupts_head.S 
> >> b/arch/arm/kvm/interrupts_head.S
> >> index 7ad0adf..494991d 100644
> >> --- a/arch/arm/kvm/interrupts_head.S
> >> +++ b/arch/arm/kvm/interrupts_head.S
> >> @@ -792,7 +792,7 @@ ARM_BE8(revr6, r6  )
> >>   * (hardware reset value is 0) */
> >>  .macro set_hdcr operation
> >>mrc p15, 4, r2, c1, c1, 1
> >> -  ldr r3, =(HDCR_TPM|HDCR_TPMCR)
> >> +  ldr r3, =(HDCR_TPM|HDCR_TPMCR|HDCR_TDRA|HDCR_TDOSA|HDCR_TDA)
> > 
> > eh, but I thought we didn't have to trap accesses to the debug registers
> > if we switch them, because the guest is likely to be using them?
> > 
> > Maybe I am getting confused, can you repeat for me exactly when we
> > context-switch the registers and when we trap accesses to them?
> > 
> 
> Since we don't want to world switch the dangerous register(DBGDSCR), we have
> to trap accesses all the time, to prevent the guest to write to the real 
> register.
> 
ok, so this is in line with my comment to your previous patch, but you
did have world-switching code of DBGDSCR in this series, hence my
confusion.  So you would need to get rid of this for the new version of
the series.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 10/15] KVM: arm: implement world switch for debug registers

2015-09-29 Thread Christoffer Dall
On Tue, Sep 29, 2015 at 01:32:35PM +0800, Zhichao Huang wrote:
> 
> 
> On 2015/9/2 22:53, Christoffer Dall wrote:
> >> +/* Reads cp14 registers from hardware.
> >> + * Writes cp14 registers in-order to the CP14 struct pointed to by r10
> >> + *
> >> + * Assumes vcpu pointer in vcpu reg
> >> + *
> >> + * Clobbers r2-r12
> >> + */
> >> +.macro save_debug_state
> >> +  read_hw_dbg_num
> >> +  cp14_read_and_str r10, 4, cp14_DBGBVR0, r11
> >> +  cp14_read_and_str r10, 5, cp14_DBGBCR0, r11
> >> +  cp14_read_and_str r10, 6, cp14_DBGWVR0, r12
> >> +  cp14_read_and_str r10, 7, cp14_DBGWCR0, r12
> >> +
> >> +  /* DBGDSCR reg */
> >> +  mrc p14, 0, r2, c0, c1, 0
> >> +  str r2, [r10, #CP14_OFFSET(cp14_DBGDSCRext)]
> > 
> > so again we're touching the scary register on every world-switch.  Since
> > it sounds like we have experience telling us that this can cause
> > troubles, I'm wondering if we can get around it by:
> > 
> > Only ever allow the guest to use debugging registers if we managed to
> > enter_monitor_mode on the host, and in that case only allow guest
> > debugging with the configuration of DBGDSCR that the host has.
> > 
> > If the host never managed to enable debugging, the guest probably won't
> > succeed either, and we should just trap all guest accesses to the debug
> > registers.
> > 
> > Does this work?
> > 
> 
> I think it works. Since the register is dangerous, we will try not to
> world switch it. It means that the guest will not be able to write the 
> register,
> and will always see what the host set. So the guest will not be able to use
> hardware debug feature if the host disable it.
> 
I talked to Will about this the last day of Linaro Connect and we
arrived at the conclusion that since you cannot trap on only a subset of
the debug registers (namely the DBGDSCR and the registers related to the
OS lock etc.), there is simply no safe and robus way to do this on
ARMv7.

Therfore, this patch series probably needs to be reworked so that we
*always* set TDA for the guest, and when the guest programs a break- or
watchpoint, then we program the hardware registers directly in KVM
without telling perf or hw_breakpoints about this, and we only do any of
this if monitor_mode_enabled() is true.  If not, then debugging inside
the guest simply won't work.

I also don't think that checking the host's breakpoint and wathcpoint
registers are necessary after all, since they are context switched per
process on the host, so this will only be a problem for guests when the
user uses GDB on the VCPU thread (in which case he's asking for trouble
anyway) or with using the weird perf CPU-wide breakpoints, which is very
far from a the common case, so I wouldn't worry about this for now.

Hopefully this makes sense.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: arm/arm64: BUG FIX: Do not inject spurious interrupts

2015-09-29 Thread Christoffer Dall
On Fri, Sep 25, 2015 at 05:00:29PM +0300, Pavel Fedin wrote:
> Commit 71760950bf3dc796e5e53ea3300dec724a09f593
> ("arm/arm64: KVM: add a common vgic_queue_irq_to_lr fn") introduced
> vgic_queue_irq_to_lr() function with additional vgic_dist_irq_is_pending()
> check before setting LR_STATE_PENDING bit. In some cases it started
> causing the following situation if the userland quickly drops a
> level-sensitive IRQ back to inactive state for some reason:
> 1. Userland injects an IRQ with level == 1, this ends up in
>vgic_update_irq_pending(), which in turn calls
>vgic_dist_irq_set_pending() for this IRQ.
> 2. vCPU gets kicked. But kernel does not manage to reschedule it quickly
>(!!!)
> 3. Userland quickly resets the IRQ to level == 0. vgic_update_irq_pending()
>in this case will call vgic_dist_irq_clear_pending() and reset the
>pending flag.
> 4. vCPU finally wakes up. It successfully rolls through through
>__kvm_vgic_flush_hwstate(), which populates vGIC registers. However,
>since neither pending nor active flags are now set for this IRQ,
>vgic_queue_irq_to_lr() does not set any state bits on this LR at all.
>Since this is level-sensitive IRQ, we end up in LR containing only
>LR_EOI_INT bit, causing unnecessary immediate exit from the guest.
> 
> This patch fixes the problem by adding forgotten vgic_cpu_irq_clear().
> This causes the IRQ not to be included into any lists, if it has been
> picked up after getting dropped to inactive level. Since this is a
> level-sensitive IRQ, this is correct behavior. Additionally,
> irq_pending_on_cpu will also be reset if this was the only pending
> interrupt, saving us from unnecessary wakeups.
> 
> The bug was caught on ARM64 kernel v4.1.6, running qemu "virt" guest,
> where it was caused by emulated pl011.
> 
> Signed-off-by: Pavel Fedin 

I reworked the commit message and applied this patch.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Inconsistent guest OS disk size compared to volume.img size

2015-09-29 Thread Jay Fishman
I  have looked all over the internet but I can not even find a
reference to this issue.


I have installed the following on Linux Mint 17.1
QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.19), Fabrice Bellard

On that, I have created a Ubuntu 14.04.3 LTS guest and created a
storage volume of 12.88GB. The format that I used was raw.

The host uses a physical mirrored drive and I did NOT use LVM (ext4
was the format type)

When installing the guest, I selected to "use entire disk" and again I
did NOT use LVM (ext4 was the format type)


After installation, the guest reports I am using 23.8% of 4.84GB. Why
is the disk size 4.84GB instead of 12.88GB?

The size of the guest virtual disk is being reduced by almost a third?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html