Re: [RFC 00/16] KVM protected memory extension

2020-05-26 Thread Liran Alon



On 26/05/2020 9:17, Mike Rapoport wrote:

On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:

On 22/05/2020 15:51, Kirill A. Shutemov wrote:

Furthermore, I would like to point out that just unmapping guest data from
kernel direct-map is not sufficient to prevent all
guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
is because host kernel VA space have other regions
which contains guest sensitive data. For example, KVM per-vCPU struct (which
holds vCPU state) is allocated on slab and therefore
still leakable.

Objects allocated from slab use the direct map, vmalloc() is another story.
It doesn't matter. This patch series, like XPFO, only removes guest 
memory pages from direct-map.
Not things such as KVM per-vCPU structs. That's why Julian & Marius 
(AWS), created the "Process local kernel VA region" patch-series
that declare a single PGD entry, which maps a kernelspace region, to 
have different PFN between different tasks.
For more information, see my KVM Forum talk slides I gave in previous 
reply and related AWS patch-series:

https://patchwork.kernel.org/cover/10990403/



   - Touching direct mapping leads to fragmentation. We need to be able to
 recover from it. I have a buggy patch that aims at recovering 2M/1G page.
 It has to be fixed and tested properly

As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
will lead to holes in kernel direct-map which force it to not be mapped
anymore as a series of 1GB huge-pages.
This have non-trivial performance cost. Thus, I am not sure addressing this
use-case is valuable.

Out of curiosity, do we actually have some numbers for the "non-trivial
performance cost"? For instance for KVM usecase?


Dig into XPFO mailing-list discussions to find out...
I just remember that this was one of the main concerns regarding XPFO.

-Liran



Re: [RFC 00/16] KVM protected memory extension

2020-05-25 Thread Liran Alon



On 25/05/2020 17:46, Kirill A. Shutemov wrote:

On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:

On 22/05/2020 15:51, Kirill A. Shutemov wrote:

== Background / Problem ==

There are a number of hardware features (MKTME, SEV) which protect guest
memory from some unauthorized host access. The patchset proposes a purely
software feature that mitigates some of the same host-side read-only
attacks.


== What does this set mitigate? ==

   - Host kernel ”accidental” access to guest data (think speculation)

Just to clarify: This is any host kernel memory info-leak vulnerability. Not
just speculative execution memory info-leaks. Also architectural ones.

In addition, note that removing guest data from host kernel VA space also
makes guest<->host memory exploits more difficult.
E.g. Guest cannot use already available memory buffer in kernel VA space for
ROP or placing valuable guest-controlled code/data in general.


   - Host kernel induced access to guest data (write(fd, _data_ptr, len))

   - Host userspace access to guest data (compromised qemu)

I don't quite understand what is the benefit of preventing userspace VMM
access to guest data while the host kernel can still access it.

Let me clarify: the guest memory mapped into host userspace is not
accessible by both host kernel and userspace. Host still has way to access
it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page
that kernel has to map (temporarily) if need to access the data. So only
blessed codepaths would know how to deal with the memory.

Yes, I understood that. I meant explicit host kernel access.


It can help preventing some host->guest attack on the compromised host.
Like if an VM has successfully attacked the host it cannot attack other
VMs as easy.


We have mechanisms to sandbox the userspace VMM process for that.

You need to be more specific on what is the attack scenario you attempt 
to address
here that is not covered by existing mechanisms. i.e. Be crystal clear 
on the extra value

of the feature of not exposing guest data to userspace VMM.



It would also help to protect against guest->host attack by removing one
more places where the guest's data is mapped on the host.
Because guest have explicit interface to request which guest pages can 
be mapped in userspace VMM, the value of this is very small.


Guest already have ability to map guest controlled code/data in 
userspace VMM either via this interface or via forcing userspace VMM
to create various objects during device emulation handling. The only 
extra property this patch-series provides, is that only a
small portion of guest pages will be mapped to host userspace instead of 
all of it. Resulting in smaller regions for exploits that require
guessing a virtual address. But: (a) Userspace VMM device emulation may 
still allow guest to spray userspace heap with objects containing
guest controlled data. (b) How is userspace VMM suppose to limit which 
guest pages should not be mapped to userspace VMM even though guest have
explicitly requested them to be mapped? (E.g. Because they are valid DMA 
sources/targets for virtual devices or because it's vGPU frame-buffer).

QEMU is more easily compromised than the host kernel because it's
guest<->host attack surface is larger (E.g. Various device emulation).
But this compromise comes from the guest itself. Not other guests. In
contrast to host kernel attack surface, which an info-leak there can
be exploited from one guest to leak another guest data.

Consider the case when unprivileged guest user exploits bug in a QEMU
device emulation to gain access to data it cannot normally have access
within the guest. With the feature it would able to see only other shared
regions of guest memory such as DMA and IO buffers, but not the rest.
This is a scenario where an unpriviledged guest userspace have direct 
access to a virtual device
and is able to exploit a bug in device emulation handling such that it 
will allow it to compromise
the security *inside* the guest. i.e. Leak guest kernel data or other 
guest userspace processes data.


That's true. Good point. This is a very important missing argument from 
the cover-letter.


Now it's crystal clear on the trade-off considered here:
Is the extra complication and perf cost provided by the mechanism of 
this patch-series worth
to protect against the scenario of a userspace VMM vulnerability that 
may be accessible to unpriviledged
guest userspace process to leak other *in-guest* data that is not 
otherwise accessible to that process?


-Liran




Re: [RFC 00/16] KVM protected memory extension

2020-05-25 Thread Liran Alon



On 22/05/2020 15:51, Kirill A. Shutemov wrote:

== Background / Problem ==

There are a number of hardware features (MKTME, SEV) which protect guest
memory from some unauthorized host access. The patchset proposes a purely
software feature that mitigates some of the same host-side read-only
attacks.


== What does this set mitigate? ==

  - Host kernel ”accidental” access to guest data (think speculation)


Just to clarify: This is any host kernel memory info-leak vulnerability. 
Not just speculative execution memory info-leaks. Also architectural ones.


In addition, note that removing guest data from host kernel VA space 
also makes guest<->host memory exploits more difficult.
E.g. Guest cannot use already available memory buffer in kernel VA space 
for ROP or placing valuable guest-controlled code/data in general.




  - Host kernel induced access to guest data (write(fd, _data_ptr, len))

  - Host userspace access to guest data (compromised qemu)


I don't quite understand what is the benefit of preventing userspace VMM 
access to guest data while the host kernel can still access it.


QEMU is more easily compromised than the host kernel because it's 
guest<->host attack surface is larger (E.g. Various device emulation).
But this compromise comes from the guest itself. Not other guests. In 
contrast to host kernel attack surface, which an info-leak there can

be exploited from one guest to leak another guest data.


== What does this set NOT mitigate? ==

  - Full host kernel compromise.  Kernel will just map the pages again.

  - Hardware attacks


The patchset is RFC-quality: it works but has known issues that must be
addressed before it can be considered for applying.

We are looking for high-level feedback on the concept.  Some open
questions:

  - This protects from some kernel and host userspace read-only attacks,
but does not place the host kernel outside the trust boundary. Is it
still valuable?
I don't currently see a good argument for preventing host userspace 
access to guest data while host kernel can still access it.
But there is definitely strong benefit of mitigating kernel info-leaks 
exploitable from one guest to leak another guest data.


  - Can this approach be used to avoid cache-coherency problems with
hardware encryption schemes that repurpose physical bits?

  - The guest kernel must be modified for this to work.  Is that a deal
breaker, especially for public clouds?

  - Are the costs of removing pages from the direct map too high to be
feasible?


If I remember correctly, this perf cost was too high when considering 
XPFO (eXclusive Page Frame Ownership) patch-series.

This created two major perf costs:
1) Removing pages from direct-map prevented direct-map from simply be 
entirely mapped as 1GB huge-pages.
2) Frequent allocation/free of userspace pages resulted in frequent TLB 
invalidations.


Having said that, (1) can be mitigated in case guest data is completely 
allocated from 1GB hugetlbfs to guarantee it will not
create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM 
use-case.


This makes me wonder:
XPFO patch-series, applied to the context of QEMU/KVM, seems to provide 
exactly the functionality of this patch-series,
with the exception of the additional "feature" of preventing guest data 
from also being accessible to host userspace VMM.
i.e. XPFO will unmap guest pages from host kernel direct-map while still 
keeping them mapped in host userspace VMM page-tables.


If I understand correctly, this "feature" is what brings most of the 
extra complexity of this patch-series compared to XPFO.
It requires guest modification to explicitly specify to host which pages 
can be accessed by userspace VMM, it requires
changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it 
creates issues with Live-Migration support.


So if there is no strong convincing argument for the motivation to 
prevent userspace VMM access to guest data *while host kernel
can still access guest data*, I don't see a good reason for using this 
approach.


Furthermore, I would like to point out that just unmapping guest data 
from kernel direct-map is not sufficient to prevent all
guest-to-guest info-leaks via a kernel memory info-leak vulnerability. 
This is because host kernel VA space have other regions
which contains guest sensitive data. For example, KVM per-vCPU struct 
(which holds vCPU state) is allocated on slab and therefore

still leakable.

I recommend you will have a look at my (and Alexandre Charte) KVM Forum 
2019 talk on KVM ASI which provides extensive background
on the various attempts done by the community for mitigating host kernel 
memory info-leaks exploitable by guest to leak other guests data:

https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf



== Series Overview ==

The hardware features protect guest data by encrypting it and then
ensuring that only the right guest can decrypt it.  

Re: [PATCH v1 00/15] Add support for Nitro Enclaves

2020-04-28 Thread Liran Alon



On 28/04/2020 18:25, Alexander Graf wrote:



On 27.04.20 13:44, Liran Alon wrote:


On 27/04/2020 10:56, Paraschiv, Andra-Irina wrote:


On 25/04/2020 18:25, Liran Alon wrote:


On 23/04/2020 16:19, Paraschiv, Andra-Irina wrote:


The memory and CPUs are carved out of the primary VM, they are
dedicated for the enclave. The Nitro hypervisor running on the host
ensures memory and CPU isolation between the primary VM and the
enclave VM.

I hope you properly take into consideration Hyper-Threading
speculative side-channel vulnerabilities here.
i.e. Usually cloud providers designate each CPU core to be assigned
to run only vCPUs of specific guest. To avoid sharing a single CPU
core between multiple guests.
To handle this properly, you need to use some kind of core-scheduling
mechanism (Such that each CPU core either runs only vCPUs of enclave
or only vCPUs of primary VM at any given point in time).

In addition, can you elaborate more on how the enclave memory is
carved out of the primary VM?
Does this involve performing a memory hot-unplug operation from
primary VM or just unmap enclave-assigned guest physical pages from
primary VM's SLAT (EPT/NPT) and map them now only in enclave's SLAT?


Correct, we take into consideration the HT setup. The enclave gets
dedicated physical cores. The primary VM and the enclave VM don't run
on CPU siblings of a physical core.

The way I would imagine this to work is that Primary-VM just specifies
how many vCPUs will the Enclave-VM have and those vCPUs will be set with
affinity to run on same physical CPU cores as Primary-VM.
But with the exception that scheduler is modified to not run vCPUs of
Primary-VM and Enclave-VM as sibling on the same physical CPU core
(core-scheduling). i.e. This is different than primary-VM losing
physical CPU cores permanently as long as the Enclave-VM is running.
Or maybe this should even be controlled by a knob in virtual PCI device
interface to allow flexibility to customer to decide if Enclave-VM needs
dedicated CPU cores or is it ok to share them with Primary-VM
as long as core-scheduling is used to guarantee proper isolation.


Running both parent and enclave on the same core can *potentially* 
lead to L2 cache leakage, so we decided not to go with it :).

Haven't thought about the L2 cache. Makes sense. Ack.




Regarding the memory carve out, the logic includes page table entries
handling.

As I thought. Thanks for conformation.


IIRC, memory hot-unplug can be used for the memory blocks that were
previously hot-plugged.

https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html__;!!GqivPVa7Brio!MubgaBjJabDtNzNpdOxxbSKtLbqXHbsEpTtZ1mj-rnfLvMIbLW1nZ8cK10GhYJQ$ 





I don't quite understand why Enclave VM needs to be
provisioned/teardown during primary VM's runtime.

For example, an alternative could have been to just provision both
primary VM and Enclave VM on primary VM startup.
Then, wait for primary VM to setup a communication channel with
Enclave VM (E.g. via virtio-vsock).
Then, primary VM is free to request Enclave VM to perform various
tasks when required on the isolated environment.

Such setup will mimic a common Enclave setup. Such as Microsoft
Windows VBS EPT-based Enclaves (That all runs on VTL1). It is also
similar to TEEs running on ARM TrustZone.
i.e. In my alternative proposed solution, the Enclave VM is similar
to VTL1/TrustZone.
It will also avoid requiring introducing a new PCI device and driver.


True, this can be another option, to provision the primary VM and the
enclave VM at launch time.

In the proposed setup, the primary VM starts with the initial
allocated resources (memory, CPUs). The launch path of the enclave VM,
as it's spawned on the same host, is done via the ioctl interface -
PCI device - host hypervisor path. Short-running or long-running
enclave can be bootstrapped during primary VM lifetime. Depending on
the use case, a custom set of resources (memory and CPUs) is set for
an enclave and then given back when the enclave is terminated; these
resources can be used for another enclave spawned later on or the
primary VM tasks.


Yes, I already understood this is how the mechanism work. I'm
questioning whether this is indeed a good approach that should also be
taken by upstream.


I thought the point of Linux was to support devices that exist, rather 
than change the way the world works around it? ;)
I agree. Just poking around to see if upstream wants to implement a 
different approach for Enclaves, regardless of accepting the Nitro 
Enclave virtual PCI driver for AWS use-case of course.



The use-case of using Nitro Enclaves is for a Confidential-Computing
service. i.e. The ability to provision a compute instance that can be
trusted to perform a bunch of computation on sensitive
information with high confidence that it cannot be compromised as it's
highly isolated. Some technologies such as Intel SGX and AMD SEV
attempted to achieve this even with guarantees

Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Liran Alon



> On 27 Sep 2019, at 17:27, Sean Christopherson 
>  wrote:
> 
> On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
>> 
>> 
>>> On 27 Sep 2019, at 0:43, Sean Christopherson 
>>>  wrote:
>>> 
>>> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
>>> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
>>> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
>>> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
>>> VM-Exit occurs without actually entering L2, e.g. if the nested run
>>> is squashed because L2 is being put into HLT.
>> 
>> I would rephrase to “If an emulated VMEntry is squashed because L1 sets
>> vmcs12->guest_activity_state to HLT”.  I think it’s a bit more explicit.
>> 
>>> 
>>> In an ideal world where EPT *requires* unrestricted guest (and vice
>>> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
>>> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
>>> the unrestricted guest silliness complicates the dirty tracking logic
>>> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
>>> VM-Enter is a simpler overall implementation.
>>> 
>>> Cc: sta...@vger.kernel.org
>>> Reported-by: Reto Buerki 
>>> Signed-off-by: Sean Christopherson 
>>> ---
>>> arch/x86/kvm/vmx/nested.c | 8 
>>> arch/x86/kvm/vmx/vmx.c| 9 ++---
>>> 2 files changed, 14 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 41abc62c9a8a..971a24134081 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
>>> struct vmcs12 *vmcs12,
>>> entry_failure_code))
>>> return -EINVAL;
>>> 
>>> +   /*
>>> +* Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
>>> +* on nested VM-Exit, which can occur without actually running L2, e.g.
>>> +* if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
>>> +*/
>> 
>> If I understand correctly, it’s not exactly if L2 is entering HLT state in
>> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
>> configured to be intercepted by vmcs12).  It’s specifically when L1 enters L2
>> with a HLT guest-activity-state. I suggest rephrasing comment.
> 
> I deliberately worded the comment so that it remains valid if there are
> more conditions in the future that cause KVM to skip running L2.  What if
> I split the difference and make the changelog more explicit, but leave the
> comment as is?

I think what is confusing in comment is that it seems to also refer to the case
where L2 directly enters HLT state without L1 intercept. Which isn’t related.
So I would explicitly mention it’s when L1 enters L2 but don’t physically enter 
guest
with vmcs02 because L2 is in HLT state.

-Liran

> 
>>> +   if (enable_ept)
>>> +   vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
>>> +
>>> /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>>> if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>>> is_pae_paging(vcpu)) {
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index d4575ffb3cec..b530950a9c2b 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
>>> cr3)
>>> {
>>> struct kvm *kvm = vcpu->kvm;
>>> unsigned long guest_cr3;
>>> +   bool skip_cr3 = false;
>>> u64 eptp;
>>> 
>>> guest_cr3 = cr3;
>>> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned 
>>> long cr3)
>>> spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
>>> }
>>> 
>>> -   if (enable_unrestricted_guest || is_paging(vcpu) ||
>>> -   is_guest_mode(vcpu))
>>> +   if (is_guest_mode(vcpu))
>>> +   skip_cr3 = true;
>>> +   else if (enable_unrestricted_guest || is_paging(vcpu))
>>> guest_cr3 = kvm_read_cr3(vcpu);
>>> else
>>> guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>>> ept_load_pdptrs(vcpu);
>>> }
>>> 
>>> -   vmcs_writel(GUEST_CR3, guest_cr3);
>>> +   if (!skip_cr3)
>> 
>> Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool
>> write_guest_cr3”.
>> 
>> Anyway, code seems valid to me. Nice catch.
>> Reviewed-by: Liran Alon 
>> 
>> -Liran
>> 
>>> +   vmcs_writel(GUEST_CR3, guest_cr3);
>>> }
>>> 
>>> int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>> -- 
>>> 2.22.0
>>> 
>> 



Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-26 Thread Liran Alon



> On 27 Sep 2019, at 0:43, Sean Christopherson 
>  wrote:
> 
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because L2 is being put into HLT.

I would rephrase to “If an emulated VMEntry is squashed because L1 sets 
vmcs12->guest_activity_state to HLT”.
I think it’s a bit more explicit.

> 
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Reto Buerki 
> Signed-off-by: Sean Christopherson 
> ---
> arch/x86/kvm/vmx/nested.c | 8 
> arch/x86/kvm/vmx/vmx.c| 9 ++---
> 2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 41abc62c9a8a..971a24134081 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   entry_failure_code))
>   return -EINVAL;
> 
> + /*
> +  * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> +  * on nested VM-Exit, which can occur without actually running L2, e.g.
> +  * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> +  */

If I understand correctly, it’s not exactly if L2 is entering HLT state in 
general.
(E.g. issue doesn’t occur if L2 runs HLT directly which is not configured to be 
intercepted by vmcs12).
It’s specifically when L1 enters L2 with a HLT guest-activity-state. I suggest 
rephrasing comment.

> + if (enable_ept)
> + vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> +
>   /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>   if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>   is_pae_paging(vcpu)) {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d4575ffb3cec..b530950a9c2b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> cr3)
> {
>   struct kvm *kvm = vcpu->kvm;
>   unsigned long guest_cr3;
> + bool skip_cr3 = false;
>   u64 eptp;
> 
>   guest_cr3 = cr3;
> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> cr3)
>   spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
>   }
> 
> - if (enable_unrestricted_guest || is_paging(vcpu) ||
> - is_guest_mode(vcpu))
> + if (is_guest_mode(vcpu))
> + skip_cr3 = true;
> + else if (enable_unrestricted_guest || is_paging(vcpu))
>   guest_cr3 = kvm_read_cr3(vcpu);
>   else
>   guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>   ept_load_pdptrs(vcpu);
>   }
> 
> - vmcs_writel(GUEST_CR3, guest_cr3);
> + if (!skip_cr3)

Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool 
write_guest_cr3”.

Anyway, code seems valid to me. Nice catch.
Reviewed-by: Liran Alon 

-Liran

> + vmcs_writel(GUEST_CR3, guest_cr3);
> }
> 
> int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> -- 
> 2.22.0
> 



Re: [PATCH] KVM: nVMX: cleanup and fix host 64-bit mode checks

2019-09-26 Thread Liran Alon



> On 25 Sep 2019, at 19:34, Paolo Bonzini  wrote:
> 
> KVM was incorrectly checking vmcs12->host_ia32_efer even if the "load
> IA32_EFER" exit control was reset.  Also, some checks were not using
> the new CC macro for tracing.
> 
> Cleanup everything so that the vCPU's 64-bit mode is determined
> directly from EFER_LMA and the VMCS checks are based on that, which
> matches section 26.2.4 of the SDM.
> 
> Cc: Sean Christopherson 
> Cc: Jim Mattson 
> Cc: Krish Sadhukhan 
> Fixes: 5845038c111db27902bc220a4f70070fe945871c
> Signed-off-by: Paolo Bonzini 
> —

Reviewed-by: Liran Alon 




Re: [PATCH v3] KVM: x86: Disable posted interrupts for odd IRQs

2019-09-05 Thread Liran Alon



> On 5 Sep 2019, at 15:58, Alexander Graf  wrote:
> 
> We can easily route hardware interrupts directly into VM context when
> they target the "Fixed" or "LowPriority" delivery modes.
> 
> However, on modes such as "SMI" or "Init", we need to go via KVM code
> to actually put the vCPU into a different mode of operation, so we can
> not post the interrupt
> 
> Add code in the VMX and SVM PI logic to explicitly refuse to establish
> posted mappings for advanced IRQ deliver modes. This reflects the logic
> in __apic_accept_irq() which also only ever passes Fixed and LowPriority
> interrupts as posted interrupts into the guest.
> 
> This fixes a bug I have with code which configures real hardware to
> inject virtual SMIs into my guest.
> 
> Signed-off-by: Alexander Graf 

Reviewed-by: Liran Alon 

> 
> ---
> 
> v1 -> v2:
> 
>  - Make error message more unique
>  - Update commit message to point to __apic_accept_irq()
> 
> v2 -> v3:
> 
>  - Use if() rather than switch()
>  - Move abort logic into existing if() branch for broadcast irqs
>  -> remove the updated error message again (thus remove R-B tag from Liran)
>  - Fold VMX and SVM changes into single commit
>  - Combine postability check into helper function kvm_irq_is_postable()
> ---
> arch/x86/include/asm/kvm_host.h | 7 +++
> arch/x86/kvm/svm.c  | 4 +++-
> arch/x86/kvm/vmx/vmx.c  | 6 +-
> 3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 44a5ce57a905..5b14aa1fbeeb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1581,6 +1581,13 @@ bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct 
> kvm_lapic_irq *irq,
> void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
>struct kvm_lapic_irq *irq);
> 
> +static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq)
> +{
> + /* We can only post Fixed and LowPrio IRQs */
> + return (irq->delivery_mode == dest_Fixed ||
> + irq->delivery_mode == dest_LowestPrio);
> +}
> +
> static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
> {
>   if (kvm_x86_ops->vcpu_blocking)
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1f220a85514f..f5b03d0c9bc6 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5260,7 +5260,8 @@ get_pi_vcpu_info(struct kvm *kvm, struct 
> kvm_kernel_irq_routing_entry *e,
> 
>   kvm_set_msi_irq(kvm, e, );
> 
> - if (!kvm_intr_is_single_vcpu(kvm, , )) {
> + if (!kvm_intr_is_single_vcpu(kvm, , ) ||
> + !kvm_irq_is_postable()) {
>   pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
>__func__, irq.vector);
>   return -1;
> @@ -5314,6 +5315,7 @@ static int svm_update_pi_irte(struct kvm *kvm, unsigned 
> int host_irq,
>* 1. When cannot target interrupt to a specific vcpu.
>* 2. Unsetting posted interrupt.
>* 3. APIC virtialization is disabled for the vcpu.
> +  * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
>*/
>   if (!get_pi_vcpu_info(kvm, e, _info, ) && set &&
>   kvm_vcpu_apicv_active(>vcpu)) {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 570a233e272b..63f3d88b36cc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7382,10 +7382,14 @@ static int vmx_update_pi_irte(struct kvm *kvm, 
> unsigned int host_irq,
>* irqbalance to make the interrupts single-CPU.
>*
>* We will support full lowest-priority interrupt later.
> +  *
> +  * In addition, we can only inject generic interrupts using
> +  * the PI mechanism, refuse to route others through it.
>*/
> 
>   kvm_set_msi_irq(kvm, e, );
> - if (!kvm_intr_is_single_vcpu(kvm, , )) {
> + if (!kvm_intr_is_single_vcpu(kvm, , ) ||
> + !kvm_irq_is_postable()) {
>   /*
>* Make sure the IRTE is in remapped mode if
>* we don't handle it in posted mode.
> -- 
> 2.17.1
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
> 



Re: [PATCH 2/2] KVM: SVM: Disable posted interrupts for odd IRQs

2019-09-03 Thread Liran Alon



> On 3 Sep 2019, at 17:29, Alexander Graf  wrote:
> 
> We can easily route hardware interrupts directly into VM context when
> they target the "Fixed" or "LowPriority" delivery modes.
> 
> However, on modes such as "SMI" or "Init", we need to go via KVM code
> to actually put the vCPU into a different mode of operation, so we can
> not post the interrupt
> 
> Add code in the SVM PI logic to explicitly refuse to establish posted
> mappings for advanced IRQ deliver modes.
> 
> This fixes a bug I have with code which configures real hardware to
> inject virtual SMIs into my guest.
> 
> Signed-off-by: Alexander Graf 

Nit: I prefer to squash both commits into one that change both VMX & SVM.
As it’s exactly the same change.

> ---
> arch/x86/kvm/svm.c | 16 
> 1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1f220a85514f..9a6ea78c3239 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5266,6 +5266,21 @@ get_pi_vcpu_info(struct kvm *kvm, struct 
> kvm_kernel_irq_routing_entry *e,
>   return -1;
>   }
> 
> + switch (irq.delivery_mode) {
> + case dest_Fixed:
> + case dest_LowestPrio:
> + break;
> + default:
> + /*
> +  * For non-trivial interrupt events, we need to go
> +  * through the full KVM IRQ code, so refuse to take
> +  * any direct PI assignments here.
> +  */
> + pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
> +  __func__, irq.vector);
> + return -1;
> + }
> +

Prefer changing printed string to something different than the 
!kvm_intr_is_single_vcpu() case.
To assist debugging.

Having said that,
Reviewed-by: Liran Alon 

-Liran

>   pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
>irq.vector);
>   *svm = to_svm(vcpu);
> @@ -5314,6 +5329,7 @@ static int svm_update_pi_irte(struct kvm *kvm, unsigned 
> int host_irq,
>* 1. When cannot target interrupt to a specific vcpu.
>* 2. Unsetting posted interrupt.
>* 3. APIC virtialization is disabled for the vcpu.
> +  * 4. IRQ has extended delivery mode (SMI, INIT, etc)
>*/
>   if (!get_pi_vcpu_info(kvm, e, _info, ) && set &&
>   kvm_vcpu_apicv_active(>vcpu)) {
> -- 
> 2.17.1
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
> 



Re: [PATCH 1/2] KVM: VMX: Disable posted interrupts for odd IRQs

2019-09-03 Thread Liran Alon



> On 3 Sep 2019, at 17:29, Alexander Graf  wrote:
> 
> We can easily route hardware interrupts directly into VM context when
> they target the "Fixed" or "LowPriority" delivery modes.
> 
> However, on modes such as "SMI" or "Init", we need to go via KVM code
> to actually put the vCPU into a different mode of operation, so we can
> not post the interrupt

I would also mention in commit message that one can see this is also
true in KVM’s vLAPIC code. i.e. __apic_accept_irq() can call
kvm_x86_ops->deliver_posted_interrupt() only in case deliver-mode is
either “Fixed” or “LowPriority”. 

> 
> Add code in the VMX PI logic to explicitly refuse to establish posted
> mappings for advanced IRQ deliver modes.
> 
> This fixes a bug I have with code which configures real hardware to
> inject virtual SMIs into my guest.
> 
> Signed-off-by: Alexander Graf 

With some small improvements I written inline below:
Reviewed-by: Liran Alon 

> ---
> arch/x86/kvm/vmx/vmx.c | 22 ++
> 1 file changed, 22 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 570a233e272b..d16c4ae8f685 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7401,6 +7401,28 @@ static int vmx_update_pi_irte(struct kvm *kvm, 
> unsigned int host_irq,
>   continue;
>   }
> 
> + switch (irq.delivery_mode) {
> + case dest_Fixed:
> + case dest_LowestPrio:
> + break;
> + default:
> + /*
> +  * For non-trivial interrupt events, we need to go
> +  * through the full KVM IRQ code, so refuse to take
> +  * any direct PI assignments here.
> +  */
> +
> + ret = irq_set_vcpu_affinity(host_irq, NULL);
> + if (ret < 0) {
> + printk(KERN_INFO
> +"failed to back to remapped mode, irq: %u\n",
> +host_irq);
> + goto out;

I recommend we will chose to print here a string that is different than the 
!kvm_intr_is_single_vcpu()
case to make it easier to diagnose which case exactly failed.

-Liran

> + }
> +
> + continue;
> + }
> +
>   vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
>   vcpu_info.vector = irq.vector;
> 
> -- 
> 2.17.1
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
> 



Re: [PATCH v2 05/14] KVM: x86: Move #GP injection for VMware into x86_emulate_instruction()

2019-08-27 Thread Liran Alon


> On 28 Aug 2019, at 0:40, Sean Christopherson 
>  wrote:
> 
> Immediately inject a #GP when VMware emulation fails and return
> EMULATE_DONE instead of propagating EMULATE_FAIL up the stack.  This
> helps pave the way for removing EMULATE_FAIL altogether.
> 
> Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to document that the x86
> emulator is called to handle VMware #GP interception, e.g. why a #GP
> is injected on emulation failure for EMULTYPE_VMWARE_GP.
> 
> Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type.  The "no #UD on fail"
> is used only in the VMWare case and is obsoleted by having the emulator
> itself reinject #GP.
> 
> Signed-off-by: Sean Christopherson 

Reviewed-by: Liran Alon 

> ---
> arch/x86/include/asm/kvm_host.h |  3 +--
> arch/x86/kvm/svm.c  | 10 ++
> arch/x86/kvm/vmx/vmx.c  | 10 ++
> arch/x86/kvm/x86.c  | 14 +-
> 4 files changed, 14 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 44a5ce57a905..d1d5b5ca1195 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1318,8 +1318,7 @@ enum emulation_result {
> #define EMULTYPE_TRAP_UD  (1 << 1)
> #define EMULTYPE_SKIP (1 << 2)
> #define EMULTYPE_ALLOW_RETRY  (1 << 3)
> -#define EMULTYPE_NO_UD_ON_FAIL   (1 << 4)
> -#define EMULTYPE_VMWARE  (1 << 5)
> +#define EMULTYPE_VMWARE_GP   (1 << 5)
> int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
> int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
>   void *insn, int insn_len);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 7242142573d6..c4b72db48bc5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2768,7 +2768,6 @@ static int gp_interception(struct vcpu_svm *svm)
> {
>   struct kvm_vcpu *vcpu = >vcpu;
>   u32 error_code = svm->vmcb->control.exit_info_1;
> - int er;
> 
>   WARN_ON_ONCE(!enable_vmware_backdoor);
> 
> @@ -2780,13 +2779,8 @@ static int gp_interception(struct vcpu_svm *svm)
>   kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
>   return 1;
>   }
> - er = kvm_emulate_instruction(vcpu,
> - EMULTYPE_VMWARE | EMULTYPE_NO_UD_ON_FAIL);
> - if (er == EMULATE_USER_EXIT)
> - return 0;
> - else if (er != EMULATE_DONE)
> - kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> - return 1;
> + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP) !=
> + EMULATE_USER_EXIT;
> }
> 
> static bool is_erratum_383(void)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8a65e1122376..c6ba452296e3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4492,7 +4492,6 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>   u32 intr_info, ex_no, error_code;
>   unsigned long cr2, rip, dr6;
>   u32 vect_info;
> - enum emulation_result er;
> 
>   vect_info = vmx->idt_vectoring_info;
>   intr_info = vmx->exit_intr_info;
> @@ -4519,13 +4518,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>   kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
>   return 1;
>   }
> - er = kvm_emulate_instruction(vcpu,
> - EMULTYPE_VMWARE | EMULTYPE_NO_UD_ON_FAIL);
> - if (er == EMULATE_USER_EXIT)
> - return 0;
> - else if (er != EMULATE_DONE)
> - kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> - return 1;
> + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP) !=
> + EMULATE_USER_EXIT;
>   }
> 
>   /*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fe847f8eb947..228ca71d5b01 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6210,8 +6210,10 @@ static int handle_emulation_failure(struct kvm_vcpu 
> *vcpu, int emulation_type)
>   ++vcpu->stat.insn_emulation_fail;
>   trace_kvm_emulate_insn_failed(vcpu);
> 
> - if (emulation_type & EMULTYPE_NO_UD_ON_FAIL)
> - return EMULATE_FAIL;
> + if (emulation_type & EMULTYPE_VMWARE_GP) {
> + kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> + return EMULATE_DONE;
> + }
> 
>   kvm_queue_exception(vcpu, UD_VECTOR);
> 
> @@ -6543,9 +

Re: [RESEND PATCH 07/13] KVM: x86: Add explicit flag for forced emulation on #UD

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 17:44, Sean Christopherson 
>  wrote:
> 
> On Fri, Aug 23, 2019 at 04:47:14PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 23 Aug 2019, at 4:07, Sean Christopherson 
>>>  wrote:
>>> 
>>> Add an explicit emulation type for forced #UD emulation and use it to
>>> detect that KVM should unconditionally inject a #UD instead of falling
>>> into its standard emulation failure handling.
>>> 
>>> Signed-off-by: Sean Christopherson 
>> 
>> The name "forced emulation on #UD" is not clear to me.
>> 
>> If I understand correctly, EMULTYPE_TRAP_UD is currently used to indicate
>> that in case the x86 emulator fails to decode instruction, the caller would
>> like the x86 emulator to fail early such that it can handle this condition
>> properly.  Thus, I would rename it EMULTYPE_TRAP_DECODE_FAILURE.
> 
> EMULTYPE_TRAP_UD is used when KVM intercepts a #UD from hardware.  KVM
> only emulates select instructions in this case in order to minmize the
> emulator attack surface, e.g.:
> 
>   if (unlikely(ctxt->ud) && likely(!(ctxt->d & EmulateOnUD)))
>   return EMULATION_FAILED;
> 
> To enable testing of the emulator, KVM recognizes a special "opcode" that
> triggers full emulation on #UD, e.g. ctxt->ud is false when the #UD was
> triggered with the magic prefix.  The prefix is only recognized when the
> module param force_emulation_prefix is toggled on, hence the name
> EMULTYPE_TRAP_UD_FORCED.

Ah-ha. This makes sense. Thanks for the explanation.
I would say it’s worth putting a comment in it in code…

> 
>> But this new flag seems to do the same. So I’m left confused.  I’m probably
>> missing something trivial here.



Re: [RESEND PATCH 08/13] KVM: x86: Move #UD injection for failed emulation into emulation code

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 4:07, Sean Christopherson 
>  wrote:
> 
> Immediately inject a #UD and return EMULATE done if emulation fails when
> handling an intercepted #UD.  This helps pave the way for removing
> EMULATE_FAIL altogether.
> 
> Signed-off-by: Sean Christopherson 

I suggest squashing this commit which previous one.

-Liran

> ---
> arch/x86/kvm/x86.c | 14 +-
> 1 file changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a1f9e36b2d58..bff3320aa78e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5328,7 +5328,6 @@ EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
> int handle_ud(struct kvm_vcpu *vcpu)
> {
>   int emul_type = EMULTYPE_TRAP_UD;
> - enum emulation_result er;
>   char sig[5]; /* ud2; .ascii "kvm" */
>   struct x86_exception e;
> 
> @@ -5340,12 +5339,7 @@ int handle_ud(struct kvm_vcpu *vcpu)
>   emul_type = EMULTYPE_TRAP_UD_FORCED;
>   }
> 
> - er = kvm_emulate_instruction(vcpu, emul_type);
> - if (er == EMULATE_USER_EXIT)
> - return 0;
> - if (er != EMULATE_DONE)
> - kvm_queue_exception(vcpu, UD_VECTOR);
> - return 1;
> + return kvm_emulate_instruction(vcpu, emul_type) != EMULATE_USER_EXIT;
> }
> EXPORT_SYMBOL_GPL(handle_ud);
> 
> @@ -6533,8 +6527,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>   ++vcpu->stat.insn_emulation;
>   if (r != EMULATION_OK)  {
>   if ((emulation_type & EMULTYPE_TRAP_UD) ||
> - (emulation_type & EMULTYPE_TRAP_UD_FORCED))
> - return EMULATE_FAIL;
> + (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return EMULATE_DONE;
> + }
>   if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
>   emulation_type))
>   return EMULATE_DONE;
> -- 
> 2.22.0
> 



Re: [RESEND PATCH 07/13] KVM: x86: Add explicit flag for forced emulation on #UD

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 4:07, Sean Christopherson 
>  wrote:
> 
> Add an explicit emulation type for forced #UD emulation and use it to
> detect that KVM should unconditionally inject a #UD instead of falling
> into its standard emulation failure handling.
> 
> Signed-off-by: Sean Christopherson 

The name "forced emulation on #UD" is not clear to me.

If I understand correctly, EMULTYPE_TRAP_UD is currently used to indicate
that in case the x86 emulator fails to decode instruction, the caller would like
the x86 emulator to fail early such that it can handle this condition properly.
Thus, I would rename it EMULTYPE_TRAP_DECODE_FAILURE.

But this new flag seems to do the same. So I’m left confused.
I’m probably missing something trivial here.

-Liran

> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c  | 5 +++--
> 2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d1d5b5ca1195..a38c93362945 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1318,6 +1318,7 @@ enum emulation_result {
> #define EMULTYPE_TRAP_UD  (1 << 1)
> #define EMULTYPE_SKIP (1 << 2)
> #define EMULTYPE_ALLOW_RETRY  (1 << 3)
> +#define EMULTYPE_TRAP_UD_FORCED  (1 << 4)
> #define EMULTYPE_VMWARE_GP(1 << 5)
> int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
> int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 228ca71d5b01..a1f9e36b2d58 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5337,7 +5337,7 @@ int handle_ud(struct kvm_vcpu *vcpu)
>   sig, sizeof(sig), ) == 0 &&
>   memcmp(sig, "\xf\xbkvm", sizeof(sig)) == 0) {
>   kvm_rip_write(vcpu, kvm_rip_read(vcpu) + sizeof(sig));
> - emul_type = 0;
> + emul_type = EMULTYPE_TRAP_UD_FORCED;
>   }
> 
>   er = kvm_emulate_instruction(vcpu, emul_type);
> @@ -6532,7 +6532,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>   trace_kvm_emulate_insn_start(vcpu);
>   ++vcpu->stat.insn_emulation;
>   if (r != EMULATION_OK)  {
> - if (emulation_type & EMULTYPE_TRAP_UD)
> + if ((emulation_type & EMULTYPE_TRAP_UD) ||
> + (emulation_type & EMULTYPE_TRAP_UD_FORCED))
>   return EMULATE_FAIL;
>   if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
>   emulation_type))
> -- 
> 2.22.0
> 



Re: [RESEND PATCH 04/13] KVM: x86: Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 16:21, Liran Alon  wrote:
> 
> 
> 
>> On 23 Aug 2019, at 4:07, Sean Christopherson 
>>  wrote:
>> 
>> The "no #UD on fail" is used only in the VMWare case, and for the VMWare
>> scenario it really means "#GP instead of #UD on fail".  Remove the flag
>> in preparation for moving all fault injection into the emulation flow
>> itself, which in turn will allow eliminating EMULATE_DONE and company.
>> 
>> Signed-off-by: Sean Christopherson 
> 
> When I created the commit which introduced this
> e23661712005 ("KVM: x86: Add emulation_type to not raise #UD on emulation 
> failure")
> I intentionally introduced a new flag to emulation_type instead of using 
> EMULTYPE_VMWARE
> as I thought it’s weird to couple this behaviour specifically with VMware 
> emulation.
> As it made sense to me that there could be more scenarios in which some 
> VMExit handler
> would like to use the x86 emulator but in case of failure want to decide what 
> would be
> the failure handling from the outside. I also didn’t want the x86 emulator to 
> be aware
> of VMware interception internals.
> 
> Having said that, one could argue that the x86 emulator already knows about 
> the VMware
> interception internals because of how x86_emulate_instruction() use 
> is_vmware_backdoor_opcode()
> and from the mere existence of EMULTYPE_VMWARE. So I think it’s legit to 
> decide
> that we will just move all the VMware interception logic into the x86 
> emulator. Including
> handling emulation failures. But then, I would make this patch of yours to 
> also
> modify handle_emulation_failure() to queue #GP to guest directly instead of 
> #GP intercept
> in VMX/SVM to do so.
> I see you do it in a later patch "KVM: x86: Move #GP injection for VMware 
> into x86_emulate_instruction()"
> but I think this should just be squashed with this patch to make sense.
> 
> To sum-up, I agree with your approach but I recommend you squash this patch 
> and patch 6 of the series to one
> and change commit message to explain that you just move entire handling of 
> VMware interception into
> the x86 emulator. Instead of providing explanations such as VMware emulation 
> is the only one that use
> “no #UD on fail”.

After reading patch 5 as-well, I would recommend to first apply patch 5 (filter 
out #GP with error-code != 0)
and only then apply 4+6.

-Liran

> 
> The diff itself looks fine to me, therefore:
> Reviewed-by: Liran Alon 
> 
> -Liran
> 
> 
>> ---
>> arch/x86/include/asm/kvm_host.h | 1 -
>> arch/x86/kvm/svm.c  | 3 +--
>> arch/x86/kvm/vmx/vmx.c  | 3 +--
>> arch/x86/kvm/x86.c  | 2 +-
>> 4 files changed, 3 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/kvm_host.h 
>> b/arch/x86/include/asm/kvm_host.h
>> index 44a5ce57a905..dd6bd9ed0839 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1318,7 +1318,6 @@ enum emulation_result {
>> #define EMULTYPE_TRAP_UD (1 << 1)
>> #define EMULTYPE_SKIP(1 << 2)
>> #define EMULTYPE_ALLOW_RETRY (1 << 3)
>> -#define EMULTYPE_NO_UD_ON_FAIL  (1 << 4)
>> #define EMULTYPE_VMWARE  (1 << 5)
>> int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
>> int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 1f220a85514f..5a42f9c70014 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -2772,8 +2772,7 @@ static int gp_interception(struct vcpu_svm *svm)
>> 
>>  WARN_ON_ONCE(!enable_vmware_backdoor);
>> 
>> -er = kvm_emulate_instruction(vcpu,
>> -EMULTYPE_VMWARE | EMULTYPE_NO_UD_ON_FAIL);
>> +er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
>>  if (er == EMULATE_USER_EXIT)
>>  return 0;
>>  else if (er != EMULATE_DONE)
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 18286e5b5983..6ecf773825e2 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -4509,8 +4509,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>> 
>>  if (!vmx->rmode.vm86_active && is_gp_fault(intr_info)) {
>>  WARN_ON_ONCE(!enable_vmware_backdoor);
>> -er = kvm_emulate_instruction(vcpu,
>> -EMULTYPE_VMWARE | EMULTYPE_NO_UD_ON_FAIL);
>> +er = kvm_emulate_instruc

Re: [RESEND PATCH 06/13] KVM: x86: Move #GP injection for VMware into x86_emulate_instruction()

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 4:07, Sean Christopherson 
>  wrote:
> 
> Immediately inject a #GP when VMware emulation fails and return
> EMULATE_DONE instead of propagating EMULATE_FAIL up the stack.  This
> helps pave the way for removing EMULATE_FAIL altogether.
> 
> Rename EMULTYPE_VMWARE to EMULTYPE_VMWARE_GP to help document why a #GP
> is injected on emulation failure.

I would rephrase to say that this rename is in order to document that the x86 
emulator is called to handle
VMware #GP interception. In theory, VMware could have also added weird behaviour
to #UD interception as-well. :P

Besides minor comments inline below:
Reviewed-by: Liran Alon 

-Liran

> 
> Signed-off-by: Sean Christopherson 
> ---
> arch/x86/include/asm/kvm_host.h |  2 +-
> arch/x86/kvm/svm.c  |  9 ++---
> arch/x86/kvm/vmx/vmx.c  |  9 ++---
> arch/x86/kvm/x86.c  | 14 +-
> 4 files changed, 14 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dd6bd9ed0839..d1d5b5ca1195 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1318,7 +1318,7 @@ enum emulation_result {
> #define EMULTYPE_TRAP_UD  (1 << 1)
> #define EMULTYPE_SKIP (1 << 2)
> #define EMULTYPE_ALLOW_RETRY  (1 << 3)
> -#define EMULTYPE_VMWARE  (1 << 5)
> +#define EMULTYPE_VMWARE_GP   (1 << 5)
> int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
> int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
>   void *insn, int insn_len);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index b96a119690f4..97562c2c8b7b 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2768,7 +2768,6 @@ static int gp_interception(struct vcpu_svm *svm)
> {
>   struct kvm_vcpu *vcpu = >vcpu;
>   u32 error_code = svm->vmcb->control.exit_info_1;
> - int er;
> 
>   WARN_ON_ONCE(!enable_vmware_backdoor);
> 
> @@ -2776,12 +2775,8 @@ static int gp_interception(struct vcpu_svm *svm)
>   kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
>   return 1;
>   }
> - er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
> - if (er == EMULATE_USER_EXIT)
> - return 0;
> - else if (er != EMULATE_DONE)
> - kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> - return 1;
> + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP) !=
> + EMULATE_USER_EXIT;
> }
> 
> static bool is_erratum_383(void)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3ee0dd304bc7..25410c58c758 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4492,7 +4492,6 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>   u32 intr_info, ex_no, error_code;
>   unsigned long cr2, rip, dr6;
>   u32 vect_info;
> - enum emulation_result er;
> 
>   vect_info = vmx->idt_vectoring_info;
>   intr_info = vmx->exit_intr_info;
> @@ -4514,12 +4513,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>   kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
>   return 1;
>   }
> - er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
> - if (er == EMULATE_USER_EXIT)
> - return 0;
> - else if (er != EMULATE_DONE)
> - kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> - return 1;
> + return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP) !=
> + EMULATE_USER_EXIT;
>   }
> 
>   /*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e0f0e14d8fac..228ca71d5b01 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6210,8 +6210,10 @@ static int handle_emulation_failure(struct kvm_vcpu 
> *vcpu, int emulation_type)
>   ++vcpu->stat.insn_emulation_fail;
>   trace_kvm_emulate_insn_failed(vcpu);
> 
> - if (emulation_type & EMULTYPE_VMWARE)
> - return EMULATE_FAIL;
> + if (emulation_type & EMULTYPE_VMWARE_GP) {
> + kvm_queue_exception_e(vcpu, GP_VECTOR, 0);

I would add here a comment explaining why you can assume #GP error-code is 0.
i.e. Explain that’s because VMware #GP interception is only related to IN{S}, 
OUT{S} and RDPMC
instructions which all of them raise #GP with error-code of 0.

> + return EMULATE_DONE;
> + }
> 
>   kvm_queue_excep

Re: [RESEND PATCH 04/13] KVM: x86: Drop EMULTYPE_NO_UD_ON_FAIL as a standalone type

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 4:07, Sean Christopherson 
>  wrote:
> 
> The "no #UD on fail" is used only in the VMWare case, and for the VMWare
> scenario it really means "#GP instead of #UD on fail".  Remove the flag
> in preparation for moving all fault injection into the emulation flow
> itself, which in turn will allow eliminating EMULATE_DONE and company.
> 
> Signed-off-by: Sean Christopherson 

When I created the commit which introduced this
e23661712005 ("KVM: x86: Add emulation_type to not raise #UD on emulation 
failure")
I intentionally introduced a new flag to emulation_type instead of using 
EMULTYPE_VMWARE
as I thought it’s weird to couple this behaviour specifically with VMware 
emulation.
As it made sense to me that there could be more scenarios in which some VMExit 
handler
would like to use the x86 emulator but in case of failure want to decide what 
would be
the failure handling from the outside. I also didn’t want the x86 emulator to 
be aware
of VMware interception internals.

Having said that, one could argue that the x86 emulator already knows about the 
VMware
interception internals because of how x86_emulate_instruction() use 
is_vmware_backdoor_opcode()
and from the mere existence of EMULTYPE_VMWARE. So I think it’s legit to decide
that we will just move all the VMware interception logic into the x86 emulator. 
Including
handling emulation failures. But then, I would make this patch of yours to also
modify handle_emulation_failure() to queue #GP to guest directly instead of #GP 
intercept
in VMX/SVM to do so.
I see you do it in a later patch "KVM: x86: Move #GP injection for VMware into 
x86_emulate_instruction()"
but I think this should just be squashed with this patch to make sense.

To sum-up, I agree with your approach but I recommend you squash this patch and 
patch 6 of the series to one
and change commit message to explain that you just move entire handling of 
VMware interception into
the x86 emulator. Instead of providing explanations such as VMware emulation is 
the only one that use
“no #UD on fail”.

The diff itself looks fine to me, therefore:
Reviewed-by: Liran Alon 

-Liran


> ---
> arch/x86/include/asm/kvm_host.h | 1 -
> arch/x86/kvm/svm.c  | 3 +--
> arch/x86/kvm/vmx/vmx.c  | 3 +--
> arch/x86/kvm/x86.c  | 2 +-
> 4 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 44a5ce57a905..dd6bd9ed0839 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1318,7 +1318,6 @@ enum emulation_result {
> #define EMULTYPE_TRAP_UD  (1 << 1)
> #define EMULTYPE_SKIP (1 << 2)
> #define EMULTYPE_ALLOW_RETRY  (1 << 3)
> -#define EMULTYPE_NO_UD_ON_FAIL   (1 << 4)
> #define EMULTYPE_VMWARE   (1 << 5)
> int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
> int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1f220a85514f..5a42f9c70014 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2772,8 +2772,7 @@ static int gp_interception(struct vcpu_svm *svm)
> 
>   WARN_ON_ONCE(!enable_vmware_backdoor);
> 
> - er = kvm_emulate_instruction(vcpu,
> - EMULTYPE_VMWARE | EMULTYPE_NO_UD_ON_FAIL);
> + er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
>   if (er == EMULATE_USER_EXIT)
>   return 0;
>   else if (er != EMULATE_DONE)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 18286e5b5983..6ecf773825e2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4509,8 +4509,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> 
>   if (!vmx->rmode.vm86_active && is_gp_fault(intr_info)) {
>   WARN_ON_ONCE(!enable_vmware_backdoor);
> - er = kvm_emulate_instruction(vcpu,
> - EMULTYPE_VMWARE | EMULTYPE_NO_UD_ON_FAIL);
> + er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
>   if (er == EMULATE_USER_EXIT)
>   return 0;
>   else if (er != EMULATE_DONE)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fe847f8eb947..e0f0e14d8fac 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6210,7 +6210,7 @@ static int handle_emulation_failure(struct kvm_vcpu 
> *vcpu, int emulation_type)
>   ++vcpu->stat.insn_emulation_fail;
>   trace_kvm_emulate_insn_failed(vcpu);
> 
> - if (emulation_type & EMULTYPE_NO_UD_ON_FAIL)
> + if (emulation_type & EMULTYPE_VMWARE)
>   return EMULATE_FAIL;
> 
>   kvm_queue_exception(vcpu, UD_VECTOR);
> -- 
> 2.22.0
> 



Re: [RESEND PATCH 05/13] KVM: x86: Don't attempt VMWare emulation on #GP with non-zero error code

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 4:07, Sean Christopherson 
>  wrote:
> 
> The VMware backdoor hooks #GP faults on IN{S}, OUT{S}, and RDPMC, none
> of which generate a non-zero error code for their #GP.  Re-injecting #GP
> instead of attempting emulation on a non-zero error code will allow a
> future patch to move #GP injection (for emulation failure) into
> kvm_emulate_instruction() without having to plumb in the error code.
> 
> Signed-off-by: Sean Christopherson 

Reviewed-by: Liran Alon 

-Liran

> ---
> arch/x86/kvm/svm.c | 6 +-
> arch/x86/kvm/vmx/vmx.c | 7 ++-
> 2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5a42f9c70014..b96a119690f4 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2772,11 +2772,15 @@ static int gp_interception(struct vcpu_svm *svm)
> 
>   WARN_ON_ONCE(!enable_vmware_backdoor);
> 
> + if (error_code) {
> + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + return 1;
> + }
>   er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
>   if (er == EMULATE_USER_EXIT)
>   return 0;
>   else if (er != EMULATE_DONE)
> - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
>   return 1;
> }
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6ecf773825e2..3ee0dd304bc7 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4509,11 +4509,16 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> 
>   if (!vmx->rmode.vm86_active && is_gp_fault(intr_info)) {
>   WARN_ON_ONCE(!enable_vmware_backdoor);
> +
> + if (error_code) {
> + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + return 1;
> + }
>   er = kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE);
>   if (er == EMULATE_USER_EXIT)
>   return 0;
>   else if (er != EMULATE_DONE)
> - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
>   return 1;
>   }
> 
> -- 
> 2.22.0
> 



Re: [RESEND PATCH 03/13] KVM: x86: Refactor kvm_vcpu_do_singlestep() to remove out param

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 4:06, Sean Christopherson 
>  wrote:
> 
> Return the single-step emulation result directly instead of via an out
> param.  Presumably at some point in the past kvm_vcpu_do_singlestep()
> could be called with *r==EMULATE_USER_EXIT, but that is no longer the
> case, i.e. all callers are happy to overwrite their own return variable.
> 
> Signed-off-by: Sean Christopherson 
> ---
> arch/x86/kvm/x86.c | 12 ++--
> 1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c6de5bc4fa5e..fe847f8eb947 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6377,7 +6377,7 @@ static int kvm_vcpu_check_hw_bp(unsigned long addr, u32 
> type, u32 dr7,
>   return dr6;
> }
> 
> -static void kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu, int *r)
> +static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu)
> {
>   struct kvm_run *kvm_run = vcpu->run;
> 
> @@ -6386,10 +6386,10 @@ static void kvm_vcpu_do_singlestep(struct kvm_vcpu 
> *vcpu, int *r)
>   kvm_run->debug.arch.pc = vcpu->arch.singlestep_rip;
>   kvm_run->debug.arch.exception = DB_VECTOR;
>   kvm_run->exit_reason = KVM_EXIT_DEBUG;
> - *r = EMULATE_USER_EXIT;
> - } else {
> - kvm_queue_exception_p(vcpu, DB_VECTOR, DR6_BS);
> + return EMULATE_USER_EXIT;
>   }
> + kvm_queue_exception_p(vcpu, DB_VECTOR, DR6_BS);
> + return EMULATE_DONE;
> }
> 
> int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> @@ -6410,7 +6410,7 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>* that sets the TF flag".
>*/
>   if (unlikely(rflags & X86_EFLAGS_TF))
> - kvm_vcpu_do_singlestep(vcpu, );
> + r = kvm_vcpu_do_singlestep(vcpu);
>   return r == EMULATE_DONE;
> }
> EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);
> @@ -6613,7 +6613,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>   vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
>   kvm_rip_write(vcpu, ctxt->eip);
>   if (r == EMULATE_DONE && ctxt->tf)
> - kvm_vcpu_do_singlestep(vcpu, );
> + r = kvm_vcpu_do_singlestep(vcpu);
>   if (!ctxt->have_exception ||
>   exception_type(ctxt->exception.vector) == EXCPT_TRAP)
>   __kvm_set_rflags(vcpu, ctxt->eflags);
> -- 
> 2.22.0
> 

Reviewed-by: Liran Alon 

-Liran




Re: [RESEND PATCH 02/13] KVM: x86: Clean up handle_emulation_failure()

2019-08-23 Thread Liran Alon



> On 23 Aug 2019, at 12:23, Vitaly Kuznetsov  wrote:
> 
> Sean Christopherson  writes:
> 
>> When handling emulation failure, return the emulation result directly
>> instead of capturing it in a local variable.  Future patches will move
>> additional cases into handle_emulation_failure(), clean up the cruft
>> before so there isn't an ugly mix of setting a local variable and
>> returning directly.
>> 
>> Signed-off-by: Sean Christopherson 
>> ---
>> arch/x86/kvm/x86.c | 10 --
>> 1 file changed, 4 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index cd425f54096a..c6de5bc4fa5e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6207,24 +6207,22 @@ EXPORT_SYMBOL_GPL(kvm_inject_realmode_interrupt);
>> 
>> static int handle_emulation_failure(struct kvm_vcpu *vcpu, int 
>> emulation_type)
>> {
>> -int r = EMULATE_DONE;
>> -
>>  ++vcpu->stat.insn_emulation_fail;
>>  trace_kvm_emulate_insn_failed(vcpu);
>> 
>>  if (emulation_type & EMULTYPE_NO_UD_ON_FAIL)
>>  return EMULATE_FAIL;
>> 
>> +kvm_queue_exception(vcpu, UD_VECTOR);
>> +
>>  if (!is_guest_mode(vcpu) && kvm_x86_ops->get_cpl(vcpu) == 0) {
>>  vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>>  vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_EMULATION;
>>  vcpu->run->internal.ndata = 0;
>> -r = EMULATE_USER_EXIT;
>> +return EMULATE_USER_EXIT;
>>  }
>> 
>> -kvm_queue_exception(vcpu, UD_VECTOR);
>> -
>> -return r;
>> +return EMULATE_DONE;
>> }
>> 
>> static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
> 
> No functional change,
> 
> Reviewed-by: Vitaly Kuznetsov 
> 
> Just for self-education, what sane userspace is supposed to do when it
> sees KVM_EXIT_INTERNAL_ERROR other than kill the guest? Why does it make
> sense to still prepare to inject '#UD’
> 
> -- 
> Vitaly

The commit which introduced this behaviour seems to be
6d77dbfc88e3 ("KVM: inject #UD if instruction emulation fails and exit to 
userspace")

I actually agree with Vitaly. It made more sense that the ABI would be that
on internal emulation failure, we just return to userspace and allow it to 
handle
the scenario however it likes. If it wishes to queue #UD on vCPU and resume
guest in case CPL==3 then it made sense that this logic would only be in 
userspace.
Thus, there is no need for KVM to queue a #UD from kernel on this scenario...

What’s even weirder is that this ABI was then further broken by 2 later commits:
First, fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to 
user-space")
changed behaviour to avoid reporting emulation error in case vCPU in guest-mode.
Then, a2b9e6c1a35a ("KVM: x86: Don't report guest userspace emulation error to 
userspace")
Changed behaviour similarly to avoid reporting emulation error in case vCPU 
CPL!=0.
In both cases, only #UD is injected to guest without userspace being aware of 
it.

Problem is that if we would change this ABI to not queue #UD on emulation error,
we will definitely break userspace VMMs that rely on it when they re-enter into 
guest
in this scenario and expect #UD to be injected.
Therefore, the only way to change this behaviour is to introduce a new KVM_CAP
that needs to be explicitly enabled from userspace.
But because most likely most userspace VMMs just terminate guest in case
of emulation-failure, it’s probably not worth it and Sean’s commit is good 
enough.

For the commit itself:
Reviewed-by: Liran Alon 

-Liran





Re: [PATCH v2] KVM: nVMX: do not use dangling shadow VMCS after guest reset

2019-07-19 Thread Liran Alon



> On 20 Jul 2019, at 1:21, Paolo Bonzini  wrote:
> 
> On 20/07/19 00:06, Liran Alon wrote:
>> 
>> 
>>> On 20 Jul 2019, at 0:39, Paolo Bonzini  wrote:
>>> 
>>> If a KVM guest is reset while running a nested guest, free_nested will
>>> disable the shadow VMCS execution control in the vmcs01.  However,
>>> on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
>>> the VMCS12 to the shadow VMCS which has since been freed.
>>> 
>>> This causes a vmptrld of a NULL pointer on my machime, but Jan reports
>>> the host to hang altogether.  Let's see how much this trivial patch fixes.
>>> 
>>> Reported-by: Jan Kiszka 
>>> Cc: Liran Alon 
>>> Cc: sta...@vger.kernel.org
>>> Signed-off-by: Paolo Bonzini 
>> 
>> 1) Are we sure we prefer WARN_ON() instead of WARN_ON_ONCE()?
> 
> I don't think you can get it to be called in a loop, the calls are
> generally guarded by ifs.
> 
>> 2) Should we also check for WARN_ON(!vmcs12)? As free_nested() also 
>> kfree(vmx->nested.cached_vmcs12).
> 
> Well, it doesn't NULL it but it does NULL shadow_vmcs so the extra
> warning wouldn't add much.
> 
>> In fact, because free_nested() don’t put NULL in cached_vmcs12 after kfree() 
>> it, I wonder if we shouldn’t create a separate patch that does:
>> (a) Modify free_nested() to put NULL in cached_vmcs12 after kfree().
>> (b) Put BUG_ON(!cached_vmcs12) in get_vmcs12() before returning value.
> 
> This is useful but a separate improvement (and not a bugfix, I want this
> patch to be small so it applies to older trees).
> 
> Paolo

ACK on all the above. :)
Reviewed-by:  Liran Alon 

-Liran

> 
>> -Liran
>> 
>>> ---
>>> arch/x86/kvm/vmx/nested.c | 8 +++-
>>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 4f23e34f628b..0f1378789bd0 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -194,6 +194,7 @@ static void vmx_disable_shadow_vmcs(struct vcpu_vmx 
>>> *vmx)
>>> {
>>> secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
>>> vmcs_write64(VMCS_LINK_POINTER, -1ull);
>>> +   vmx->nested.need_vmcs12_to_shadow_sync = false;
>>> }
>>> 
>>> static inline void nested_release_evmcs(struct kvm_vcpu *vcpu)
>>> @@ -1341,6 +1342,9 @@ static void copy_shadow_to_vmcs12(struct vcpu_vmx 
>>> *vmx)
>>> unsigned long val;
>>> int i;
>>> 
>>> +   if (WARN_ON(!shadow_vmcs))
>>> +   return;
>>> +
>>> preempt_disable();
>>> 
>>> vmcs_load(shadow_vmcs);
>>> @@ -1373,6 +1377,9 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx 
>>> *vmx)
>>> unsigned long val;
>>> int i, q;
>>> 
>>> +   if (WARN_ON(!shadow_vmcs))
>>> +   return;
>>> +
>>> vmcs_load(shadow_vmcs);
>>> 
>>> for (q = 0; q < ARRAY_SIZE(fields); q++) {
>>> @@ -4436,7 +4443,6 @@ static inline void nested_release_vmcs12(struct 
>>> kvm_vcpu *vcpu)
>>> /* copy to memory all shadowed fields in case
>>>they were modified */
>>> copy_shadow_to_vmcs12(vmx);
>>> -   vmx->nested.need_vmcs12_to_shadow_sync = false;
>>> vmx_disable_shadow_vmcs(vmx);
>>> }
>>> vmx->nested.posted_intr_nv = -1;
>>> -- 
>>> 1.8.3.1
>>> 
>> 
> 



Re: [PATCH v2] KVM: nVMX: do not use dangling shadow VMCS after guest reset

2019-07-19 Thread Liran Alon



> On 20 Jul 2019, at 0:39, Paolo Bonzini  wrote:
> 
> If a KVM guest is reset while running a nested guest, free_nested will
> disable the shadow VMCS execution control in the vmcs01.  However,
> on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
> the VMCS12 to the shadow VMCS which has since been freed.
> 
> This causes a vmptrld of a NULL pointer on my machime, but Jan reports
> the host to hang altogether.  Let's see how much this trivial patch fixes.
> 
> Reported-by: Jan Kiszka 
> Cc: Liran Alon 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 

1) Are we sure we prefer WARN_ON() instead of WARN_ON_ONCE()?
2) Should we also check for WARN_ON(!vmcs12)? As free_nested() also 
kfree(vmx->nested.cached_vmcs12).
In fact, because free_nested() don’t put NULL in cached_vmcs12 after kfree() 
it, I wonder if we shouldn’t create a separate patch that does:
(a) Modify free_nested() to put NULL in cached_vmcs12 after kfree().
(b) Put BUG_ON(!cached_vmcs12) in get_vmcs12() before returning value.

-Liran

> ---
> arch/x86/kvm/vmx/nested.c | 8 +++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 4f23e34f628b..0f1378789bd0 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -194,6 +194,7 @@ static void vmx_disable_shadow_vmcs(struct vcpu_vmx *vmx)
> {
>   secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
>   vmcs_write64(VMCS_LINK_POINTER, -1ull);
> + vmx->nested.need_vmcs12_to_shadow_sync = false;
> }
> 
> static inline void nested_release_evmcs(struct kvm_vcpu *vcpu)
> @@ -1341,6 +1342,9 @@ static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
>   unsigned long val;
>   int i;
> 
> + if (WARN_ON(!shadow_vmcs))
> + return;
> +
>   preempt_disable();
> 
>   vmcs_load(shadow_vmcs);
> @@ -1373,6 +1377,9 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
>   unsigned long val;
>   int i, q;
> 
> + if (WARN_ON(!shadow_vmcs))
> + return;
> +
>   vmcs_load(shadow_vmcs);
> 
>   for (q = 0; q < ARRAY_SIZE(fields); q++) {
> @@ -4436,7 +4443,6 @@ static inline void nested_release_vmcs12(struct 
> kvm_vcpu *vcpu)
>   /* copy to memory all shadowed fields in case
>  they were modified */
>   copy_shadow_to_vmcs12(vmx);
> - vmx->nested.need_vmcs12_to_shadow_sync = false;
>   vmx_disable_shadow_vmcs(vmx);
>   }
>   vmx->nested.posted_intr_nv = -1;
> -- 
> 1.8.3.1
> 



Re: [PATCH] KVM: VMX: dump VMCS on failed entry

2019-07-19 Thread Liran Alon



> On 19 Jul 2019, at 19:42, Paolo Bonzini  wrote:
> 
> This is useful for debugging, and is ratelimited nowadays.
> 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Liran Alon 

> ---
> arch/x86/kvm/vmx/vmx.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 69536553446d..c7ee5ead1565 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5829,6 +5829,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>   }
> 
>   if (unlikely(vmx->fail)) {
> + dump_vmcs();
>   vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
>   vcpu->run->fail_entry.hardware_entry_failure_reason
>   = vmcs_read32(VM_INSTRUCTION_ERROR);
> -- 
> 1.8.3.1
> 



Re: [PATCH] KVM: nVMX: do not use dangling shadow VMCS after guest reset

2019-07-19 Thread Liran Alon



> On 19 Jul 2019, at 19:42, Paolo Bonzini  wrote:
> 
> If a KVM guest is reset while running a nested guest, free_nested will
> disable the shadow VMCS execution control in the vmcs01.  However,
> on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
> the VMCS12 to the shadow VMCS which has since been freed.
> 
> This causes a vmptrld of a NULL pointer on my machime, but Jan reports
> the host to hang altogether.  Let's see how much this trivial patch fixes.
> 
> Reported-by: Jan Kiszka 
> Signed-off-by: Paolo Bonzini 

First, nested_release_vmcs12() also sets need_vmcs12_to_shadow_sync to false 
explicitly. This can now be removed.

Second, I suggest putting a WARN_ON_ONCE() on copy_vmcs12_to_shadow() in case 
shadow_vmcs==NULL.
To assist catching these kind of errors more easily in the future.

Besides that, the fix seems correct to me.
Reviewed-by: Liran Alon 

-Liran

> ---
> arch/x86/kvm/vmx/nested.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 6e88f459b323..6119b30347c6 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -194,6 +194,7 @@ static void vmx_disable_shadow_vmcs(struct vcpu_vmx *vmx)
> {
>   secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
>   vmcs_write64(VMCS_LINK_POINTER, -1ull);
> + vmx->nested.need_vmcs12_to_shadow_sync = false;
> }
> 
> static inline void nested_release_evmcs(struct kvm_vcpu *vcpu)
> -- 
> 1.8.3.1
> 



Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC

2019-07-05 Thread Liran Alon



> On 5 Jul 2019, at 15:14, Paolo Bonzini  wrote:
> 
> kvm-unit-tests were adjusted to match bare metal behavior, but KVM
> itself was not doing what bare metal does; fix that.
> 
> Signed-off-by: Paolo Bonzini 
> ---
> arch/x86/kvm/lapic.c | 6 +-
> 1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index d6ca5c4f29f1..2e4470f2685a 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1318,7 +1318,7 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 
> offset, int len,
>   unsigned char alignment = offset & 0xf;
>   u32 result;
>   /* this bitmask has a bit cleared for each reserved register */
> - static const u64 rmask = 0x43ff01ffe70cULL;
> + u64 rmask = 0x43ff01ffe70cULL;

Why not rename this to “used_bits_mask” and calculate it properly by macros?
It seems a lot nicer than having a pre-calculated magic.

-Liran

> 
>   if ((alignment + len) > 4) {
>   apic_debug("KVM_APIC_READ: alignment error %x %d\n",
> @@ -1326,6 +1326,10 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 
> offset, int len,
>   return 1;
>   }
> 
> + /* ARBPRI is also reserved on x2APIC */
> + if (apic_x2apic_mode(apic))
> + rmask &= ~(1 << (APIC_ARBPRI >> 4));
> +
>   if (offset > 0x3f0 || !(rmask & (1ULL << (offset >> 4 {
>   apic_debug("KVM_APIC_READ: read reserved register %x\n",
>  offset);
> -- 
> 1.8.3.1
> 



Re: [PATCH 2/4] kvm: x86: allow set apic and ioapic debug dynamically

2019-07-03 Thread Liran Alon



> On 3 Jul 2019, at 19:23, Paolo Bonzini  wrote:
> 
> On 01/07/19 08:21, Yi Wang wrote:
>> There are two *_debug() macros in kvm apic source file:
>> - ioapic_debug, which is disable using #if 0
>> - apic_debug, which is commented
>> 
>> Maybe it's better to control these two macros using CONFIG_KVM_DEBUG,
>> which can be set in make menuconfig.
>> 
>> Signed-off-by: Yi Wang 
>> ---
>> arch/x86/kvm/ioapic.c | 2 +-
>> arch/x86/kvm/lapic.c  | 5 -
>> 2 files changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>> index 1add1bc..8099253 100644
>> --- a/arch/x86/kvm/ioapic.c
>> +++ b/arch/x86/kvm/ioapic.c
>> @@ -45,7 +45,7 @@
>> #include "lapic.h"
>> #include "irq.h"
>> 
>> -#if 0
>> +#ifdef CONFIG_KVM_DEBUG
>> #define ioapic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg)
>> #else
>> #define ioapic_debug(fmt, arg...)
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 4924f83..dfff5c6 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -54,8 +54,11 @@
>> #define PRIu64 "u"
>> #define PRIo64 "o"
>> 
>> -/* #define apic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg) */
>> +#ifdef CONFIG_KVM_DEBUG
>> +#define apic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg)
>> +#else
>> #define apic_debug(fmt, arg...) do {} while (0)
>> +#endif
>> 
>> /* 14 is the version for Xeon and Pentium 8.4.8*/
>> #define APIC_VERSION (0x14UL | ((KVM_APIC_LVT_NUM - 1) << 
>> 16))
>> 
> 
> I would just drop all of them.  I've never used them in years, the kvm
> tracepoints are enough.
> 
> Paolo

As someone who have done many LAPIC/IOAPIC debugging, I tend to agree. :)

-Liran



Re: [PATCH 2/3] KVM: nVMX: allow setting the VMFUNC controls MSR

2019-07-02 Thread Liran Alon



> On 2 Jul 2019, at 18:04, Paolo Bonzini  wrote:
> 
> Allow userspace to set a custom value for the VMFUNC controls MSR, as long
> as the capabilities it advertises do not exceed those of the host.
> 
> Fixes: 27c42a1bb ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 
> 2017-08-03)
> Cc: sta...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Liran Alon 

> ---
> arch/x86/kvm/vmx/nested.c | 5 +
> 1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c4e29ef0b21e..163d226efa96 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1234,6 +1234,11 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 
> msr_index, u64 data)
>   case MSR_IA32_VMX_VMCS_ENUM:
>   vmx->nested.msrs.vmcs_enum = data;
>   return 0;
> + case MSR_IA32_VMX_VMFUNC:
> + if (data & ~vmx->nested.msrs.vmfunc_controls)
> + return -EINVAL;
> + vmx->nested.msrs.vmfunc_controls = data;
> + return 0;
>   default:
>   /*
>* The rest of the VMX capability MSRs do not support restore.
> -- 
> 1.8.3.1
> 
> 



Re: [PATCH 1/3] KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS

2019-07-02 Thread Liran Alon



> On 2 Jul 2019, at 18:04, Paolo Bonzini  wrote:
> 
> Some secondary controls are automatically enabled/disabled based on the CPUID
> values that are set for the guest.  However, they are still available at a
> global level and therefore should be present when KVM_GET_MSRS is sent to
> /dev/kvm.
> 
> Fixes: 1389309c811 ("KVM: nVMX: expose VMX capabilities for nested 
> hypervisors to userspace", 2018-02-26)
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Liran Alon 

> ---
> arch/x86/kvm/vmx/nested.c | 7 ++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 990e543f4531..c4e29ef0b21e 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -5750,10 +5750,15 @@ void nested_vmx_setup_ctls_msrs(struct 
> nested_vmx_msrs *msrs, u32 ept_caps,
>   msrs->secondary_ctls_low = 0;
>   msrs->secondary_ctls_high &=
>   SECONDARY_EXEC_DESC |
> + SECONDARY_EXEC_RDTSCP |
>   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> + SECONDARY_EXEC_WBINVD_EXITING |
>   SECONDARY_EXEC_APIC_REGISTER_VIRT |
>   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
> - SECONDARY_EXEC_WBINVD_EXITING;
> + SECONDARY_EXEC_RDRAND_EXITING |
> + SECONDARY_EXEC_ENABLE_INVPCID |
> + SECONDARY_EXEC_RDSEED_EXITING |
> + SECONDARY_EXEC_XSAVES;
> 
>   /*
>* We can emulate "VMCS shadowing," even if the hardware
> -- 
> 1.8.3.1
> 
> 



Re: [PATCH] x86/kvm/nVMCS: fix VMCLEAR when Enlightened VMCS is in use

2019-06-25 Thread Liran Alon



> On 25 Jun 2019, at 14:15, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 25 Jun 2019, at 11:51, Vitaly Kuznetsov  wrote:
>>> 
>>> Liran Alon  writes:
>>> 
>>>>> On 24 Jun 2019, at 16:30, Vitaly Kuznetsov  wrote:
>>>>> 
>>>>> 
>>>>> +bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmptr)
>>>> 
>>>> I prefer to rename evmptr to evmcs_ptr. I think it’s more readable and 
>>>> sufficiently short.
>>>> In addition, I think you should return either -1ull or 
>>>> assist_page.current_nested_vmcs.
>>>> i.e. Don’t return evmcs_ptr by pointer but instead as a return-value
>>>> and get rid of the bool.
>>> 
>>> Actually no, sorry, I'm having second thoughts here: in handle_vmclear()
>>> we don't care about the value of evmcs_ptr, we only want to check that
>>> enlightened vmentry bit is enabled in assist page. If we switch to
>>> checking evmcs_ptr against '-1', for example, we will make '-1' a magic
>>> value which is not in the TLFS. Windows may decide to use it for
>>> something else - and we will get a hard-to-debug bug again.
>> 
>> I’m not sure I understand.
>> You are worried that when guest have setup a valid assist-page and set
>> enlighten_vmentry to true,
>> that assist_page.current_nested_vmcs can be -1ull and still be considered a 
>> valid eVMCS?
>> I don't think that's reasonable.
> 
> No, -1ull is not a valid eVMCS - but this shouldn't change VMCLEAR
> semantics as VMCLEAR has it's own argument. It's perfectly valid to try
> to put a eVMCS which was previously used on a different vCPU (and thus
> which is 'active') to non-active state. The fact that we don't have an
> active eVMCS on the vCPU doing VMCLEAR shouldn't matter at all.
> 
> -- 
> Vitaly

Oh oops sure. Yes you are right.
I forgot about the larger context here for a moment.
Sorry for the confusion. :)

-Liran



Re: [PATCH] x86/kvm/nVMCS: fix VMCLEAR when Enlightened VMCS is in use

2019-06-25 Thread Liran Alon



> On 25 Jun 2019, at 11:51, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 24 Jun 2019, at 16:30, Vitaly Kuznetsov  wrote:
>>> 
>>> 
>>> +bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmptr)
>> 
>> I prefer to rename evmptr to evmcs_ptr. I think it’s more readable and 
>> sufficiently short.
>> In addition, I think you should return either -1ull or 
>> assist_page.current_nested_vmcs.
>> i.e. Don’t return evmcs_ptr by pointer but instead as a return-value
>> and get rid of the bool.
> 
> Actually no, sorry, I'm having second thoughts here: in handle_vmclear()
> we don't care about the value of evmcs_ptr, we only want to check that
> enlightened vmentry bit is enabled in assist page. If we switch to
> checking evmcs_ptr against '-1', for example, we will make '-1' a magic
> value which is not in the TLFS. Windows may decide to use it for
> something else - and we will get a hard-to-debug bug again.

I’m not sure I understand.
You are worried that when guest have setup a valid assist-page and set 
enlighten_vmentry to true,
that assist_page.current_nested_vmcs can be -1ull and still be considered a 
valid eVMCS?
I don't think that's reasonable.

i.e. I thought about having this version of the method:

+u64 nested_enlightened_vmentry(struct kvm_vcpu *vcpu)
+{
+   struct hv_vp_assist_page assist_page;
+
+   if (unlikely(!kvm_hv_get_assist_page(vcpu, _page)))
+   return -1ull;
+
+   if (unlikely(!assist_page.enlighten_vmentry))
+   return -1ull;
+
+   return assist_page.current_nested_vmcs;
+}
+

-Liran

> 
> If you still dislike nested_enlightened_vmentry() having the side effect
> of fetching evmcs_ptr I can get rid of it by splitting the function into
> two, however, it will be less efficient for
> nested_vmx_handle_enlightened_vmptrld(). Or we can just leave things as
> they are there and use the newly introduced function in handle_vmclear()
> only.
> 
> -- 
> Vitaly



Re: [PATCH] x86/kvm/nVMCS: fix VMCLEAR when Enlightened VMCS is in use

2019-06-24 Thread Liran Alon



> On 24 Jun 2019, at 17:16, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 24 Jun 2019, at 16:30, Vitaly Kuznetsov  wrote:
>>> 
>>> When Enlightened VMCS is in use, it is valid to do VMCLEAR and,
>>> according to TLFS, this should "transition an enlightened VMCS from the
>>> active to the non-active state". It is, however, wrong to assume that
>>> it is only valid to do VMCLEAR for the eVMCS which is currently active
>>> on the vCPU performing VMCLEAR.
>>> 
>>> Currently, the logic in handle_vmclear() is broken: in case, there is no
>>> active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal'
>>> VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly
>>> corrupts the memory area.
>>> 
>>> So, in case the VMCLEAR argument is not the current active eVMCS on the
>>> vCPU, how can we know if the area it is pointing to is a normal or an
>>> enlightened VMCS?
>>> Thanks to the bug in Hyper-V (see commit 72aeb60c52bf7 ("KVM: nVMX: Verify
>>> eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can
>>> not, the revision can't be used to distinguish between them. So let's
>>> assume it is always enlightened in case enlightened vmentry is enabled in
>>> the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to
>>> minimize the impact for 'unenlightened' workloads.
>>> 
>>> Fixes: b8bbab928fb1 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
>>> Signed-off-by: Vitaly Kuznetsov 
>>> ---
>>> arch/x86/kvm/vmx/evmcs.c  | 18 ++
>>> arch/x86/kvm/vmx/evmcs.h  |  1 +
>>> arch/x86/kvm/vmx/nested.c | 19 ---
>>> 3 files changed, 27 insertions(+), 11 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
>>> index 1a6b3e1581aa..eae636ec0cc8 100644
>>> --- a/arch/x86/kvm/vmx/evmcs.c
>>> +++ b/arch/x86/kvm/vmx/evmcs.c
>>> @@ -3,6 +3,7 @@
>>> #include 
>>> #include 
>>> 
>>> +#include "../hyperv.h"
>>> #include "evmcs.h"
>>> #include "vmcs.h"
>>> #include "vmx.h"
>>> @@ -309,6 +310,23 @@ void evmcs_sanitize_exec_ctrls(struct vmcs_config 
>>> *vmcs_conf)
>>> }
>>> #endif
>>> 
>>> +bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmptr)
>> 
>> I prefer to rename evmptr to evmcs_ptr. I think it’s more readable and 
>> sufficiently short.
>> In addition, I think you should return either -1ull or 
>> assist_page.current_nested_vmcs.
>> i.e. Don’t return evmcs_ptr by pointer but instead as a return-value
>> and get rid of the bool.
> 
> Sure, can do in v2.
> 
>> 
>>> +{
>>> +   struct hv_vp_assist_page assist_page;
>>> +
>>> +   *evmptr = -1ull;
>>> +
>>> +   if (unlikely(!kvm_hv_get_assist_page(vcpu, _page)))
>>> +   return false;
>>> +
>>> +   if (unlikely(!assist_page.enlighten_vmentry))
>>> +   return false;
>>> +
>>> +   *evmptr = assist_page.current_nested_vmcs;
>>> +
>>> +   return true;
>>> +}
>>> +
>>> uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu)
>>> {
>>>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
>>> index e0fcef85b332..c449e79a9c4a 100644
>>> --- a/arch/x86/kvm/vmx/evmcs.h
>>> +++ b/arch/x86/kvm/vmx/evmcs.h
>>> @@ -195,6 +195,7 @@ static inline void evmcs_sanitize_exec_ctrls(struct 
>>> vmcs_config *vmcs_conf) {}
>>> static inline void evmcs_touch_msr_bitmap(void) {}
>>> #endif /* IS_ENABLED(CONFIG_HYPERV) */
>>> 
>>> +bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmptr);
>>> uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu);
>>> int nested_enable_evmcs(struct kvm_vcpu *vcpu,
>>> uint16_t *vmcs_version);
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 9214b3aea1f9..ee8dda7d8a03 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -1765,26 +1765,21 @@ static int 
>>> nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu,
>>>  bool from_launch)
>>> {
>>>

Re: [PATCH] x86/kvm/nVMCS: fix VMCLEAR when Enlightened VMCS is in use

2019-06-24 Thread Liran Alon



> On 24 Jun 2019, at 16:30, Vitaly Kuznetsov  wrote:
> 
> When Enlightened VMCS is in use, it is valid to do VMCLEAR and,
> according to TLFS, this should "transition an enlightened VMCS from the
> active to the non-active state". It is, however, wrong to assume that
> it is only valid to do VMCLEAR for the eVMCS which is currently active
> on the vCPU performing VMCLEAR.
> 
> Currently, the logic in handle_vmclear() is broken: in case, there is no
> active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal'
> VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly
> corrupts the memory area.
> 
> So, in case the VMCLEAR argument is not the current active eVMCS on the
> vCPU, how can we know if the area it is pointing to is a normal or an
> enlightened VMCS?
> Thanks to the bug in Hyper-V (see commit 72aeb60c52bf7 ("KVM: nVMX: Verify
> eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can
> not, the revision can't be used to distinguish between them. So let's
> assume it is always enlightened in case enlightened vmentry is enabled in
> the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to
> minimize the impact for 'unenlightened' workloads.
> 
> Fixes: b8bbab928fb1 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
> Signed-off-by: Vitaly Kuznetsov 
> ---
> arch/x86/kvm/vmx/evmcs.c  | 18 ++
> arch/x86/kvm/vmx/evmcs.h  |  1 +
> arch/x86/kvm/vmx/nested.c | 19 ---
> 3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c
> index 1a6b3e1581aa..eae636ec0cc8 100644
> --- a/arch/x86/kvm/vmx/evmcs.c
> +++ b/arch/x86/kvm/vmx/evmcs.c
> @@ -3,6 +3,7 @@
> #include 
> #include 
> 
> +#include "../hyperv.h"
> #include "evmcs.h"
> #include "vmcs.h"
> #include "vmx.h"
> @@ -309,6 +310,23 @@ void evmcs_sanitize_exec_ctrls(struct vmcs_config 
> *vmcs_conf)
> }
> #endif
> 
> +bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmptr)

I prefer to rename evmptr to evmcs_ptr. I think it’s more readable and 
sufficiently short.
In addition, I think you should return either -1ull or 
assist_page.current_nested_vmcs.
i.e. Don’t return evmcs_ptr by pointer but instead as a return-value and get 
rid of the bool.

> +{
> + struct hv_vp_assist_page assist_page;
> +
> + *evmptr = -1ull;
> +
> + if (unlikely(!kvm_hv_get_assist_page(vcpu, _page)))
> + return false;
> +
> + if (unlikely(!assist_page.enlighten_vmentry))
> + return false;
> +
> + *evmptr = assist_page.current_nested_vmcs;
> +
> + return true;
> +}
> +
> uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu)
> {
>struct vcpu_vmx *vmx = to_vmx(vcpu);
> diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
> index e0fcef85b332..c449e79a9c4a 100644
> --- a/arch/x86/kvm/vmx/evmcs.h
> +++ b/arch/x86/kvm/vmx/evmcs.h
> @@ -195,6 +195,7 @@ static inline void evmcs_sanitize_exec_ctrls(struct 
> vmcs_config *vmcs_conf) {}
> static inline void evmcs_touch_msr_bitmap(void) {}
> #endif /* IS_ENABLED(CONFIG_HYPERV) */
> 
> +bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmptr);
> uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu);
> int nested_enable_evmcs(struct kvm_vcpu *vcpu,
>   uint16_t *vmcs_version);
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 9214b3aea1f9..ee8dda7d8a03 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1765,26 +1765,21 @@ static int 
> nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu,
>bool from_launch)
> {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> - struct hv_vp_assist_page assist_page;
> + u64 evmptr;

I prefer to rename evmptr to evmcs_ptr. I think it’s more readable and 
sufficiently short.

> 
>   if (likely(!vmx->nested.enlightened_vmcs_enabled))
>   return 1;
> 
> - if (unlikely(!kvm_hv_get_assist_page(vcpu, _page)))
> + if (!nested_enlightened_vmentry(vcpu, ))
>   return 1;
> 
> - if (unlikely(!assist_page.enlighten_vmentry))
> - return 1;
> -
> - if (unlikely(assist_page.current_nested_vmcs !=
> -  vmx->nested.hv_evmcs_vmptr)) {
> -
> + if (unlikely(evmptr != vmx->nested.hv_evmcs_vmptr)) {
>   if (!vmx->nested.hv_evmcs)
>   vmx->nested.current_vmptr = -1ull;
> 
>   nested_release_evmcs(vcpu);
> 
> - if (kvm_vcpu_map(vcpu, 
> gpa_to_gfn(assist_page.current_nested_vmcs),
> + if (kvm_vcpu_map(vcpu, gpa_to_gfn(evmptr),
>>nested.hv_evmcs_map))
>   return 0;
> 
> @@ -1826,7 +1821,7 @@ static int nested_vmx_handle_enlightened_vmptrld(struct 
> kvm_vcpu *vcpu,
>*/
>   vmx->nested.hv_evmcs->hv_clean_fields &=
>  

Re: [PATCH v2] KVM: x86: Modify struct kvm_nested_state to have explicit fields for data

2019-06-19 Thread Liran Alon



> On 19 Jun 2019, at 13:45, Paolo Bonzini  wrote:
> 
> On 19/06/19 00:36, Liran Alon wrote:
>> 
>> 
>>> On 18 Jun 2019, at 19:24, Paolo Bonzini  wrote:
>>> 
>>> From: Liran Alon 
>>> 
>>> Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
>>> of VMX nested state data in a struct.
>>> 
>>> In order to avoid changing the ioctl values of
>>> KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
>>> sizeof(struct kvm_nested_state). This is done by defining the data
>>> struct as "data.vmx[0]". It was the most elegant way I found to
>>> preserve struct size while still keeping struct readable and easy to
>>> maintain. It does have a misfortunate side-effect that now it has to be
>>> accessed as "data.vmx[0]" rather than just "data.vmx".
>>> 
>>> Because we are already modifying these structs, I also modified the
>>> following:
>>> * Define the "format" field values as macros.
>>> * Rename vmcs_pa to vmcs12_pa for better readability.
>>> 
>>> Signed-off-by: Liran Alon 
>>> [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]
>> 
>> 1) Why should we remove SVM stubs? I think it makes the interface intention 
>> more clear.
>> Do you see any disadvantage of having them?
> 
> In its current state I think it would not require any state apart from
> the global flags, because MSRs can be extracted independent of
> KVM_GET_NESTED_STATE; this may change as things are cleaned up, but if
> that remains the case there would be no need for SVM structs at all.

Hmm yes I see your point. Ok I agree.

> 
>> 2) What is the advantage of defining a separate 
>> KVM_STATE_NESTED_VMX_VMCS12_SIZE
>> rather than just moving VMCS12_SIZE to userspace header?
> 
> It's just for namespace cleanliness.  I'm keeping VMCS12_SIZE for the
> arch/x86/kvm/vmx/ code because it's shorter and we're used to it, but
> userspace headers should use a more specific name.

Ok then.
I will submit my next version of QEMU patches according to this version of the 
headers.

Reviewed-by: Liran Alon 

> 
> Paolo



Re: [PATCH v2] KVM: x86: Modify struct kvm_nested_state to have explicit fields for data

2019-06-18 Thread Liran Alon



> On 18 Jun 2019, at 19:24, Paolo Bonzini  wrote:
> 
> From: Liran Alon 
> 
> Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
> of VMX nested state data in a struct.
> 
> In order to avoid changing the ioctl values of
> KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
> sizeof(struct kvm_nested_state). This is done by defining the data
> struct as "data.vmx[0]". It was the most elegant way I found to
> preserve struct size while still keeping struct readable and easy to
> maintain. It does have a misfortunate side-effect that now it has to be
> accessed as "data.vmx[0]" rather than just "data.vmx".
> 
> Because we are already modifying these structs, I also modified the
> following:
> * Define the "format" field values as macros.
> * Rename vmcs_pa to vmcs12_pa for better readability.
> 
> Signed-off-by: Liran Alon 
> [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]

1) Why should we remove SVM stubs? I think it makes the interface intention 
more clear.
Do you see any disadvantage of having them?

2) What is the advantage of defining a separate KVM_STATE_NESTED_VMX_VMCS12_SIZE
rather than just moving VMCS12_SIZE to userspace header?

-Liran

> Signed-off-by: Paolo Bonzini 
> ---



Re: [PATCH 16/43] KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01

2019-06-15 Thread Liran Alon
You should apply something as the following instead of the original fix by Sean
to play nicely on upstream without additional dependency:

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f1a69117ac0f..3fc44852ed4f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2234,12 +2234,9 @@ static void prepare_vmcs02_full(struct vcpu_vmx *vmx, 
struct vmcs12 *vmcs12)

set_cr4_guest_host_mask(vmx);

-   if (kvm_mpx_supported()) {
-   if (vmx->nested.nested_run_pending &&
-   (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
-   vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
-   else
-   vmcs_write64(GUEST_BNDCFGS, 
vmx->nested.vmcs01_guest_bndcfgs);
+   if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
+   (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) {
+   vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
}
 }

@@ -2283,6 +2280,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.vmcs01_debugctl);
}
+   if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
+   !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) {
+   vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
+   }
vmx_set_rflags(vcpu, vmcs12->guest_rflags);

/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the

-Liran

> On 16 Jun 2019, at 1:16, Sasha Levin  wrote:
> 
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 62cf9bd8118c KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS.
> 
> The bot has tested the following trees: v5.1.9, v4.19.50.
> 
> v5.1.9: Build OK!
> v4.19.50: Failed to apply! Possible dependencies:
>09abb5e3e5e5 ("KVM: nVMX: call kvm_skip_emulated_instruction in 
> nested_vmx_{fail,succeed}")
>09abe3200266 ("KVM: nVMX: split pieces of prepare_vmcs02() to 
> prepare_vmcs02_early()")
>1438921c6dc1 ("KVM: nVMX: Flush TLB entries tagged by dest EPTP on L1<->L2 
> transitions")
>199b118ab3d5 ("KVM: VMX: Alphabetize the includes in vmx.c")
>1abf23fb42f5 ("KVM: nVMX: use vm_exit_controls_init() to write exit 
> controls for vmcs02")
>327c072187f7 ("KVM: nVMX: Flush linear and combined mappings on VPID02 
> related flushes")
>3d5bdae8b164 ("KVM: nVMX: Use correct VPID02 when emulating L1 INVVPID")
>3df5c37e55c8 ("KVM: nVMX: try to set EFER bits correctly when initializing 
> controls")
>55d2375e58a6 ("KVM: nVMX: Move nested code to dedicated files")
>5b8ba41dafd7 ("KVM: nVMX: move vmcs12 EPTP consistency check to 
> check_vmentry_prereqs()")
>609363cf81fc ("KVM: nVMX: Move vmcs12 code to dedicated files")
>75edce8a4548 ("KVM: VMX: Move eVMCS code to dedicated files")
>7671ce21b13b ("KVM: nVMX: move check_vmentry_postreqs() call to 
> nested_vmx_enter_non_root_mode()")
>945679e301ea ("KVM: nVMX: add enlightened VMCS state")
>a633e41e7362 ("KVM: nVMX: assimilate nested_vmx_entry_failure() into 
> nested_vmx_enter_non_root_mode()")
>a821bab2d1ee ("KVM: VMX: Move VMX specific files to a "vmx" subdirectory")
>b8bbab928fb1 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
>d63907dc7dd1 ("KVM: nVMX: rename enter_vmx_non_root_mode to 
> nested_vmx_enter_non_root_mode")
>efebf0aaec3d ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 
> uses VPID and EPT")
> 
> 
> How should we proceed with this patch?
> 
> --
> Thanks,
> Sasha



Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets

2019-06-13 Thread Liran Alon



> On 12 Jun 2019, at 21:25, Sean Christopherson 
>  wrote:
> 
> On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
>> The Linux kernel has a global address space that is the same for any
>> kernel code. This address space becomes a liability in a world with
>> processor information leak vulnerabilities, such as L1TF. With the right
>> cache load gadget, an attacker-controlled hyperthread pair can leak
>> arbitrary data via L1TF. Disabling hyperthreading is one recommended
>> mitigation, but it comes with a large performance hit for a wide range
>> of workloads.
>> 
>> An alternative mitigation is to not make certain data in the kernel
>> globally visible, but only when the kernel executes in the context of
>> the process where this data belongs to.
>> 
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. Page
>> tables and mappings in that region will be exclusive to one address
>> space, instead of implicitly shared between all kernel address spaces.
>> Any data placed in that region will be out of reach of cache load
>> gadgets that execute in different address spaces. To implement
>> process-local memory, we introduce a new interface kmalloc_proclocal() /
>> kfree_proclocal() that allocates and maps pages exclusively into the
>> current kernel address space. As a first use case, we move architectural
>> state of guest CPUs in KVM out of reach of other kernel address spaces.
> 
> Can you briefly describe what types of attacks this is intended to
> mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
> make comments based on my potentially bad assumptions.

I think I can assist in the explanation.

Consider the following scenario:
1) Hyperthread A in CPU core runs in guest and triggers a VMExit which is 
handled by host kernel.
While hyperthread A runs VMExit handler, it populates CPU core cache / 
internal-resources (e.g. MDS buffers)
with some sensitive data it have speculatively/architecturally access.
2) During hyperthread A running on host kernel, hyperthread B on same CPU core 
runs in guest and use
some CPU speculative execution vulnerability to leak the sensitive host data 
populated by hyperthread A
in CPU core cache / internal-resources.

Current CPU microcode mitigations (L1D/MDS flush) only handle the case of a 
single hyperthread and don’t
provide a mechanism to mitigate this hyperthreading attack scenario.

Assuming there is some guest triggerable speculative load gadget in some VMExit 
path,
it can be used to force any data that is mapped into kernel address space to be 
loaded into CPU resource that is subject to leak.
Therefore, there were multiple attempts to reduce sensitive information from 
being mapped into the kernel address space
that is accessible by this VMExit path.

One attempt was XPFO which attempts to remove from kernel direct-map any page 
that is currently used only by userspace.
Unfortunately, XPFO currently exhibits multiple performance issues that 
*currently* makes it impractical as far as I know.

Another attempt is this patch-series which attempts to remove from one vCPU 
thread host kernel address space,
the state of vCPUs of other guests. Which is very specific but I personally 
have additional ideas on how this patch series can be further used.
For example, vhost-net needs to kmap entire guest memory into kernel-space to 
write ingress packets data into guest memory.
Thus, vCPU thread kernel address space now maps entire other guest memory which 
can be leaked using the technique described above.
Therefore, it should be useful to also move this kmap() to happen on 
process-local kernel virtual address region.

One could argue however that there is still a much bigger issue because of 
kernel direct-map that maps all physical pages that kernel
manage (i.e. have struct page) in kernel virtual address space. And all of 
those pages can theoretically be leaked.
However, this could be handled by complementary techniques such as booting host 
kernel with “mem=X” and mapping guest memory
by directly mmap relevant portion of /dev/mem.
Which is probably what AWS does given these upstream KVM patches they have 
contributed:
bd53cb35a3e9 X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
e45adf665a53 KVM: Introduce a new guest mapping API
0c55671f84ff kvm, x86: Properly check whether a pfn is an MMIO or not

Also note that when using such “mem=X” technique, you can also avoid 
performance penalties introduced by CPU microcode mitigations.
E.g. You can avoid doing L1D flush on VMEntry if VMExit handler run only in 
kernel and didn’t context-switch as you assume kernel address
space don’t map any host sensitive data.

It’s also worth mentioning that another alternative that I have attempted to 
this “mem=X” technique
was to create an isolated address space that is only used when running KVM 
VMExit handlers.
For more information, refer to:

Re: [PATCH] KVM: x86: move MSR_IA32_POWER_CTL handling to common code

2019-06-06 Thread Liran Alon


> On 6 Jun 2019, at 15:33, Paolo Bonzini  wrote:
> 
> Make it available to AMD hosts as well, just in case someone is trying
> to use an Intel processor's CPUID setup.

I’m actually quite surprised that such a setup works properly.

> 
> Suggested-by: Sean Christopherson 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Liran Alon 

> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/vmx/vmx.c  | 6 --
> arch/x86/kvm/vmx/vmx.h  | 2 --
> arch/x86/kvm/x86.c  | 6 ++
> 4 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a86026969b19..35e7937cc9ac 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -689,6 +689,7 @@ struct kvm_vcpu_arch {
>   u32 virtual_tsc_mult;
>   u32 virtual_tsc_khz;
>   s64 ia32_tsc_adjust_msr;
> + u64 msr_ia32_power_ctl;
>   u64 tsc_scaling_ratio;
> 
>   atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cccf73a91e88..5d903f8909d1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1695,9 +1695,6 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_SYSENTER_ESP:
>   msr_info->data = vmcs_readl(GUEST_SYSENTER_ESP);
>   break;
> - case MSR_IA32_POWER_CTL:
> - msr_info->data = vmx->msr_ia32_power_ctl;
> - break;
>   case MSR_IA32_BNDCFGS:
>   if (!kvm_mpx_supported() ||
>   (!msr_info->host_initiated &&
> @@ -1828,9 +1825,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_SYSENTER_ESP:
>   vmcs_writel(GUEST_SYSENTER_ESP, data);
>   break;
> - case MSR_IA32_POWER_CTL:
> - vmx->msr_ia32_power_ctl = data;
> - break;
>   case MSR_IA32_BNDCFGS:
>   if (!kvm_mpx_supported() ||
>   (!msr_info->host_initiated &&
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 61128b48c503..1cdaa5af8245 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -260,8 +260,6 @@ struct vcpu_vmx {
> 
>   unsigned long host_debugctlmsr;
> 
> - u64 msr_ia32_power_ctl;
> -
>   /*
>* Only bits masked by msr_ia32_feature_control_valid_bits can be set in
>* msr_ia32_feature_control. FEATURE_CONTROL_LOCKED is always included
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 145df9778ed0..5ec87ded17db 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2563,6 +2563,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   return 1;
>   vcpu->arch.smbase = data;
>   break;
> + case MSR_IA32_POWER_CTL:
> + vcpu->arch.msr_ia32_power_ctl = data;
> + break;
>   case MSR_IA32_TSC:
>   kvm_write_tsc(vcpu, msr_info);
>   break;
> @@ -2822,6 +2825,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   return 1;
>   msr_info->data = vcpu->arch.arch_capabilities;
>   break;
> + case MSR_IA32_POWER_CTL:
> + msr_info->data = vcpu->arch.msr_ia32_power_ctl;
> + break;
>   case MSR_IA32_TSC:
>   msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + 
> vcpu->arch.tsc_offset;
>   break;
> -- 
> 1.8.3.1
> 



Re: [PATCH v2 2/3] KVM: X86: Implement PV sched yield hypercall

2019-05-29 Thread Liran Alon



> On 28 May 2019, at 3:53, Wanpeng Li  wrote:
> 
> From: Wanpeng Li 
> 
> The target vCPUs are in runnable state after vcpu_kick and suitable 
> as a yield target. This patch implements the sched yield hypercall.
> 
> 17% performace increase of ebizzy benchmark can be observed in an 
> over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush 
> call-function IPI-many since call-function is not easy to be trigged 
> by userspace workload).
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> arch/x86/kvm/x86.c | 24 
> 1 file changed, 24 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e7e57de..2ceef51 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7172,6 +7172,26 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
>   kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
> }
> 
> +void kvm_sched_yield(struct kvm *kvm, u64 dest_id)
> +{
> + struct kvm_vcpu *target;
> + struct kvm_apic_map *map;
> +
> + rcu_read_lock();
> + map = rcu_dereference(kvm->arch.apic_map);
> +
> + if (unlikely(!map))
> + goto out;
> +

We should have a bounds-check here on “dest_id”.

-Liran

> + if (map->phys_map[dest_id]->vcpu) {
> + target = map->phys_map[dest_id]->vcpu;
> + kvm_vcpu_yield_to(target);
> + }
> +
> +out:
> + rcu_read_unlock();
> +}
> +
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
>   unsigned long nr, a0, a1, a2, a3, ret;
> @@ -7218,6 +7238,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   case KVM_HC_SEND_IPI:
>   ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
>   break;
> + case KVM_HC_SCHED_YIELD:
> + kvm_sched_yield(vcpu->kvm, a0);
> + ret = 0;
> + break;
>   default:
>   ret = -KVM_ENOSYS;
>   break;
> -- 
> 2.7.4
> 



Re: Question about MDS mitigation

2019-05-16 Thread Liran Alon
Indeed those CPU resources are shared between sibling hyperthreads on same CPU 
core.
There is currently no mechanism merged upstream to completely mitigate 
SMT-enabled scenarios.
Note that this is also true for L1TF.

There are several proposal to address this but they are still in early research 
mode.
For example, see this KVM address space isolation patch series developed by 
myself and Alexandre:
https://lkml.org/lkml/2019/5/13/515
(Which should be integrated with a mechanism which kick sibling hyperthreads 
when switching from KVM isolated address space to full kernel address space)
This partially mimics Microsoft work regarding HyperClear which you can read 
more about it here:
https://techcommunity.microsoft.com/t5/Virtualization/Hyper-V-HyperClear-Mitigation-for-L1-Terminal-Fault/ba-p/382429

-Liran

> On 16 May 2019, at 5:42, wencongyang (A)  wrote:
> 
> Hi all
> 
> Fill buffers, load ports are shared between threads on the same physical core.
> We need to run more than one vm on the same physical core.
> Is there any complete mitigation for environments utilizing SMT?
> 



Re: [RFC KVM 00/27] KVM Address Space Isolation

2019-05-14 Thread Liran Alon



> On 14 May 2019, at 5:07, Andy Lutomirski  wrote:
> 
> On Mon, May 13, 2019 at 2:09 PM Liran Alon  wrote:
>> 
>> 
>> 
>>> On 13 May 2019, at 21:17, Andy Lutomirski  wrote:
>>> 
>>>> I expect that the KVM address space can eventually be expanded to include
>>>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>>>> to the process userland page table (which should be safe to do because the
>>>> KVM address space doesn't have any secret), we could potentially handle the
>>>> KVM ioctl without having to switch to the kernel pagetable (thus 
>>>> effectively
>>>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>>>> to be handled using the full kernel address space.
>>>> 
>>> 
>>> In the hopefully common case where a VM exits and then gets re-entered
>>> without needing to load full page tables, what code actually runs?
>>> I'm trying to understand when the optimization of not switching is
>>> actually useful.
>>> 
>>> Allowing ioctl() without switching to kernel tables sounds...
>>> extremely complicated.  It also makes the dubious assumption that user
>>> memory contains no secrets.
>> 
>> Let me attempt to clarify what we were thinking when creating this patch 
>> series:
>> 
>> 1) It is never safe to execute one hyperthread inside guest while it’s 
>> sibling hyperthread runs in a virtual address space which contains secrets 
>> of host or other guests.
>> This is because we assume that using some speculative gadget (such as 
>> half-Spectrev2 gadget), it will be possible to populate *some* CPU core 
>> resource which could then be *somehow* leaked by the hyperthread running 
>> inside guest. In case of L1TF, this would be data populated to the L1D cache.
>> 
>> 2) Because of (1), every time a hyperthread runs inside host kernel, we must 
>> make sure it’s sibling is not running inside guest. i.e. We must kick the 
>> sibling hyperthread outside of guest using IPI.
>> 
>> 3) From (2), we should have theoretically deduced that for every #VMExit, 
>> there is a need to kick the sibling hyperthread also outside of guest until 
>> the #VMExit is completed. Such a patch series was implemented at some point 
>> but it had (obviously) significant performance hit.
>> 
>> 
> 4) The main goal of this patch series is to preserve (2), but to avoid
> the overhead specified in (3).
>> 
>> The way this patch series achieves (4) is by observing that during the run 
>> of a VM, most #VMExits can be handled rather quickly and locally inside KVM 
>> and doesn’t need to reference any data that is not relevant to this VM or 
>> KVM code. Therefore, if we will run these #VMExits in an isolated virtual 
>> address space (i.e. KVM isolated address space), there is no need to kick 
>> the sibling hyperthread from guest while these #VMExits handlers run.
> 
> Thanks!  This clarifies a lot of things.
> 
>> The hope is that the very vast majority of #VMExit handlers will be able to 
>> completely run without requiring to switch to full address space. Therefore, 
>> avoiding the performance hit of (2).
>> However, for the very few #VMExits that does require to run in full kernel 
>> address space, we must first kick the sibling hyperthread outside of guest 
>> and only then switch to full kernel address space and only once all 
>> hyperthreads return to KVM address space, then allow then to enter into 
>> guest.
> 
> What exactly does "kick" mean in this context?  It sounds like you're
> going to need to be able to kick sibling VMs from extremely atomic
> contexts like NMI and MCE.

Yes that’s true.
“kick” in this context will probably mean sending an IPI to all sibling 
hyperthreads.
This IPI will cause these sibling hyperthreads to exit from guest to host on 
EXTERNAL_INTERRUPT
and wait for a condition that again allows to enter back into guest.
This condition will be once all hyperthreads of CPU core is again running only 
within KVM isolated address space of this VM.

-Liran





Re: [RFC KVM 00/27] KVM Address Space Isolation

2019-05-14 Thread Liran Alon



> On 14 May 2019, at 10:29, Peter Zijlstra  wrote:
> 
> 
> (please, wrap our emails at 78 chars)
> 
> On Tue, May 14, 2019 at 12:08:23AM +0300, Liran Alon wrote:
> 
>> 3) From (2), we should have theoretically deduced that for every
>> #VMExit, there is a need to kick the sibling hyperthread also outside
>> of guest until the #VMExit is completed.
> 
> That's not in fact quite true; all you have to do is send the IPI.
> Having one sibling IPI the other sibling carries enough guarantees that
> the receiving sibling will not execute any further guest instructions.
> 
> That is, you don't have to wait on the VMExit to complete; you can just
> IPI and get on with things. Now, this is still expensive, But it is
> heaps better than doing a full sync up between siblings.
> 

I agree.

I didn’t say you need to do full sync. You just need to IPI the sibling
hyperthreads before switching to the full kernel address space.
But you need to make sure these sibling hyperthreads don’t get back into
the guest until all hyperthreads are running with KVM isolated address space.

It is still very expensive if done for every #VMExit. Which as I explained,
can be avoided in case we use the KVM isolated address space technique.

-Liran



Re: [RFC KVM 00/27] KVM Address Space Isolation

2019-05-13 Thread Liran Alon



> On 14 May 2019, at 0:42, Nakajima, Jun  wrote:
> 
> 
> 
>> On May 13, 2019, at 2:16 PM, Liran Alon  wrote:
>> 
>>> On 13 May 2019, at 22:31, Nakajima, Jun  wrote:
>>> 
>>> On 5/13/19, 7:43 AM, "kvm-ow...@vger.kernel.org on behalf of Alexandre 
>>> Chartre" wrote:
>>> 
>>>   Proposal
>>>   
>>> 
>>>   To handle both these points, this series introduce the mechanism of KVM
>>>   address space isolation. Note that this mechanism completes (a)+(b) and
>>>   don't contradict. In case this mechanism is also applied, (a)+(b) should
>>>   still be applied to the full virtual address space as a defence-in-depth).
>>> 
>>>   The idea is that most of KVM #VMExit handlers code will run in a special
>>>   KVM isolated address space which maps only KVM required code and per-VM
>>>   information. Only once KVM needs to architectually access other 
>>> (sensitive)
>>>   data, it will switch from KVM isolated address space to full standard
>>>   host address space. At this point, KVM will also need to kick all sibling
>>>   hyperthreads to get out of guest (note that kicking all sibling 
>>> hyperthreads
>>>   is not implemented in this serie).
>>> 
>>>   Basically, we will have the following flow:
>>> 
>>> - qemu issues KVM_RUN ioctl
>>> - KVM handles the ioctl and calls vcpu_run():
>>>   . KVM switches from the kernel address to the KVM address space
>>>   . KVM transfers control to VM (VMLAUNCH/VMRESUME)
>>>   . VM returns to KVM
>>>   . KVM handles VM-Exit:
>>> . if handling need full kernel then switch to kernel address space
>>> . else continues with KVM address space
>>>   . KVM loops in vcpu_run() or return
>>> - KVM_RUN ioctl returns
>>> 
>>>   So, the KVM_RUN core function will mainly be executed using the KVM 
>>> address
>>>   space. The handling of a VM-Exit can require access to the kernel space
>>>   and, in that case, we will switch back to the kernel address space.
>>> 
>>> Once all sibling hyperthreads are in the host (either using the full kernel 
>>> address space or user address space), what happens to the other sibling 
>>> hyperthreads if one of them tries to do VM entry? That VCPU will switch to 
>>> the KVM address space prior to VM entry, but others continue to run? Do you 
>>> think (a) + (b) would be sufficient for that case?
>> 
>> The description here is missing and important part: When a hyperthread needs 
>> to switch from KVM isolated address space to kernel full address space, it 
>> should first kick all sibling hyperthreads outside of guest and only then 
>> safety switch to full kernel address space. Only once all sibling 
>> hyperthreads are running with KVM isolated address space, it is safe to 
>> enter guest.
>> 
> 
> Okay, it makes sense. So, it will require some synchronization among the 
> siblings there.

Definitely.
Currently the kicking of sibling hyperthreads is not integrated yet with this 
patch series. But it should be at some point.

-Liran

> 
>> The main point of this address space is to avoid kicking all sibling 
>> hyperthreads on *every* VMExit from guest but instead only kick them when 
>> switching address space. The assumption is that the vast majority of exits 
>> can be handled in KVM isolated address space and therefore do not require to 
>> kick the sibling hyperthreads outside of guest.
> 
> 
> ---
> Jun
> Intel Open Source Technology Center



Re: [RFC KVM 24/27] kvm/isolation: KVM page fault handler

2019-05-13 Thread Liran Alon



> On 13 May 2019, at 18:15, Peter Zijlstra  wrote:
> 
> On Mon, May 13, 2019 at 04:38:32PM +0200, Alexandre Chartre wrote:
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 46df4c6..317e105 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -33,6 +33,10 @@
>> #define CREATE_TRACE_POINTS
>> #include 
>> 
>> +bool (*kvm_page_fault_handler)(struct pt_regs *regs, unsigned long 
>> error_code,
>> +   unsigned long address);
>> +EXPORT_SYMBOL(kvm_page_fault_handler);
> 
> NAK NAK NAK NAK
> 
> This is one of the biggest anti-patterns around.

I agree.
I think that mm should expose a mm_set_kvm_page_fault_handler() or something 
(give it a better name).
Similar to how arch/x86/kernel/irq.c have kvm_set_posted_intr_wakeup_handler().

-Liran




Re: [RFC KVM 00/27] KVM Address Space Isolation

2019-05-13 Thread Liran Alon



> On 13 May 2019, at 22:31, Nakajima, Jun  wrote:
> 
> On 5/13/19, 7:43 AM, "kvm-ow...@vger.kernel.org on behalf of Alexandre 
> Chartre" wrote:
> 
>Proposal
>
> 
>To handle both these points, this series introduce the mechanism of KVM
>address space isolation. Note that this mechanism completes (a)+(b) and
>don't contradict. In case this mechanism is also applied, (a)+(b) should
>still be applied to the full virtual address space as a defence-in-depth).
> 
>The idea is that most of KVM #VMExit handlers code will run in a special
>KVM isolated address space which maps only KVM required code and per-VM
>information. Only once KVM needs to architectually access other (sensitive)
>data, it will switch from KVM isolated address space to full standard
>host address space. At this point, KVM will also need to kick all sibling
>hyperthreads to get out of guest (note that kicking all sibling 
> hyperthreads
>is not implemented in this serie).
> 
>Basically, we will have the following flow:
> 
>  - qemu issues KVM_RUN ioctl
>  - KVM handles the ioctl and calls vcpu_run():
>. KVM switches from the kernel address to the KVM address space
>. KVM transfers control to VM (VMLAUNCH/VMRESUME)
>. VM returns to KVM
>. KVM handles VM-Exit:
>  . if handling need full kernel then switch to kernel address space
>  . else continues with KVM address space
>. KVM loops in vcpu_run() or return
>  - KVM_RUN ioctl returns
> 
>So, the KVM_RUN core function will mainly be executed using the KVM address
>space. The handling of a VM-Exit can require access to the kernel space
>and, in that case, we will switch back to the kernel address space.
> 
> Once all sibling hyperthreads are in the host (either using the full kernel 
> address space or user address space), what happens to the other sibling 
> hyperthreads if one of them tries to do VM entry? That VCPU will switch to 
> the KVM address space prior to VM entry, but others continue to run? Do you 
> think (a) + (b) would be sufficient for that case?

The description here is missing and important part: When a hyperthread needs to 
switch from KVM isolated address space to kernel full address space, it should 
first kick all sibling hyperthreads outside of guest and only then safety 
switch to full kernel address space. Only once all sibling hyperthreads are 
running with KVM isolated address space, it is safe to enter guest.

The main point of this address space is to avoid kicking all sibling 
hyperthreads on *every* VMExit from guest but instead only kick them when 
switching address space. The assumption is that the vast majority of exits can 
be handled in KVM isolated address space and therefore do not require to kick 
the sibling hyperthreads outside of guest.

-Liran

> 
> ---
> Jun
> Intel Open Source Technology Center
> 
> 



Re: [RFC KVM 00/27] KVM Address Space Isolation

2019-05-13 Thread Liran Alon



> On 13 May 2019, at 21:17, Andy Lutomirski  wrote:
> 
>> I expect that the KVM address space can eventually be expanded to include
>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>> to the process userland page table (which should be safe to do because the
>> KVM address space doesn't have any secret), we could potentially handle the
>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>> to be handled using the full kernel address space.
>> 
> 
> In the hopefully common case where a VM exits and then gets re-entered
> without needing to load full page tables, what code actually runs?
> I'm trying to understand when the optimization of not switching is
> actually useful.
> 
> Allowing ioctl() without switching to kernel tables sounds...
> extremely complicated.  It also makes the dubious assumption that user
> memory contains no secrets.

Let me attempt to clarify what we were thinking when creating this patch series:

1) It is never safe to execute one hyperthread inside guest while it’s sibling 
hyperthread runs in a virtual address space which contains secrets of host or 
other guests.
This is because we assume that using some speculative gadget (such as 
half-Spectrev2 gadget), it will be possible to populate *some* CPU core 
resource which could then be *somehow* leaked by the hyperthread running inside 
guest. In case of L1TF, this would be data populated to the L1D cache.

2) Because of (1), every time a hyperthread runs inside host kernel, we must 
make sure it’s sibling is not running inside guest. i.e. We must kick the 
sibling hyperthread outside of guest using IPI.

3) From (2), we should have theoretically deduced that for every #VMExit, there 
is a need to kick the sibling hyperthread also outside of guest until the 
#VMExit is completed. Such a patch series was implemented at some point but it 
had (obviously) significant performance hit.

4) The main goal of this patch series is to preserve (2), but to avoid the 
overhead specified in (3).

The way this patch series achieves (4) is by observing that during the run of a 
VM, most #VMExits can be handled rather quickly and locally inside KVM and 
doesn’t need to reference any data that is not relevant to this VM or KVM code. 
Therefore, if we will run these #VMExits in an isolated virtual address space 
(i.e. KVM isolated address space), there is no need to kick the sibling 
hyperthread from guest while these #VMExits handlers run.
The hope is that the very vast majority of #VMExit handlers will be able to 
completely run without requiring to switch to full address space. Therefore, 
avoiding the performance hit of (2).
However, for the very few #VMExits that does require to run in full kernel 
address space, we must first kick the sibling hyperthread outside of guest and 
only then switch to full kernel address space and only once all hyperthreads 
return to KVM address space, then allow then to enter into guest.

From this reason, I think the above paragraph (that was added to my original 
cover letter) is incorrect.
I believe that we should by design treat all exits to userspace VMM (e.g. QEMU) 
as slow-path that should not be optimised and therefore ok to switch address 
space (and therefore also kick sibling hyperthread). Similarly, all IOCTLs 
handlers are also slow-path and therefore it should be ok for them to also not 
run in KVM isolated address space.

-Liran











Re: [RFC KVM 00/27] KVM Address Space Isolation

2019-05-13 Thread Liran Alon



> On 13 May 2019, at 17:38, Alexandre Chartre  
> wrote:
> 
> Hi,
> 
> This series aims to introduce the concept of KVM address space isolation.
> This is done as part of the upstream community effort to have exploit
> mitigations for CPU info-leaks vulnerabilities such as L1TF. 
> 
> These patches are based on an original patches from Liran Alon, completed
> with additional patches to effectively create KVM address space different
> from the full kernel address space.

Great job for pushing this forward! Thank you!

> 
> The current code is just an early POC, and it is not fully stable at the
> moment (unfortunately you can expect crashes/hangs, see the "Issues"
> section below). However I would like to start a discussion get feedback
> and opinions about this approach.
> 
> Context
> ===
> 
> The most naive approach to handle L1TF SMT-variant exploit is to just disable
> hyper-threading. But that is not practical for public cloud providers. As a
> second next best alternative, there is an approach to combine coscheduling
> together with flushing L1D cache on every VMEntry. By coscheduling I refer
> to some mechanism which on every VMExit from guest, kicks all sibling
> hyperthreads from guest aswell.
> 
> However, this approach have some open issues:
> 
> 1. Kicking all sibling hyperthreads for every VMExit have significant
>   performance hit for some compute shapes (e.g. Emulated and PV).
> 
> 2. It assumes only CPU core resource which could be leaked by some
>   vulnerability is L1D cache. But future vulnerabilities may also be able
>   to leak other CPU core resources. Therefore, we would prefer to have a
>   mechanism which prevents these resources to be able to be loaded with
>   sensitive data to begin with.
> 
> To better address (2), upstream community has discussed some mechanisms
> related to reducing data that is mapped on kernel virtual address space.
> Specifically:
> 
> a. XPFO: Removes from physmap pages that currently should only be accessed
>   by userspace.
> 
> b. Process-local memory allocations: Allows having a memory area in kernel
>   virtual address space that maps different content per-process. Then,
>   allocations made on this memory area can be hidden from other tasks in
>   the system running in kernel space. Most obvious use it to allocate
>   there per-vCPU and per-VM KVM structures.
> 
> However, both (a)+(b) work in a black-list approach (where we decide which
> data is considered dangerous and remove it from kernel virtual address
> space) and don't address performance hit described at (1).

+Cc Stefan from AWS and Kaya from Google.
(I have sent them my original patch series for review and discuss with them 
about this subject)
Stefan: Do you know what is Julian's current email address to Cc him as-well?

> 
> 
> Proposal
> 
> 
> To handle both these points, this series introduce the mechanism of KVM
> address space isolation. Note that this mechanism completes (a)+(b) and
> don't contradict. In case this mechanism is also applied, (a)+(b) should
> still be applied to the full virtual address space as a defence-in-depth).
> 
> The idea is that most of KVM #VMExit handlers code will run in a special
> KVM isolated address space which maps only KVM required code and per-VM
> information. Only once KVM needs to architectually access other (sensitive)
> data, it will switch from KVM isolated address space to full standard
> host address space. At this point, KVM will also need to kick all sibling
> hyperthreads to get out of guest (note that kicking all sibling hyperthreads
> is not implemented in this serie).
> 
> Basically, we will have the following flow:
> 
>  - qemu issues KVM_RUN ioctl
>  - KVM handles the ioctl and calls vcpu_run():
>. KVM switches from the kernel address to the KVM address space
>. KVM transfers control to VM (VMLAUNCH/VMRESUME)
>. VM returns to KVM
>. KVM handles VM-Exit:
>  . if handling need full kernel then switch to kernel address space

*AND* kick sibling hyperthreads before switching to that address space.
I think it’s important to emphasise that one of the main points of this KVM 
address space isolation mechanism is to minimise number of times we require to 
kick sibling hyperthreads outside of guest. By hopefully having the vast 
majority of VMExits handled on KVM isolated address space.

>  . else continues with KVM address space
>. KVM loops in vcpu_run() or return
>  - KVM_RUN ioctl returns
> 
> So, the KVM_RUN core function will mainly be executed using the KVM address
> space. The handling of a VM-Exit can require access to the kernel space
> and, in that case, we will switch back to the kernel address space.
>

Re: [RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation

2019-05-13 Thread Liran Alon



> On 13 May 2019, at 18:15, Peter Zijlstra  wrote:
> 
> On Mon, May 13, 2019 at 04:38:09PM +0200, Alexandre Chartre wrote:
>> From: Liran Alon 
>> 
>> Export symbols needed to create, manage, populate and switch
>> a mm from a kernel module (kvm in this case).
>> 
>> This is a hacky way for now to start.
>> This should be changed to some suitable memory-management API.
> 
> This should not be exported at all, ever, end of story.
> 
> Modules do not get to play with address spaces like that.

I agree… No doubt about that. This should never be merged like this.
It’s just to have an initial PoC of the concept so we can:
1) Messure performance impact of concept.
2) Get feedback on appropriate design and APIs from community.

-Liran



Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

2019-04-01 Thread Liran Alon



> On 1 Apr 2019, at 11:39, Vitaly Kuznetsov  wrote:
> 
> Paolo Bonzini  writes:
> 
>> On 29/03/19 16:32, Liran Alon wrote:
>>> Paolo I am not sure this is the case here. Please read my other
>>> replies in this email thread.
>>> 
>>> I think this is just a standard issue of a level-triggered interrupt
>>> handler in L1 (Hyper-V) that performs EOI before it lowers the
>>> irq-line. I don’t think vector 96 is even related to the issue at
>>> hand here. This is why after it was already handled, the loop of
>>> EXTERNAL_INTERRUPT happens on vector 80 and not vector 96.
>> 
>> Hmm... Vitaly, what machine were you testing on---does it have APIC-v?
>> If not, then you should have seen either an EOI for irq 96 or a TPR
>> below threshold vmexit.  However, if it has APIC-v then you wouldn't
>> have seen any of this (you only see the EOI for irq 80 because it's
>> level triggered) and Liran is probably right.
>> 
> 
> It does, however, the issue is reproducible with and without
> it. Moreover, I think the second simultaneous IRQ is just a red herring;
> Here is another trace (enable_apicv). Posting it non-stripped and hope
> your eyes will catch something I'm missing:
> 
> [001] 513675.736316: kvm_exit: reason VMRESUME rip 
> 0xf80002cae115 info 0 0
> [001] 513675.736321: kvm_entry:vcpu 0
> [001] 513675.736565: kvm_exit: reason EXTERNAL_INTERRUPT rip 
> 0xf80362dcd26d info 0 80ec
> [001] 513675.736566: kvm_nested_vmexit:rip f80362dcd26d reason 
> EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80ec int_info_err 0
> [001] 513675.736568: kvm_entry:vcpu 0
> [001] 513675.736650: kvm_exit: reason EPT_VIOLATION rip 
> 0xf80362dcd230 info 182 0
> [001] 513675.736651: kvm_nested_vmexit:rip f80362dcd230 reason 
> EPT_VIOLATION info1 182 info2 0 int_info 0 int_info_err 0
> [001] 513675.736651: kvm_page_fault:   address 26120 error_code 182
> 
> -> injecting
> 
> [008] 513675.737059: kvm_set_irq:  gsi 23 level 1 source 0
> [008] 513675.737061: kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
> [008] 513675.737062: kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
> [001] 513675.737233: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT 
> info1 0 info2 0 int_info 8050 int_info_err 0
> [001] 513675.737239: kvm_entry:vcpu 0
> [001] 513675.737243: kvm_exit: reason EOI_INDUCED rip 
> 0xf80002c85e1a info 50 0
> 
> -> immediate EOI causing re-injection (even preemption timer is not
> involved here).
> 
> [001] 513675.737244: kvm_eoi:  apicid 0 vector 80
> [001] 513675.737245: kvm_fpu:  unload
> [001] 513675.737246: kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
> [001] 513675.737256: kvm_set_irq:  gsi 23 level 1 source 0
> [001] 513675.737259: kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
> [001] 513675.737260: kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
> [001] 513675.737264: kvm_fpu:  load
> [001] 513675.737265: kvm_entry:vcpu 0
> [001] 513675.737271: kvm_exit: reason VMRESUME rip 
> 0xf80002cae115 info 0 0
> [001] 513675.737278: kvm_entry:vcpu 0
> [001] 513675.737282: kvm_exit: reason PREEMPTION_TIMER rip 
> 0xf80362dcc2d0 info 0 0
> [001] 513675.737283: kvm_nested_vmexit:rip f80362dcc2d0 reason 
> PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
> [001] 513675.737285: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT 
> info1 0 info2 0 int_info 8050 int_info_err 0
> [001] 513675.737289: kvm_entry:vcpu 0
> [001] 513675.737293: kvm_exit: reason EOI_INDUCED rip 
> 0xf80002c85e1a info 50 0
> [001] 513675.737293: kvm_eoi:  apicid 0 vector 80
> [001] 513675.737294: kvm_fpu:  unload
> [001] 513675.737295: kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
> [001] 513675.737299: kvm_set_irq:  gsi 23 level 1 source 0
> [001] 513675.737299: kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
> [001] 513675.737300: kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
> [001] 513675.737302: kvm_fpu:  load
> [001] 513675.737303: kvm_entry:vcpu 0
> [001] 513675.737307: kvm_exit: reason VMRESUME rip 
> 0xf80002cae115 info 0 0
> 
> ...
> 
> -- 
> Vitaly

So to sum-up: This matches what I mentioned in my previous emails right?
That vector 96 is not related, and the only issue here is that level-triggered 
interrupt handler for vector 80 is doing EOI before lowering the irq-line.
Which cause vector 

Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

2019-03-29 Thread Liran Alon



> On 29 Mar 2019, at 18:01, Paolo Bonzini  wrote:
> 
> On 29/03/19 15:40, Vitaly Kuznetsov wrote:
>> Paolo Bonzini  writes:
>> 
>>> On 28/03/19 21:31, Vitaly Kuznetsov wrote:
 
 The 'hang' scenario develops like this:
 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
 of them is level-triggered. KVM injects the edge-triggered one and
 requests immediate exit to inject the level-triggered:
 
 kvm_set_irq:  gsi 23 level 1 source 0
 kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
 kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
 kvm_msi_set_irq:  dst 0 vec 96 (Fixed|physical|edge)
 kvm_apic_accept_irq:  apicid 0 vec 96 (Fixed|edge)
 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 
 int_info 8060 int_info_err 0
 
 2) Hyper-V requires one of its VMs to run to handle the situation but
 immediate exit happens:
 
 kvm_entry:vcpu 0
 kvm_exit: reason VMRESUME rip 0xf80006a40115 info 0 0
 kvm_entry:vcpu 0
 kvm_exit: reason PREEMPTION_TIMER rip 0xf8022f3d8350 info 
 0 0
 kvm_nested_vmexit:rip f8022f3d8350 reason PREEMPTION_TIMER info1 0 
 info2 0 int_info 0 int_info_err 0
 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 
 int_info 8050 int_info_err 0
>>> 
>>> I supposed before this there was an eoi for vector 96?
>> 
>> AFAIR: no, it seems that it is actually the VM it is trying to resume
>> (Windows partition?) which needs to do some work and with the preemtion
>> timer of 0 we don't allow it to.
> 
> kvm_apic_accept_irq placed IRQ 96 in IRR, and Hyper-V should be running
> with "acknowledge interrupt on exit" since int_info is nonzero in
> kvm_nested_vmexit_inject.
> 
> Therefore, at the kvm_nested_vmexit_inject tracepoint KVM should have
> set bit 96 in ISR; and because PPR is now 96, interrupt 80 should have
> never been delivered.  Unless 96 is an auto-EOI interrupt, in which case
> this comment would apply
> 
>  /*
>   * For auto-EOI interrupts, there might be another pending
>   * interrupt above PPR, so check whether to raise another
>   * KVM_REQ_EVENT.
>   */
> 
> IIRC there was an enlightenment to tell Windows "I support auto-EOI but
> please don't use it".  If this is what's happening, that would also fix it.
> 
> Thanks,
> 
> Paolo

Paolo I am not sure this is the case here.
Please read my other replies in this email thread.

I think this is just a standard issue of a level-triggered interrupt handler in 
L1 (Hyper-V) that performs EOI before it lowers the irq-line.
I don’t think vector 96 is even related to the issue at hand here. This is why 
after it was already handled, the loop of EXTERNAL_INTERRUPT
happens on vector 80 and not vector 96.

In addition, there is a missing optimisation from Hyper-V that after it handles 
an EXTERNAL_INTERRUPT exit, it doesn’t enable interrupts
to receive other pending host interrupts (In our case, the pending vector 80) 
and will therefore only receive it once it enters back to L2
which will cause another EXTERNAL_INTERRUPT exit but this time on vector 80.

-Liran




Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

2019-03-29 Thread Liran Alon



> On 29 Mar 2019, at 12:14, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov  wrote:
>>> 
>>> This is embarassing but we have another Windows/Hyper-V issue to workaround
>>> in KVM (or QEMU). Hope "RFC" makes it less offensive.
>>> 
>>> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
>>> 'piix4-usb-uhci' device is attached. The problem with this device is that
>>> it uses level-triggered interrupts.
>>> 
>>> The 'hang' scenario develops like this:
>>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>>> of them is level-triggered. KVM injects the edge-triggered one and
>>> requests immediate exit to inject the level-triggered:
>>> 
>>> kvm_set_irq:  gsi 23 level 1 source 0
>>> kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
>>> kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
>>> kvm_msi_set_irq:  dst 0 vec 96 (Fixed|physical|edge)
>>> kvm_apic_accept_irq:  apicid 0 vec 96 (Fixed|edge)
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 
>>> int_info 8060 int_info_err 0
>> 
>> There is no immediate-exit here.
>> Here QEMU just set two pending irqs: vector 80 and vector 96.
>> Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to 
>> L1 on EXTERNAL_INTERRUPT.
>> Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the 
>> higher vector which is pending which is correct.
>> 
>> BTW, I don’t know why both are set in LAPIC as edge-triggered and not 
>> level-triggered.
>> But it can be seen from trace pattern that these interrupts are both 
>> level-triggered. (See QEMU’s ioapic_service()).
>> How did you deduce that one is edge-triggered and the other is
>> level-triggered?
> 
> "kvm_apic_accept_irq" event misreports level-triggered interrupts as
> edge-triggered, see my "KVM: x86: avoid misreporting level-triggered
> irqs as edge-triggered in tracing" patch.
> 
> Other than that I double-checked, both in Qemu and KVM (and there's a
> lot of additional debug prints stripped) and I'm certain there's no
> disagreement anywhere: gsi 23/vec 80 is a level-triggered interrupt.
> 
>> 
>>> 
>>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>>> immediate exit happens:
>>> 
>>> kvm_entry:vcpu 0
>>> kvm_exit: reason VMRESUME rip 0xf80006a40115 info 0 0
>>> kvm_entry:vcpu 0
>>> kvm_exit: reason PREEMPTION_TIMER rip 0xf8022f3d8350 info 0 >>> 0
>>> kvm_nested_vmexit:rip f8022f3d8350 reason PREEMPTION_TIMER info1 0 
>>> info2 0 int_info 0 int_info_err 0
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 
>>> int_info 8050 int_info_err 0
>> 
>> I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it 
>> will forward the interrupt to the host.
>> As done in KVM vcpu_enter_guest() calling 
>> kvm_x86_ops->handle_external_intr().
>> Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with 
>> vector 80 pending.
>> 
>> I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 
>> 96 is cleared from LAPIC IRR
>> and the bit in LAPIC ISR for vector 96 is set.
>> This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt().
>> 
>> I further assume that at the point that vector 96 runs in L1, interrupts are 
>> disabled.
>> Afterwards I would expect L1 to enable interrupts (Similar to 
>> vcpu_enter_guest() calling local_irq_enable() after 
>> kvm_x86_ops->handle_external_intr()).
>> 
>> I would expect Hyper-V handler for vector 96 at some point to do EOI such 
>> that when interrupts are later enabled, vector 80 will also get injected.
>> All of this before attempting to resume back into L2.
>> 
>> However, it can be seen that indeed at this resume, you receive, after an 
>> immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80.
>> As if Hyper-V never enabled interrupts after handling vector 96 before doing 
>> a resume again to L2.
>> 
>> This is still valid of course but just a bit bizarre and
>> inefficient. Oh well. :)
> 
> Reverse-engineering is always fun :-)

We can report this to Microsoft as-well. :)

> 
>> 
>>> 
>>> 3) Hyper-V doesn't want to deal wit

Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

2019-03-28 Thread Liran Alon



> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov  wrote:
> 
> This is embarassing but we have another Windows/Hyper-V issue to workaround
> in KVM (or QEMU). Hope "RFC" makes it less offensive.
> 
> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
> 'piix4-usb-uhci' device is attached. The problem with this device is that
> it uses level-triggered interrupts.
> 
> The 'hang' scenario develops like this:
> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
> of them is level-triggered. KVM injects the edge-triggered one and
> requests immediate exit to inject the level-triggered:
> 
> kvm_set_irq:  gsi 23 level 1 source 0
> kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
> kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
> kvm_msi_set_irq:  dst 0 vec 96 (Fixed|physical|edge)
> kvm_apic_accept_irq:  apicid 0 vec 96 (Fixed|edge)
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 
> 8060 int_info_err 0

There is no immediate-exit here.
Here QEMU just set two pending irqs: vector 80 and vector 96.
Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to L1 
on EXTERNAL_INTERRUPT.
Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the higher 
vector which is pending which is correct.

BTW, I don’t know why both are set in LAPIC as edge-triggered and not 
level-triggered.
But it can be seen from trace pattern that these interrupts are both 
level-triggered. (See QEMU’s ioapic_service()).
How did you deduce that one is edge-triggered and the other is level-triggered?

> 
> 2) Hyper-V requires one of its VMs to run to handle the situation but
> immediate exit happens:
> 
> kvm_entry:vcpu 0
> kvm_exit: reason VMRESUME rip 0xf80006a40115 info 0 0
> kvm_entry:vcpu 0
> kvm_exit: reason PREEMPTION_TIMER rip 0xf8022f3d8350 info 0 0
> kvm_nested_vmexit:rip f8022f3d8350 reason PREEMPTION_TIMER info1 0 
> info2 0 int_info 0 int_info_err 0
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 
> 8050 int_info_err 0

I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it will 
forward the interrupt to the host.
As done in KVM vcpu_enter_guest() calling kvm_x86_ops->handle_external_intr().
Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with 
vector 80 pending.

I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 96 
is cleared from LAPIC IRR
and the bit in LAPIC ISR for vector 96 is set.
This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt().

I further assume that at the point that vector 96 runs in L1, interrupts are 
disabled.
Afterwards I would expect L1 to enable interrupts (Similar to 
vcpu_enter_guest() calling local_irq_enable() after 
kvm_x86_ops->handle_external_intr()).

I would expect Hyper-V handler for vector 96 at some point to do EOI such that 
when interrupts are later enabled, vector 80 will also get injected.
All of this before attempting to resume back into L2.

However, it can be seen that indeed at this resume, you receive, after an 
immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80.
As if Hyper-V never enabled interrupts after handling vector 96 before doing a 
resume again to L2.

This is still valid of course but just a bit bizarre and inefficient. Oh well. 
:)

> 
> 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
> process the first one)

Both interrupts are for L1 not L2.

> so it just does 'EOI' for level-triggered interrupt
> and this causes re-injection:
> 
> kvm_exit: reason EOI_INDUCED rip 0xf80006a17e1a info 50 0
> kvm_eoi:  apicid 0 vector 80
> kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
> kvm_set_irq:  gsi 23 level 1 source 0
> kvm_msi_set_irq:  dst 0 vec 80 (Fixed|physical|level)
> kvm_apic_accept_irq:  apicid 0 vec 80 (Fixed|edge)
> kvm_entry:vcpu 0

What happens here is that Hyper-V as a response to second EXTERNAL_INTERRUPT on 
vector 80,
it invokes vector 80 handler which performs EOI which is configured in 
ioapic_exit_bitmap to cause EOI_INDUCED exit to L0.
The EOI_INDUCED handler will reach handle_apic_eoi_induced() -> 
kvm_apic_set_eoi_accelerated() -> kvm_ioapic_send_eoi() -> 
kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT),
which will cause the exit on KVM_EXIT_IOAPIC_EOI to QEMU as required.

As part of QEMU handling for this exit (ioapic_eoi_broadcast()), it will note 
that pin’s irr is still set (irq-line was not lowered by vector 80 interrupt 
handler before EOI),
and thus vector 80 is re-injected by IOAPIC at ioapic_service().

If this is indeed a level-triggered interrupt, then it seems buggy to me that 
vector 80 handler haven’t lowered the irq-line before EOI.
I would suggest adding a trace to QEMU’s ioapic_set_irq() for when vector=80 
and level=0 and 

Re: [PATCH] KVM: x86: nVMX: allow RSM to restore VMXE CR4 flag

2019-03-26 Thread Liran Alon



> On 26 Mar 2019, at 15:48, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 26 Mar 2019, at 15:07, Vitaly Kuznetsov  wrote:
>>> - Instread of putting the temporary HF_SMM_MASK drop to
>>> rsm_enter_protected_mode() (as was suggested by Liran), move it to
>>> emulator_set_cr() modifying its interface. emulate.c seems to be
>>> vcpu-specifics-free at this moment, we may want to keep it this way.
>>> - It seems that Hyper-V+UEFI on KVM is still broken, I'm observing sporadic
>>> hangs even with this patch. These hangs, however, seem to be unrelated to
>>> rsm.
>> 
>> Feel free to share details on these hangs ;)
>> 
> 
> You've asked for it)
> 
> The immediate issue I'm observing is some sort of a lockup which is easy
> to trigger with e.g. "-usb -device usb-tablet" on Qemu command line; it
> seems we get too many interrupts and combined with preemtion timer for
> L2 we're not making any progress:
> 
> kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
> kvm_set_irq:  gsi 18 level 1 source 0
> kvm_msi_set_irq:  dst 0 vec 177 (Fixed|physical|level)
> kvm_apic_accept_irq:  apicid 0 vec 177 (Fixed|edge)
> kvm_fpu:  load
> kvm_entry:vcpu 0
> kvm_exit: reason VMRESUME rip 0xf8848115 info 0 0
> kvm_entry:vcpu 0
> kvm_exit: reason PREEMPTION_TIMER rip 0xf800f4448e01 info 0 0
> kvm_nested_vmexit:rip f800f4448e01 reason PREEMPTION_TIMER info1 0 
> info2 0 int_info 0 int_info_err 0
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 
> 80b1 int_info_err 0
> kvm_entry:vcpu 0
> kvm_exit: reason APIC_ACCESS rip 0xf881fe11 info 10b0 0
> kvm_apic: apic_write APIC_EOI = 0x0
> kvm_eoi:  apicid 0 vector 177
> kvm_fpu:  unload
> kvm_userspace_exit:   reason KVM_EXIT_IOAPIC_EOI (26)
> ...
> (and the pattern repeats)
> 
> Maybe it is a usb-only/Qemu-only problem, maybe not.
> 
> -- 
> Vitaly

The trace of kvm_apic_accept_irq should indicate that __apic_accept_irq() was 
called to inject an interrupt to L1 guest.
(I know that now we are running in L1 because next exit is a VMRESUME).

However, it is surprising to see that on next entry to guest, no interrupt was 
injected by vmx_inject_irq().
It may be because L1 guest is currently running with interrupt disabled and 
therefore only an IRQ-window was requested.
(Too bad we don’t have a trace for this…)

Next, we got an exit from L1 guest on VMRESUME. As part of it’s handling, 
active VMCS was changed from vmcs01 to vmcs02.
I believe the immediate exit later on preemption-timer was because the 
immediate-exit-request mechanism was invoked
which is now implemented by setting a VMX preemption-timer with value of 0 
(Thanks to Sean).
(See vmx_vcpu_run() -> vmx_update_hv_timer() -> vmx_arm_hv_timer(vmx, 0)).
(Note that the pending interrupt was evaluated because of a recent patch of 
mine to nested_vmx_enter_non_root_mode()
to request KVM_REQ_EVENT when vmcs01 have requested an IRQ-window)

Therefore when entering L2, you immediately get an exit on PREEMPTION_TIMER 
which will cause eventually L0 to call
vmx_check_nested_events() which notices now the pending interrupt that should 
have been injected before to L1
and now exit from L2 to L1 on EXTERNAL_INTERRUPT on vector 0xb1.

Then L1 handles the interrupt by performing an EOI to LAPIC which propagate an 
EOI to IOAPIC which immediately re-inject
the interrupt (after clearing the remote_irr) as the irq-line is still set. 
i.e. QEMU’s ioapic_eoi_broadcast() calls ioapic_service() immediate after it 
clears remote-irr for this pin.

Also note that in trace we see only a single kvm_set_irq to level 1 but we 
don’t see immediately another kvm_set_irq to level 0.
This should indicate that in QEMU’s IOAPIC redirection-table, this pin is 
configured as level-triggered interrupt.
However, the trace of kvm_apic_accept_irq indicates that this interrupt is 
raised as an edge-triggered interrupt.

To sum up:
1) I would create a patch to add a trace to vcpu_enter_guest() when calling 
enable_smi_window() / enable_nmi_window() / enable_irq_window().
2) It is worth investigating why MSI trigger-mode is edge-triggered instead of 
level-triggered.
3) If this is indeed a level-triggered interrupt, it is worth investigating how 
the interrupt source behaves. i.e. What cause this device to lower the irq-line?
(As we don’t see any I/O Port or MMIO access by L1 guest interrupt-handler 
before performing the EOI)
4) Does this issue reproduce also when running with kernel-irqchip? (Instead of 
split-irqchip)

-Liran






Re: [PATCH] KVM: x86: nVMX: allow RSM to restore VMXE CR4 flag

2019-03-26 Thread Liran Alon



> On 26 Mar 2019, at 15:07, Vitaly Kuznetsov  wrote:
> 
> Commit 5bea5123cbf0 ("KVM: VMX: check nested state and CR4.VMXE against
> SMM") introduced a check to vmx_set_cr4() forbidding to set VMXE from SMM.
> The check is correct, however, there is a special case when RSM is called
> to leave SMM: rsm_enter_protected_mode() is called with HF_SMM_MASK still
> set and in case VMXE was set before entering SMM we're failing to return.
> 
> Resolve the issue by temporary dropping HF_SMM_MASK around set_cr4() calls
> when ops->set_cr() is called from RSM.
> 
> Reported-by: Jon Doron 
> Suggested-by: Liran Alon 
> Fixes: 5bea5123cbf0 ("KVM: VMX: check nested state and CR4.VMXE against SMM")
> Signed-off-by: Vitaly Kuznetsov 

Patch looks good to me.
Reviewed-by: Liran Alon 

> ---
> - Instread of putting the temporary HF_SMM_MASK drop to
>  rsm_enter_protected_mode() (as was suggested by Liran), move it to
>  emulator_set_cr() modifying its interface. emulate.c seems to be
>  vcpu-specifics-free at this moment, we may want to keep it this way.
> - It seems that Hyper-V+UEFI on KVM is still broken, I'm observing sporadic
>  hangs even with this patch. These hangs, however, seem to be unrelated to
>  rsm.

Feel free to share details on these hangs ;)

Great work,
-Liran

> ---
> arch/x86/include/asm/kvm_emulate.h |  3 ++-
> arch/x86/kvm/emulate.c | 27 ++-
> arch/x86/kvm/x86.c | 12 +++-
> 3 files changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_emulate.h 
> b/arch/x86/include/asm/kvm_emulate.h
> index 93c4bf598fb0..6c33caa82fa5 100644
> --- a/arch/x86/include/asm/kvm_emulate.h
> +++ b/arch/x86/include/asm/kvm_emulate.h
> @@ -203,7 +203,8 @@ struct x86_emulate_ops {
>   void (*set_gdt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt);
>   void (*set_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt);
>   ulong (*get_cr)(struct x86_emulate_ctxt *ctxt, int cr);
> - int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val);
> + int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val,
> +   bool from_rsm);
>   int (*cpl)(struct x86_emulate_ctxt *ctxt);
>   int (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong *dest);
>   int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value);
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index c338984c850d..a6204105d4d7 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -2413,7 +2413,7 @@ static int rsm_enter_protected_mode(struct 
> x86_emulate_ctxt *ctxt,
>   cr3 &= ~0xfff;
>   }
> 
> - bad = ctxt->ops->set_cr(ctxt, 3, cr3);
> + bad = ctxt->ops->set_cr(ctxt, 3, cr3, true);
>   if (bad)
>   return X86EMUL_UNHANDLEABLE;
> 
> @@ -2422,20 +2422,20 @@ static int rsm_enter_protected_mode(struct 
> x86_emulate_ctxt *ctxt,
>* Then enable protected mode.  However, PCID cannot be enabled
>* if EFER.LMA=0, so set it separately.
>*/
> - bad = ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
> + bad = ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE, true);
>   if (bad)
>   return X86EMUL_UNHANDLEABLE;
> 
> - bad = ctxt->ops->set_cr(ctxt, 0, cr0);
> + bad = ctxt->ops->set_cr(ctxt, 0, cr0, true);
>   if (bad)
>   return X86EMUL_UNHANDLEABLE;
> 
>   if (cr4 & X86_CR4_PCIDE) {
> - bad = ctxt->ops->set_cr(ctxt, 4, cr4);
> + bad = ctxt->ops->set_cr(ctxt, 4, cr4, true);
>   if (bad)
>   return X86EMUL_UNHANDLEABLE;
>   if (pcid) {
> - bad = ctxt->ops->set_cr(ctxt, 3, cr3 | pcid);
> + bad = ctxt->ops->set_cr(ctxt, 3, cr3 | pcid, true);
>   if (bad)
>   return X86EMUL_UNHANDLEABLE;
>   }
> @@ -2581,7 +2581,7 @@ static int em_rsm(struct x86_emulate_ctxt *ctxt)
> 
>   /* Zero CR4.PCIDE before CR0.PG.  */
>   if (cr4 & X86_CR4_PCIDE) {
> - ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
> + ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE, true);
>   cr4 &= ~X86_CR4_PCIDE;
>   }
> 
> @@ -2595,11 +2595,12 @@ static int em_rsm(struct x86_emulate_ctxt *ctxt)
>   /* For the 64-bit case, this will clear EFER.LMA.  */
>   cr0 = ctxt->ops->get_cr(ctxt, 0);
>   if (cr0 & X86_

Re: [PATCH] x86/kvm/hyper-v: tweak HYPERV_CPUID_ENLIGHTMENT_INFO

2019-01-24 Thread Liran Alon



> On 24 Jan 2019, at 19:39, Vitaly Kuznetsov  wrote:
> 
> Liran Alon  writes:
> 
>>> On 24 Jan 2019, at 19:15, Vitaly Kuznetsov  wrote:
>>> 
>>> We shouldn't probably be suggesting using Enlightened VMCS when it's not
>>> enabled (not supported from guest's point of view). System reset through
>>> synthetic MSR is not recommended neither by genuine Hyper-V nor my QEMU.
>>> 
>>> Windows seems to be fine either way but let's be consistent.
>>> 
>>> Fixes: 2bc39970e932 ("x86/kvm/hyper-v: Introduce 
>>> KVM_GET_SUPPORTED_HV_CPUID")
>>> Signed-off-by: Vitaly Kuznetsov 
>>> ---
>>> arch/x86/kvm/hyperv.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>>> index ac44a681f065..4730fcaa70cf 100644
>>> --- a/arch/x86/kvm/hyperv.c
>>> +++ b/arch/x86/kvm/hyperv.c
>>> @@ -1847,11 +1847,11 @@ int kvm_vcpu_ioctl_get_hv_cpuid(struct kvm_vcpu 
>>> *vcpu, struct kvm_cpuid2 *cpuid,
>>> case HYPERV_CPUID_ENLIGHTMENT_INFO:
>>> ent->eax |= HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED;
>>> ent->eax |= HV_X64_APIC_ACCESS_RECOMMENDED;
>>> -   ent->eax |= HV_X64_SYSTEM_RESET_RECOMMENDED;
>>> ent->eax |= HV_X64_RELAXED_TIMING_RECOMMENDED;
>>> ent->eax |= HV_X64_CLUSTER_IPI_RECOMMENDED;
>>> ent->eax |= HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED;
>>> -   ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>>> +   if (evmcs_ver)
>>> +   ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>>> 
>>> /*
>>>  * Default number of spinlock retry attempts, matches
>>> -- 
>>> 2.20.1
>>> 
>> 
>> Seems to me that there are 2 unrelated separated patches here. Why not
>> split them?
> 
> They seem to be too small :-) No problem, I'll split them up in v2.

I don’t think in general that it matters how small they are.
Separating to small logical patches allows better bisect, easier review and 
better revert resolution. So better overall. :)

> 
>> For content itself: Reviewed-by: Liran Alon 
>> 
> 
> Thanks!
> 
> -- 
> Vitaly



Re: [PATCH] x86/kvm/hyper-v: tweak HYPERV_CPUID_ENLIGHTMENT_INFO

2019-01-24 Thread Liran Alon



> On 24 Jan 2019, at 19:15, Vitaly Kuznetsov  wrote:
> 
> We shouldn't probably be suggesting using Enlightened VMCS when it's not
> enabled (not supported from guest's point of view). System reset through
> synthetic MSR is not recommended neither by genuine Hyper-V nor my QEMU.
> 
> Windows seems to be fine either way but let's be consistent.
> 
> Fixes: 2bc39970e932 ("x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID")
> Signed-off-by: Vitaly Kuznetsov 
> ---
> arch/x86/kvm/hyperv.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index ac44a681f065..4730fcaa70cf 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1847,11 +1847,11 @@ int kvm_vcpu_ioctl_get_hv_cpuid(struct kvm_vcpu 
> *vcpu, struct kvm_cpuid2 *cpuid,
>   case HYPERV_CPUID_ENLIGHTMENT_INFO:
>   ent->eax |= HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED;
>   ent->eax |= HV_X64_APIC_ACCESS_RECOMMENDED;
> - ent->eax |= HV_X64_SYSTEM_RESET_RECOMMENDED;
>   ent->eax |= HV_X64_RELAXED_TIMING_RECOMMENDED;
>   ent->eax |= HV_X64_CLUSTER_IPI_RECOMMENDED;
>   ent->eax |= HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED;
> - ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> + if (evmcs_ver)
> + ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> 
>   /*
>* Default number of spinlock retry attempts, matches
> -- 
> 2.20.1
> 

Seems to me that there are 2 unrelated separated patches here. Why not split 
them?
For content itself: Reviewed-by: Liran Alon 



Re: [PATCH v1 3/8] kvm:vmx Enable loading CET state bit while guest CR4.CET is being set.

2018-12-26 Thread Liran Alon



> On 26 Dec 2018, at 10:15, Yang Weijiang  wrote:
> 
> This bit controls whether guest CET states will be loaded on guest entry.
> 
> Signed-off-by: Zhang Yi Z 
> Signed-off-by: Yang Weijiang 
> ---
> arch/x86/kvm/vmx.c | 19 +++
> 1 file changed, 19 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7bbb8b26e901..25fa6bd2fb95 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1045,6 +1045,8 @@ struct vcpu_vmx {
> 
>   bool req_immediate_exit;
> 
> + bool vcpu_cet_on;
> +
>   /* Support for PML */
> #define PML_ENTITY_NUM512
>   struct page *pml_pg;
> @@ -5409,6 +5411,23 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
> long cr4)
>   return 1;
>   }
> 
> + /*
> +  * When CET.CR4 is being set, it means we're enabling CET for

You probably meant to write here CR4.CET.

> +  * the guest, then enable loading CET state bit in entry control.
> +  * Otherwise, clear loading CET bit to disable guest CET.
> +  */
> + if (cr4 & X86_CR4_CET) {
> + if (!to_vmx(vcpu)->vcpu_cet_on) {
> + vmcs_set_bits(VM_ENTRY_CONTROLS,
> +   VM_ENTRY_LOAD_GUEST_CET_STATE);
> + to_vmx(vcpu)->vcpu_cet_on = 1;
> + }
> + } else if (to_vmx(vcpu)->vcpu_cet_on) {
> + vmcs_clear_bits(VM_ENTRY_CONTROLS,
> + VM_ENTRY_LOAD_GUEST_CET_STATE);
> + to_vmx(vcpu)->vcpu_cet_on = 0;
> + }
> +
>   if (to_vmx(vcpu)->nested.vmxon && !nested_cr4_valid(vcpu, cr4))
>   return 1;
> 
> -- 
> 2.17.1
> 

I haven’t seen a patch in the series that modifies kvm_set_cr4() to verify 
CR4.CET is not set when CET is not reported as supported by CPUID.
I think that is missing from the series.

-Liran




Re: [PATCH] KVM: x86: Trace changes to active TSC offset regardless if vCPU in guest-mode

2018-11-25 Thread Liran Alon



> On 25 Nov 2018, at 19:53, Paolo Bonzini  wrote:
> 
> For some reason, kvm_x86_ops->write_l1_tsc_offset() skipped trace
> of change to active TSC offset in case vCPU is in guest-mode.
> This patch changes write_l1_tsc_offset() behavior to trace any change
> to active TSC offset to aid debugging.  The VMX code is changed to
> look more similar to SVM, which is in my opinion nicer.
> 
> Based on a patch by Liran Alon.
> 
> Signed-off-by: Paolo Bonzini 

I would have applied this refactoring change on top of my original version of 
this patch. Easier to read and review.
But I guess it’s a matter of taste…
Anyway, code looks correct to me. Therefore:
Reviewed-by: Liran Alon 

> ---
>   Untested still, but throwing it out because it seems pretty
>   obvious...
> 
> arch/x86/kvm/svm.c |  9 +
> arch/x86/kvm/vmx.c | 34 +-
> 2 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index a24733aade4c..0d1a74069a9e 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1456,10 +1456,11 @@ static u64 svm_write_l1_tsc_offset(struct kvm_vcpu 
> *vcpu, u64 offset)
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> - } else
> - trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> -svm->vmcb->control.tsc_offset,
> -offset);
> + }
> +
> + trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +svm->vmcb->control.tsc_offset - g_tsc_offset,
> +offset);
> 
>   svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 764c23dc444f..e7d3f7d35355 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3466,24 +3466,24 @@ static u64 vmx_read_l1_tsc_offset(struct kvm_vcpu 
> *vcpu)
> 
> static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> {
> - u64 active_offset = offset;
> - if (is_guest_mode(vcpu)) {
> - /*
> -  * We're here if L1 chose not to trap WRMSR to TSC. According
> -  * to the spec, this should set L1's TSC; The offset that L1
> -  * set for L2 remains unchanged, and still needs to be added
> -  * to the newly set TSC to get L2's TSC.
> -  */
> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> - if (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING))
> - active_offset += vmcs12->tsc_offset;
> - } else {
> - trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> -vmcs_read64(TSC_OFFSET), offset);
> - }
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> + u64 g_tsc_offset = 0;
> +
> + /*
> +  * We're here if L1 chose not to trap WRMSR to TSC. According
> +  * to the spec, this should set L1's TSC; The offset that L1
> +  * set for L2 remains unchanged, and still needs to be added
> +  * to the newly set TSC to get L2's TSC.
> +  */
> + if (is_guest_mode(vcpu) &&
> + (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING))
> + g_tsc_offset = vmcs12->tsc_offset;
> 
> - vmcs_write64(TSC_OFFSET, active_offset);
> - return active_offset;
> + trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +vcpu->arch.tsc_offset - g_tsc_offset,
> +offset);
> + vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> + return offset + g_tsc_offset;
> }
> 
> /*
> -- 
> 1.8.3.1
> 





Re: [PATCH] KVM: x86: Trace changes to active TSC offset regardless if vCPU in guest-mode

2018-11-25 Thread Liran Alon



> On 25 Nov 2018, at 19:53, Paolo Bonzini  wrote:
> 
> For some reason, kvm_x86_ops->write_l1_tsc_offset() skipped trace
> of change to active TSC offset in case vCPU is in guest-mode.
> This patch changes write_l1_tsc_offset() behavior to trace any change
> to active TSC offset to aid debugging.  The VMX code is changed to
> look more similar to SVM, which is in my opinion nicer.
> 
> Based on a patch by Liran Alon.
> 
> Signed-off-by: Paolo Bonzini 

I would have applied this refactoring change on top of my original version of 
this patch. Easier to read and review.
But I guess it’s a matter of taste…
Anyway, code looks correct to me. Therefore:
Reviewed-by: Liran Alon 

> ---
>   Untested still, but throwing it out because it seems pretty
>   obvious...
> 
> arch/x86/kvm/svm.c |  9 +
> arch/x86/kvm/vmx.c | 34 +-
> 2 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index a24733aade4c..0d1a74069a9e 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1456,10 +1456,11 @@ static u64 svm_write_l1_tsc_offset(struct kvm_vcpu 
> *vcpu, u64 offset)
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> - } else
> - trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> -svm->vmcb->control.tsc_offset,
> -offset);
> + }
> +
> + trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +svm->vmcb->control.tsc_offset - g_tsc_offset,
> +offset);
> 
>   svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 764c23dc444f..e7d3f7d35355 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3466,24 +3466,24 @@ static u64 vmx_read_l1_tsc_offset(struct kvm_vcpu 
> *vcpu)
> 
> static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> {
> - u64 active_offset = offset;
> - if (is_guest_mode(vcpu)) {
> - /*
> -  * We're here if L1 chose not to trap WRMSR to TSC. According
> -  * to the spec, this should set L1's TSC; The offset that L1
> -  * set for L2 remains unchanged, and still needs to be added
> -  * to the newly set TSC to get L2's TSC.
> -  */
> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> - if (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING))
> - active_offset += vmcs12->tsc_offset;
> - } else {
> - trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> -vmcs_read64(TSC_OFFSET), offset);
> - }
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> + u64 g_tsc_offset = 0;
> +
> + /*
> +  * We're here if L1 chose not to trap WRMSR to TSC. According
> +  * to the spec, this should set L1's TSC; The offset that L1
> +  * set for L2 remains unchanged, and still needs to be added
> +  * to the newly set TSC to get L2's TSC.
> +  */
> + if (is_guest_mode(vcpu) &&
> + (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING))
> + g_tsc_offset = vmcs12->tsc_offset;
> 
> - vmcs_write64(TSC_OFFSET, active_offset);
> - return active_offset;
> + trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +vcpu->arch.tsc_offset - g_tsc_offset,
> +offset);
> + vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> + return offset + g_tsc_offset;
> }
> 
> /*
> -- 
> 1.8.3.1
> 





Re: [PATCH] KVM: VMX: re-add ple_gap module parameter

2018-11-23 Thread Liran Alon



> On 23 Nov 2018, at 19:02, Luiz Capitulino  wrote:
> 
> 
> Apparently, the ple_gap parameter was accidentally removed
> by commit c8e88717cfc6b36bedea22368d97667446318291. Add it
> back.
> 
> Signed-off-by: Luiz Capitulino 

Weird that nobody noticed this when patch was applied… Thanks.
Reviewed-by: Liran Alon 

> ---
> arch/x86/kvm/vmx.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4555077d69ce..be6f13f1c25f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -174,6 +174,7 @@ module_param_named(preemption_timer, 
> enable_preemption_timer, bool, S_IRUGO);
>  * refer SDM volume 3b section 21.6.13 & 22.1.3.
>  */
> static unsigned int ple_gap = KVM_DEFAULT_PLE_GAP;
> +module_param(ple_gap, uint, 0444);
> 
> static unsigned int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
> module_param(ple_window, uint, 0444);
> -- 
> 2.17.2
> 



Re: [PATCH] KVM: VMX: re-add ple_gap module parameter

2018-11-23 Thread Liran Alon



> On 23 Nov 2018, at 19:02, Luiz Capitulino  wrote:
> 
> 
> Apparently, the ple_gap parameter was accidentally removed
> by commit c8e88717cfc6b36bedea22368d97667446318291. Add it
> back.
> 
> Signed-off-by: Luiz Capitulino 

Weird that nobody noticed this when patch was applied… Thanks.
Reviewed-by: Liran Alon 

> ---
> arch/x86/kvm/vmx.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4555077d69ce..be6f13f1c25f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -174,6 +174,7 @@ module_param_named(preemption_timer, 
> enable_preemption_timer, bool, S_IRUGO);
>  * refer SDM volume 3b section 21.6.13 & 22.1.3.
>  */
> static unsigned int ple_gap = KVM_DEFAULT_PLE_GAP;
> +module_param(ple_gap, uint, 0444);
> 
> static unsigned int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
> module_param(ple_window, uint, 0444);
> -- 
> 2.17.2
> 



Re: KMSAN: kernel-infoleak in kvm_arch_vcpu_ioctl

2018-11-16 Thread Liran Alon



> On 17 Nov 2018, at 0:09, syzbot 
>  wrote:
> 
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:006aa39cddee kmsan: don't instrument fixup_bad_iret()
> git tree:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_kmsan.git_master=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=qBVZxFprJcGeWFBppSXwwkTrx3u7E8vv78UxFb1N2yM=
> console output: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D101dcb0b40=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=m7FuREmXFg2ZHCkPmLKPCIrMU2Pw0zlAPdkpQXTtmHs=
> kernel config:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3Df388ea1732f3c473=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=vvmEqhqn8-cl-D88zy6BsN9exSDvIKxVRopXWs0hIYE=
> dashboard link: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3Dcfbc368e283d381f8cef=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=HTwepMOiTfeo-sYvcbfkdwY3-DouLgzz3XT6a26qLhM=
> compiler:   clang version 8.0.0 (trunk 343298)
> syz repro:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D10c56fbd40=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=N7ZWdZk1M360lcHmoXIX8utlbcjYLe7MPu_Gxe3sLBU=
> C reproducer:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D153c8a4740=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=Y_PwncNZIx_yxjGLefhRu5GpOYmjCqOloLFHGxsH7eQ=
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+cfbc368e283d381f8...@syzkaller.appspotmail.com
> 
> IPVS: ftp: loaded support on port[0] = 21
> L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.kernel.org_doc_html_latest_admin-2Dguide_l1tf.html=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=W59LMRxHTTYAxlGlzLxWOuioL5Z6ousHYqJPO5KJuDI=
>  for details.
> ==
> BUG: KMSAN: kernel-infoleak in _copy_to_user+0x19a/0x230 lib/usercopy.c:31
> CPU: 0 PID: 6697 Comm: syz-executor853 Not tainted 4.20.0-rc2+ #85
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x32d/0x480 lib/dump_stack.c:113
> kmsan_report+0x19f/0x300 mm/kmsan/kmsan.c:911
> kmsan_internal_check_memory+0x35b/0x3b0 mm/kmsan/kmsan.c:993
> kmsan_copy_to_user+0x7c/0xe0 mm/kmsan/kmsan_hooks.c:552
> _copy_to_user+0x19a/0x230 lib/usercopy.c:31
> copy_to_user include/linux/uaccess.h:183 [inline]
> kvm_vcpu_ioctl_enable_cap arch/x86/kvm/x86.c:3834 [inline]
> kvm_arch_vcpu_ioctl+0x5dee/0x7680 arch/x86/kvm/x86.c:4132
> kvm_vcpu_ioctl+0xca3/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2748
> do_vfs_ioctl+0xfbc/0x2f70 fs/ioctl.c:46
> ksys_ioctl fs/ioctl.c:713 [inline]
> __do_sys_ioctl fs/ioctl.c:720 [inline]
> __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:718
> __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:718
> do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
> entry_SYSCALL_64_after_hwframe+0x63/0xe7
> RIP: 0033:0x4471b9
> Code: e8 fc b9 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 3b 07 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f1e22946da8 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 006f0038 RCX: 004471b9
> RDX: 2000 RSI: 4068aea3 RDI: 0005
> RBP: 006f0030 R08:  R09: 
> R10:  R11: 0246 R12: 006f003c
> R13: 6d766b2f7665642f R14: 7f1e229479c0 R15: 03e8
> 
> Local variable description: __pu_val@kvm_arch_vcpu_ioctl
> Variable was created at:
> kvm_arch_vcpu_ioctl+0x29d/0x7680 arch/x86/kvm/x86.c:3848
> kvm_vcpu_ioctl+0xca3/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2748
> 
> Bytes 0-1 of 2 are uninitialized
> Memory access of size 2 starts at 8881967ffbb0
> Data copied to user address 00706000
> ==
> 

The info-leak bug is very simple, What happens is the 

Re: KMSAN: kernel-infoleak in kvm_arch_vcpu_ioctl

2018-11-16 Thread Liran Alon



> On 17 Nov 2018, at 0:09, syzbot 
>  wrote:
> 
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:006aa39cddee kmsan: don't instrument fixup_bad_iret()
> git tree:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_kmsan.git_master=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=qBVZxFprJcGeWFBppSXwwkTrx3u7E8vv78UxFb1N2yM=
> console output: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D101dcb0b40=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=m7FuREmXFg2ZHCkPmLKPCIrMU2Pw0zlAPdkpQXTtmHs=
> kernel config:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3Df388ea1732f3c473=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=vvmEqhqn8-cl-D88zy6BsN9exSDvIKxVRopXWs0hIYE=
> dashboard link: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3Dcfbc368e283d381f8cef=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=HTwepMOiTfeo-sYvcbfkdwY3-DouLgzz3XT6a26qLhM=
> compiler:   clang version 8.0.0 (trunk 343298)
> syz repro:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D10c56fbd40=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=N7ZWdZk1M360lcHmoXIX8utlbcjYLe7MPu_Gxe3sLBU=
> C reproducer:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D153c8a4740=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=Y_PwncNZIx_yxjGLefhRu5GpOYmjCqOloLFHGxsH7eQ=
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+cfbc368e283d381f8...@syzkaller.appspotmail.com
> 
> IPVS: ftp: loaded support on port[0] = 21
> L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.kernel.org_doc_html_latest_admin-2Dguide_l1tf.html=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=6-j0bZwOIu2D7UVtppvdFzCPoDvVU1l3ExqVO_Po11o=W59LMRxHTTYAxlGlzLxWOuioL5Z6ousHYqJPO5KJuDI=
>  for details.
> ==
> BUG: KMSAN: kernel-infoleak in _copy_to_user+0x19a/0x230 lib/usercopy.c:31
> CPU: 0 PID: 6697 Comm: syz-executor853 Not tainted 4.20.0-rc2+ #85
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x32d/0x480 lib/dump_stack.c:113
> kmsan_report+0x19f/0x300 mm/kmsan/kmsan.c:911
> kmsan_internal_check_memory+0x35b/0x3b0 mm/kmsan/kmsan.c:993
> kmsan_copy_to_user+0x7c/0xe0 mm/kmsan/kmsan_hooks.c:552
> _copy_to_user+0x19a/0x230 lib/usercopy.c:31
> copy_to_user include/linux/uaccess.h:183 [inline]
> kvm_vcpu_ioctl_enable_cap arch/x86/kvm/x86.c:3834 [inline]
> kvm_arch_vcpu_ioctl+0x5dee/0x7680 arch/x86/kvm/x86.c:4132
> kvm_vcpu_ioctl+0xca3/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2748
> do_vfs_ioctl+0xfbc/0x2f70 fs/ioctl.c:46
> ksys_ioctl fs/ioctl.c:713 [inline]
> __do_sys_ioctl fs/ioctl.c:720 [inline]
> __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:718
> __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:718
> do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
> entry_SYSCALL_64_after_hwframe+0x63/0xe7
> RIP: 0033:0x4471b9
> Code: e8 fc b9 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 3b 07 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f1e22946da8 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 006f0038 RCX: 004471b9
> RDX: 2000 RSI: 4068aea3 RDI: 0005
> RBP: 006f0030 R08:  R09: 
> R10:  R11: 0246 R12: 006f003c
> R13: 6d766b2f7665642f R14: 7f1e229479c0 R15: 03e8
> 
> Local variable description: __pu_val@kvm_arch_vcpu_ioctl
> Variable was created at:
> kvm_arch_vcpu_ioctl+0x29d/0x7680 arch/x86/kvm/x86.c:3848
> kvm_vcpu_ioctl+0xca3/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2748
> 
> Bytes 0-1 of 2 are uninitialized
> Memory access of size 2 starts at 8881967ffbb0
> Data copied to user address 00706000
> ==
> 

The info-leak bug is very simple, What happens is the 

Re: KMSAN: kernel-infoleak in kvm_write_guest_page

2018-11-07 Thread Liran Alon



> On 7 Nov 2018, at 20:58, syzbot 
>  wrote:
> 
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:7438a3b20295 kmsan: print user address when reporting info..
> git tree:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_kmsan.git_master=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=fcwxYfHP71V16E_WRx7hgde6KeNMFBzprI2k6PzzZEo=
> console output: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D10d782f540=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=G8EVFBcgaTj-Rg8XsqlSgOVE7Hjg9gU5JEG7hA0ry1I=
> kernel config:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3D8df5fc509a1b351b=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=yxSBwzSyc6oRH7v1EJAKQMbq9VghK3Nme98FwYd8iJA=
> dashboard link: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3Da8ef68d71211ba264f56=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=FqvBbzpE2srFcjavB0i99ekCvznCaHFOuWgIwYzmrXs=
> compiler:   clang version 8.0.0 (trunk 343298)
> syz repro:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D15f0913340=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=0JDFhFc1yLDbhcyTkAE1NO9PulBlKC3ZAsRaieLSDWg=
> C reproducer:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D15a39e0540=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=FSZ1iQgxwIrifJ7X_cwC7-L4zPR3qH5jVMEFYpP2770=
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a8ef68d71211ba264...@syzkaller.appspotmail.com
> 
> L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.kernel.org_doc_html_latest_admin-2Dguide_l1tf.html=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=RlJUg93Uo3_bn1O3eehqy8r8h6IXBRoE0YNsIVFQY2A=
>  for details.
> ==
> BUG: KMSAN: kernel-infoleak in __copy_to_user include/linux/uaccess.h:121 
> [inline]
> BUG: KMSAN: kernel-infoleak in __kvm_write_guest_page 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849 [inline]
> BUG: KMSAN: kernel-infoleak in kvm_write_guest_page+0x373/0x500 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1861
> CPU: 1 PID: 6274 Comm: syz-executor149 Not tainted 4.19.0+ #78
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x32d/0x480 lib/dump_stack.c:113
> kmsan_report+0x19f/0x300 mm/kmsan/kmsan.c:911
> kmsan_internal_check_memory+0x35f/0x450 mm/kmsan/kmsan.c:993
> kmsan_copy_to_user+0x7c/0xe0 mm/kmsan/kmsan_hooks.c:552
> __copy_to_user include/linux/uaccess.h:121 [inline]
> __kvm_write_guest_page arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849 [inline]
> kvm_write_guest_page+0x373/0x500 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1861
> kvm_write_guest+0x1e1/0x360 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1883
> kvm_pv_clock_pairing arch/x86/kvm/x86.c:6793 [inline]
> kvm_emulate_hypercall+0x1c96/0x21b0 arch/x86/kvm/x86.c:6866
> handle_vmcall+0x41/0x50 arch/x86/kvm/vmx.c:7487
> vmx_handle_exit+0x1e81/0xbac0 arch/x86/kvm/vmx.c:10128
> vcpu_enter_guest arch/x86/kvm/x86.c:7667 [inline]
> vcpu_run arch/x86/kvm/x86.c:7730 [inline]
> kvm_arch_vcpu_ioctl_run+0xac32/0x11d80 arch/x86/kvm/x86.c:7930
> kvm_vcpu_ioctl+0xfb1/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2590
> do_vfs_ioctl+0xf77/0x2d30 fs/ioctl.c:46
> ksys_ioctl fs/ioctl.c:702 [inline]
> __do_sys_ioctl fs/ioctl.c:709 [inline]
> __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:707
> __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:707
> do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
> entry_SYSCALL_64_after_hwframe+0x63/0xe7
> RIP: 0033:0x442b39
> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 1b 0c fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7ffcb4e05478 EFLAGS: 0217 ORIG_RAX: 0010
> RAX: ffda RBX: 004002c8 RCX: 00442b39
> RDX:  RSI: ae80 RDI: 0007
> RBP: 006cd018 R08: 

Re: KMSAN: kernel-infoleak in kvm_write_guest_page

2018-11-07 Thread Liran Alon



> On 7 Nov 2018, at 20:58, syzbot 
>  wrote:
> 
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:7438a3b20295 kmsan: print user address when reporting info..
> git tree:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_kmsan.git_master=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=fcwxYfHP71V16E_WRx7hgde6KeNMFBzprI2k6PzzZEo=
> console output: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D10d782f540=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=G8EVFBcgaTj-Rg8XsqlSgOVE7Hjg9gU5JEG7hA0ry1I=
> kernel config:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3D8df5fc509a1b351b=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=yxSBwzSyc6oRH7v1EJAKQMbq9VghK3Nme98FwYd8iJA=
> dashboard link: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3Da8ef68d71211ba264f56=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=FqvBbzpE2srFcjavB0i99ekCvznCaHFOuWgIwYzmrXs=
> compiler:   clang version 8.0.0 (trunk 343298)
> syz repro:  
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D15f0913340=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=0JDFhFc1yLDbhcyTkAE1NO9PulBlKC3ZAsRaieLSDWg=
> C reproducer:   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D15a39e0540=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=FSZ1iQgxwIrifJ7X_cwC7-L4zPR3qH5jVMEFYpP2770=
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a8ef68d71211ba264...@syzkaller.appspotmail.com
> 
> L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.kernel.org_doc_html_latest_admin-2Dguide_l1tf.html=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=_zu2xJguqdkES0k8nx8UlglUdOiAHk72aWKl44bi2Sw=RlJUg93Uo3_bn1O3eehqy8r8h6IXBRoE0YNsIVFQY2A=
>  for details.
> ==
> BUG: KMSAN: kernel-infoleak in __copy_to_user include/linux/uaccess.h:121 
> [inline]
> BUG: KMSAN: kernel-infoleak in __kvm_write_guest_page 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849 [inline]
> BUG: KMSAN: kernel-infoleak in kvm_write_guest_page+0x373/0x500 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1861
> CPU: 1 PID: 6274 Comm: syz-executor149 Not tainted 4.19.0+ #78
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x32d/0x480 lib/dump_stack.c:113
> kmsan_report+0x19f/0x300 mm/kmsan/kmsan.c:911
> kmsan_internal_check_memory+0x35f/0x450 mm/kmsan/kmsan.c:993
> kmsan_copy_to_user+0x7c/0xe0 mm/kmsan/kmsan_hooks.c:552
> __copy_to_user include/linux/uaccess.h:121 [inline]
> __kvm_write_guest_page arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849 [inline]
> kvm_write_guest_page+0x373/0x500 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1861
> kvm_write_guest+0x1e1/0x360 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1883
> kvm_pv_clock_pairing arch/x86/kvm/x86.c:6793 [inline]
> kvm_emulate_hypercall+0x1c96/0x21b0 arch/x86/kvm/x86.c:6866
> handle_vmcall+0x41/0x50 arch/x86/kvm/vmx.c:7487
> vmx_handle_exit+0x1e81/0xbac0 arch/x86/kvm/vmx.c:10128
> vcpu_enter_guest arch/x86/kvm/x86.c:7667 [inline]
> vcpu_run arch/x86/kvm/x86.c:7730 [inline]
> kvm_arch_vcpu_ioctl_run+0xac32/0x11d80 arch/x86/kvm/x86.c:7930
> kvm_vcpu_ioctl+0xfb1/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2590
> do_vfs_ioctl+0xf77/0x2d30 fs/ioctl.c:46
> ksys_ioctl fs/ioctl.c:702 [inline]
> __do_sys_ioctl fs/ioctl.c:709 [inline]
> __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:707
> __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:707
> do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
> entry_SYSCALL_64_after_hwframe+0x63/0xe7
> RIP: 0033:0x442b39
> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 1b 0c fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7ffcb4e05478 EFLAGS: 0217 ORIG_RAX: 0010
> RAX: ffda RBX: 004002c8 RCX: 00442b39
> RDX:  RSI: ae80 RDI: 0007
> RBP: 006cd018 R08: 

Re: KMSAN: kernel-infoleak in kvm_vcpu_write_guest_page

2018-11-07 Thread Liran Alon



> On 7 Nov 2018, at 14:47, Paolo Bonzini  wrote:
> 
> On 07/11/2018 13:10, Alexander Potapenko wrote:
>> This appears to be a real bug in KVM.
>> Please see a simplified reproducer attached.
> 
> Thanks, I agree it's a reael bug.  The basic issue is that the
> kvm_state->size member is too small (1040) in the KVM_SET_NESTED_STATE
> ioctl, aka 0x4080aebf.
> 
> One way to fix it would be to just change kmalloc to kzalloc when
> allocating cached_vmcs12 and cached_shadow_vmcs12, but really the ioctl
> is wrong and should be rejected.  And the case where a shadow VMCS has
> to be loaded is even more wrong, and we have to fix it anyway, so I
> don't really like the idea of papering over the bug in the allocation.
> 
> I'll test this patch and submit it formally:
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c645f777b425..c546f0b1f3e0 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -14888,10 +14888,13 @@ static int vmx_set_nested_state(struct
> kvm_vcpu *vcpu,
>   if (ret)
>   return ret;
> 
> - /* Empty 'VMXON' state is permitted */
> - if (kvm_state->size < sizeof(kvm_state) + sizeof(*vmcs12))
> + /* Empty 'VMXON' state is permitted.  A partial VMCS12 is not.  */
> + if (kvm_state->size == sizeof(kvm_state))
>   return 0;
> 
> + if (kvm_state->size < sizeof(kvm_state) + VMCS12_SIZE)
> + return -EINVAL;
> +

I don’t think that this test is sufficient to fully resolve issue.
What if malicious userspace supplies valid size but pages containing 
nested_state->vmcs12 is unmapped?
This will result in vmx_set_nested_state() still calling set_current_vmptr() 
but failing on copy_from_user()
which still leaks cached_vmcs12 on next VMPTRLD of guest.

Therefore, I think that the correct patch should be to change 
vmx_set_nested_state() to
first gather all relevant information from userspace and validate it,
and only then start applying it to KVM’s internal vCPU state.

>   if (kvm_state->vmx.vmcs_pa != -1ull) {
>   if (kvm_state->vmx.vmcs_pa == kvm_state->vmx.vmxon_pa ||
>   !page_address_valid(vcpu, kvm_state->vmx.vmcs_pa))
> @@ -14917,6 +14920,7 @@ static int vmx_set_nested_state(struct kvm_vcpu
> *vcpu,
>   }
> 
>   vmcs12 = get_vmcs12(vcpu);
> + BUILD_BUG_ON(sizeof(*vmcs12) > VMCS12_SIZE);

Why put this BUILD_BUG_ON() specifically here?
There are many places which assumes cached_vmcs12 is of size VMCS12_SIZE.
(Such as nested_release_vmcs12() and handle_vmptrld()).

>   if (copy_from_user(vmcs12, user_kvm_nested_state->data, 
> sizeof(*vmcs12)))
>   return -EFAULT;
> 
> @@ -14932,7 +14936,7 @@ static int vmx_set_nested_state(struct kvm_vcpu
> *vcpu,
>   if (nested_cpu_has_shadow_vmcs(vmcs12) &&
>   vmcs12->vmcs_link_pointer != -1ull) {
>   struct vmcs12 *shadow_vmcs12 = get_shadow_vmcs12(vcpu);
> - if (kvm_state->size < sizeof(kvm_state) + 2 * sizeof(*vmcs12))
> + if (kvm_state->size < sizeof(kvm_state) + 2 * VMCS12_SIZE)
>   return -EINVAL;
> 
>   if (copy_from_user(shadow_vmcs12,
> 
> Paolo

-Liran




Re: KMSAN: kernel-infoleak in kvm_vcpu_write_guest_page

2018-11-07 Thread Liran Alon



> On 7 Nov 2018, at 14:47, Paolo Bonzini  wrote:
> 
> On 07/11/2018 13:10, Alexander Potapenko wrote:
>> This appears to be a real bug in KVM.
>> Please see a simplified reproducer attached.
> 
> Thanks, I agree it's a reael bug.  The basic issue is that the
> kvm_state->size member is too small (1040) in the KVM_SET_NESTED_STATE
> ioctl, aka 0x4080aebf.
> 
> One way to fix it would be to just change kmalloc to kzalloc when
> allocating cached_vmcs12 and cached_shadow_vmcs12, but really the ioctl
> is wrong and should be rejected.  And the case where a shadow VMCS has
> to be loaded is even more wrong, and we have to fix it anyway, so I
> don't really like the idea of papering over the bug in the allocation.
> 
> I'll test this patch and submit it formally:
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c645f777b425..c546f0b1f3e0 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -14888,10 +14888,13 @@ static int vmx_set_nested_state(struct
> kvm_vcpu *vcpu,
>   if (ret)
>   return ret;
> 
> - /* Empty 'VMXON' state is permitted */
> - if (kvm_state->size < sizeof(kvm_state) + sizeof(*vmcs12))
> + /* Empty 'VMXON' state is permitted.  A partial VMCS12 is not.  */
> + if (kvm_state->size == sizeof(kvm_state))
>   return 0;
> 
> + if (kvm_state->size < sizeof(kvm_state) + VMCS12_SIZE)
> + return -EINVAL;
> +

I don’t think that this test is sufficient to fully resolve issue.
What if malicious userspace supplies valid size but pages containing 
nested_state->vmcs12 is unmapped?
This will result in vmx_set_nested_state() still calling set_current_vmptr() 
but failing on copy_from_user()
which still leaks cached_vmcs12 on next VMPTRLD of guest.

Therefore, I think that the correct patch should be to change 
vmx_set_nested_state() to
first gather all relevant information from userspace and validate it,
and only then start applying it to KVM’s internal vCPU state.

>   if (kvm_state->vmx.vmcs_pa != -1ull) {
>   if (kvm_state->vmx.vmcs_pa == kvm_state->vmx.vmxon_pa ||
>   !page_address_valid(vcpu, kvm_state->vmx.vmcs_pa))
> @@ -14917,6 +14920,7 @@ static int vmx_set_nested_state(struct kvm_vcpu
> *vcpu,
>   }
> 
>   vmcs12 = get_vmcs12(vcpu);
> + BUILD_BUG_ON(sizeof(*vmcs12) > VMCS12_SIZE);

Why put this BUILD_BUG_ON() specifically here?
There are many places which assumes cached_vmcs12 is of size VMCS12_SIZE.
(Such as nested_release_vmcs12() and handle_vmptrld()).

>   if (copy_from_user(vmcs12, user_kvm_nested_state->data, 
> sizeof(*vmcs12)))
>   return -EFAULT;
> 
> @@ -14932,7 +14936,7 @@ static int vmx_set_nested_state(struct kvm_vcpu
> *vcpu,
>   if (nested_cpu_has_shadow_vmcs(vmcs12) &&
>   vmcs12->vmcs_link_pointer != -1ull) {
>   struct vmcs12 *shadow_vmcs12 = get_shadow_vmcs12(vcpu);
> - if (kvm_state->size < sizeof(kvm_state) + 2 * sizeof(*vmcs12))
> + if (kvm_state->size < sizeof(kvm_state) + 2 * VMCS12_SIZE)
>   return -EINVAL;
> 
>   if (copy_from_user(shadow_vmcs12,
> 
> Paolo

-Liran




Re: KMSAN: kernel-infoleak in kvm_vcpu_write_guest_page

2018-11-07 Thread Liran Alon



> On 7 Nov 2018, at 14:10, Alexander Potapenko  wrote:
> 
> On Wed, Nov 7, 2018 at 2:38 AM syzbot
>  wrote:
>> 
>> Hello,
>> 
>> syzbot found the following crash on:
>> 
>> HEAD commit:88b95ef4c780 kmsan: use MSan assembly instrumentation
>> git tree:   
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_kmsan.git_master=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=A5G9_UvV5TpBoJitbGWS2VXmfVMXFUgggq64QM-6nqc=
>> console output: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D12505e3340=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=ZGVw04dRMjYdKZf4amN1yl3IheljZZq6V9h3mO9U-vM=
>> kernel config:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3D8df5fc509a1b351b=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=OUIhmSMzBSMhswtebZqyLnc6SkAagVPBGr0EmCLL2Fg=
>> dashboard link: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3Dded1696f6b50b615b630=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=jeCou6OWbHHIf190FBGsLr1wGsDvBWlpKnfNMmqGIqI=
>> compiler:   clang version 8.0.0 (trunk 343298)
>> syz repro:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D15ce62f540=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=PVi1m-uNP3XRO4XbDZJicGiqXdVmOPFDxCP20jmXuAs=
>> C reproducer:   
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D174efca340=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=XzvJtd3__2LEBvhAi4F6RTLbxV9mELkY07bMDSDLRMc=
>> 
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+ded1696f6b50b615b...@syzkaller.appspotmail.com
>> 
>> ==
>> BUG: KMSAN: kernel-infoleak in __copy_to_user include/linux/uaccess.h:121
>> [inline]
>> BUG: KMSAN: kernel-infoleak in __kvm_write_guest_page
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849 [inline]
>> BUG: KMSAN: kernel-infoleak in kvm_vcpu_write_guest_page+0x39a/0x510
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1870
>> CPU: 0 PID: 7918 Comm: syz-executor542 Not tainted 4.19.0+ #77
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x32d/0x480 lib/dump_stack.c:113
>>  kmsan_report+0x1a2/0x2e0 mm/kmsan/kmsan.c:911
>>  kmsan_internal_check_memory+0x34c/0x430 mm/kmsan/kmsan.c:991
>>  kmsan_copy_to_user+0x85/0xe0 mm/kmsan/kmsan_hooks.c:552
>>  __copy_to_user include/linux/uaccess.h:121 [inline]
>>  __kvm_write_guest_page arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849
>> [inline]
>>  kvm_vcpu_write_guest_page+0x39a/0x510
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1870
>>  nested_release_vmcs12 arch/x86/kvm/vmx.c:8441 [inline]
>>  handle_vmptrld+0x2384/0x26b0 arch/x86/kvm/vmx.c:8907
>>  vmx_handle_exit+0x1e81/0xbac0 arch/x86/kvm/vmx.c:10128
>>  vcpu_enter_guest arch/x86/kvm/x86.c:7667 [inline]
>>  vcpu_run arch/x86/kvm/x86.c:7730 [inline]
>>  kvm_arch_vcpu_ioctl_run+0xac32/0x11d80 arch/x86/kvm/x86.c:7930
>>  kvm_vcpu_ioctl+0xfb1/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2590
>>  do_vfs_ioctl+0xf77/0x2d30 fs/ioctl.c:46
>>  ksys_ioctl fs/ioctl.c:702 [inline]
>>  __do_sys_ioctl fs/ioctl.c:709 [inline]
>>  __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:707
>>  __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:707
>>  do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
>>  entry_SYSCALL_64_after_hwframe+0x63/0xe7
>> RIP: 0033:0x44b6e9
>> Code: e8 dc e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
>> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
>> ff 0f 83 2b ff fb ff c3 66 2e 0f 1f 84 00 00 00 00
>> RSP: 002b:7f096b292ce8 EFLAGS: 0206 ORIG_RAX: 0010
>> RAX: ffda RBX: 006e3c48 RCX: 0044b6e9
>> RDX:  RSI: ae80 RDI: 0005
>> RBP: 006e3c40 R08:  R09: 
>> R10:  R11: 0206 R12: 006e3c4c
>> R13: 7ffd978aeb2f R14: 7f096b2939c0 R15: 006e3d4c
>> 
>> Uninit was created at:
>>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:252 [inline]
>>  kmsan_internal_poison_shadow+0xc8/0x1e0 mm/kmsan/kmsan.c:177
>>  kmsan_kmalloc+0x98/0x110 

Re: KMSAN: kernel-infoleak in kvm_vcpu_write_guest_page

2018-11-07 Thread Liran Alon



> On 7 Nov 2018, at 14:10, Alexander Potapenko  wrote:
> 
> On Wed, Nov 7, 2018 at 2:38 AM syzbot
>  wrote:
>> 
>> Hello,
>> 
>> syzbot found the following crash on:
>> 
>> HEAD commit:88b95ef4c780 kmsan: use MSan assembly instrumentation
>> git tree:   
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_kmsan.git_master=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=A5G9_UvV5TpBoJitbGWS2VXmfVMXFUgggq64QM-6nqc=
>> console output: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D12505e3340=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=ZGVw04dRMjYdKZf4amN1yl3IheljZZq6V9h3mO9U-vM=
>> kernel config:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3D8df5fc509a1b351b=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=OUIhmSMzBSMhswtebZqyLnc6SkAagVPBGr0EmCLL2Fg=
>> dashboard link: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3Dded1696f6b50b615b630=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=jeCou6OWbHHIf190FBGsLr1wGsDvBWlpKnfNMmqGIqI=
>> compiler:   clang version 8.0.0 (trunk 343298)
>> syz repro:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D15ce62f540=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=PVi1m-uNP3XRO4XbDZJicGiqXdVmOPFDxCP20jmXuAs=
>> C reproducer:   
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D174efca340=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=NpjK8vgD4UmRGNWTNe6YFA-AJTZp9VlD0oMFvKFV25I=XzvJtd3__2LEBvhAi4F6RTLbxV9mELkY07bMDSDLRMc=
>> 
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+ded1696f6b50b615b...@syzkaller.appspotmail.com
>> 
>> ==
>> BUG: KMSAN: kernel-infoleak in __copy_to_user include/linux/uaccess.h:121
>> [inline]
>> BUG: KMSAN: kernel-infoleak in __kvm_write_guest_page
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849 [inline]
>> BUG: KMSAN: kernel-infoleak in kvm_vcpu_write_guest_page+0x39a/0x510
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1870
>> CPU: 0 PID: 7918 Comm: syz-executor542 Not tainted 4.19.0+ #77
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x32d/0x480 lib/dump_stack.c:113
>>  kmsan_report+0x1a2/0x2e0 mm/kmsan/kmsan.c:911
>>  kmsan_internal_check_memory+0x34c/0x430 mm/kmsan/kmsan.c:991
>>  kmsan_copy_to_user+0x85/0xe0 mm/kmsan/kmsan_hooks.c:552
>>  __copy_to_user include/linux/uaccess.h:121 [inline]
>>  __kvm_write_guest_page arch/x86/kvm/../../../virt/kvm/kvm_main.c:1849
>> [inline]
>>  kvm_vcpu_write_guest_page+0x39a/0x510
>> arch/x86/kvm/../../../virt/kvm/kvm_main.c:1870
>>  nested_release_vmcs12 arch/x86/kvm/vmx.c:8441 [inline]
>>  handle_vmptrld+0x2384/0x26b0 arch/x86/kvm/vmx.c:8907
>>  vmx_handle_exit+0x1e81/0xbac0 arch/x86/kvm/vmx.c:10128
>>  vcpu_enter_guest arch/x86/kvm/x86.c:7667 [inline]
>>  vcpu_run arch/x86/kvm/x86.c:7730 [inline]
>>  kvm_arch_vcpu_ioctl_run+0xac32/0x11d80 arch/x86/kvm/x86.c:7930
>>  kvm_vcpu_ioctl+0xfb1/0x1f90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2590
>>  do_vfs_ioctl+0xf77/0x2d30 fs/ioctl.c:46
>>  ksys_ioctl fs/ioctl.c:702 [inline]
>>  __do_sys_ioctl fs/ioctl.c:709 [inline]
>>  __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:707
>>  __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:707
>>  do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
>>  entry_SYSCALL_64_after_hwframe+0x63/0xe7
>> RIP: 0033:0x44b6e9
>> Code: e8 dc e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
>> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
>> ff 0f 83 2b ff fb ff c3 66 2e 0f 1f 84 00 00 00 00
>> RSP: 002b:7f096b292ce8 EFLAGS: 0206 ORIG_RAX: 0010
>> RAX: ffda RBX: 006e3c48 RCX: 0044b6e9
>> RDX:  RSI: ae80 RDI: 0005
>> RBP: 006e3c40 R08:  R09: 
>> R10:  R11: 0206 R12: 006e3c4c
>> R13: 7ffd978aeb2f R14: 7f096b2939c0 R15: 006e3d4c
>> 
>> Uninit was created at:
>>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:252 [inline]
>>  kmsan_internal_poison_shadow+0xc8/0x1e0 mm/kmsan/kmsan.c:177
>>  kmsan_kmalloc+0x98/0x110 

Re: [PATCH] KVM: VMX: enable nested virtualization by default

2018-10-16 Thread Liran Alon



> On 17 Oct 2018, at 1:55, Paolo Bonzini  wrote:
> 
> With live migration support and finally a good solution for CR2/DR6
> exception payloads, nested VMX should finally be ready for having a stable

And good solution for setting/getting vCPU events from userspace with correct 
pending/injected state.

> userspace ABI.  The results of syzkaller fuzzing are not perfect but not
> horrible either (and might be partially due to running on GCE, so that
> effectively we're testing three-level nesting on a fork of upstream KVM!).
> Enabling it by default seems like a nice way to conclude the 4.20
> pull request. :)
> 
> Unfortunately, enabling nested SVM in 2009 was a bit premature.  However,

Don’t you wish to mention commit which enabled it?

> until live migration support is in place we can reasonably expect that
> it does not offer much in terms of ABI guarantees.  Therefore we are
> still in time to break things and conform as much as possible to the
> interface used for VMX.
> 
> Suggested-by: Jim Mattson 
> Suggested-by: Liran Alon 
> Signed-off-by: Paolo Bonzini 
> ---
> arch/x86/kvm/vmx.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e665aa7167cf..89fc2a744d7f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -107,7 +107,7 @@ module_param_named(enable_shadow_vmcs, 
> enable_shadow_vmcs, bool, S_IRUGO);
>  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>  * use VMX instructions.
>  */
> -static bool __read_mostly nested = 0;
> +static bool __read_mostly nested = 1;
> module_param(nested, bool, S_IRUGO);
> 
> static u64 __read_mostly host_xss;
> -- 
> 2.17.1
> 

Woohoo! :)
Reviewed-by: Liran Alon 




Re: [PATCH] KVM: VMX: enable nested virtualization by default

2018-10-16 Thread Liran Alon



> On 17 Oct 2018, at 1:55, Paolo Bonzini  wrote:
> 
> With live migration support and finally a good solution for CR2/DR6
> exception payloads, nested VMX should finally be ready for having a stable

And good solution for setting/getting vCPU events from userspace with correct 
pending/injected state.

> userspace ABI.  The results of syzkaller fuzzing are not perfect but not
> horrible either (and might be partially due to running on GCE, so that
> effectively we're testing three-level nesting on a fork of upstream KVM!).
> Enabling it by default seems like a nice way to conclude the 4.20
> pull request. :)
> 
> Unfortunately, enabling nested SVM in 2009 was a bit premature.  However,

Don’t you wish to mention commit which enabled it?

> until live migration support is in place we can reasonably expect that
> it does not offer much in terms of ABI guarantees.  Therefore we are
> still in time to break things and conform as much as possible to the
> interface used for VMX.
> 
> Suggested-by: Jim Mattson 
> Suggested-by: Liran Alon 
> Signed-off-by: Paolo Bonzini 
> ---
> arch/x86/kvm/vmx.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e665aa7167cf..89fc2a744d7f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -107,7 +107,7 @@ module_param_named(enable_shadow_vmcs, 
> enable_shadow_vmcs, bool, S_IRUGO);
>  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>  * use VMX instructions.
>  */
> -static bool __read_mostly nested = 0;
> +static bool __read_mostly nested = 1;
> module_param(nested, bool, S_IRUGO);
> 
> static u64 __read_mostly host_xss;
> -- 
> 2.17.1
> 

Woohoo! :)
Reviewed-by: Liran Alon 




Re: [PATCH] KVM: LAPIC: Tune lapic_timer_advance_ns automatically

2018-10-08 Thread Liran Alon



> On 8 Oct 2018, at 13:59, Wanpeng Li  wrote:
> 
> On Mon, 8 Oct 2018 at 05:02, Liran Alon  wrote:
>> 
>> 
>> 
>>> On 28 Sep 2018, at 9:12, Wanpeng Li  wrote:
>>> 
>>> From: Wanpeng Li 
>>> 
>>> In cloud environment, lapic_timer_advance_ns is needed to be tuned for 
>>> every CPU
>>> generations, and every host kernel versions(the 
>>> kvm-unit-tests/tscdeadline_latency.flat
>>> is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product 
>>> kernel,
>>> both preemption_timer=N, Skylake server).
>>> 
>>> This patch adds the capability to automatically tune lapic_timer_advance_ns
>>> step by step, the initial value is 1000ns as d0659d946be05 (KVM: x86: add
>>> option to advance tscdeadline hrtimer expiration) recommended, it will be
>>> reduced when it is too early, and increased when it is too late. The 
>>> guest_tsc
>>> and tsc_deadline are hard to equal, so we assume we are done when the delta
>>> is within a small scope e.g. 100 cycles. This patch reduces latency
>>> (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
>>> from ~2600 cyles to ~1200 cyles on our Skylake server.
>>> 
>>> Cc: Paolo Bonzini 
>>> Cc: Radim Krčmář 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>> arch/x86/kvm/lapic.c | 7 +++
>>> arch/x86/kvm/x86.c   | 2 +-
>>> 2 files changed, 8 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index fbb0e6d..b756f12 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -70,6 +70,8 @@
>>> #define APIC_BROADCAST0xFF
>>> #define X2APIC_BROADCAST  0xul
>>> 
>>> +static bool __read_mostly lapic_timer_advance_adjust_done = false;
>>> +
>>> static inline int apic_test_vector(int vec, void *bitmap)
>>> {
>>>  return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
>>> @@ -1492,6 +1494,11 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>>  if (guest_tsc < tsc_deadline)
>>>  __delay(min(tsc_deadline - guest_tsc,
>>>  nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>>> + if (!lapic_timer_advance_adjust_done) {
>>> + lapic_timer_advance_ns += (s64)(guest_tsc - tsc_deadline) / 8;
>> 
>> I don’t understand how this “/ 8” converts between guest TSC units to host 
>> nanoseconds.
> 
> Oh, I miss it. In addition, /8 here I mean adjust
> lapic_timer_advance_ns step by step. I can observe big fluctuated

If that’s the case, I would also put the “8” as a #define to make it more clear 
of it’s purpose.

> value between early and late when running real guest os like linux
> instead of kvm-unit-tests. After more testing, I saw
> lapic_timer_advance_ns can be overflow since the delta between
> guest_tsc and tsc_deadline is too huge.
> 
>> 
>> I think that instead you should do something like:
>> s64 ns = (s64)(guest_tsc - tsc_deadline) * 100ULL;
>> do_div(ns, vcpu->arch.virtual_tsc_khz);
>> lapic_timer_advance_ns += ns;
>> 
>>> + if (abs(guest_tsc - tsc_deadline) < 100)
>> 
>> I would put this “100” hard-coded value as some “#define” to make code more 
>> clear.
> 
> How about something like below:
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fbb0e6d..354eb13c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -70,6 +70,9 @@
> #define APIC_BROADCAST0xFF
> #define X2APIC_BROADCAST0xul
> 
> +static bool __read_mostly lapic_timer_advance_adjust_done = false;
> +#define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
> +
> static inline int apic_test_vector(int vec, void *bitmap)
> {
> return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
> @@ -1472,7 +1475,7 @@ static bool lapic_timer_int_injected(struct
> kvm_vcpu *vcpu)
> void wait_lapic_expire(struct kvm_vcpu *vcpu)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> -u64 guest_tsc, tsc_deadline;
> +u64 guest_tsc, tsc_deadline, ns;
> 
> if (!lapic_in_kernel(vcpu))
> return;
> @@ -1492,6 +1495,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> if (guest_tsc < tsc_deadline)
> __delay(min(tsc_deadline - guest_tsc,
> nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +if (!lapic_timer_advance_adjust_done) {
> +if (guest_

Re: [PATCH] KVM: LAPIC: Tune lapic_timer_advance_ns automatically

2018-10-08 Thread Liran Alon



> On 8 Oct 2018, at 13:59, Wanpeng Li  wrote:
> 
> On Mon, 8 Oct 2018 at 05:02, Liran Alon  wrote:
>> 
>> 
>> 
>>> On 28 Sep 2018, at 9:12, Wanpeng Li  wrote:
>>> 
>>> From: Wanpeng Li 
>>> 
>>> In cloud environment, lapic_timer_advance_ns is needed to be tuned for 
>>> every CPU
>>> generations, and every host kernel versions(the 
>>> kvm-unit-tests/tscdeadline_latency.flat
>>> is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product 
>>> kernel,
>>> both preemption_timer=N, Skylake server).
>>> 
>>> This patch adds the capability to automatically tune lapic_timer_advance_ns
>>> step by step, the initial value is 1000ns as d0659d946be05 (KVM: x86: add
>>> option to advance tscdeadline hrtimer expiration) recommended, it will be
>>> reduced when it is too early, and increased when it is too late. The 
>>> guest_tsc
>>> and tsc_deadline are hard to equal, so we assume we are done when the delta
>>> is within a small scope e.g. 100 cycles. This patch reduces latency
>>> (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
>>> from ~2600 cyles to ~1200 cyles on our Skylake server.
>>> 
>>> Cc: Paolo Bonzini 
>>> Cc: Radim Krčmář 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>> arch/x86/kvm/lapic.c | 7 +++
>>> arch/x86/kvm/x86.c   | 2 +-
>>> 2 files changed, 8 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index fbb0e6d..b756f12 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -70,6 +70,8 @@
>>> #define APIC_BROADCAST0xFF
>>> #define X2APIC_BROADCAST  0xul
>>> 
>>> +static bool __read_mostly lapic_timer_advance_adjust_done = false;
>>> +
>>> static inline int apic_test_vector(int vec, void *bitmap)
>>> {
>>>  return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
>>> @@ -1492,6 +1494,11 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>>  if (guest_tsc < tsc_deadline)
>>>  __delay(min(tsc_deadline - guest_tsc,
>>>  nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>>> + if (!lapic_timer_advance_adjust_done) {
>>> + lapic_timer_advance_ns += (s64)(guest_tsc - tsc_deadline) / 8;
>> 
>> I don’t understand how this “/ 8” converts between guest TSC units to host 
>> nanoseconds.
> 
> Oh, I miss it. In addition, /8 here I mean adjust
> lapic_timer_advance_ns step by step. I can observe big fluctuated

If that’s the case, I would also put the “8” as a #define to make it more clear 
of it’s purpose.

> value between early and late when running real guest os like linux
> instead of kvm-unit-tests. After more testing, I saw
> lapic_timer_advance_ns can be overflow since the delta between
> guest_tsc and tsc_deadline is too huge.
> 
>> 
>> I think that instead you should do something like:
>> s64 ns = (s64)(guest_tsc - tsc_deadline) * 100ULL;
>> do_div(ns, vcpu->arch.virtual_tsc_khz);
>> lapic_timer_advance_ns += ns;
>> 
>>> + if (abs(guest_tsc - tsc_deadline) < 100)
>> 
>> I would put this “100” hard-coded value as some “#define” to make code more 
>> clear.
> 
> How about something like below:
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fbb0e6d..354eb13c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -70,6 +70,9 @@
> #define APIC_BROADCAST0xFF
> #define X2APIC_BROADCAST0xul
> 
> +static bool __read_mostly lapic_timer_advance_adjust_done = false;
> +#define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
> +
> static inline int apic_test_vector(int vec, void *bitmap)
> {
> return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
> @@ -1472,7 +1475,7 @@ static bool lapic_timer_int_injected(struct
> kvm_vcpu *vcpu)
> void wait_lapic_expire(struct kvm_vcpu *vcpu)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> -u64 guest_tsc, tsc_deadline;
> +u64 guest_tsc, tsc_deadline, ns;
> 
> if (!lapic_in_kernel(vcpu))
> return;
> @@ -1492,6 +1495,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> if (guest_tsc < tsc_deadline)
> __delay(min(tsc_deadline - guest_tsc,
> nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +if (!lapic_timer_advance_adjust_done) {
> +if (guest_

Re: [PATCH] KVM: LAPIC: Tune lapic_timer_advance_ns automatically

2018-10-07 Thread Liran Alon



> On 28 Sep 2018, at 9:12, Wanpeng Li  wrote:
> 
> From: Wanpeng Li 
> 
> In cloud environment, lapic_timer_advance_ns is needed to be tuned for every 
> CPU 
> generations, and every host kernel versions(the 
> kvm-unit-tests/tscdeadline_latency.flat 
> is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product 
> kernel, 
> both preemption_timer=N, Skylake server).
> 
> This patch adds the capability to automatically tune lapic_timer_advance_ns
> step by step, the initial value is 1000ns as d0659d946be05 (KVM: x86: add 
> option to advance tscdeadline hrtimer expiration) recommended, it will be 
> reduced when it is too early, and increased when it is too late. The 
> guest_tsc 
> and tsc_deadline are hard to equal, so we assume we are done when the delta 
> is within a small scope e.g. 100 cycles. This patch reduces latency 
> (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
> from ~2600 cyles to ~1200 cyles on our Skylake server.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> arch/x86/kvm/lapic.c | 7 +++
> arch/x86/kvm/x86.c   | 2 +-
> 2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fbb0e6d..b756f12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -70,6 +70,8 @@
> #define APIC_BROADCAST0xFF
> #define X2APIC_BROADCAST  0xul
> 
> +static bool __read_mostly lapic_timer_advance_adjust_done = false;
> +
> static inline int apic_test_vector(int vec, void *bitmap)
> {
>   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
> @@ -1492,6 +1494,11 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>   if (guest_tsc < tsc_deadline)
>   __delay(min(tsc_deadline - guest_tsc,
>   nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> + if (!lapic_timer_advance_adjust_done) {
> + lapic_timer_advance_ns += (s64)(guest_tsc - tsc_deadline) / 8;

I don’t understand how this “/ 8” converts between guest TSC units to host 
nanoseconds.

I think that instead you should do something like:
s64 ns = (s64)(guest_tsc - tsc_deadline) * 100ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
lapic_timer_advance_ns += ns;

> + if (abs(guest_tsc - tsc_deadline) < 100)

I would put this “100” hard-coded value as some “#define” to make code more 
clear.

> + lapic_timer_advance_adjust_done = true;
> + }
> }
> 
> static void start_sw_tscdeadline(struct kvm_lapic *apic)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index edbf00e..e865d12 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -136,7 +136,7 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> 
> /* lapic timer advance (tscdeadline mode only) in nanoseconds */
> -unsigned int __read_mostly lapic_timer_advance_ns = 0;
> +unsigned int lapic_timer_advance_ns = 1000;
> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
> 
> -- 
> 2.7.4
> 



Re: [PATCH] KVM: LAPIC: Tune lapic_timer_advance_ns automatically

2018-10-07 Thread Liran Alon



> On 28 Sep 2018, at 9:12, Wanpeng Li  wrote:
> 
> From: Wanpeng Li 
> 
> In cloud environment, lapic_timer_advance_ns is needed to be tuned for every 
> CPU 
> generations, and every host kernel versions(the 
> kvm-unit-tests/tscdeadline_latency.flat 
> is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product 
> kernel, 
> both preemption_timer=N, Skylake server).
> 
> This patch adds the capability to automatically tune lapic_timer_advance_ns
> step by step, the initial value is 1000ns as d0659d946be05 (KVM: x86: add 
> option to advance tscdeadline hrtimer expiration) recommended, it will be 
> reduced when it is too early, and increased when it is too late. The 
> guest_tsc 
> and tsc_deadline are hard to equal, so we assume we are done when the delta 
> is within a small scope e.g. 100 cycles. This patch reduces latency 
> (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
> from ~2600 cyles to ~1200 cyles on our Skylake server.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
> arch/x86/kvm/lapic.c | 7 +++
> arch/x86/kvm/x86.c   | 2 +-
> 2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index fbb0e6d..b756f12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -70,6 +70,8 @@
> #define APIC_BROADCAST0xFF
> #define X2APIC_BROADCAST  0xul
> 
> +static bool __read_mostly lapic_timer_advance_adjust_done = false;
> +
> static inline int apic_test_vector(int vec, void *bitmap)
> {
>   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
> @@ -1492,6 +1494,11 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>   if (guest_tsc < tsc_deadline)
>   __delay(min(tsc_deadline - guest_tsc,
>   nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> + if (!lapic_timer_advance_adjust_done) {
> + lapic_timer_advance_ns += (s64)(guest_tsc - tsc_deadline) / 8;

I don’t understand how this “/ 8” converts between guest TSC units to host 
nanoseconds.

I think that instead you should do something like:
s64 ns = (s64)(guest_tsc - tsc_deadline) * 100ULL;
do_div(ns, vcpu->arch.virtual_tsc_khz);
lapic_timer_advance_ns += ns;

> + if (abs(guest_tsc - tsc_deadline) < 100)

I would put this “100” hard-coded value as some “#define” to make code more 
clear.

> + lapic_timer_advance_adjust_done = true;
> + }
> }
> 
> static void start_sw_tscdeadline(struct kvm_lapic *apic)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index edbf00e..e865d12 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -136,7 +136,7 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> 
> /* lapic timer advance (tscdeadline mode only) in nanoseconds */
> -unsigned int __read_mostly lapic_timer_advance_ns = 0;
> +unsigned int lapic_timer_advance_ns = 1000;
> module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
> 
> -- 
> 2.7.4
> 



Re: [PATCH] KVM: LAPIC: Fix pv ipis out-of-bounds access

2018-08-29 Thread Liran Alon



> On 29 Aug 2018, at 13:29, Dan Carpenter  wrote:
> 
> On Wed, Aug 29, 2018 at 06:23:08PM +0800, Wanpeng Li wrote:
>> On Wed, 29 Aug 2018 at 18:18, Dan Carpenter  wrote:
>>> 
>>> On Wed, Aug 29, 2018 at 01:12:05PM +0300, Dan Carpenter wrote:
>>>> On Wed, Aug 29, 2018 at 12:05:06PM +0300, Liran Alon wrote:
>>>>>> arch/x86/kvm/lapic.c | 17 +
>>>>>> 1 file changed, 13 insertions(+), 4 deletions(-)
>>>>>> 
>>>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>>>> index 0cefba2..86e933c 100644
>>>>>> --- a/arch/x86/kvm/lapic.c
>>>>>> +++ b/arch/x86/kvm/lapic.c
>>>>>> @@ -571,18 +571,27 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
>>>>>> ipi_bitmap_low,
>>>>>>  rcu_read_lock();
>>>>>>  map = rcu_dereference(kvm->arch.apic_map);
>>>>>> 
>>>>>> + if (unlikely((s32)(map->max_apic_id - __fls(ipi_bitmap_low)) < min))
>>>>>> + goto out;
>>>>> 
>>>>> I personally think “if ((min + __fls(ipi_bitmap_low)) > 
>>>>> map->max_apic_id)” is more readable.
>>>>> But that’s just a matter of taste :)
>>>> 
>>>> That's an integer overflow.
>>>> 
>>>> But I do prefer to put the variable on the left.  The truth is that some
>>>> Smatch checks just ignore code which is backwards written because
>>>> otherwise you have to write duplicate code and the most code is written
>>>> with the variable on the left.
>>>> 
>>>>  if (min > (s32)(map->max_apic_id - __fls(ipi_bitmap_low))
>>> 
>>> Wait, the (s32) cast doesn't make sense.  We want negative min values to
>>> be treated as invalid.
>> 
>> In v2, how about:
>> 
>> if (unlikely(min > map->max_apic_id || (min + __fls(ipi_bitmap_low)) >
>> map->max_apic_id))
>>goto out;
> 
> That works, too.  It still has the off by one and we should set
> "count = -KVM_EINVAL;".
> 
> Is the unlikely() really required?  I don't know what the fast paths are
> in KVM, so I don't know.
> 
> regards,
> dan carpenter

Why is “min” defined as “int” instead of “unsigned int”?
It represents the lowest APIC ID in bitmap so it can’t be negative…

"if (unlikely(min > map->max_apic_id || (min + __fls(ipi_bitmap_low)) > 
map->max_apic_id))”
should indeed be ok.

-Liran




Re: [PATCH] KVM: LAPIC: Fix pv ipis out-of-bounds access

2018-08-29 Thread Liran Alon



> On 29 Aug 2018, at 13:29, Dan Carpenter  wrote:
> 
> On Wed, Aug 29, 2018 at 06:23:08PM +0800, Wanpeng Li wrote:
>> On Wed, 29 Aug 2018 at 18:18, Dan Carpenter  wrote:
>>> 
>>> On Wed, Aug 29, 2018 at 01:12:05PM +0300, Dan Carpenter wrote:
>>>> On Wed, Aug 29, 2018 at 12:05:06PM +0300, Liran Alon wrote:
>>>>>> arch/x86/kvm/lapic.c | 17 +
>>>>>> 1 file changed, 13 insertions(+), 4 deletions(-)
>>>>>> 
>>>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>>>> index 0cefba2..86e933c 100644
>>>>>> --- a/arch/x86/kvm/lapic.c
>>>>>> +++ b/arch/x86/kvm/lapic.c
>>>>>> @@ -571,18 +571,27 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
>>>>>> ipi_bitmap_low,
>>>>>>  rcu_read_lock();
>>>>>>  map = rcu_dereference(kvm->arch.apic_map);
>>>>>> 
>>>>>> + if (unlikely((s32)(map->max_apic_id - __fls(ipi_bitmap_low)) < min))
>>>>>> + goto out;
>>>>> 
>>>>> I personally think “if ((min + __fls(ipi_bitmap_low)) > 
>>>>> map->max_apic_id)” is more readable.
>>>>> But that’s just a matter of taste :)
>>>> 
>>>> That's an integer overflow.
>>>> 
>>>> But I do prefer to put the variable on the left.  The truth is that some
>>>> Smatch checks just ignore code which is backwards written because
>>>> otherwise you have to write duplicate code and the most code is written
>>>> with the variable on the left.
>>>> 
>>>>  if (min > (s32)(map->max_apic_id - __fls(ipi_bitmap_low))
>>> 
>>> Wait, the (s32) cast doesn't make sense.  We want negative min values to
>>> be treated as invalid.
>> 
>> In v2, how about:
>> 
>> if (unlikely(min > map->max_apic_id || (min + __fls(ipi_bitmap_low)) >
>> map->max_apic_id))
>>goto out;
> 
> That works, too.  It still has the off by one and we should set
> "count = -KVM_EINVAL;".
> 
> Is the unlikely() really required?  I don't know what the fast paths are
> in KVM, so I don't know.
> 
> regards,
> dan carpenter

Why is “min” defined as “int” instead of “unsigned int”?
It represents the lowest APIC ID in bitmap so it can’t be negative…

"if (unlikely(min > map->max_apic_id || (min + __fls(ipi_bitmap_low)) > 
map->max_apic_id))”
should indeed be ok.

-Liran




Re: [PATCH] KVM: LAPIC: Fix pv ipis out-of-bounds access

2018-08-29 Thread Liran Alon



> On 29 Aug 2018, at 8:52, Wanpeng Li  wrote:
> 
> From: Wanpeng Li 
> 
> Dan Carpenter reported that the untrusted data returns from 
> kvm_register_read()
> results in the following static checker warning:
>  arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
>  error: buffer underflow 'map->phys_map' 's32min-s32max'
> 
> KVM guest can easily trigger this by executing the following assembly 
> sequence 
> in Ring0:
> 
> mov $10, %rax
> mov $0x, %rbx
> mov $0x, %rdx
> mov $0, %rsi
> vmcall
> 
> As this will cause KVM to execute the following code-path:
> vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> 
> kvm_pv_send_ipi()
> which will reach out-of-bounds access.
> 
> This patch fixes it by adding a check to kvm_pv_send_ipi() against 
> map->max_apic_id 
> and also checking whether or not map->phys_map[min + i] is NULL since the 
> max_apic_id 
> is set according to the max apic id, however, some phys_map maybe NULL when 
> apic id 
> is sparse, in addition, kvm also unconditionally set max_apic_id to 255 to 
> reserve 
> enough space for any xAPIC ID.
> 
> Reported-by: Dan Carpenter 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Liran Alon 
> Cc: Dan Carpenter 
> Signed-off-by: Wanpeng Li 
> ---
> arch/x86/kvm/lapic.c | 17 +
> 1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0cefba2..86e933c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -571,18 +571,27 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
> ipi_bitmap_low,
>   rcu_read_lock();
>   map = rcu_dereference(kvm->arch.apic_map);
> 
> + if (unlikely((s32)(map->max_apic_id - __fls(ipi_bitmap_low)) < min))
> + goto out;

I personally think “if ((min + __fls(ipi_bitmap_low)) > map->max_apic_id)” is 
more readable.
But that’s just a matter of taste :)

>   /* Bits above cluster_size are masked in the caller.  */
>   for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
> 
>   min += cluster_size;
> + if (unlikely((s32)(map->max_apic_id - __fls(ipi_bitmap_high)) < min))
> + goto out;
>   for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
> 
> +out:
>   rcu_read_unlock();
>   return count;
> }
> -- 
> 2.7.4
> 

Reviewed-By: Liran Alon 




Re: [PATCH] KVM: LAPIC: Fix pv ipis out-of-bounds access

2018-08-29 Thread Liran Alon



> On 29 Aug 2018, at 8:52, Wanpeng Li  wrote:
> 
> From: Wanpeng Li 
> 
> Dan Carpenter reported that the untrusted data returns from 
> kvm_register_read()
> results in the following static checker warning:
>  arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
>  error: buffer underflow 'map->phys_map' 's32min-s32max'
> 
> KVM guest can easily trigger this by executing the following assembly 
> sequence 
> in Ring0:
> 
> mov $10, %rax
> mov $0x, %rbx
> mov $0x, %rdx
> mov $0, %rsi
> vmcall
> 
> As this will cause KVM to execute the following code-path:
> vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> 
> kvm_pv_send_ipi()
> which will reach out-of-bounds access.
> 
> This patch fixes it by adding a check to kvm_pv_send_ipi() against 
> map->max_apic_id 
> and also checking whether or not map->phys_map[min + i] is NULL since the 
> max_apic_id 
> is set according to the max apic id, however, some phys_map maybe NULL when 
> apic id 
> is sparse, in addition, kvm also unconditionally set max_apic_id to 255 to 
> reserve 
> enough space for any xAPIC ID.
> 
> Reported-by: Dan Carpenter 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Liran Alon 
> Cc: Dan Carpenter 
> Signed-off-by: Wanpeng Li 
> ---
> arch/x86/kvm/lapic.c | 17 +
> 1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0cefba2..86e933c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -571,18 +571,27 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
> ipi_bitmap_low,
>   rcu_read_lock();
>   map = rcu_dereference(kvm->arch.apic_map);
> 
> + if (unlikely((s32)(map->max_apic_id - __fls(ipi_bitmap_low)) < min))
> + goto out;

I personally think “if ((min + __fls(ipi_bitmap_low)) > map->max_apic_id)” is 
more readable.
But that’s just a matter of taste :)

>   /* Bits above cluster_size are masked in the caller.  */
>   for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
> 
>   min += cluster_size;
> + if (unlikely((s32)(map->max_apic_id - __fls(ipi_bitmap_high)) < min))
> + goto out;
>   for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) {
> - vcpu = map->phys_map[min + i]->vcpu;
> - count += kvm_apic_set_irq(vcpu, , NULL);
> + if (map->phys_map[min + i]) {
> + vcpu = map->phys_map[min + i]->vcpu;
> + count += kvm_apic_set_irq(vcpu, , NULL);
> + }
>   }
> 
> +out:
>   rcu_read_unlock();
>   return count;
> }
> -- 
> 2.7.4
> 

Reviewed-By: Liran Alon 




Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon



> On 21 Aug 2018, at 17:22, David Woodhouse  wrote:
> 
> On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
>> 
>>> On 21 Aug 2018, at 12:57, David Woodhouse 
>> wrote:
>>>  
>>> Another alternative... I'm told POWER8 does an interesting thing
>> with
>>> hyperthreading and gang scheduling for KVM. The host kernel doesn't
>>> actually *see* the hyperthreads at all, and KVM just launches the
>> full
>>> set of siblings when it enters a guest, and gathers them again when
>> any
>>> of them exits. That's definitely worth investigating as an option
>> for
>>> x86, too.
>> 
>> I actually think that such scheduling mechanism which prevents
>> leaking cache entries to sibling hyperthreads should co-exist
>> together with the KVM address space isolation to fully mitigate L1TF
>> and other similar vulnerabilities. The address space isolation should
>> prevent VMExit handlers code gadgets from loading arbitrary host
>> memory to the cache. Once VMExit code path switches to full host
>> address space, then we should also make sure that no other sibling
>> hyprethread is running in the guest.
> 
> The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
> The siblings are *never* running host kernel code; they're all torn
> down when any of them exits the guest. And it's always the *same*
> guest.
> 

I wasn’t aware of this KVM Power8 mechanism. Thanks for the pointer.
(371fefd6f2dc ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes”))

Note though that my point regarding the co-existence of the isolated address 
space together with such scheduling mechanism is still valid.
The scheduling mechanism should not be seen as an alternative to the isolated 
address space if we wish to reduce the frequency of events
in which we need to kick sibling hyperthreads from guest.

>> Focusing on the scheduling mechanism, we must make sure that when a
>> logical processor runs guest code, all siblings logical processors
>> must run code which do not populate L1D cache with information
>> unrelated to this VM. This includes forbidding one logical processor
>> to run guest code while sibling is running a host task such as a NIC
>> interrupt handler.
>> Thus, when a vCPU thread exits the guest into the host and VMExit
>> handler reaches code flow which could populate L1D cache with this
>> information, we should force an exit from the guest of the siblings
>> logical processors, such that they will be allowed to resume only on
>> a core which we can promise that the L1D cache is free from
>> information unrelated to this VM.
>> 
>> At first, I have created a patch series which attempts to implement
>> such mechanism in KVM. However, it became clear to me that this may
>> need to be implemented in the scheduler itself. This is because:
>> 1. It is difficult to handle all new scheduling contrains only in
>> KVM.
>> 2. This mechanism should be relevant for any Type-2 hypervisor which
>> runs inside Linux besides KVM (Such as VMware Workstation or
>> VirtualBox).
>> 3. This mechanism could also be used to prevent future “core-cache-
>> leaking” vulnerabilities to be exploited between processes of
>> different security domains which run as siblings on the same core.
> 
> I'm not sure I agree. If KVM is handling "only let siblings run the
> *same* guest" and the siblings aren't visible to the host at all,
> that's quite simple. Any other hypervisor can also do it.
> 
> Now, the down-side of this is that the siblings aren't visible to the
> host. They can't be used to run multiple threads of the same userspace
> processes; only multiple threads of the same KVM guest. A truly generic
> core scheduler would cope with userspace threads too.
> 
> BUT I strongly suspect there's a huge correlation between the set of
> people who care enough about the KVM/L1TF issue to enable a costly
> XFPO-like solution, and the set of people who mostly don't give a shit
> about having sibling CPUs available to run the host's userspace anyway.
> 
> This is not the "I happen to run a Windows VM on my Linux desktop" use
> case...

If I understand your proposal correctly, you suggest to do something similar to 
the KVM Power8 solution:
1. Disable HyperThreading for use by host user space.
2. Use sibling hyperthreads only in KVM and schedule group of vCPUs that run on 
a single core as a “gang” to enter and exit guest together.

This solution may work well for KVM-based cloud providers that match the 
following criteria:
1. All compute instances run with SR-IOV and IOMMU Posted-Interrupts.
2. Configure affinity such that host dedicat

Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon



> On 21 Aug 2018, at 17:22, David Woodhouse  wrote:
> 
> On Tue, 2018-08-21 at 17:01 +0300, Liran Alon wrote:
>> 
>>> On 21 Aug 2018, at 12:57, David Woodhouse 
>> wrote:
>>>  
>>> Another alternative... I'm told POWER8 does an interesting thing
>> with
>>> hyperthreading and gang scheduling for KVM. The host kernel doesn't
>>> actually *see* the hyperthreads at all, and KVM just launches the
>> full
>>> set of siblings when it enters a guest, and gathers them again when
>> any
>>> of them exits. That's definitely worth investigating as an option
>> for
>>> x86, too.
>> 
>> I actually think that such scheduling mechanism which prevents
>> leaking cache entries to sibling hyperthreads should co-exist
>> together with the KVM address space isolation to fully mitigate L1TF
>> and other similar vulnerabilities. The address space isolation should
>> prevent VMExit handlers code gadgets from loading arbitrary host
>> memory to the cache. Once VMExit code path switches to full host
>> address space, then we should also make sure that no other sibling
>> hyprethread is running in the guest.
> 
> The KVM POWER8 solution (see arch/powerpc/kvm/book3s_hv.c) does that.
> The siblings are *never* running host kernel code; they're all torn
> down when any of them exits the guest. And it's always the *same*
> guest.
> 

I wasn’t aware of this KVM Power8 mechanism. Thanks for the pointer.
(371fefd6f2dc ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes”))

Note though that my point regarding the co-existence of the isolated address 
space together with such scheduling mechanism is still valid.
The scheduling mechanism should not be seen as an alternative to the isolated 
address space if we wish to reduce the frequency of events
in which we need to kick sibling hyperthreads from guest.

>> Focusing on the scheduling mechanism, we must make sure that when a
>> logical processor runs guest code, all siblings logical processors
>> must run code which do not populate L1D cache with information
>> unrelated to this VM. This includes forbidding one logical processor
>> to run guest code while sibling is running a host task such as a NIC
>> interrupt handler.
>> Thus, when a vCPU thread exits the guest into the host and VMExit
>> handler reaches code flow which could populate L1D cache with this
>> information, we should force an exit from the guest of the siblings
>> logical processors, such that they will be allowed to resume only on
>> a core which we can promise that the L1D cache is free from
>> information unrelated to this VM.
>> 
>> At first, I have created a patch series which attempts to implement
>> such mechanism in KVM. However, it became clear to me that this may
>> need to be implemented in the scheduler itself. This is because:
>> 1. It is difficult to handle all new scheduling contrains only in
>> KVM.
>> 2. This mechanism should be relevant for any Type-2 hypervisor which
>> runs inside Linux besides KVM (Such as VMware Workstation or
>> VirtualBox).
>> 3. This mechanism could also be used to prevent future “core-cache-
>> leaking” vulnerabilities to be exploited between processes of
>> different security domains which run as siblings on the same core.
> 
> I'm not sure I agree. If KVM is handling "only let siblings run the
> *same* guest" and the siblings aren't visible to the host at all,
> that's quite simple. Any other hypervisor can also do it.
> 
> Now, the down-side of this is that the siblings aren't visible to the
> host. They can't be used to run multiple threads of the same userspace
> processes; only multiple threads of the same KVM guest. A truly generic
> core scheduler would cope with userspace threads too.
> 
> BUT I strongly suspect there's a huge correlation between the set of
> people who care enough about the KVM/L1TF issue to enable a costly
> XFPO-like solution, and the set of people who mostly don't give a shit
> about having sibling CPUs available to run the host's userspace anyway.
> 
> This is not the "I happen to run a Windows VM on my Linux desktop" use
> case...

If I understand your proposal correctly, you suggest to do something similar to 
the KVM Power8 solution:
1. Disable HyperThreading for use by host user space.
2. Use sibling hyperthreads only in KVM and schedule group of vCPUs that run on 
a single core as a “gang” to enter and exit guest together.

This solution may work well for KVM-based cloud providers that match the 
following criteria:
1. All compute instances run with SR-IOV and IOMMU Posted-Interrupts.
2. Configure affinity such that host dedicat

Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon


> On 21 Aug 2018, at 12:57, David Woodhouse  wrote:
> 
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache 
entries to sibling hyperthreads should co-exist together with the KVM address 
space isolation to fully mitigate L1TF and other similar vulnerabilities. The 
address space isolation should prevent VMExit handlers code gadgets from 
loading arbitrary host memory to the cache. Once VMExit code path switches to 
full host address space, then we should also make sure that no other sibling 
hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical 
processor runs guest code, all siblings logical processors must run code which 
do not populate L1D cache with information unrelated to this VM. This includes 
forbidding one logical processor to run guest code while sibling is running a 
host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler 
reaches code flow which could populate L1D cache with this information, we 
should force an exit from the guest of the siblings logical processors, such 
that they will be allowed to resume only on a core which we can promise that 
the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such 
mechanism in KVM. However, it became clear to me that this may need to be 
implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs 
inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” 
vulnerabilities to be exploited between processes of different security domains 
which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core 
scheduler" which they implemented to mitigate this vulnerability. The mechanism 
should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain 
id.
3. Tasks will inherit their security domain id from their parent task.
3.1. First task in system will have security domain id of 0. Thus, if 
nothing special is done, all tasks will be assigned with security domain id of 
0.
4. Tasks will be able to allocate a new security domain id from the scheduler 
and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different 
security domain id:
5.0. CPU core security domain id will be set to the security domain id of 
the tasks which currently run on it.
5.1. The scheduler will attempt to first schedule a task on a core with 
required security domain id if such exists.
5.2. Otherwise, will need to decide if it wishes to kick all tasks running 
on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just 
assigning vCPU tasks with a security domain id which is unique per VM and also 
different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so 
and I will open a new thread.

Thanks,
-Liran




Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

2018-08-21 Thread Liran Alon


> On 21 Aug 2018, at 12:57, David Woodhouse  wrote:
> 
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache 
entries to sibling hyperthreads should co-exist together with the KVM address 
space isolation to fully mitigate L1TF and other similar vulnerabilities. The 
address space isolation should prevent VMExit handlers code gadgets from 
loading arbitrary host memory to the cache. Once VMExit code path switches to 
full host address space, then we should also make sure that no other sibling 
hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical 
processor runs guest code, all siblings logical processors must run code which 
do not populate L1D cache with information unrelated to this VM. This includes 
forbidding one logical processor to run guest code while sibling is running a 
host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler 
reaches code flow which could populate L1D cache with this information, we 
should force an exit from the guest of the siblings logical processors, such 
that they will be allowed to resume only on a core which we can promise that 
the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such 
mechanism in KVM. However, it became clear to me that this may need to be 
implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs 
inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” 
vulnerabilities to be exploited between processes of different security domains 
which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core 
scheduler" which they implemented to mitigate this vulnerability. The mechanism 
should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain 
id.
3. Tasks will inherit their security domain id from their parent task.
3.1. First task in system will have security domain id of 0. Thus, if 
nothing special is done, all tasks will be assigned with security domain id of 
0.
4. Tasks will be able to allocate a new security domain id from the scheduler 
and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different 
security domain id:
5.0. CPU core security domain id will be set to the security domain id of 
the tasks which currently run on it.
5.1. The scheduler will attempt to first schedule a task on a core with 
required security domain id if such exists.
5.2. Otherwise, will need to decide if it wishes to kick all tasks running 
on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just 
assigning vCPU tasks with a security domain id which is unique per VM and also 
different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so 
and I will open a new thread.

Thanks,
-Liran




[PATCH] net: net_failover: fix typo in net_failover_slave_register()

2018-06-18 Thread Liran Alon
Sync both unicast and multicast lists instead of unicast twice.

Fixes: cfc80d9a116 ("net: Introduce net_failover driver")
Reviewed-by: Joao Martins 
Signed-off-by: Liran Alon 
---
 drivers/net/net_failover.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 83f7420ddea5..4f390fa557e4 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -527,7 +527,7 @@ static int net_failover_slave_register(struct net_device 
*slave_dev,
 
netif_addr_lock_bh(failover_dev);
dev_uc_sync_multiple(slave_dev, failover_dev);
-   dev_uc_sync_multiple(slave_dev, failover_dev);
+   dev_mc_sync_multiple(slave_dev, failover_dev);
netif_addr_unlock_bh(failover_dev);
 
err = vlan_vids_add_by_dev(slave_dev, failover_dev);
-- 
1.9.1



[PATCH] net: net_failover: fix typo in net_failover_slave_register()

2018-06-18 Thread Liran Alon
Sync both unicast and multicast lists instead of unicast twice.

Fixes: cfc80d9a116 ("net: Introduce net_failover driver")
Reviewed-by: Joao Martins 
Signed-off-by: Liran Alon 
---
 drivers/net/net_failover.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 83f7420ddea5..4f390fa557e4 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -527,7 +527,7 @@ static int net_failover_slave_register(struct net_device 
*slave_dev,
 
netif_addr_lock_bh(failover_dev);
dev_uc_sync_multiple(slave_dev, failover_dev);
-   dev_uc_sync_multiple(slave_dev, failover_dev);
+   dev_mc_sync_multiple(slave_dev, failover_dev);
netif_addr_unlock_bh(failover_dev);
 
err = vlan_vids_add_by_dev(slave_dev, failover_dev);
-- 
1.9.1



Re: [PATCH 1/5] KVM: hyperv: define VP assist page helpers

2018-06-14 Thread Liran Alon
 u64 data);
> +int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned
> long len);
>  void kvm_lapic_init(void);
>  void kvm_lapic_exit(void);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 06dd4cdb2ca8..a57766b940a5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2442,7 +2442,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
>  
>   break;
>   case MSR_KVM_PV_EOI_EN:
> - if (kvm_lapic_enable_pv_eoi(vcpu, data))
> + if (kvm_lapic_enable_pv_eoi(vcpu, data, sizeof(u8)))
>   return 1;
>   break;
>  
> -- 
> 2.14.4

Reviewed-By: Liran Alon 


Re: [PATCH 1/5] KVM: hyperv: define VP assist page helpers

2018-06-14 Thread Liran Alon
 u64 data);
> +int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned
> long len);
>  void kvm_lapic_init(void);
>  void kvm_lapic_exit(void);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 06dd4cdb2ca8..a57766b940a5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2442,7 +2442,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
>  
>   break;
>   case MSR_KVM_PV_EOI_EN:
> - if (kvm_lapic_enable_pv_eoi(vcpu, data))
> + if (kvm_lapic_enable_pv_eoi(vcpu, data, sizeof(u8)))
>   return 1;
>   break;
>  
> -- 
> 2.14.4

Reviewed-By: Liran Alon 


Re: [PATCH 5/5] KVM: nVMX: optimize prepare_vmcs02{,_full} for Enlightened VMCS case

2018-06-14 Thread Liran Alon


- vkuzn...@redhat.com wrote:

> When Enlightened VMCS is in use by L1 hypervisor we can avoid
> vmwriting
> VMCS fields which did not change.
> 
> Our first goal is to achieve minimal impact on traditional VMCS case
> so
> we're not wrapping each vmwrite() with an if-changed checker. We also
> can't
> utilize static keys as Enlightened VMCS usage is per-guest.
> 
> This patch implements the simpliest solution: checking fields in
> groups.
> We skip single vmwrite() statements as doing the check will cost us
> something even in non-evmcs case and the win is tiny. Unfortunately,
> this
> makes prepare_vmcs02_full{,_full}() code Enlightened VMCS-dependent
> (and
> a bit ugly).
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 143
> ++---
>  1 file changed, 82 insertions(+), 61 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6802ba91468c..9a7d76c5c92b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11619,50 +11619,79 @@ static int nested_vmx_load_cr3(struct
> kvm_vcpu *vcpu, unsigned long cr3, bool ne
>   return 0;
>  }
>  
> +/*
> + * Check if L1 hypervisor changed the particular field in
> Enlightened
> + * VMCS and avoid redundant vmwrite if it didn't. Can only be used
> when
> + * the value we're about to write is unchanged vmcs12->field.
> + */
> +#define evmcs_needs_write(vmx, clean_field)
> ((vmx)->nested.dirty_vmcs12 ||\
> + !(vmx->nested.hv_evmcs->hv_clean_fields &\
> +   HV_VMX_ENLIGHTENED_CLEAN_FIELD_##clean_field))

Why declare this is a macro instead of an static inline small function?
Just to shorten the name of the clean-field constant?

> +
>  static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs;
> +
> + if (!hv_evmcs || evmcs_needs_write(vmx, GUEST_GRP2)) {
> + vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> + vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> + vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
> + vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
> + vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
> + vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
> + vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
> + vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
> + vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
> + vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
> + vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
> + vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
> + vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
> + vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
> + vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
> + vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
> + vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
> + vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
> + vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
> + vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
> + vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
> + vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
> + vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
> + vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
> + vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
> + vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
> + vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
> + vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
> + vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
> + vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
> + vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
> + }
> +
> + if (!hv_evmcs || evmcs_needs_write(vmx, GUEST_GRP1)) {
> + vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
> + vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> + vmcs12->guest_pending_dbg_exceptions);
> + vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
> + vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
> +
> + if (vmx_mpx_supported())
> + vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>  
> - vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> - vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> - vmcs_write16(GUEST_DS_SELECTOR, 

Re: [PATCH 5/5] KVM: nVMX: optimize prepare_vmcs02{,_full} for Enlightened VMCS case

2018-06-14 Thread Liran Alon


- vkuzn...@redhat.com wrote:

> When Enlightened VMCS is in use by L1 hypervisor we can avoid
> vmwriting
> VMCS fields which did not change.
> 
> Our first goal is to achieve minimal impact on traditional VMCS case
> so
> we're not wrapping each vmwrite() with an if-changed checker. We also
> can't
> utilize static keys as Enlightened VMCS usage is per-guest.
> 
> This patch implements the simpliest solution: checking fields in
> groups.
> We skip single vmwrite() statements as doing the check will cost us
> something even in non-evmcs case and the win is tiny. Unfortunately,
> this
> makes prepare_vmcs02_full{,_full}() code Enlightened VMCS-dependent
> (and
> a bit ugly).
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 143
> ++---
>  1 file changed, 82 insertions(+), 61 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6802ba91468c..9a7d76c5c92b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11619,50 +11619,79 @@ static int nested_vmx_load_cr3(struct
> kvm_vcpu *vcpu, unsigned long cr3, bool ne
>   return 0;
>  }
>  
> +/*
> + * Check if L1 hypervisor changed the particular field in
> Enlightened
> + * VMCS and avoid redundant vmwrite if it didn't. Can only be used
> when
> + * the value we're about to write is unchanged vmcs12->field.
> + */
> +#define evmcs_needs_write(vmx, clean_field)
> ((vmx)->nested.dirty_vmcs12 ||\
> + !(vmx->nested.hv_evmcs->hv_clean_fields &\
> +   HV_VMX_ENLIGHTENED_CLEAN_FIELD_##clean_field))

Why declare this is a macro instead of an static inline small function?
Just to shorten the name of the clean-field constant?

> +
>  static void prepare_vmcs02_full(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs;
> +
> + if (!hv_evmcs || evmcs_needs_write(vmx, GUEST_GRP2)) {
> + vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> + vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> + vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
> + vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
> + vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
> + vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
> + vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
> + vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
> + vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
> + vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
> + vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
> + vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
> + vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
> + vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
> + vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
> + vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
> + vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
> + vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
> + vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
> + vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
> + vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
> + vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
> + vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
> + vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
> + vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
> + vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
> + vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
> + vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
> + vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
> + vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
> + vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
> + }
> +
> + if (!hv_evmcs || evmcs_needs_write(vmx, GUEST_GRP1)) {
> + vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
> + vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> + vmcs12->guest_pending_dbg_exceptions);
> + vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
> + vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
> +
> + if (vmx_mpx_supported())
> + vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>  
> - vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> - vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> - vmcs_write16(GUEST_DS_SELECTOR, 

Re: [PATCH 3/5] KVM: nVMX: add enlightened VMCS state

2018-06-14 Thread Liran Alon


- vkuzn...@redhat.com wrote:

> Adds hv_evmcs pointer and implement copy_enlightened_to_vmcs12() and
> copy_enlightened_to_vmcs12().
> 
> prepare_vmcs02()/prepare_vmcs02_full() separation is not valid for
> Enlightened VMCS, do full sync for now.
> 
> Suggested-by: Ladi Prosek 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 431
> +++--
>  1 file changed, 417 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 51749207cef1..e7fa9f9c6e36 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -640,10 +640,10 @@ struct nested_vmx {
>*/
>   struct vmcs12 *cached_vmcs12;
>   /*
> -  * Indicates if the shadow vmcs must be updated with the
> -  * data hold by vmcs12
> +  * Indicates if the shadow vmcs or enlightened vmcs must be updated
> +  * with the data held by struct vmcs12.
>*/
> - bool sync_shadow_vmcs;
> + bool need_vmcs12_sync;
>   bool dirty_vmcs12;
>  
>   bool change_vmcs01_virtual_apic_mode;
> @@ -689,6 +689,8 @@ struct nested_vmx {
>   /* in guest mode on SMM entry? */
>   bool guest_mode;
>   } smm;
> +
> + struct hv_enlightened_vmcs *hv_evmcs;
>  };
>  
>  #define POSTED_INTR_ON  0
> @@ -8010,7 +8012,7 @@ static inline void nested_release_vmcs12(struct
> vcpu_vmx *vmx)
>   /* copy to memory all shadowed fields in case
>  they were modified */
>   copy_shadow_to_vmcs12(vmx);
> - vmx->nested.sync_shadow_vmcs = false;
> + vmx->nested.need_vmcs12_sync = false;
>   vmx_disable_shadow_vmcs(vmx);
>   }
>   vmx->nested.posted_intr_nv = -1;
> @@ -8187,6 +8189,393 @@ static inline int vmcs12_write_any(struct
> kvm_vcpu *vcpu,
>  
>  }
>  
> +static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, bool
> full)
> +{
> + struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12;
> + struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs;
> +
> + /* HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE */
> + vmcs12->tpr_threshold = evmcs->tpr_threshold;
> + vmcs12->guest_rip = evmcs->guest_rip;
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC))) {
> + vmcs12->guest_rsp = evmcs->guest_rsp;
> + vmcs12->guest_rflags = evmcs->guest_rflags;
> + vmcs12->guest_interruptibility_info =
> + evmcs->guest_interruptibility_info;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
> + vmcs12->cpu_based_vm_exec_control =
> + evmcs->cpu_based_vm_exec_control;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
> + vmcs12->exception_bitmap = evmcs->exception_bitmap;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY))) {
> + vmcs12->vm_entry_controls = evmcs->vm_entry_controls;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT))) {
> + vmcs12->vm_entry_intr_info_field =
> + evmcs->vm_entry_intr_info_field;
> + vmcs12->vm_entry_exception_error_code =
> + evmcs->vm_entry_exception_error_code;
> + vmcs12->vm_entry_instruction_len =
> + evmcs->vm_entry_instruction_len;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +   HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1))) {
> + vmcs12->host_ia32_pat = evmcs->host_ia32_pat;
> + vmcs12->host_ia32_efer = evmcs->host_ia32_efer;
> + vmcs12->host_cr0 = evmcs->host_cr0;
> + vmcs12->host_cr3 = evmcs->host_cr3;
> + vmcs12->host_cr4 = evmcs->host_cr4;
> + vmcs12->host_ia32_sysenter_esp = evmcs->host_ia32_sysenter_esp;
> + vmcs12->host_ia32_sysenter_eip = evmcs->host_ia32_sysenter_eip;
> + vmcs12->host_rip = evmcs->host_rip;
> + vmcs12->host_ia32_sysenter_cs = evmcs->host_ia32_sysenter_cs;
> + vmcs12->host_es_selector = evmcs->host_es_selector;
> + vmcs12->host_cs_selector = evmcs->host_cs_selector;
> + vmcs12->host_ss_selector = evmcs->host_ss_selector;
> + vmcs12->host_ds_selector = evmcs->host_ds_selector;
> + vmcs12->host_fs_selector = evmcs->host_fs_selector;
> + vmcs12->host_gs_selector = evmcs->host_gs_selector;
> + vmcs12->host_tr_selector = evmcs->host_tr_selector;
> + }
> +
> + if 

Re: [PATCH 3/5] KVM: nVMX: add enlightened VMCS state

2018-06-14 Thread Liran Alon


- vkuzn...@redhat.com wrote:

> Adds hv_evmcs pointer and implement copy_enlightened_to_vmcs12() and
> copy_enlightened_to_vmcs12().
> 
> prepare_vmcs02()/prepare_vmcs02_full() separation is not valid for
> Enlightened VMCS, do full sync for now.
> 
> Suggested-by: Ladi Prosek 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx.c | 431
> +++--
>  1 file changed, 417 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 51749207cef1..e7fa9f9c6e36 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -640,10 +640,10 @@ struct nested_vmx {
>*/
>   struct vmcs12 *cached_vmcs12;
>   /*
> -  * Indicates if the shadow vmcs must be updated with the
> -  * data hold by vmcs12
> +  * Indicates if the shadow vmcs or enlightened vmcs must be updated
> +  * with the data held by struct vmcs12.
>*/
> - bool sync_shadow_vmcs;
> + bool need_vmcs12_sync;
>   bool dirty_vmcs12;
>  
>   bool change_vmcs01_virtual_apic_mode;
> @@ -689,6 +689,8 @@ struct nested_vmx {
>   /* in guest mode on SMM entry? */
>   bool guest_mode;
>   } smm;
> +
> + struct hv_enlightened_vmcs *hv_evmcs;
>  };
>  
>  #define POSTED_INTR_ON  0
> @@ -8010,7 +8012,7 @@ static inline void nested_release_vmcs12(struct
> vcpu_vmx *vmx)
>   /* copy to memory all shadowed fields in case
>  they were modified */
>   copy_shadow_to_vmcs12(vmx);
> - vmx->nested.sync_shadow_vmcs = false;
> + vmx->nested.need_vmcs12_sync = false;
>   vmx_disable_shadow_vmcs(vmx);
>   }
>   vmx->nested.posted_intr_nv = -1;
> @@ -8187,6 +8189,393 @@ static inline int vmcs12_write_any(struct
> kvm_vcpu *vcpu,
>  
>  }
>  
> +static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, bool
> full)
> +{
> + struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12;
> + struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs;
> +
> + /* HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE */
> + vmcs12->tpr_threshold = evmcs->tpr_threshold;
> + vmcs12->guest_rip = evmcs->guest_rip;
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC))) {
> + vmcs12->guest_rsp = evmcs->guest_rsp;
> + vmcs12->guest_rflags = evmcs->guest_rflags;
> + vmcs12->guest_interruptibility_info =
> + evmcs->guest_interruptibility_info;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
> + vmcs12->cpu_based_vm_exec_control =
> + evmcs->cpu_based_vm_exec_control;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
> + vmcs12->exception_bitmap = evmcs->exception_bitmap;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY))) {
> + vmcs12->vm_entry_controls = evmcs->vm_entry_controls;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT))) {
> + vmcs12->vm_entry_intr_info_field =
> + evmcs->vm_entry_intr_info_field;
> + vmcs12->vm_entry_exception_error_code =
> + evmcs->vm_entry_exception_error_code;
> + vmcs12->vm_entry_instruction_len =
> + evmcs->vm_entry_instruction_len;
> + }
> +
> + if (unlikely(full || !(evmcs->hv_clean_fields &
> +   HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1))) {
> + vmcs12->host_ia32_pat = evmcs->host_ia32_pat;
> + vmcs12->host_ia32_efer = evmcs->host_ia32_efer;
> + vmcs12->host_cr0 = evmcs->host_cr0;
> + vmcs12->host_cr3 = evmcs->host_cr3;
> + vmcs12->host_cr4 = evmcs->host_cr4;
> + vmcs12->host_ia32_sysenter_esp = evmcs->host_ia32_sysenter_esp;
> + vmcs12->host_ia32_sysenter_eip = evmcs->host_ia32_sysenter_eip;
> + vmcs12->host_rip = evmcs->host_rip;
> + vmcs12->host_ia32_sysenter_cs = evmcs->host_ia32_sysenter_cs;
> + vmcs12->host_es_selector = evmcs->host_es_selector;
> + vmcs12->host_cs_selector = evmcs->host_cs_selector;
> + vmcs12->host_ss_selector = evmcs->host_ss_selector;
> + vmcs12->host_ds_selector = evmcs->host_ds_selector;
> + vmcs12->host_fs_selector = evmcs->host_fs_selector;
> + vmcs12->host_gs_selector = evmcs->host_gs_selector;
> + vmcs12->host_tr_selector = evmcs->host_tr_selector;
> + }
> +
> + if 

Re: [PATCH 4/5] KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR

2018-06-14 Thread Liran Alon
(vmx->nested.current_vmptr != vmptr) {
>   struct vmcs12 *new_vmcs12;
>   struct page *page;
> @@ -8847,6 +8876,55 @@ static int handle_vmptrld(struct kvm_vcpu
> *vcpu)
>   return kvm_skip_emulated_instruction(vcpu);
>  }
>  
> +/*
> + * This is an equivalent of the nested hypervisor executing the
> vmptrld
> + * instruction.
> + */
> +static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu
> *vcpu)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct hv_vp_assist_page assist_page;
> +
> + if (likely(!vmx->nested.enlightened_vmcs_enabled))
> + return 1;
> +
> + if (unlikely(!kvm_hv_get_assist_page(vcpu, _page)))
> + return 1;
> +
> + if (unlikely(!assist_page.enlighten_vmentry))
> + return 1;
> +
> + if (unlikely(assist_page.current_nested_vmcs !=
> +  vmx->nested.hv_evmcs_vmptr)) {
> +
> + if (!vmx->nested.hv_evmcs)
> + vmx->nested.current_vmptr = -1ull;
> +
> + nested_release_evmcs(vmx);
> +
> + vmx->nested.hv_evmcs_page = kvm_vcpu_gpa_to_page(
> + vcpu, assist_page.current_nested_vmcs);
> +
> + if (unlikely(is_error_page(vmx->nested.hv_evmcs_page)))
> + return 0;
> +
> + vmx->nested.hv_evmcs = kmap(vmx->nested.hv_evmcs_page);
> + vmx->nested.dirty_vmcs12 = true;
> + vmx->nested.hv_evmcs_vmptr = assist_page.current_nested_vmcs;
> +
> + /*
> +  * Unlike normal vmcs12, enlightened vmcs12 is not fully
> +  * reloaded from guest's memory (read only fields, fields not
> +  * present in struct hv_enlightened_vmcs, ...). Make sure there
> +  * are no leftovers.
> +  */
> + memset(vmx->nested.cached_vmcs12, 0,
> +sizeof(*vmx->nested.cached_vmcs12));
> +
> + }
> + return 1;
> +}
> +
>  /* Emulate the VMPTRST instruction */
>  static int handle_vmptrst(struct kvm_vcpu *vcpu)
>  {
> @@ -8858,6 +8936,9 @@ static int handle_vmptrst(struct kvm_vcpu
> *vcpu)
>   if (!nested_vmx_check_permission(vcpu))
>       return 1;
>  
> + if (unlikely(to_vmx(vcpu)->nested.hv_evmcs))
> + return 1;
> +
>   if (get_vmx_mem_address(vcpu, exit_qualification,
>   vmx_instruction_info, true, _gva))
>   return 1;
> @@ -12148,7 +12229,10 @@ static int nested_vmx_run(struct kvm_vcpu
> *vcpu, bool launch)
>   if (!nested_vmx_check_permission(vcpu))
>   return 1;
>  
> - if (!nested_vmx_check_vmcs12(vcpu))
> + if (!nested_vmx_handle_enlightened_vmptrld(vcpu))
> + return 1;
> +
> + if (!vmx->nested.hv_evmcs && !nested_vmx_check_vmcs12(vcpu))
>   goto out;
>  
>   vmcs12 = get_vmcs12(vcpu);
> -- 
> 2.14.4

Reviewed-By: Liran Alon 


Re: [PATCH 4/5] KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR

2018-06-14 Thread Liran Alon
(vmx->nested.current_vmptr != vmptr) {
>   struct vmcs12 *new_vmcs12;
>   struct page *page;
> @@ -8847,6 +8876,55 @@ static int handle_vmptrld(struct kvm_vcpu
> *vcpu)
>   return kvm_skip_emulated_instruction(vcpu);
>  }
>  
> +/*
> + * This is an equivalent of the nested hypervisor executing the
> vmptrld
> + * instruction.
> + */
> +static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu
> *vcpu)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct hv_vp_assist_page assist_page;
> +
> + if (likely(!vmx->nested.enlightened_vmcs_enabled))
> + return 1;
> +
> + if (unlikely(!kvm_hv_get_assist_page(vcpu, _page)))
> + return 1;
> +
> + if (unlikely(!assist_page.enlighten_vmentry))
> + return 1;
> +
> + if (unlikely(assist_page.current_nested_vmcs !=
> +  vmx->nested.hv_evmcs_vmptr)) {
> +
> + if (!vmx->nested.hv_evmcs)
> + vmx->nested.current_vmptr = -1ull;
> +
> + nested_release_evmcs(vmx);
> +
> + vmx->nested.hv_evmcs_page = kvm_vcpu_gpa_to_page(
> + vcpu, assist_page.current_nested_vmcs);
> +
> + if (unlikely(is_error_page(vmx->nested.hv_evmcs_page)))
> + return 0;
> +
> + vmx->nested.hv_evmcs = kmap(vmx->nested.hv_evmcs_page);
> + vmx->nested.dirty_vmcs12 = true;
> + vmx->nested.hv_evmcs_vmptr = assist_page.current_nested_vmcs;
> +
> + /*
> +  * Unlike normal vmcs12, enlightened vmcs12 is not fully
> +  * reloaded from guest's memory (read only fields, fields not
> +  * present in struct hv_enlightened_vmcs, ...). Make sure there
> +  * are no leftovers.
> +  */
> + memset(vmx->nested.cached_vmcs12, 0,
> +sizeof(*vmx->nested.cached_vmcs12));
> +
> + }
> + return 1;
> +}
> +
>  /* Emulate the VMPTRST instruction */
>  static int handle_vmptrst(struct kvm_vcpu *vcpu)
>  {
> @@ -8858,6 +8936,9 @@ static int handle_vmptrst(struct kvm_vcpu
> *vcpu)
>   if (!nested_vmx_check_permission(vcpu))
>       return 1;
>  
> + if (unlikely(to_vmx(vcpu)->nested.hv_evmcs))
> + return 1;
> +
>   if (get_vmx_mem_address(vcpu, exit_qualification,
>   vmx_instruction_info, true, _gva))
>   return 1;
> @@ -12148,7 +12229,10 @@ static int nested_vmx_run(struct kvm_vcpu
> *vcpu, bool launch)
>   if (!nested_vmx_check_permission(vcpu))
>   return 1;
>  
> - if (!nested_vmx_check_vmcs12(vcpu))
> + if (!nested_vmx_handle_enlightened_vmptrld(vcpu))
> + return 1;
> +
> + if (!vmx->nested.hv_evmcs && !nested_vmx_check_vmcs12(vcpu))
>   goto out;
>  
>   vmcs12 = get_vmcs12(vcpu);
> -- 
> 2.14.4

Reviewed-By: Liran Alon 


Re: [PATCH 2/5] KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability

2018-06-14 Thread Liran Alon
ature for simplicity. */
> + if (vmx->nested.enlightened_vmcs_enabled)
> + return 0;
> +
> + vmx->nested.enlightened_vmcs_enabled = true;
> + *vmcs_version = (KVM_EVMCS_VERSION << 8) | 1;

Please add a comment here explaining the "<< 8) | 1" part.

> +
> + vmx->nested.msrs.pinbased_ctls_high &= ~EVMCS1_UNSUPPORTED_PINCTRL;
> + vmx->nested.msrs.entry_ctls_high &=
> ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
> + vmx->nested.msrs.exit_ctls_high &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
> + vmx->nested.msrs.secondary_ctls_high &=
> ~EVMCS1_UNSUPPORTED_2NDEXEC;
> + vmx->nested.msrs.vmfunc_controls &= ~EVMCS1_UNSUPPORTED_VMFUNC;
> +
> + return 0;
> +}
> +
>  static inline bool is_exception_n(u32 intr_info, u8 vector)
>  {
>   return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VECTOR_MASK |
> @@ -13039,6 +13053,8 @@ static struct kvm_x86_ops vmx_x86_ops
> __ro_after_init = {
>   .pre_enter_smm = vmx_pre_enter_smm,
>   .pre_leave_smm = vmx_pre_leave_smm,
>   .enable_smi_window = enable_smi_window,
> +
> + .nested_enable_evmcs = nested_enable_evmcs,
>  };
>  
>  static int __init vmx_init(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a57766b940a5..51488019dec2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2873,6 +2873,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> *kvm, long ext)
>   case KVM_CAP_HYPERV_VP_INDEX:
>   case KVM_CAP_HYPERV_EVENTFD:
>   case KVM_CAP_HYPERV_TLBFLUSH:
> + case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
>   case KVM_CAP_PCI_SEGMENT:
>   case KVM_CAP_DEBUGREGS:
>   case KVM_CAP_X86_ROBUST_SINGLESTEP:
> @@ -3650,6 +3651,10 @@ static int kvm_set_guest_paused(struct kvm_vcpu
> *vcpu)
>  static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>struct kvm_enable_cap *cap)
>  {
> + int r;
> + uint16_t vmcs_version;
> + void __user *user_ptr;
> +
>   if (cap->flags)
>   return -EINVAL;
>  
> @@ -3662,6 +3667,16 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> kvm_vcpu *vcpu,
>   return -EINVAL;
>   return kvm_hv_activate_synic(vcpu, cap->cap ==
>    KVM_CAP_HYPERV_SYNIC2);
> + case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
> + r = kvm_x86_ops->nested_enable_evmcs(vcpu, _version);
> + if (!r) {
> + user_ptr = (void __user *)(uintptr_t)cap->args[0];
> + if (copy_to_user(user_ptr, _version,
> +  sizeof(vmcs_version)))
> + r = -EFAULT;
> + }
> + return r;
> +
>   default:
>   return -EINVAL;
>   }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3b38e9..5c4b79c1af19 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_GET_MSR_FEATURES 153
>  #define KVM_CAP_HYPERV_EVENTFD 154
>  #define KVM_CAP_HYPERV_TLBFLUSH 155
> +#define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 156
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> -- 
> 2.14.4

Besides above comments,
Reviewed-By: Liran Alon 


Re: [PATCH 2/5] KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability

2018-06-14 Thread Liran Alon
ature for simplicity. */
> + if (vmx->nested.enlightened_vmcs_enabled)
> + return 0;
> +
> + vmx->nested.enlightened_vmcs_enabled = true;
> + *vmcs_version = (KVM_EVMCS_VERSION << 8) | 1;

Please add a comment here explaining the "<< 8) | 1" part.

> +
> + vmx->nested.msrs.pinbased_ctls_high &= ~EVMCS1_UNSUPPORTED_PINCTRL;
> + vmx->nested.msrs.entry_ctls_high &=
> ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
> + vmx->nested.msrs.exit_ctls_high &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
> + vmx->nested.msrs.secondary_ctls_high &=
> ~EVMCS1_UNSUPPORTED_2NDEXEC;
> + vmx->nested.msrs.vmfunc_controls &= ~EVMCS1_UNSUPPORTED_VMFUNC;
> +
> + return 0;
> +}
> +
>  static inline bool is_exception_n(u32 intr_info, u8 vector)
>  {
>   return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VECTOR_MASK |
> @@ -13039,6 +13053,8 @@ static struct kvm_x86_ops vmx_x86_ops
> __ro_after_init = {
>   .pre_enter_smm = vmx_pre_enter_smm,
>   .pre_leave_smm = vmx_pre_leave_smm,
>   .enable_smi_window = enable_smi_window,
> +
> + .nested_enable_evmcs = nested_enable_evmcs,
>  };
>  
>  static int __init vmx_init(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a57766b940a5..51488019dec2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2873,6 +2873,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> *kvm, long ext)
>   case KVM_CAP_HYPERV_VP_INDEX:
>   case KVM_CAP_HYPERV_EVENTFD:
>   case KVM_CAP_HYPERV_TLBFLUSH:
> + case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
>   case KVM_CAP_PCI_SEGMENT:
>   case KVM_CAP_DEBUGREGS:
>   case KVM_CAP_X86_ROBUST_SINGLESTEP:
> @@ -3650,6 +3651,10 @@ static int kvm_set_guest_paused(struct kvm_vcpu
> *vcpu)
>  static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>struct kvm_enable_cap *cap)
>  {
> + int r;
> + uint16_t vmcs_version;
> + void __user *user_ptr;
> +
>   if (cap->flags)
>   return -EINVAL;
>  
> @@ -3662,6 +3667,16 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> kvm_vcpu *vcpu,
>   return -EINVAL;
>   return kvm_hv_activate_synic(vcpu, cap->cap ==
>    KVM_CAP_HYPERV_SYNIC2);
> + case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
> + r = kvm_x86_ops->nested_enable_evmcs(vcpu, _version);
> + if (!r) {
> + user_ptr = (void __user *)(uintptr_t)cap->args[0];
> + if (copy_to_user(user_ptr, _version,
> +  sizeof(vmcs_version)))
> + r = -EFAULT;
> + }
> + return r;
> +
>   default:
>   return -EINVAL;
>   }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3b38e9..5c4b79c1af19 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -949,6 +949,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_GET_MSR_FEATURES 153
>  #define KVM_CAP_HYPERV_EVENTFD 154
>  #define KVM_CAP_HYPERV_TLBFLUSH 155
> +#define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 156
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> -- 
> 2.14.4

Besides above comments,
Reviewed-By: Liran Alon 


Re: [PATCH v2] KVM: X86: Fix CR3 reserve bits

2018-05-13 Thread Liran Alon

- kernel...@gmail.com wrote:

> From: Wanpeng Li <wanpen...@tencent.com>
> 
> MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4. 
> It should be checked when PCIDE bit is not set, however commit 
> 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on 
> its physical address width")' removes the bit 63 checking 
> unconditionally. This patch fixes it by checking bit 63 of CR3 
> when PCIDE bit is not set in CR4.
> 
> Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based on
> its physical address width)
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Radim Krčmář <rkrc...@redhat.com>
> Cc: Junaid Shahid <juna...@google.com>
> Cc: Liran Alon <liran.a...@oracle.com>
> Signed-off-by: Wanpeng Li <wanpen...@tencent.com>
> ---
> v1 -> v2:
>  * remove CR3_PCID_INVD in rsvd when PCIDE is 1 instead of 
>removing CR3_PCID_INVD in new_value
> 
>  arch/x86/kvm/emulate.c | 4 +++-
>  arch/x86/kvm/x86.c | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index b3705ae..143b7ae 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4189,7 +4189,9 @@ static int check_cr_write(struct
> x86_emulate_ctxt *ctxt)
>   maxphyaddr = eax & 0xff;
>   else
>   maxphyaddr = 36;
> - rsvd = rsvd_bits(maxphyaddr, 62);
> + rsvd = rsvd_bits(maxphyaddr, 63);
> + if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
> + rsvd &= ~CR3_PCID_INVD;
>   }
>  
>   if (new_val & rsvd)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 87e4805..9a90668 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -863,7 +863,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned
> long cr3)
>   }
>  
>   if (is_long_mode(vcpu) &&
> - (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 62)))
> + (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
>   return 1;
>   else if (is_pae(vcpu) && is_paging(vcpu) &&
>  !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
> -- 
> 2.7.4

Reviewed-by: Liran Alon <liran.a...@oracle.com>


Re: [PATCH v2] KVM: X86: Fix CR3 reserve bits

2018-05-13 Thread Liran Alon

- kernel...@gmail.com wrote:

> From: Wanpeng Li 
> 
> MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4. 
> It should be checked when PCIDE bit is not set, however commit 
> 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on 
> its physical address width")' removes the bit 63 checking 
> unconditionally. This patch fixes it by checking bit 63 of CR3 
> when PCIDE bit is not set in CR4.
> 
> Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based on
> its physical address width)
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Junaid Shahid 
> Cc: Liran Alon 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * remove CR3_PCID_INVD in rsvd when PCIDE is 1 instead of 
>removing CR3_PCID_INVD in new_value
> 
>  arch/x86/kvm/emulate.c | 4 +++-
>  arch/x86/kvm/x86.c | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index b3705ae..143b7ae 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4189,7 +4189,9 @@ static int check_cr_write(struct
> x86_emulate_ctxt *ctxt)
>   maxphyaddr = eax & 0xff;
>   else
>   maxphyaddr = 36;
> - rsvd = rsvd_bits(maxphyaddr, 62);
> + rsvd = rsvd_bits(maxphyaddr, 63);
> + if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
> + rsvd &= ~CR3_PCID_INVD;
>   }
>  
>   if (new_val & rsvd)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 87e4805..9a90668 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -863,7 +863,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned
> long cr3)
>   }
>  
>   if (is_long_mode(vcpu) &&
> - (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 62)))
> + (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
>   return 1;
>   else if (is_pae(vcpu) && is_paging(vcpu) &&
>  !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
> -- 
> 2.7.4

Reviewed-by: Liran Alon 


Re: [PATCH 1/2] KVM: X86: Fix CR3 reserve bits

2018-05-13 Thread Liran Alon

- kernel...@gmail.com wrote:

> 2018-05-13 16:28 GMT+08:00 Liran Alon <liran.a...@oracle.com>:
> >
> > - kernel...@gmail.com wrote:
> >
> >> 2018-05-13 15:53 GMT+08:00 Liran Alon <liran.a...@oracle.com>:
> >> >
> >> > - kernel...@gmail.com wrote:
> >> >
> >> >> From: Wanpeng Li <wanpen...@tencent.com>
> >> >>
> >> >> MSB of CR3 is a reserved bit if the PCIDE bit is not set in
> CR4.
> >> >> It should be checked when PCIDE bit is not set, however commit
> >> >> 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based
> on
> >> >> its physical address width")' removes the bit 63 checking
> >> >> unconditionally. This patch fixes it by checking bit 63 of CR3
> >> >> when PCIDE bit is not set in CR4.
> >> >>
> >> >> Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits
> based
> >> on
> >> >> its physical address width)
> >> >> Cc: Paolo Bonzini <pbonz...@redhat.com>
> >> >> Cc: Radim Krčmář <rkrc...@redhat.com>
> >> >> Cc: Junaid Shahid <juna...@google.com>
> >> >> Signed-off-by: Wanpeng Li <wanpen...@tencent.com>
> >> >> ---
> >> >>  arch/x86/kvm/emulate.c | 4 +++-
> >> >>  arch/x86/kvm/x86.c | 2 +-
> >> >>  2 files changed, 4 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> >> >> index b3705ae..b21f427 100644
> >> >> --- a/arch/x86/kvm/emulate.c
> >> >> +++ b/arch/x86/kvm/emulate.c
> >> >> @@ -4189,7 +4189,9 @@ static int check_cr_write(struct
> >> >> x86_emulate_ctxt *ctxt)
> >> >>   maxphyaddr = eax & 0xff;
> >> >>   else
> >> >>   maxphyaddr = 36;
> >> >> - rsvd = rsvd_bits(maxphyaddr, 62);
> >> >> + if (ctxt->ops->get_cr(ctxt, 4) &
> >> X86_CR4_PCIDE)
> >> >> + new_val &= ~CR3_PCID_INVD;
> >> >> + rsvd = rsvd_bits(maxphyaddr, 63);
> >> >
> >> > I would prefer instead to do this:
> >> > if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
> >> > rsvd &= ~CR3_PCID_INVD;
> >> > It makes more sense as opposed to temporary removing the
> >> CR3_PCID_INVD bit from new_val.
> >>
> >> It tries the same way
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_virt_kvm_kvm.git_commit_-3Fid-3Dc19986fea873f3c745122bf79013a872a190f212=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=r52WDgKBorUHwe_B_5Nw2Le_F_E0ne8lqqWW6n-3bSg=ufTcXvhhAMkY3XP6gAx-HiKCT8ynPWo2fs2z9DqCzM4=
> >> pointed out.
> >>
> >> Regards,
> >> Wanpeng Li
> >
> > Yes but there it makes sense as new CR3 value should not have bit 63
> set in vcpu->arch.cr3.
> 
> When X86_CR4_PCIDE == 0 and CR3 63 bit is set, a #GP is missing in
> your suggestion.
> 
> Regards,
> Wanpeng Li

Why?

I suggest the following change:
- rsvd = rsvd_bits(maxphyaddr, 62);
+ rsvd = rsvd_bits(maxphyaddr, 63);
+ if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
+ rsvd &= ~CR3_PCID_INVD;

In this case, if PCIDE=0 then bit 63 is set in rsvd and therefore 
check_cr_write() will emulate_gp() as needed.


Re: [PATCH 1/2] KVM: X86: Fix CR3 reserve bits

2018-05-13 Thread Liran Alon

- kernel...@gmail.com wrote:

> 2018-05-13 16:28 GMT+08:00 Liran Alon :
> >
> > - kernel...@gmail.com wrote:
> >
> >> 2018-05-13 15:53 GMT+08:00 Liran Alon :
> >> >
> >> > - kernel...@gmail.com wrote:
> >> >
> >> >> From: Wanpeng Li 
> >> >>
> >> >> MSB of CR3 is a reserved bit if the PCIDE bit is not set in
> CR4.
> >> >> It should be checked when PCIDE bit is not set, however commit
> >> >> 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based
> on
> >> >> its physical address width")' removes the bit 63 checking
> >> >> unconditionally. This patch fixes it by checking bit 63 of CR3
> >> >> when PCIDE bit is not set in CR4.
> >> >>
> >> >> Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits
> based
> >> on
> >> >> its physical address width)
> >> >> Cc: Paolo Bonzini 
> >> >> Cc: Radim Krčmář 
> >> >> Cc: Junaid Shahid 
> >> >> Signed-off-by: Wanpeng Li 
> >> >> ---
> >> >>  arch/x86/kvm/emulate.c | 4 +++-
> >> >>  arch/x86/kvm/x86.c | 2 +-
> >> >>  2 files changed, 4 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> >> >> index b3705ae..b21f427 100644
> >> >> --- a/arch/x86/kvm/emulate.c
> >> >> +++ b/arch/x86/kvm/emulate.c
> >> >> @@ -4189,7 +4189,9 @@ static int check_cr_write(struct
> >> >> x86_emulate_ctxt *ctxt)
> >> >>   maxphyaddr = eax & 0xff;
> >> >>   else
> >> >>   maxphyaddr = 36;
> >> >> - rsvd = rsvd_bits(maxphyaddr, 62);
> >> >> + if (ctxt->ops->get_cr(ctxt, 4) &
> >> X86_CR4_PCIDE)
> >> >> + new_val &= ~CR3_PCID_INVD;
> >> >> + rsvd = rsvd_bits(maxphyaddr, 63);
> >> >
> >> > I would prefer instead to do this:
> >> > if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
> >> > rsvd &= ~CR3_PCID_INVD;
> >> > It makes more sense as opposed to temporary removing the
> >> CR3_PCID_INVD bit from new_val.
> >>
> >> It tries the same way
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_virt_kvm_kvm.git_commit_-3Fid-3Dc19986fea873f3c745122bf79013a872a190f212=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=r52WDgKBorUHwe_B_5Nw2Le_F_E0ne8lqqWW6n-3bSg=ufTcXvhhAMkY3XP6gAx-HiKCT8ynPWo2fs2z9DqCzM4=
> >> pointed out.
> >>
> >> Regards,
> >> Wanpeng Li
> >
> > Yes but there it makes sense as new CR3 value should not have bit 63
> set in vcpu->arch.cr3.
> 
> When X86_CR4_PCIDE == 0 and CR3 63 bit is set, a #GP is missing in
> your suggestion.
> 
> Regards,
> Wanpeng Li

Why?

I suggest the following change:
- rsvd = rsvd_bits(maxphyaddr, 62);
+ rsvd = rsvd_bits(maxphyaddr, 63);
+ if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
+ rsvd &= ~CR3_PCID_INVD;

In this case, if PCIDE=0 then bit 63 is set in rsvd and therefore 
check_cr_write() will emulate_gp() as needed.


Re: [PATCH 1/2] KVM: X86: Fix CR3 reserve bits

2018-05-13 Thread Liran Alon

- kernel...@gmail.com wrote:

> 2018-05-13 15:53 GMT+08:00 Liran Alon <liran.a...@oracle.com>:
> >
> > - kernel...@gmail.com wrote:
> >
> >> From: Wanpeng Li <wanpen...@tencent.com>
> >>
> >> MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4.
> >> It should be checked when PCIDE bit is not set, however commit
> >> 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on
> >> its physical address width")' removes the bit 63 checking
> >> unconditionally. This patch fixes it by checking bit 63 of CR3
> >> when PCIDE bit is not set in CR4.
> >>
> >> Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based
> on
> >> its physical address width)
> >> Cc: Paolo Bonzini <pbonz...@redhat.com>
> >> Cc: Radim Krčmář <rkrc...@redhat.com>
> >> Cc: Junaid Shahid <juna...@google.com>
> >> Signed-off-by: Wanpeng Li <wanpen...@tencent.com>
> >> ---
> >>  arch/x86/kvm/emulate.c | 4 +++-
> >>  arch/x86/kvm/x86.c | 2 +-
> >>  2 files changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> >> index b3705ae..b21f427 100644
> >> --- a/arch/x86/kvm/emulate.c
> >> +++ b/arch/x86/kvm/emulate.c
> >> @@ -4189,7 +4189,9 @@ static int check_cr_write(struct
> >> x86_emulate_ctxt *ctxt)
> >>   maxphyaddr = eax & 0xff;
> >>   else
> >>   maxphyaddr = 36;
> >> - rsvd = rsvd_bits(maxphyaddr, 62);
> >> + if (ctxt->ops->get_cr(ctxt, 4) &
> X86_CR4_PCIDE)
> >> + new_val &= ~CR3_PCID_INVD;
> >> + rsvd = rsvd_bits(maxphyaddr, 63);
> >
> > I would prefer instead to do this:
> > if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
> > rsvd &= ~CR3_PCID_INVD;
> > It makes more sense as opposed to temporary removing the
> CR3_PCID_INVD bit from new_val.
> 
> It tries the same way
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_virt_kvm_kvm.git_commit_-3Fid-3Dc19986fea873f3c745122bf79013a872a190f212=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=r52WDgKBorUHwe_B_5Nw2Le_F_E0ne8lqqWW6n-3bSg=ufTcXvhhAMkY3XP6gAx-HiKCT8ynPWo2fs2z9DqCzM4=
> pointed out.
> 
> Regards,
> Wanpeng Li

Yes but there it makes sense as new CR3 value should not have bit 63 set in 
vcpu->arch.cr3.

> 
> >
> >>   }
> >>
> >>   if (new_val & rsvd)
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 87e4805..9a90668 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -863,7 +863,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu,
> unsigned
> >> long cr3)
> >>   }
> >>
> >>   if (is_long_mode(vcpu) &&
> >> - (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 62)))
> >> + (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
> >>   return 1;
> >>   else if (is_pae(vcpu) && is_paging(vcpu) &&
> >>  !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
> >> --
> >> 2.7.4


Re: [PATCH 1/2] KVM: X86: Fix CR3 reserve bits

2018-05-13 Thread Liran Alon

- kernel...@gmail.com wrote:

> 2018-05-13 15:53 GMT+08:00 Liran Alon :
> >
> > - kernel...@gmail.com wrote:
> >
> >> From: Wanpeng Li 
> >>
> >> MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4.
> >> It should be checked when PCIDE bit is not set, however commit
> >> 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on
> >> its physical address width")' removes the bit 63 checking
> >> unconditionally. This patch fixes it by checking bit 63 of CR3
> >> when PCIDE bit is not set in CR4.
> >>
> >> Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based
> on
> >> its physical address width)
> >> Cc: Paolo Bonzini 
> >> Cc: Radim Krčmář 
> >> Cc: Junaid Shahid 
> >> Signed-off-by: Wanpeng Li 
> >> ---
> >>  arch/x86/kvm/emulate.c | 4 +++-
> >>  arch/x86/kvm/x86.c | 2 +-
> >>  2 files changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> >> index b3705ae..b21f427 100644
> >> --- a/arch/x86/kvm/emulate.c
> >> +++ b/arch/x86/kvm/emulate.c
> >> @@ -4189,7 +4189,9 @@ static int check_cr_write(struct
> >> x86_emulate_ctxt *ctxt)
> >>   maxphyaddr = eax & 0xff;
> >>   else
> >>   maxphyaddr = 36;
> >> - rsvd = rsvd_bits(maxphyaddr, 62);
> >> + if (ctxt->ops->get_cr(ctxt, 4) &
> X86_CR4_PCIDE)
> >> + new_val &= ~CR3_PCID_INVD;
> >> + rsvd = rsvd_bits(maxphyaddr, 63);
> >
> > I would prefer instead to do this:
> > if (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_PCIDE)
> > rsvd &= ~CR3_PCID_INVD;
> > It makes more sense as opposed to temporary removing the
> CR3_PCID_INVD bit from new_val.
> 
> It tries the same way
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_virt_kvm_kvm.git_commit_-3Fid-3Dc19986fea873f3c745122bf79013a872a190f212=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=r52WDgKBorUHwe_B_5Nw2Le_F_E0ne8lqqWW6n-3bSg=ufTcXvhhAMkY3XP6gAx-HiKCT8ynPWo2fs2z9DqCzM4=
> pointed out.
> 
> Regards,
> Wanpeng Li

Yes but there it makes sense as new CR3 value should not have bit 63 set in 
vcpu->arch.cr3.

> 
> >
> >>   }
> >>
> >>   if (new_val & rsvd)
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 87e4805..9a90668 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -863,7 +863,7 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu,
> unsigned
> >> long cr3)
> >>   }
> >>
> >>   if (is_long_mode(vcpu) &&
> >> - (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 62)))
> >> + (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
> >>   return 1;
> >>   else if (is_pae(vcpu) && is_paging(vcpu) &&
> >>  !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
> >> --
> >> 2.7.4


  1   2   3   >