from:"Raslan, KarimAllah"

Re: [PATCH] mm: sparse: Skip no-map regions in memblocks_present

2019-07-13 Thread Raslan, KarimAllah

On Fri, 2019-07-12 at 23:09 +, Wei Yang wrote:
> On Fri, Jul 12, 2019 at 10:51:31AM +0200, KarimAllah Ahmed wrote:
> > 
> > Do not mark regions that are marked with nomap to be present, otherwise
> > these memblock cause unnecessarily allocation of metadata.
> > 
> > Cc: Andrew Morton 
> > Cc: Pavel Tatashin 
> > Cc: Oscar Salvador 
> > Cc: Michal Hocko 
> > Cc: Mike Rapoport 
> > Cc: Baoquan He 
> > Cc: Qian Cai 
> > Cc: Wei Yang 
> > Cc: Logan Gunthorpe 
> > Cc: linux...@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > mm/sparse.c | 4 
> > 1 file changed, 4 insertions(+)
> > 
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index fd13166..33810b6 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -256,6 +256,10 @@ void __init memblocks_present(void)
> > struct memblock_region *reg;
> > 
> > for_each_memblock(memory, reg) {
> > +
> > +   if (memblock_is_nomap(reg))
> > +   continue;
> > +
> > memory_present(memblock_get_region_node(reg),
> >memblock_region_memory_base_pfn(reg),
> >memblock_region_memory_end_pfn(reg));
> 
> 
> The logic looks good, while I am not sure this would take effect. Since the
> metadata is SECTION size aligned while memblock is not.
> 
> If I am correct, on arm64, we mark nomap memblock in map_mem()
> 
> memblock_mark_nomap(kernel_start, kernel_end - kernel_start);

The nomap is also done by EFI code in ${src}/drivers/firmware/efi/arm-init.c

.. and hopefully in the future by this:
https://lkml.org/lkml/2019/7/12/126

So it is not really striclty associated with the map_mem().

So it is extremely dependent on the platform how much memory will end up mapped 
as nomap.

> 
> And kernel text area is less than 40M, if I am right. This means
> memblocks_present would still mark the section present. 
> 
> Would you mind showing how much memory range it is marked nomap?

We actually have some downstream patches that are using this nomap flag for
more than the use-cases I described above which would enflate the nomap regions 
a bit :)

> 
> > 
> > -- 
> > 2.7.4
> 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH] arm: Extend the check for RAM in /dev/mem

2019-07-12 Thread Raslan, KarimAllah

On Fri, 2019-07-12 at 16:34 +0100, Will Deacon wrote:
> On Fri, Jul 12, 2019 at 03:13:38PM +0000, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2019-07-12 at 15:57 +0100, Will Deacon wrote:
> > > 
> > > On Fri, Jul 12, 2019 at 12:21:21AM +0200, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > > index 3645f29..cdc3e8e 100644
> > > > --- a/arch/arm64/mm/mmu.c
> > > > +++ b/arch/arm64/mm/mmu.c
> > > > @@ -78,7 +78,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
> > > >  pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> > > >   unsigned long size, pgprot_t vma_prot)
> > > >  {
> > > > -   if (!pfn_valid(pfn))
> > > > +   if (!memblock_is_memory(__pfn_to_phys(pfn)))
> > > 
> > > This looks broken to me, since it will end up returning 'true' for nomap
> > > memory and we really don't want to map that using writeback attributes.
> > 
> > True, I will fix this by using memblock_is_map_memory instead. That said, do
> > you have any concerns about this approach in general?
> 
> If you do that, I don't understand why you need the patch at all given our
> implementation of pfn_valid() in arch/arm64/mm/init.c.

Oops! Right, I guess that would not work either.

Let me dig into a better way to do that.

> 
> Will



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH] arm: Extend the check for RAM in /dev/mem

2019-07-12 Thread Raslan, KarimAllah

On Fri, 2019-07-12 at 15:57 +0100, Will Deacon wrote:
> On Fri, Jul 12, 2019 at 12:21:21AM +0200, KarimAllah Ahmed wrote:
> > 
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 3645f29..cdc3e8e 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -78,7 +78,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
> >  pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> >   unsigned long size, pgprot_t vma_prot)
> >  {
> > -   if (!pfn_valid(pfn))
> > +   if (!memblock_is_memory(__pfn_to_phys(pfn)))
> 
> This looks broken to me, since it will end up returning 'true' for nomap
> memory and we really don't want to map that using writeback attributes.

True, I will fix this by using memblock_is_map_memory instead. That said, do
you have any concerns about this approach in general?

> 
> Will

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH] arm: Extend the check for RAM in /dev/mem

2019-07-12 Thread Raslan, KarimAllah

On Fri, 2019-07-12 at 09:56 +0100, Russell King - ARM Linux admin wrote:
> On Fri, Jul 12, 2019 at 02:58:18AM +0000, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2019-07-12 at 08:06 +0530, Anshuman Khandual wrote:
> > > 
> > > 
> > > On 07/12/2019 03:51 AM, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > Some valid RAM can live outside kernel control (e.g. using mem= kernel
> > > > command-line). For these regions, pfn_valid would return "false" causing
> > > > system RAM to be mapped as uncached. Use memblock instead to identify 
> > > > RAM.
> > > 
> > > Once the remaining memory is outside of the kernel (as the admin would 
> > > have
> > > intended with mem= command line) what is the particular concern regarding
> > > the way those get mapped (cached or not) ? It is not to be used any way.
> > 
> > They can be used by user-space which might lead to them being used by the 
> > kernel. One use-case would be using them as guest memory for KVM as I 
> > detailed 
> > here:
> > 
> > https://lwn.net/Articles/778240/
> 
> From the 32-bit ARM point of view...
> 
> What if someone's already doing something similar with a non-coherent
> DSP and is relying on the current behaviour?  This change is a user
> visible behavioural change that could end up breaking userspace.
> 
> In other words, it isn't something we should rush into.

Yes, that makes sense. How about adding a command-line option for this new 
behavior instead? Would this be more reasonable?



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH] arm: Extend the check for RAM in /dev/mem

2019-07-11 Thread Raslan, KarimAllah

On Fri, 2019-07-12 at 08:06 +0530, Anshuman Khandual wrote:
> 
> On 07/12/2019 03:51 AM, KarimAllah Ahmed wrote:
> > 
> > Some valid RAM can live outside kernel control (e.g. using mem= kernel
> > command-line). For these regions, pfn_valid would return "false" causing
> > system RAM to be mapped as uncached. Use memblock instead to identify RAM.
> 
> Once the remaining memory is outside of the kernel (as the admin would have
> intended with mem= command line) what is the particular concern regarding
> the way those get mapped (cached or not) ? It is not to be used any way.

They can be used by user-space which might lead to them being used by the 
kernel. One use-case would be using them as guest memory for KVM as I detailed 
here:

https://lwn.net/Articles/778240/

> 
> > 
> > 
> > Cc: Russell King 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Mike Rapoport 
> > Cc: Andrew Morton 
> > Cc: Anders Roxell 
> > Cc: Enrico Weigelt 
> > Cc: Thomas Gleixner 
> > Cc: KarimAllah Ahmed 
> > Cc: Mark Rutland 
> > Cc: James Morse 
> > Cc: Anshuman Khandual 
> > Cc: Jun Yao 
> > Cc: Yu Zhao 
> > Cc: Robin Murphy 
> > Cc: Ard Biesheuvel 
> > Cc: linux-arm-ker...@lists.infradead.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/arm/mm/mmu.c   | 2 +-
> >  arch/arm64/mm/mmu.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
> > index 1aa2586..492774b 100644
> > --- a/arch/arm/mm/mmu.c
> > +++ b/arch/arm/mm/mmu.c
> > @@ -705,7 +705,7 @@ static void __init build_mem_type_table(void)
> >  pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> >   unsigned long size, pgprot_t vma_prot)
> >  {
> > -   if (!pfn_valid(pfn))
> > +   if (!memblock_is_memory(__pfn_to_phys(pfn)))
> > return pgprot_noncached(vma_prot);
> > else if (file->f_flags & O_SYNC)
> > return pgprot_writecombine(vma_prot);
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 3645f29..cdc3e8e 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -78,7 +78,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
> >  pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
> >   unsigned long size, pgprot_t vma_prot)
> >  {
> > -   if (!pfn_valid(pfn))
> > +   if (!memblock_is_memory(__pfn_to_phys(pfn)))
> 
> pfn_valid() on arm64 checks if the memblock region is mapped i.e does it have
> a linear mapping or not. If a segment of RAM is outside linear mapping due to
> mem= directive and lacks a linear mapping then why should it be mapped 
> similarly
> like system RAM on this path ?

I actually struggled a bit here because there is really no *explicit* 
documentation of what is the expected behavior here, so for me it was open to 
interpretation.

It seems like for you the deciding factor between cached and uncached is the 
existence of linear mapping. However, for me the deciding factor is whether it
is RAM or not. I choose this interpretation because it helps in the KVM
scenario that I mentioned above :)




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: cputime takes cstate into consideration

2019-06-26 Thread Raslan, KarimAllah

On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> 
> > 
> > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > single vCPU/task (and the guest is 100% CPU bound and never exits), you 
> > would 
> > still be ticking in the host once every second for housekeeping, right? 
> > Would 
> > not updating the mwait-time once a second be enough here?
> 
> People are trying very hard to get rid of that remnant tick. Lets not
> add dependencies to it.
> 
> IMO this is a really stupid issue, 100% time is correct if the guest
> does idle in pinned vcpu mode.

One use case for proper accounting (obviously for a slightly relaxed definition 
or *proper*) is *external* monitoring of CPU utilization for scaling group
(i.e. more VMs will be launched when you reach a certain CPU utilization).
These external monitoring tools needs to account CPU utilization properly.

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: cputime takes cstate into consideration

2019-06-26 Thread Raslan, KarimAllah

On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> > 
> > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > > 
> > > After exposing mwait/monitor into kvm guest, the guest can make
> > > physical cpu enter deeper cstate through mwait instruction, however,
> > > the top command on host still observe 100% cpu utilization since qemu
> > > process is running even though guest who has the power management
> > > capability executes mwait. Actually we can observe the physical cpu
> > > has already enter deeper cstate by powertop on host. Could we take
> > > cstate into consideration when accounting cputime etc?
> > 
> > If MWAIT can be used inside the guest then the host cannot distinguish
> > between execution and stuck in mwait.
> > 
> > It'd need to poll the power monitoring MSRs on every occasion where the
> > accounting happens.
> > 
> > This completely falls apart when you have zero exit guest. (think
> > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > the per CPU MSRs.
> > 
> > I assume a lot of people will be happy about all that :)
> 
> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> counters (in the host) to sample the guest and construct a better
> accounting idea of what the guest does. That way the dashboard
> from the host would not show 100% CPU utilization.

You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF 
MSRs for that. (sorry I got distracted and forgot to send the patch)

> 
> But the patches that Marcelo posted (" cpuidle-haltpoll driver") in 
> "solves" the problem for Linux. That is the guest wants awesome latency and
> one way was to expose MWAIT to the guest, or just tweak the guest to do the
> idling a bit different.
> 
> Marcelo patches are all good for Linux, but Windows is still an issue.
> 
> Ankur, would you be OK sharing some of your ideas?
> > 
> > 
> > Thanks,
> > 
> > tglx
> > 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: cputime takes cstate into consideration

2019-06-26 Thread Raslan, KarimAllah

On Wed, 2019-06-26 at 20:41 +0200, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Konrad Rzeszutek Wilk wrote:
> > 
> > On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> > > 
> > > On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > 
> > > > There were some ideas that Ankur (CC-ed) mentioned to me of using the 
> > > > perf
> > > > counters (in the host) to sample the guest and construct a better
> > > > accounting idea of what the guest does. That way the dashboard
> > > > from the host would not show 100% CPU utilization.
> > > 
> > > But then you generate extra noise and vmexits on those cpus, just to get
> > > this accounting sorted, which sounds like a bad trade.
> > 
> > Considering that the CPUs aren't doing anything and if you do say the 
> > IPIs "only" 100/second - that would be so small but give you a big benefit
> > in properly accounting the guests.
> 
> The host doesn't know what the guest CPUs are doing. And if you have a full
> zero exit setup and the guest is computing stuff or doing that network
> offloading thing then they will notice the 100/s vmexits and complain.

If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
still be ticking in the host once every second for housekeeping, right? Would 
not updating the mwait-time once a second be enough here?

> 
> > 
> > But perhaps there are other ways too to "snoop" if a guest is sitting on
> > an MWAIT?
> 
> No idea.
> 
> Thanks,
> 
>   tglx
> 
> 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH v2 1/1] PCI/IOV: Fix incorrect cfg_size for VF > 0

2019-06-12 Thread Raslan, KarimAllah

On Wed, 2019-06-12 at 12:03 -0700, Raj, Ashok wrote:
> On Wed, Jun 12, 2019 at 12:58:17PM -0600, Alex Williamson wrote:
> > 
> > On Wed, 12 Jun 2019 11:41:36 -0700
> > sathyanarayanan kuppuswamy 
> > wrote:
> > 
> > > 
> > > On 6/12/19 11:19 AM, Alex Williamson wrote:
> > > > 
> > > > On Wed, 12 Jun 2019 10:06:47 -0700
> > > > sathyanarayanan.kuppusw...@linux.intel.com wrote:
> > > >  
> > > > > 
> > > > > From: Kuppuswamy Sathyanarayanan 
> > > > > 
> > > > > 
> > > > > Commit 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for
> > > > > other VFs") calculates and caches the cfg_size for VF0 device before
> > > > > initializing the pcie_cap of the device which results in using 
> > > > > incorrect
> > > > > cfg_size for all VF devices > 0. So set pcie_cap of the device before
> > > > > calculating the cfg_size of VF0 device.
> > > > > 
> > > > > Fixes: 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for
> > > > > other VFs")
> > > > > Cc: Ashok Raj 
> > > > > Suggested-by: Mike Campin 
> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > > > > 
> > > > > ---
> > > > > 
> > > > > Changes since v1:
> > > > >   * Fixed a typo in commit message.
> > > > > 
> > > > >   drivers/pci/iov.c | 1 +
> > > > >   1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > index 3aa115ed3a65..2869011c0e35 100644
> > > > > --- a/drivers/pci/iov.c
> > > > > +++ b/drivers/pci/iov.c
> > > > > @@ -160,6 +160,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int 
> > > > > id)
> > > > >   virtfn->device = iov->vf_device;
> > > > >   virtfn->is_virtfn = 1;
> > > > >   virtfn->physfn = pci_dev_get(dev);
> > > > > + virtfn->pcie_cap = pci_find_capability(virtfn, PCI_CAP_ID_EXP);
> > > > >   
> > > > >   if (id == 0)
> > > > >   pci_read_vf_config_common(virtfn);  
> > > > Why not re-order until after we've setup pcie_cap?
> > > > 
> > > > https://lore.kernel.org/linux-pci/20190604143617.0a226...@x1.home/T/#  
> > > 
> > > pci_read_vf_config_common() also caches values for properties like 
> > > class, hdr_type, susbsystem_vendor/device. These values are read/used in 
> > > pci_setup_device(). So if we can use cached values in 
> > > pci_setup_device(), we don't have to read them from registers twice for 
> > > each device.
> > 
> > Sorry, I missed that dependency, a bit too subtle.  It's still pretty
> > ugly that pci_setup_device()->set_pcie_port_type() is the canonical
> > location for setting pcie_cap and now we need to kludge it earlier.
> > What about the question in the self follow-up to my patch in the link
> > above, can we simply assume 4K config space on a VF?  Thanks,
> 
> There should be no issue simply reading them once? I don't know
> what that exact optimization saves, unless some broken VFs didn't
> actually expose all the capabilities in config space and this happens
> to workaround the problem.

The original patch was to save time when you have hundreds of VFs in the system 
and doing this for each one of them is just a waste of time.

> 
> + Karim
> 
> Cheers,
> Ashok



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-31 Thread Raslan, KarimAllah

On Fri, 2019-05-31 at 11:06 +0200, Alexander Graf wrote:
> On 17.05.19 17:41, Sironi, Filippo wrote:
> > 
> > > 
> > > On 16. May 2019, at 15:50, Graf, Alexander  wrote:
> > > 
> > > On 14.05.19 08:16, Filippo Sironi wrote:
> > > > 
> > > > Start populating /sys/hypervisor with KVM entries when we're running on
> > > > KVM. This is to replicate functionality that's available when we're
> > > > running on Xen.
> > > > 
> > > > Start with /sys/hypervisor/uuid, which users prefer over
> > > > /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
> > > > machine, since it's also available when running on Xen HVM and on Xen PV
> > > > and, on top of that doesn't require root privileges by default.
> > > > Let's create arch-specific hooks so that different architectures can
> > > > provide different implementations.
> > > > 
> > > > Signed-off-by: Filippo Sironi 
> > > I think this needs something akin to
> > > 
> > >   https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-hypervisor-xen
> > > 
> > > to document which files are available.
> > > 
> > > > 
> > > > ---
> > > > v2:
> > > > * move the retrieval of the VM UUID out of uuid_show and into
> > > >   kvm_para_get_uuid, which is a weak function that can be overwritten
> > > > 
> > > > drivers/Kconfig  |  2 ++
> > > > drivers/Makefile |  2 ++
> > > > drivers/kvm/Kconfig  | 14 ++
> > > > drivers/kvm/Makefile |  1 +
> > > > drivers/kvm/sys-hypervisor.c | 30 ++
> > > > 5 files changed, 49 insertions(+)
> > > > create mode 100644 drivers/kvm/Kconfig
> > > > create mode 100644 drivers/kvm/Makefile
> > > > create mode 100644 drivers/kvm/sys-hypervisor.c
> > > > 
> > > [...]
> > > 
> > > > 
> > > > +
> > > > +__weak const char *kvm_para_get_uuid(void)
> > > > +{
> > > > +   return NULL;
> > > > +}
> > > > +
> > > > +static ssize_t uuid_show(struct kobject *obj,
> > > > +struct kobj_attribute *attr,
> > > > +char *buf)
> > > > +{
> > > > +   const char *uuid = kvm_para_get_uuid();
> > > > +   return sprintf(buf, "%s\n", uuid);
> > > The usual return value for the Xen /sys/hypervisor interface is
> > > "". Wouldn't it make sense to follow that pattern for the KVM
> > > one too? Currently, if we can not determine the UUID this will just
> > > return (null).
> > > 
> > > Otherwise, looks good to me. Are you aware of any other files we should
> > > provide? Also, is there any reason not to implement ARM as well while at 
> > > it?
> > > 
> > > Alex
> > This originated from a customer request that was using /sys/hypervisor/uuid.
> > My guess is that we would want to expose "type" and "version" moving
> > forward and that's when we hypervisor hooks will be useful on top
> > of arch hooks.
> > 
> > On a different note, any idea how to check whether the OS is running
> > virtualized on KVM on ARM and ARM64?  kvm_para_available() isn't an
> 
> 
> Yeah, ARM doesn't have any KVM PV FWIW. I also can't find any explicit 
> hint passed into guests that we are indeed running in KVM. The closest 
> thing I can see is the SMBIOS product identifier in QEMU which gets 
> patched to "KVM Virtual Machine". Maybe we'll have to do with that for 
> the sake of backwards compatibility ...

How about "psci_ops.conduit" (PSCI_CONDUIT_HVC vs PSCI_CONDUIT_SMC)?

> 
> 
> > 
> > option and the same is true for S390 where kvm_para_available()
> > always returns true and it would even if a KVM enabled kernel would
> > be running on bare metal.
> 
> 
> For s390, you can figure the topology out using the sthyi instruction. 
> I'm not sure if there is a nice in-kernel API to leverage that though. 
> In fact, kvm_para_available() probably should check sthyi output to 
> determine whether we really can use it, no? Christian?
> 
> 
> Alex
> 
> 
> > 
> > 
> > I think we will need another arch hook to call a function that says
> > whether the OS is running virtualized on KVM.
> > 
> > > 
> > > > 
> > > > +}
> > > > +
> > > > +static struct kobj_attribute uuid = __ATTR_RO(uuid);
> > > > +
> > > > +static int __init uuid_init(void)
> > > > +{
> > > > +   if (!kvm_para_available())
> > > > +   return 0;
> > > > +   return sysfs_create_file(hypervisor_kobj, );
> > > > +}
> > > > +
> > > > +device_initcall(uuid_init);



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: [PATCH] sched: introduce configurable delay before entering idle

2019-05-13 Thread Raslan, KarimAllah

On Mon, 2019-05-13 at 07:31 -0400, Konrad Rzeszutek Wilk wrote:
> On May 13, 2019 5:20:37 AM EDT, Wanpeng Li  wrote:
> > 
> > On Wed, 8 May 2019 at 02:57, Marcelo Tosatti 
> > wrote:
> > > 
> > > 
> > > 
> > > Certain workloads perform poorly on KVM compared to baremetal
> > > due to baremetal's ability to perform mwait on NEED_RESCHED
> > > bit of task flags (therefore skipping the IPI).
> > 
> > KVM supports expose mwait to the guest, if it can solve this?
> > 
> 
> 
> There is a bit of problem with that. The host will see 100% CPU utilization 
> even if the guest is idle and taking long naps..
> 
> Which depending on your dashboard can look like the machine is on fire.

This can also be fixed. I have a patch that kind of expose proper information 
about the *real* utilization here if that would be help.

> 
> CCing Ankur and Boris
> 
> > 
> > Regards,
> > Wanpeng Li
> > 
> > > 
> > > 
> > > This patch introduces a configurable busy-wait delay before entering
> > the
> > > 
> > > architecture delay routine, allowing wakeup IPIs to be skipped
> > > (if the IPI happens in that window).
> > > 
> > > The real-life workload which this patch improves performance
> > > is SAP HANA (by 5-10%) (for which case setting idle_spin to 30
> > > is sufficient).
> > > 
> > > This patch improves the attached server.py and client.py example
> > > as follows:
> > > 
> > > Host:   31.814230202231556
> > > Guest:  38.1771876513   (83 %)
> > > Guest, idle_spin=50us:  33.31770989804  (95 %)
> > > Guest, idle_spin=220us: 32.2782655149   (98 %)
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > ---
> > >  kernel/sched/idle.c |   86
> > ++
> > > 
> > >  1 file changed, 86 insertions(+)
> > > 
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index f5516bae0c1b..bca7656a7ea0 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -216,6 +216,29 @@ static void cpuidle_idle_call(void)
> > > rcu_idle_exit();
> > >  }
> > > 
> > > +static unsigned int spin_before_idle_us;
> > > 
> > > +static void do_spin_before_idle(void)
> > > +{
> > > +   ktime_t now, end_spin;
> > > +
> > > +   now = ktime_get();
> > > +   end_spin = ktime_add_ns(now, spin_before_idle_us*1000);
> > > +
> > > +   rcu_idle_enter();
> > > +   local_irq_enable();
> > > +   stop_critical_timings();
> > > +
> > > +   do {
> > > +   cpu_relax();
> > > +   now = ktime_get();
> > > +   } while (!tif_need_resched() && ktime_before(now, end_spin));
> > > +
> > > +   start_critical_timings();
> > > +   rcu_idle_exit();
> > > +   local_irq_disable();
> > > +}
> > > +
> > >  /*
> > >   * Generic idle loop implementation
> > >   *
> > > @@ -259,6 +282,8 @@ static void do_idle(void)
> > > tick_nohz_idle_restart_tick();
> > > cpu_idle_poll();
> > > } else {
> > > +   if (spin_before_idle_us)
> > > +   do_spin_before_idle();
> > > cpuidle_idle_call();
> > > }
> > > arch_cpu_idle_exit();
> > > @@ -465,3 +490,64 @@ const struct sched_class idle_sched_class = {
> > > .switched_to= switched_to_idle,
> > > .update_curr= update_curr_idle,
> > >  };
> > > +
> > > +
> > > +static ssize_t store_idle_spin(struct kobject *kobj,
> > > +  struct kobj_attribute *attr,
> > > +  const char *buf, size_t count)
> > > +{
> > > +   unsigned int val;
> > > +
> > > +   if (kstrtouint(buf, 10, ) < 0)
> > > +   return -EINVAL;
> > > +
> > > +   if (val > USEC_PER_SEC)
> > > +   return -EINVAL;
> > > +
> > > +   spin_before_idle_us = val;
> > > +   return count;
> > > +}
> > > +
> > > +static ssize_t show_idle_spin(struct kobject *kobj,
> > > + struct kobj_attribute *attr,
> > > + char *buf)
> > > +{
> > > +   ssize_t ret;
> > > +
> > > +   ret = sprintf(buf, "%d\n", spin_before_idle_us);
> > > +
> > > +   return ret;
> > > +}
> > > +
> > > +static struct kobj_attribute idle_spin_attr =
> > > +   __ATTR(idle_spin, 0644, show_idle_spin, store_idle_spin);
> > > +
> > > +static struct attribute *sched_attrs[] = {
> > > +   _spin_attr.attr,
> > > +   NULL,
> > > +};
> > > +
> > > +static const struct attribute_group sched_attr_group = {
> > > +   .attrs = sched_attrs,
> > > +};
> > > +
> > > +static struct kobject *sched_kobj;
> > > +
> > > +static int __init sched_sysfs_init(void)
> > > +{
> > > +   int error;
> > > +
> > > +   sched_kobj = kobject_create_and_add("sched", kernel_kobj);
> > > +   if (!sched_kobj)
> > > +   return -ENOMEM;
> > > +
> > > +

Re: [PATCH v6 00/14] KVM/X86: Introduce a new guest mapping interface

2019-03-18 Thread Raslan, KarimAllah

On Mon, 2019-03-18 at 10:22 -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 18, 2019 at 01:10:24PM +0000, Raslan, KarimAllah wrote:
> > 
> > I guess this patch series missed the 5.1 merge window? :)
> 
> Were there any outstanding fixes that had to be addressed?

Not as far as I can remember. This version addressed all requests raised in 
'v5'.

> 
> > 
> > 
> > On Thu, 2019-01-31 at 21:24 +0100, KarimAllah Ahmed wrote:
> > > 
> > > Guest memory can either be directly managed by the kernel (i.e. have a 
> > > "struct
> > > page") or they can simply live outside kernel control (i.e. do not have a
> > > "struct page"). KVM mostly support these two modes, except in a few places
> > > where the code seems to assume that guest memory must have a "struct 
> > > page".
> > > 
> > > This patchset introduces a new mapping interface to map guest memory into 
> > > host
> > > kernel memory which also supports PFN-based memory (i.e. memory without 
> > > 'struct
> > > page'). It also converts all offending code to this interface or simply
> > > read/write directly from guest memory. Patch 2 is additionally fixing an
> > > incorrect page release and marking the page as dirty (i.e. as a 
> > > side-effect of
> > > using the helper function to write).
> > > 
> > > As far as I can see all offending code is now fixed except the 
> > > APIC-access page
> > > which I will handle in a seperate series along with dropping
> > > kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.
> > > 
> > > The current implementation of the new API uses memremap to map memory 
> > > that does
> > > not have a "struct page". This proves to be very slow for high frequency
> > > mappings. Since this does not affect the normal use-case where a "struct 
> > > page"
> > > is available, the performance of this API will be handled by a seperate 
> > > patch
> > > series.
> > > 
> > > So the simple way to use memory outside kernel control is:
> > > 
> > > 1- Pass 'mem=' in the kernel command-line to limit the amount of memory 
> > > managed 
> > >by the kernel.
> > > 2- Map this physical memory you want to give to the guest with:
> > >mmap("/dev/mem", physical_address_offset, ..)
> > > 3- Use the user-space virtual address as the "userspace_addr" field in
> > >KVM_SET_USER_MEMORY_REGION ioctl.
> > > 
> > > v5 -> v6:
> > > - Added one extra patch to ensure that support for this mem= case is 
> > > complete
> > >   for x86.
> > > - Added a helper function to check if the mapping is mapped or not.
> > > - Added more comments on the struct.
> > > - Setting ->page to NULL on unmap and to a poison ptr if unused during map
> > > - Checking for map ptr before using it.
> > > - Change kvm_vcpu_unmap to also mark page dirty for LM. That requires
> > >   passing the vCPU pointer again to this function.
> > > 
> > > v4 -> v5:
> > > - Introduce a new parameter 'dirty' into kvm_vcpu_unmap
> > > - A horrible rebase due to nested.c :)
> > > - Dropped a couple of hyperv patches as the code was fixed already as a
> > >   side-effect of another patch.
> > > - Added a new trivial cleanup patch.
> > > 
> > > v3 -> v4:
> > > - Rebase
> > > - Add a new patch to also fix the newly introduced enlightned VMCS.
> > > 
> > > v2 -> v3:
> > > - Rebase
> > > - Add a new patch to also fix the newly introduced shadow VMCS.
> > > 
> > > Filippo Sironi (1):
> > >   X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
> > > 
> > > KarimAllah Ahmed (13):
> > >   X86/nVMX: handle_vmon: Read 4 bytes from guest memory
> > >   X86/nVMX: Update the PML table without mapping and unmapping the page
> > >   KVM: Introduce a new guest mapping API
> > >   X86/nVMX: handle_vmptrld: Use kvm_vcpu_map when copying VMCS12 from
> > > guest memory
> > >   KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
> > >   KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
> > >   KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
> > > descriptor table
> > >   KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
> > >   KVM/nSVM: Use the new mapping API for mapping guest

Re: [PATCH v6 00/14] KVM/X86: Introduce a new guest mapping interface

2019-03-18 Thread Raslan, KarimAllah

I guess this patch series missed the 5.1 merge window? :)

On Thu, 2019-01-31 at 21:24 +0100, KarimAllah Ahmed wrote:
> Guest memory can either be directly managed by the kernel (i.e. have a "struct
> page") or they can simply live outside kernel control (i.e. do not have a
> "struct page"). KVM mostly support these two modes, except in a few places
> where the code seems to assume that guest memory must have a "struct page".
> 
> This patchset introduces a new mapping interface to map guest memory into host
> kernel memory which also supports PFN-based memory (i.e. memory without 
> 'struct
> page'). It also converts all offending code to this interface or simply
> read/write directly from guest memory. Patch 2 is additionally fixing an
> incorrect page release and marking the page as dirty (i.e. as a side-effect of
> using the helper function to write).
> 
> As far as I can see all offending code is now fixed except the APIC-access 
> page
> which I will handle in a seperate series along with dropping
> kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.
> 
> The current implementation of the new API uses memremap to map memory that 
> does
> not have a "struct page". This proves to be very slow for high frequency
> mappings. Since this does not affect the normal use-case where a "struct page"
> is available, the performance of this API will be handled by a seperate patch
> series.
> 
> So the simple way to use memory outside kernel control is:
> 
> 1- Pass 'mem=' in the kernel command-line to limit the amount of memory 
> managed 
>by the kernel.
> 2- Map this physical memory you want to give to the guest with:
>mmap("/dev/mem", physical_address_offset, ..)
> 3- Use the user-space virtual address as the "userspace_addr" field in
>KVM_SET_USER_MEMORY_REGION ioctl.
> 
> v5 -> v6:
> - Added one extra patch to ensure that support for this mem= case is complete
>   for x86.
> - Added a helper function to check if the mapping is mapped or not.
> - Added more comments on the struct.
> - Setting ->page to NULL on unmap and to a poison ptr if unused during map
> - Checking for map ptr before using it.
> - Change kvm_vcpu_unmap to also mark page dirty for LM. That requires
>   passing the vCPU pointer again to this function.
> 
> v4 -> v5:
> - Introduce a new parameter 'dirty' into kvm_vcpu_unmap
> - A horrible rebase due to nested.c :)
> - Dropped a couple of hyperv patches as the code was fixed already as a
>   side-effect of another patch.
> - Added a new trivial cleanup patch.
> 
> v3 -> v4:
> - Rebase
> - Add a new patch to also fix the newly introduced enlightned VMCS.
> 
> v2 -> v3:
> - Rebase
> - Add a new patch to also fix the newly introduced shadow VMCS.
> 
> Filippo Sironi (1):
>   X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
> 
> KarimAllah Ahmed (13):
>   X86/nVMX: handle_vmon: Read 4 bytes from guest memory
>   X86/nVMX: Update the PML table without mapping and unmapping the page
>   KVM: Introduce a new guest mapping API
>   X86/nVMX: handle_vmptrld: Use kvm_vcpu_map when copying VMCS12 from
> guest memory
>   KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
>   KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
>   KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
> descriptor table
>   KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
>   KVM/nSVM: Use the new mapping API for mapping guest memory
>   KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS
>   KVM/nVMX: Use kvm_vcpu_map for accessing the enlightened VMCS
>   KVM/nVMX: Use page_address_valid in a few more locations
>   kvm, x86: Properly check whether a pfn is an MMIO or not
> 
>  arch/x86/include/asm/e820/api.h |   1 +
>  arch/x86/kernel/e820.c  |  18 -
>  arch/x86/kvm/mmu.c  |   5 +-
>  arch/x86/kvm/paging_tmpl.h  |  38 +++---
>  arch/x86/kvm/svm.c  |  97 
>  arch/x86/kvm/vmx/nested.c   | 160 
> +++-
>  arch/x86/kvm/vmx/vmx.c  |  19 ++---
>  arch/x86/kvm/vmx/vmx.h  |   9 ++-
>  arch/x86/kvm/x86.c  |  14 ++--
>  include/linux/kvm_host.h|  28 +++
>  virt/kvm/kvm_main.c |  64 
>  11 files changed, 267 insertions(+), 186 deletions(-)
> 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v5 00/13] KVM/X86: Introduce a new guest mapping interface

2019-01-31 Thread Raslan, KarimAllah

On Wed, 2019-01-30 at 18:14 +0100, Paolo Bonzini wrote:
> On 25/01/19 19:28, Raslan, KarimAllah wrote:
> > 
> > So the simple way to do it is:
> > 
> > 1- Pass 'mem=' in the kernel command-line to limit the amount of memory 
> > managed 
> >    by the kernel.
> > 2- Map this physical memory you want to give to the guest with
> >       mmap("/dev/mem", physical_address_offset, ..)
> > 3- Use the user-space virtual address as the "userspace_addr" field 
> >    in KVM_SET_USER_MEMORY_REGION ioctl.
> > 
> > You will also need this patch (hopefully I will repost next week as well):
> > https://patchwork.kernel.org/patch/9191755/
> 
> I took a look again at that patch and I guess I've changed my mind now
> that the kernel provides e820__mapped_any and e820__mapped_all.
> However, please do use e820__mapped_any instead of adding a new function
> e820_is_ram.

The problem with e820__mapped_* is that they are iterating over 'e820_table'
which is already truncated from the 'mem=' and 'memmap=' parameters:

"""
 * - 'e820_table': this is the main E820 table that is massaged by the
 *   low level x86 platform code, or modified by boot parameters, before
 *   passed on to higher level MM layers.
"""

.. so I really still can not use it for this purpose. The structure that I want
to look at is actually 'e820_table_firmware' which is:

"""
 * - 'e820_table_firmware': the original firmware version passed to us by the
 *   bootloader - not modified by the kernel. It is composed of two parts:
 *   the first 128 E820 memory entries in boot_params.e820_table and the
remaining
 *   (if any) entries of the SETUP_E820_EXT nodes. We use this to:
 *
 *   - inform the user about the firmware's notion of memory layout
 * via /sys/firmware/memmap
 *
 *   - the hibernation code uses it to generate a kernel-independent MD5
 * fingerprint of the physical memory layout of a system.
"""

The users of e820__mapped_any expect these semantics, so even changing the 
implementation of these functions to use 'e820_table_firmware' to handle this 
will not be an option!

One option here would be to add 'e820__mapped_raw_any' (or whatever 
other name) and make it identical to the current implementation of 
e820__mapped_any at. Would that be slightly more acceptable? :)

> 
> Thanks,
> 
> Paolo
> 
> > 
> > I will make sure to expand on this in the cover letter in v6.
> 

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v5 00/13] KVM/X86: Introduce a new guest mapping interface

2019-01-25 Thread Raslan, KarimAllah

On Wed, 2019-01-23 at 13:16 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:00AM +0100, KarimAllah Ahmed wrote:
> > 
> > Guest memory can either be directly managed by the kernel (i.e. have a 
> > "struct
> > page") or they can simply live outside kernel control (i.e. do not have a
> > "struct page"). KVM mostly support these two modes, except in a few places
> > where the code seems to assume that guest memory must have a "struct page".
> > 
> > This patchset introduces a new mapping interface to map guest memory into 
> > host
> > kernel memory which also supports PFN-based memory (i.e. memory without 
> > 'struct
> > page'). It also converts all offending code to this interface or simply
> > read/write directly from guest memory. Patch 2 is additionally fixing an
> > incorrect page release and marking the page as dirty (i.e. as a side-effect 
> > of
> > using the helper function to write).
> > 
> > As far as I can see all offending code is now fixed except the APIC-access 
> > page
> > which I will handle in a seperate series along with dropping
> > kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.
> > 
> > The current implementation of the new API uses memremap to map memory that 
> > does
> > not have a "struct page". This proves to be very slow for high frequency
> > mappings. Since this does not affect the normal use-case where a "struct 
> > page"
> > is available, the performance of this API will be handled by a seperate 
> > patch
> > series.
> 
> Where could one find this patchset?

Let me clean it and send it out as well :)

> 
> Also is there an simple test-case (or a writeup) you have for testing
> this code? Specifically I am thinking about the use-case of "memory
> without the 'struct page'"

So the simple way to do it is:

1- Pass 'mem=' in the kernel command-line to limit the amount of memory managed 
   by the kernel.
2- Map this physical memory you want to give to the guest with
      mmap("/dev/mem", physical_address_offset, ..)
3- Use the user-space virtual address as the "userspace_addr" field 
   in KVM_SET_USER_MEMORY_REGION ioctl.

You will also need this patch (hopefully I will repost next week as well):
https://patchwork.kernel.org/patch/9191755/

I will make sure to expand on this in the cover letter in v6.

> 
> And thank you for posting this patchset. It was a pleasure reviewing the
> code!



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v5 13/13] KVM/nVMX: Use page_address_valid in a few more locations

2019-01-25 Thread Raslan, KarimAllah

On Wed, 2019-01-23 at 13:18 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:13AM +0100, KarimAllah Ahmed wrote:
> > 
> > Use page_address_valid in a few more locations that is already checking for
> > a page aligned address that does not cross the maximum physical address.
> 
> Where is this page_address_valid declared? The latest linus's tree does
> not have it, nor does your patchset?

It is already defined in the code, I can not see any commits that removed it:

$ git grep page_address_valid
arch/x86/kvm/vmx/nested.c:static bool page_address_valid(struct kvm_vcpu *vcpu,
gpa_t gpa)
arch/x86/kvm/vmx/nested.c:  if (!page_address_valid(vcpu, vmcs12-
>io_bitmap_a) ||
arch/x86/kvm/vmx/nested.c:  !page_address_valid(vcpu, vmcs12->io_bitmap_
b))
arch/x86/kvm/vmx/nested.c:  if (!page_address_valid(vcpu, vmcs12-
>msr_bitmap))
arch/x86/kvm/vmx/nested.c:  if (!page_address_valid(vcpu, vmcs12-
>virtual_apic_page_addr))
arch/x86/kvm/vmx/nested.c:  !page_address_valid(vcpu, vmcs12-
>apic_access_addr))
arch/x86/kvm/vmx/nested.c:  !page_address_valid(vcpu, vmcs12-
>pml_address))
arch/x86/kvm/vmx/nested.c:  if (!page_address_valid(vcpu, vmcs12-
>vmread_bitmap) ||
arch/x86/kvm/vmx/nested.c:  !page_address_valid(vcpu, vmcs12-
>vmwrite_bitmap))
arch/x86/kvm/vmx/nested.c:  !page_address_valid(vcpu,
vmcs12->eptp_list_address))
arch/x86/kvm/vmx/nested.c:  if (!page_address_valid(vcpu, vmcs12-
>vmcs_link_pointer))
arch/x86/kvm/vmx/nested.c:  if (!page_address_valid(vcpu, kvm_state-
>vmx.vmxon_pa))
arch/x86/kvm/vmx/nested.c:  !page_address_valid(vcpu, kvm_state-
>vmx.vmcs_pa))

> > 
> > 
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/vmx/nested.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index ccb3b63..77aad46 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -4203,7 +4203,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
> >  * Note - IA32_VMX_BASIC[48] will never be 1 for the nested case;
> >  * which replaces physical address width with 32
> >  */
> > -   if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> > +   if (!page_address_valid(vcpu, vmptr))
> > return nested_vmx_failInvalid(vcpu);
> >  
> > if (kvm_read_guest(vcpu->kvm, vmptr, , sizeof(revision)) ||
> > @@ -4266,7 +4266,7 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
> > if (nested_vmx_get_vmptr(vcpu, ))
> > return 1;
> >  
> > -   if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> > +   if (!page_address_valid(vcpu, vmptr))
> > return nested_vmx_failValid(vcpu,
> > VMXERR_VMCLEAR_INVALID_ADDRESS);
> >  
> > @@ -4473,7 +4473,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> > if (nested_vmx_get_vmptr(vcpu, ))
> > return 1;
> >  
> > -   if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> > +   if (!page_address_valid(vcpu, vmptr))
> > return nested_vmx_failValid(vcpu,
> > VMXERR_VMPTRLD_INVALID_ADDRESS);
> >  
> > -- 
> > 2.7.4
> > 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v5 08/13] KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table

2019-01-25 Thread Raslan, KarimAllah

On Wed, 2019-01-23 at 13:03 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:08AM +0100, KarimAllah Ahmed wrote:
> > 
> > Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
> > using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
> > that has a "struct page".
> > 
> > One additional semantic change is that the virtual host mapping lifecycle
> > has changed a bit. It now has the same lifetime of the pinning of the
> > interrupt descriptor table page on the host side.
> 
> Is the description stale? I am not seeing how you are changing the
> semantics here. You follow the same path - map/unmap.
> 
> Could you expand please?

This is pretty much the same case as in 7/13, it's just two different
life-cycle changes and I dropped one of them and the other one is still there :D

> 
> > 
> > 
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v4 -> v5:
> > - unmap with dirty flag
> > 
> > v1 -> v2:
> > - Do not change the lifecycle of the mapping (pbonzini)



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v5 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page

2019-01-25 Thread Raslan, KarimAllah

On Wed, 2019-01-23 at 12:57 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:07AM +0100, KarimAllah Ahmed wrote:
> > 
> > Use kvm_vcpu_map when mapping the virtual APIC page since using
> > kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> > a "struct page".
> > 
> > One additional semantic change is that the virtual host mapping lifecycle
> > has changed a bit. It now has the same lifetime of the pinning of the
> > virtual APIC page on the host side.
> 
> Could you expand a bit on the 'same lifetime .. on the host side'  to be
> obvious for folks what exactly the semantic is?
> 
> And how does this ring with this comment:
> > 
> > 
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v4 -> v5:
> > - unmap with dirty flag
> > 
> > v1 -> v2:
> > - Do not change the lifecycle of the mapping (pbonzini)
> 
> .. Where Paolo does not want the semantics of the mapping to be changed?
> 
> Code wise feel free to smack my Reviewed-by on it, but obviously the
> question on the above comment needs to be resolved.

Ah, right. So there was two life-cycle changes:

1- Lazily unmap the mapping and only do it on: a) release b) if the gfn is 
   different. This was done as an optimization (i.e. cache of a single entry). 
   This is the first life-cycle change that Paolo asked me to do seperately and
   suggested to create a PFN cache instead. This has indeed been dropped
   between v1 -> v2 switch.

2- The life-cycle change now is the fact that the kvm_vcpu_map interface does: 
   a) map to a virtual address b) translate a gfn to a pfn.

   The original code was doing the kmap in one location and the gfn_to_page in
   another. Using kvm_vcpu_map means that both kmap+gfn_to_page will be tied 
   together and will not be done seperately. So far no one complained about
   this one so I kept it :D

> 
> Thank you.
> 
> > 
> > - Use pfn_to_hpa instead of gfn_to_gpa
> > ---
> >  arch/x86/kvm/vmx/nested.c | 32 +++-
> >  arch/x86/kvm/vmx/vmx.c|  5 ++---
> >  arch/x86/kvm/vmx/vmx.h|  2 +-
> >  3 files changed, 14 insertions(+), 25 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 4127ad9..dcff99d 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -229,10 +229,7 @@ static void free_nested(struct kvm_vcpu *vcpu)
> > kvm_release_page_dirty(vmx->nested.apic_access_page);
> > vmx->nested.apic_access_page = NULL;
> > }
> > -   if (vmx->nested.virtual_apic_page) {
> > -   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> > -   vmx->nested.virtual_apic_page = NULL;
> > -   }
> > +   kvm_vcpu_unmap(>nested.virtual_apic_map, true);
> > if (vmx->nested.pi_desc_page) {
> > kunmap(vmx->nested.pi_desc_page);
> > kvm_release_page_dirty(vmx->nested.pi_desc_page);
> > @@ -2817,6 +2814,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> > *vcpu)
> >  {
> > struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct kvm_host_map *map;
> > struct page *page;
> > u64 hpa;
> >  
> > @@ -2849,11 +2847,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> > *vcpu)
> > }
> >  
> > if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
> > -   if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
> > -   kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> > -   vmx->nested.virtual_apic_page = NULL;
> > -   }
> > -   page = kvm_vcpu_gpa_to_page(vcpu, 
> > vmcs12->virtual_apic_page_addr);
> > +   map = >nested.virtual_apic_map;
> >  
> > /*
> >  * If translation failed, VM entry will fail because
> > @@ -2868,11 +2862,9 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> > *vcpu)
> >  * control.  But such a configuration is useless, so
> >  * let's keep the code simple.
> >  */
> > -   if (!is_error_page(page)) {
> > -   vmx->nested.virtual_apic_page = page;
> > -   hpa = page_to_phys(vmx->nested.virtual_apic_page);
> > -   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
> > -   }
> > +   if (!kvm_vcpu_map(vcpu, 
> > gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
> > +   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 
> > pfn_to_hpa(map->pfn));
> > +
> > }
> >  
> > if (nested_cpu_has_posted_intr(vmcs12)) {
> > @@ -3279,11 +3271,12 @@ static void 
> > vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> >  
> > max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
> > if (max_irr != 256) {
> > -   vapic_page = kmap(vmx->nested.virtual_apic_page);
> > +   vapic_page = vmx->nested.virtual_apic_map.hva;
> > +   if (!vapic_page)
> > +   return;
> > +
> >

Re: [PATCH v5 04/13] KVM: Introduce a new guest mapping API

2019-01-25 Thread Raslan, KarimAllah

On Wed, 2019-01-23 at 12:50 -0500, Konrad Rzeszutek Wilk wrote:
> > 
> > +   if (dirty)
> > +   kvm_release_pfn_dirty(map->pfn);
> > +   else
> > +   kvm_release_pfn_clean(map->pfn);
> > +   map->hva = NULL;
> 
> I keep on having this gnawing feeling that we MUST set map->page to
> NULL.
> 
> That is I can see how it is not needed if you are using 'map' and
> 'unmap' together - for that we are good. But what I am worried is that
> some one unmaps it .. and instead of checking map->hva they end up
> checking map->page and think the page is mapped.
> 
> Would you be OK adding that extra statement just as a fail-safe
> mechanism in case someones misues the APIs?

Good point, will do.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v5 04/13] KVM: Introduce a new guest mapping API

2019-01-25 Thread Raslan, KarimAllah

On Thu, 2019-01-10 at 14:07 +0100, David Hildenbrand wrote:
> On 09.01.19 10:42, KarimAllah Ahmed wrote:
> > 
> > In KVM, specially for nested guests, there is a dominant pattern of:
> > 
> > => map guest memory -> do_something -> unmap guest memory
> > 
> > In addition to all this unnecessarily noise in the code due to boiler plate
> > code, most of the time the mapping function does not properly handle memory
> > that is not backed by "struct page". This new guest mapping API encapsulate
> > most of this boiler plate code and also handles guest memory that is not
> > backed by "struct page".
> > 
> > The current implementation of this API is using memremap for memory that is
> > not backed by a "struct page" which would lead to a huge slow-down if it
> > was used for high-frequency mapping operations. The API does not have any
> > effect on current setups where guest memory is backed by a "struct page".
> > Further patches are going to also introduce a pfn-cache which would
> > significantly improve the performance of the memremap case.
> > 
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v3 -> v4:
> > - Update the commit message.
> > v1 -> v2:
> > - Drop the caching optimization (pbonzini)
> > - Use 'hva' instead of 'kaddr' (pbonzini)
> > - Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
> >   AMD patch (pbonzini)
> > - Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
> > - Only clear map->hva instead of memsetting the whole structure.
> > - Drop kvm_vcpu_map_valid since it is no longer used.
> > - Fix EXPORT_MODULE naming.
> > ---
> >  include/linux/kvm_host.h |  9 
> >  virt/kvm/kvm_main.c  | 53 
> > 
> >  2 files changed, 62 insertions(+)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c38cc5e..8a2f5fa 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -205,6 +205,13 @@ enum {
> > READING_SHADOW_PAGE_TABLES,
> >  };
> >  
> > +struct kvm_host_map {
> > +   struct page *page;
> 
> Can you add somme comments to what it means when there is a page vs.
> when there is none?
> 
> > 
> > +   void *hva;
> > +   kvm_pfn_t pfn;
> > +   kvm_pfn_t gfn;
> > +};
> > +
> >  /*
> >   * Sometimes a large or cross-page mmio needs to be broken up into separate
> >   * exits for userspace servicing.
> > @@ -710,7 +717,9 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu 
> > *vcpu);
> >  struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, 
> > gfn_t gfn);
> >  kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> > +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map 
> > *map);
> >  struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
> > +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty);
> >  unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, 
> > bool *writable);
> >  int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, 
> > int offset,
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1f888a1..4d8f2e3 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1733,6 +1733,59 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
> >  }
> >  EXPORT_SYMBOL_GPL(gfn_to_page);
> >  
> > +static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
> > +struct kvm_host_map *map)
> > +{
> > +   kvm_pfn_t pfn;
> > +   void *hva = NULL;
> > +   struct page *page = NULL;
> 
> nit: I prefer these in a growing line-length fashion.
> 
> > 
> > +
> > +   pfn = gfn_to_pfn_memslot(slot, gfn);
> > +   if (is_error_noslot_pfn(pfn))
> > +   return -EINVAL;
> > +
> > +   if (pfn_valid(pfn)) {
> > +   page = pfn_to_page(pfn);
> > +   hva = kmap(page);
> > +   } else {
> > +   hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
> > +   }
> > +
> > +   if (!hva)
> > +   return -EFAULT;
> > +
> > +   map->page = page;
> > +   map->hva = hva;
> > +   map->pfn = pfn;
> > +   map->gfn = gfn;
> > +
> > +   return 0;
> > +}
> > +
> > +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map 
> > *map)
> > +{
> > +   return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_map);
> > +
> > +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty)
> > +{
> > +   if (!map->hva)
> > +   return;
> > +
> > +   if (map->page)
> > +   kunmap(map->page);
> > +   else
> > +   memunmap(map->hva);
> > +
> > +   if (dirty)
> 
> 
> I am wondering if this would also be the right place for
> 
> kvm_vcpu_mark_page_dirty() to mark the page dirty for migration.

I indeed considered this, however, either I am

Re: [PATCH v4 02/14] X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory

2019-01-03 Thread Raslan, KarimAllah

On Fri, 2018-12-21 at 16:20 +0100, Paolo Bonzini wrote:
> On 06/12/18 00:10, Jim Mattson wrote:
> > 
> > On Mon, Dec 3, 2018 at 1:31 AM KarimAllah Ahmed  wrote:
> > > 
> > > 
> > > Copy the VMCS12 directly from guest memory instead of the map->copy->unmap
> > > sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
> > > assumes that there is a "struct page" for guest memory.
> > > 
> > > Signed-off-by: KarimAllah Ahmed 
> > > ---
> > > v3 -> v4:
> > > - Return VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID on failure (jmattson@)
> > > v1 -> v2:
> > > - Massage commit message a bit.
> > > ---
> > >  arch/x86/kvm/vmx.c | 24 
> > >  1 file changed, 12 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index b84f230..75817cb 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -9301,20 +9301,22 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> > > return 1;
> > > 
> > > if (vmx->nested.current_vmptr != vmptr) {
> > > -   struct vmcs12 *new_vmcs12;
> > > -   struct page *page;
> > > -   page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
> > > -   if (is_error_page(page))
> > > -   return nested_vmx_failInvalid(vcpu);
> > > +   struct vmcs12 *new_vmcs12 = (struct vmcs12 
> > > *)__get_free_page(GFP_KERNEL);
> > > +
> > > +   if (!new_vmcs12 ||
> > > +   kvm_read_guest(vcpu->kvm, vmptr, new_vmcs12,
> > > +  sizeof(*new_vmcs12))) {
> > 
> > Isn't this a lot slower than kmap() when there is a struct page?
> 
> It wouldn't be slower if he read directly into cached_vmcs12.  However,
> as it is now, it's doing two reads instead of one.  By doing this, the
> ENOMEM case also disappears.

I can not use "cached_vmcs12" directly here because its old contents will be 
used for "nested_release_vmcs12(..)" a few lines below (i.e. it will be flushed 
into guest memory).

I can just switch this to the new API introduced a few patches later and that 
would have the same semantics for normal use-cases as before.

> 
> Paolo



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM/nVMX: Stop mapping the "APIC-access address" page into the kernel

2018-12-03 Thread Raslan, KarimAllah

On Mon, 2018-12-03 at 14:59 +0100, KarimAllah Ahmed wrote:
> The "APIC-access address" is simply a token that the hypervisor puts into
> the PFN of a 4K EPTE (or PTE if using shadow paging) that triggers APIC
> virtualization whenever a page walk terminates with that PFN. This address
> has to be a legal address (i.e.  within the physical address supported by
> the CPU), but it need not have WB memory behind it. In fact, it need not
> have anything at all behind it. When bit 31 ("activate secondary controls")
> of the primary processor-based VM-execution controls is set and bit 0
> ("virtualize APIC accesses") of the secondary processor-based VM-execution
> controls is set, the PFN recorded in the VMCS "APIC-access address" field
> will never be touched. (Instead, the access triggers APIC virtualization,
> which may access the PFN recorded in the "Virtual-APIC address" field of
> the VMCS.)
> 
> So stop mapping the "APIC-access address" page into the kernel and even
> drop the requirements to have a valid page backing it. Instead, just use
> some token that:
> 
> 1) Not one of the valid guest pages.
> 2) Within the physical address supported by the CPU.
> 
> Suggested-by: Jim Mattson 
> Signed-off-by: KarimAllah Ahmed 
> ---
> 
> Thanks Jim for the commit message :)
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.c  | 10 ++
>  arch/x86/kvm/vmx.c  | 71 
> ++---
>  3 files changed, 42 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fbda5a9..7e50196 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1077,6 +1077,7 @@ struct kvm_x86_ops {
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
>   void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
> + bool (*nested_apic_access_addr)(struct kvm_vcpu *vcpu, gpa_t gpa, hpa_t 
> *hpa);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7c03c0f..ae46a8d 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3962,9 +3962,19 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
>  {
> + hpa_t hpa;
>   struct kvm_memory_slot *slot;
>   bool async;
>  
> + if (is_guest_mode(vcpu) &&
> + kvm_x86_ops->nested_apic_access_addr &&
> + kvm_x86_ops->nested_apic_access_addr(vcpu, gfn_to_gpa(gfn), )) {
> + *pfn = hpa >> PAGE_SHIFT;
> + if (writable)
> + *writable = true;
> + return false;
> + }

Now thinking further about this, I actually still need to validate that the L12 
EPT for this gfn actually contains the apic_access address. To ensure that I 
only fixup the fault when the L1 hypervisor sets up both VMCS L12 APIC_ACCESS 
and L12 EPT to contain the same address.

Will fix and send v2.

> +
>   /*
>* Don't expose private memslots to L2.
>*/
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 83a614f..340cf56 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -864,7 +864,6 @@ struct nested_vmx {
>* Guest pages referred to in the vmcs02 with host-physical
>* pointers, so we must keep them pinned while L2 runs.
>*/
> - struct page *apic_access_page;
>   struct kvm_host_map virtual_apic_map;
>   struct kvm_host_map pi_desc_map;
>   struct kvm_host_map msr_bitmap_map;
> @@ -8512,10 +8511,6 @@ static void free_nested(struct kvm_vcpu *vcpu)
>   kfree(vmx->nested.cached_vmcs12);
>   kfree(vmx->nested.cached_shadow_vmcs12);
>   /* Unpin physical memory we referred to in the vmcs02 */
> - if (vmx->nested.apic_access_page) {
> - kvm_release_page_dirty(vmx->nested.apic_access_page);
> - vmx->nested.apic_access_page = NULL;
> - }
>   kvm_vcpu_unmap(>nested.virtual_apic_map);
>   kvm_vcpu_unmap(>nested.pi_desc_map);
>   vmx->nested.pi_desc = NULL;
> @@ -11901,41 +11896,27 @@ static void vmx_inject_page_fault_nested(struct 
> kvm_vcpu *vcpu,
>  static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>struct vmcs12 *vmcs12);
>  
> +static hpa_t vmx_apic_access_addr(void)
> +{
> + /*
> +  * The physical address choosen here has to:
> +  * 1) Never be an address that could be assigned to a guest.
> +  * 2) Within the maximum physical limits of the CPU.
> +  *
> +  * So our choice below is completely

Re: [PATCH] KVM/nVMX: Stop mapping the "APIC-access address" page into the kernel

2018-12-03 Thread Raslan, KarimAllah

On Mon, 2018-12-03 at 14:59 +0100, KarimAllah Ahmed wrote:
> The "APIC-access address" is simply a token that the hypervisor puts into
> the PFN of a 4K EPTE (or PTE if using shadow paging) that triggers APIC
> virtualization whenever a page walk terminates with that PFN. This address
> has to be a legal address (i.e.  within the physical address supported by
> the CPU), but it need not have WB memory behind it. In fact, it need not
> have anything at all behind it. When bit 31 ("activate secondary controls")
> of the primary processor-based VM-execution controls is set and bit 0
> ("virtualize APIC accesses") of the secondary processor-based VM-execution
> controls is set, the PFN recorded in the VMCS "APIC-access address" field
> will never be touched. (Instead, the access triggers APIC virtualization,
> which may access the PFN recorded in the "Virtual-APIC address" field of
> the VMCS.)
> 
> So stop mapping the "APIC-access address" page into the kernel and even
> drop the requirements to have a valid page backing it. Instead, just use
> some token that:
> 
> 1) Not one of the valid guest pages.
> 2) Within the physical address supported by the CPU.
> 
> Suggested-by: Jim Mattson 
> Signed-off-by: KarimAllah Ahmed 
> ---
> 
> Thanks Jim for the commit message :)
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.c  | 10 ++
>  arch/x86/kvm/vmx.c  | 71 
> ++---
>  3 files changed, 42 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fbda5a9..7e50196 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1077,6 +1077,7 @@ struct kvm_x86_ops {
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
>   void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
> + bool (*nested_apic_access_addr)(struct kvm_vcpu *vcpu, gpa_t gpa, hpa_t 
> *hpa);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7c03c0f..ae46a8d 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3962,9 +3962,19 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
>  {
> + hpa_t hpa;
>   struct kvm_memory_slot *slot;
>   bool async;
>  
> + if (is_guest_mode(vcpu) &&
> + kvm_x86_ops->nested_apic_access_addr &&
> + kvm_x86_ops->nested_apic_access_addr(vcpu, gfn_to_gpa(gfn), )) {
> + *pfn = hpa >> PAGE_SHIFT;
> + if (writable)
> + *writable = true;
> + return false;
> + }

Now thinking further about this, I actually still need to validate that the L12 
EPT for this gfn actually contains the apic_access address. To ensure that I 
only fixup the fault when the L1 hypervisor sets up both VMCS L12 APIC_ACCESS 
and L12 EPT to contain the same address.

Will fix and send v2.

> +
>   /*
>* Don't expose private memslots to L2.
>*/
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 83a614f..340cf56 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -864,7 +864,6 @@ struct nested_vmx {
>* Guest pages referred to in the vmcs02 with host-physical
>* pointers, so we must keep them pinned while L2 runs.
>*/
> - struct page *apic_access_page;
>   struct kvm_host_map virtual_apic_map;
>   struct kvm_host_map pi_desc_map;
>   struct kvm_host_map msr_bitmap_map;
> @@ -8512,10 +8511,6 @@ static void free_nested(struct kvm_vcpu *vcpu)
>   kfree(vmx->nested.cached_vmcs12);
>   kfree(vmx->nested.cached_shadow_vmcs12);
>   /* Unpin physical memory we referred to in the vmcs02 */
> - if (vmx->nested.apic_access_page) {
> - kvm_release_page_dirty(vmx->nested.apic_access_page);
> - vmx->nested.apic_access_page = NULL;
> - }
>   kvm_vcpu_unmap(>nested.virtual_apic_map);
>   kvm_vcpu_unmap(>nested.pi_desc_map);
>   vmx->nested.pi_desc = NULL;
> @@ -11901,41 +11896,27 @@ static void vmx_inject_page_fault_nested(struct 
> kvm_vcpu *vcpu,
>  static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>struct vmcs12 *vmcs12);
>  
> +static hpa_t vmx_apic_access_addr(void)
> +{
> + /*
> +  * The physical address choosen here has to:
> +  * 1) Never be an address that could be assigned to a guest.
> +  * 2) Within the maximum physical limits of the CPU.
> +  *
> +  * So our choice below is completely

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-23 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 07:45:51PM +0000, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > 
> > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > When expedited grace-period is set, both synchronize_sched
> > > > synchronize_rcu_bh can be optimized to have a significantly lower 
> > > > latency.
> > > > 
> > > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > > variants
> > > > concurrently when an expedited grace-period is set, however, given the
> > > > improved latency it does not really matter.
> > > > 
> > > > Cc: Paul E. McKenney 
> > > > Cc: Josh Triplett 
> > > > Cc: Steven Rostedt 
> > > > Cc: Mathieu Desnoyers 
> > > > Cc: Lai Jiangshan 
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: KarimAllah Ahmed 
> > > 
> > > Cute!
> > > 
> > > Unfortunately, there are a few problems with this patch:
> > > 
> > > 1.I will be eliminating synchronize_rcu_mult() due to the fact 
> > > that
> > >   the upcoming RCU flavor consolidation eliminates its sole caller.
> > >   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > >   in my -rcu tree.  This would of course also eliminate the effects
> > >   of this patch.
> > 
> > Your patch covers our use-case already, but I still think that the 
> > semantics 
> > for wait_rcu_gp is not clear to me.
> > 
> > The problem for us was that sched_cpu_deactivate would call
> > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > though we are already using rcu_expedited sysctl variable, 
> > synchronize_rcu_mult 
> > was just ignoring it.
> > 
> > That being said, I indeed overlooked rcu_normal and that it takes 
> > precedence 
> > over expedited and I did not notice at all the deadlock you mentioned below!
> > 
> > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> 
> ???
> 
> The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> with synchronize_rcu(), which really would be subject to the sysfs
> variables.  Of course, this is not yet in mainline, so it perhaps cannot
> solve your immediate problem, which probably involve older kernels in
> any case.  More on this below...
> 
> > 
> > > 
> > > 2.The real-time guys' users are not going to be at all happy
> > >   with the IPIs resulting from the _expedited() API members.
> > >   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > >   always need that big a hammer, and use of this kernel parameter
> > >   can slow down boot, hibernation, suspend, network configuration,
> > >   and much else besides.  We therefore don't want them to have to
> > >   use rcupdate.rcu_normal=1 unless absolutely necessary.
> > 
> > I might be missing something here. Why would they need to "explicitly" use 
> > rcu_normal? If rcu_expedited is set, would not the expected behavior is to 
> > call 
> > into the expedited version?
> > 
> > My patch should only activate *expedited* if and only if it is set.
> 
> You are right, I was confused.  However...
> 
> > 
> > I think I might be misunderstanding the expected behavior 
> > from synchronize_rcu_mult. My understanding is that something like:
> > 
> > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > identical behavior, right?
> 
> You would clearly prefer that it did, and the commit log does seem to
> read that way, but synchronize_rcu_mult() is going away anyway, so there
> isn't a whole lot of point in arguing about what it should have done.
> And the eventual implementation (with 5fc9d4e000b1 or its successor)
> will act as you want.
> 
> > 
> > At least in this commit:
> > 
> > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() 
> > de-duplication")
> > 
> > .. the change clearly gives the impression that they can be used 
> > interchangeably. The problem is that this is not true when you look at the 
> > implementation. One of them (i.e. synchronize_rcu) will respect the
> > expedite_rcu flag set

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-23 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 07:45:51PM +0000, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > 
> > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > When expedited grace-period is set, both synchronize_sched
> > > > synchronize_rcu_bh can be optimized to have a significantly lower 
> > > > latency.
> > > > 
> > > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > > The downside is that wait_rcu_gp will not wait anymore for all RCU 
> > > > variants
> > > > concurrently when an expedited grace-period is set, however, given the
> > > > improved latency it does not really matter.
> > > > 
> > > > Cc: Paul E. McKenney 
> > > > Cc: Josh Triplett 
> > > > Cc: Steven Rostedt 
> > > > Cc: Mathieu Desnoyers 
> > > > Cc: Lai Jiangshan 
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: KarimAllah Ahmed 
> > > 
> > > Cute!
> > > 
> > > Unfortunately, there are a few problems with this patch:
> > > 
> > > 1.I will be eliminating synchronize_rcu_mult() due to the fact 
> > > that
> > >   the upcoming RCU flavor consolidation eliminates its sole caller.
> > >   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > >   in my -rcu tree.  This would of course also eliminate the effects
> > >   of this patch.
> > 
> > Your patch covers our use-case already, but I still think that the 
> > semantics 
> > for wait_rcu_gp is not clear to me.
> > 
> > The problem for us was that sched_cpu_deactivate would call
> > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > though we are already using rcu_expedited sysctl variable, 
> > synchronize_rcu_mult 
> > was just ignoring it.
> > 
> > That being said, I indeed overlooked rcu_normal and that it takes 
> > precedence 
> > over expedited and I did not notice at all the deadlock you mentioned below!
> > 
> > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> 
> ???
> 
> The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> with synchronize_rcu(), which really would be subject to the sysfs
> variables.  Of course, this is not yet in mainline, so it perhaps cannot
> solve your immediate problem, which probably involve older kernels in
> any case.  More on this below...
> 
> > 
> > > 
> > > 2.The real-time guys' users are not going to be at all happy
> > >   with the IPIs resulting from the _expedited() API members.
> > >   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > >   always need that big a hammer, and use of this kernel parameter
> > >   can slow down boot, hibernation, suspend, network configuration,
> > >   and much else besides.  We therefore don't want them to have to
> > >   use rcupdate.rcu_normal=1 unless absolutely necessary.
> > 
> > I might be missing something here. Why would they need to "explicitly" use 
> > rcu_normal? If rcu_expedited is set, would not the expected behavior is to 
> > call 
> > into the expedited version?
> > 
> > My patch should only activate *expedited* if and only if it is set.
> 
> You are right, I was confused.  However...
> 
> > 
> > I think I might be misunderstanding the expected behavior 
> > from synchronize_rcu_mult. My understanding is that something like:
> > 
> > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > identical behavior, right?
> 
> You would clearly prefer that it did, and the commit log does seem to
> read that way, but synchronize_rcu_mult() is going away anyway, so there
> isn't a whole lot of point in arguing about what it should have done.
> And the eventual implementation (with 5fc9d4e000b1 or its successor)
> will act as you want.
> 
> > 
> > At least in this commit:
> > 
> > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() 
> > de-duplication")
> > 
> > .. the change clearly gives the impression that they can be used 
> > interchangeably. The problem is that this is not true when you look at the 
> > implementation. One of them (i.e. synchronize_rcu) will respect the
> > expedite_rcu flag set

Re: [PATCH v3 06/13] KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap

2018-10-22 Thread Raslan, KarimAllah

On Mon, 2018-10-22 at 14:42 -0700, Jim Mattson wrote:
> On Sat, Oct 20, 2018 at 3:22 PM, KarimAllah Ahmed  wrote:
> > 
> > Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
> > kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> > a "struct page".
> > 
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v1 -> v2:
> > - Do not change the lifecycle of the mapping (pbonzini)
> > ---
> >  arch/x86/kvm/vmx.c | 14 --
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index d857401..5b15ca2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -847,6 +847,9 @@ struct nested_vmx {
> > struct page *apic_access_page;
> > struct page *virtual_apic_page;
> > struct page *pi_desc_page;
> > +
> > +   struct kvm_host_map msr_bitmap_map;
> > +
> > struct pi_desc *pi_desc;
> > bool pi_pending;
> > u16 posted_intr_nv;
> > @@ -11546,9 +11549,10 @@ static inline bool 
> > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >  struct vmcs12 *vmcs12)
> >  {
> > int msr;
> > -   struct page *page;
> > unsigned long *msr_bitmap_l1;
> > unsigned long *msr_bitmap_l0 = 
> > to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
> > +   struct kvm_host_map *map = _vmx(vcpu)->nested.msr_bitmap_map;
> > +
> > /*
> >  * pred_cmd & spec_ctrl are trying to verify two things:
> >  *
> > @@ -11574,11 +11578,10 @@ static inline bool 
> > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > !pred_cmd && !spec_ctrl)
> > return false;
> > 
> > -   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
> > -   if (is_error_page(page))
> > +   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map))
> 
> Isn't this the sort of high frequency operation that should not use the new 
> API?

With the current implementation of the API, yes. The performance will be 
horrible. This does not affect the current users though (i.e. when guest memory 
is backed by "struct page").

I have a few patches that implements a pfn_cache on top of this as suggested by 
Paolo. This would allow this API to be used for this type of high-frequency 
mappings.

For example with this pfn_cache, booting an Ubuntu was 10x faster (from ~ 2 
minutes to 13 seconds).

> 
> > 
> > return false;
> > 
> > -   msr_bitmap_l1 = (unsigned long *)kmap(page);
> > +   msr_bitmap_l1 = (unsigned long *)map->hva;
> > if (nested_cpu_has_apic_reg_virt(vmcs12)) {
> > /*
> >  * L0 need not intercept reads for MSRs between 0x800 and 
> > 0x8ff, it
> > @@ -11626,8 +11629,7 @@ static inline bool 
> > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > MSR_IA32_PRED_CMD,
> > MSR_TYPE_W);
> > 
> > -   kunmap(page);
> > -   kvm_release_page_clean(page);
> > +   kvm_vcpu_unmap(_vmx(vcpu)->nested.msr_bitmap_map);
> > 
> > return true;
> >  }
> > --
> > 2.7.4
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v3 06/13] KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap

2018-10-22 Thread Raslan, KarimAllah

On Mon, 2018-10-22 at 14:42 -0700, Jim Mattson wrote:
> On Sat, Oct 20, 2018 at 3:22 PM, KarimAllah Ahmed  wrote:
> > 
> > Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
> > kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> > a "struct page".
> > 
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v1 -> v2:
> > - Do not change the lifecycle of the mapping (pbonzini)
> > ---
> >  arch/x86/kvm/vmx.c | 14 --
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index d857401..5b15ca2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -847,6 +847,9 @@ struct nested_vmx {
> > struct page *apic_access_page;
> > struct page *virtual_apic_page;
> > struct page *pi_desc_page;
> > +
> > +   struct kvm_host_map msr_bitmap_map;
> > +
> > struct pi_desc *pi_desc;
> > bool pi_pending;
> > u16 posted_intr_nv;
> > @@ -11546,9 +11549,10 @@ static inline bool 
> > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >  struct vmcs12 *vmcs12)
> >  {
> > int msr;
> > -   struct page *page;
> > unsigned long *msr_bitmap_l1;
> > unsigned long *msr_bitmap_l0 = 
> > to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
> > +   struct kvm_host_map *map = _vmx(vcpu)->nested.msr_bitmap_map;
> > +
> > /*
> >  * pred_cmd & spec_ctrl are trying to verify two things:
> >  *
> > @@ -11574,11 +11578,10 @@ static inline bool 
> > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > !pred_cmd && !spec_ctrl)
> > return false;
> > 
> > -   page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
> > -   if (is_error_page(page))
> > +   if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map))
> 
> Isn't this the sort of high frequency operation that should not use the new 
> API?

With the current implementation of the API, yes. The performance will be 
horrible. This does not affect the current users though (i.e. when guest memory 
is backed by "struct page").

I have a few patches that implements a pfn_cache on top of this as suggested by 
Paolo. This would allow this API to be used for this type of high-frequency 
mappings.

For example with this pfn_cache, booting an Ubuntu was 10x faster (from ~ 2 
minutes to 13 seconds).

> 
> > 
> > return false;
> > 
> > -   msr_bitmap_l1 = (unsigned long *)kmap(page);
> > +   msr_bitmap_l1 = (unsigned long *)map->hva;
> > if (nested_cpu_has_apic_reg_virt(vmcs12)) {
> > /*
> >  * L0 need not intercept reads for MSRs between 0x800 and 
> > 0x8ff, it
> > @@ -11626,8 +11629,7 @@ static inline bool 
> > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > MSR_IA32_PRED_CMD,
> > MSR_TYPE_W);
> > 
> > -   kunmap(page);
> > -   kvm_release_page_clean(page);
> > +   kvm_vcpu_unmap(_vmx(vcpu)->nested.msr_bitmap_map);
> > 
> > return true;
> >  }
> > --
> > 2.7.4
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v3 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page

2018-10-21 Thread Raslan, KarimAllah

Sorry! please ignore this patch in favor of its RESEND. I realized that a few 
lines from it leaked into another patch series. The "RESEND" should have this 
fixed.

On Sun, 2018-10-21 at 00:22 +0200, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map when mapping the virtual APIC page since using
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
> 
> One additional semantic change is that the virtual host mapping lifecycle
> has changed a bit. It now has the same lifetime of the pinning of the
> virtual APIC page on the host side.
> 
> Signed-off-by: KarimAllah Ahmed 
> ---
> v1 -> v2:
> - Do not change the lifecycle of the mapping (pbonzini)
> - Use pfn_to_hpa instead of gfn_to_gpa
> ---
>  arch/x86/kvm/vmx.c | 34 +++---
>  1 file changed, 11 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5b15ca2..83a5e95 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -845,9 +845,8 @@ struct nested_vmx {
>* pointers, so we must keep them pinned while L2 runs.
>*/
>   struct page *apic_access_page;
> - struct page *virtual_apic_page;
> + struct kvm_host_map virtual_apic_map;
>   struct page *pi_desc_page;
> -
>   struct kvm_host_map msr_bitmap_map;
>  
>   struct pi_desc *pi_desc;
> @@ -6152,11 +6151,12 @@ static void 
> vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>  
>   max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
>   if (max_irr != 256) {
> - vapic_page = kmap(vmx->nested.virtual_apic_page);
> + vapic_page = vmx->nested.virtual_apic_map.hva;
> + if (!vapic_page)
> + return;
> +
>   __kvm_apic_update_irr(vmx->nested.pi_desc->pir,
>   vapic_page, _irr);
> - kunmap(vmx->nested.virtual_apic_page);
> -
>   status = vmcs_read16(GUEST_INTR_STATUS);
>   if ((u8)max_irr > ((u8)status & 0xff)) {
>   status &= ~0xff;
> @@ -8468,10 +8468,7 @@ static void free_nested(struct vcpu_vmx *vmx)
>   kvm_release_page_dirty(vmx->nested.apic_access_page);
>   vmx->nested.apic_access_page = NULL;
>   }
> - if (vmx->nested.virtual_apic_page) {
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> + kvm_vcpu_unmap(>nested.virtual_apic_map);
>   if (vmx->nested.pi_desc_page) {
>   kunmap(vmx->nested.pi_desc_page);
>   kvm_release_page_dirty(vmx->nested.pi_desc_page);
> @@ -11394,6 +11391,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> *vcpu)
>  {
>   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct kvm_host_map *map;
>   struct page *page;
>   u64 hpa;
>  
> @@ -11426,11 +11424,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> *vcpu)
>   }
>  
>   if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
> - if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> - page = kvm_vcpu_gpa_to_page(vcpu, 
> vmcs12->virtual_apic_page_addr);
> + map = >nested.virtual_apic_map;
>  
>   /*
>* If translation failed, VM entry will fail because
> @@ -11445,11 +11439,8 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> *vcpu)
>* control.  But such a configuration is useless, so
>* let's keep the code simple.
>*/
> - if (!is_error_page(page)) {
> - vmx->nested.virtual_apic_page = page;
> - hpa = page_to_phys(vmx->nested.virtual_apic_page);
> - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
> - }
> + if (!kvm_vcpu_map(vcpu, 
> gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
> + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 
> pfn_to_hpa(map->pfn));
>   }
>  
>   if (nested_cpu_has_posted_intr(vmcs12)) {
> @@ -13353,10 +13344,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
> u32 exit_reason,
>   kvm_release_page_dirty(vmx->nested.apic_access_page);
>   vmx->nested.apic_access_page = NULL;
>   }
> - if (vmx->nested.virtual_apic_page) {
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> + kvm_vcpu_unmap(>nested.virtual_apic_map);
>   if (vmx->nested.pi_desc_page) {
>   kunmap(vmx->nested.pi_desc_page);
>   kvm_release_page_dirty(vmx->nested.pi_desc_page);
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen

Re: [PATCH v3 07/13] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page

2018-10-21 Thread Raslan, KarimAllah

Sorry! please ignore this patch in favor of its RESEND. I realized that a few 
lines from it leaked into another patch series. The "RESEND" should have this 
fixed.

On Sun, 2018-10-21 at 00:22 +0200, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map when mapping the virtual APIC page since using
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
> 
> One additional semantic change is that the virtual host mapping lifecycle
> has changed a bit. It now has the same lifetime of the pinning of the
> virtual APIC page on the host side.
> 
> Signed-off-by: KarimAllah Ahmed 
> ---
> v1 -> v2:
> - Do not change the lifecycle of the mapping (pbonzini)
> - Use pfn_to_hpa instead of gfn_to_gpa
> ---
>  arch/x86/kvm/vmx.c | 34 +++---
>  1 file changed, 11 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5b15ca2..83a5e95 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -845,9 +845,8 @@ struct nested_vmx {
>* pointers, so we must keep them pinned while L2 runs.
>*/
>   struct page *apic_access_page;
> - struct page *virtual_apic_page;
> + struct kvm_host_map virtual_apic_map;
>   struct page *pi_desc_page;
> -
>   struct kvm_host_map msr_bitmap_map;
>  
>   struct pi_desc *pi_desc;
> @@ -6152,11 +6151,12 @@ static void 
> vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>  
>   max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
>   if (max_irr != 256) {
> - vapic_page = kmap(vmx->nested.virtual_apic_page);
> + vapic_page = vmx->nested.virtual_apic_map.hva;
> + if (!vapic_page)
> + return;
> +
>   __kvm_apic_update_irr(vmx->nested.pi_desc->pir,
>   vapic_page, _irr);
> - kunmap(vmx->nested.virtual_apic_page);
> -
>   status = vmcs_read16(GUEST_INTR_STATUS);
>   if ((u8)max_irr > ((u8)status & 0xff)) {
>   status &= ~0xff;
> @@ -8468,10 +8468,7 @@ static void free_nested(struct vcpu_vmx *vmx)
>   kvm_release_page_dirty(vmx->nested.apic_access_page);
>   vmx->nested.apic_access_page = NULL;
>   }
> - if (vmx->nested.virtual_apic_page) {
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> + kvm_vcpu_unmap(>nested.virtual_apic_map);
>   if (vmx->nested.pi_desc_page) {
>   kunmap(vmx->nested.pi_desc_page);
>   kvm_release_page_dirty(vmx->nested.pi_desc_page);
> @@ -11394,6 +11391,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> *vcpu)
>  {
>   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct kvm_host_map *map;
>   struct page *page;
>   u64 hpa;
>  
> @@ -11426,11 +11424,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> *vcpu)
>   }
>  
>   if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
> - if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> - page = kvm_vcpu_gpa_to_page(vcpu, 
> vmcs12->virtual_apic_page_addr);
> + map = >nested.virtual_apic_map;
>  
>   /*
>* If translation failed, VM entry will fail because
> @@ -11445,11 +11439,8 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu 
> *vcpu)
>* control.  But such a configuration is useless, so
>* let's keep the code simple.
>*/
> - if (!is_error_page(page)) {
> - vmx->nested.virtual_apic_page = page;
> - hpa = page_to_phys(vmx->nested.virtual_apic_page);
> - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
> - }
> + if (!kvm_vcpu_map(vcpu, 
> gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
> + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 
> pfn_to_hpa(map->pfn));
>   }
>  
>   if (nested_cpu_has_posted_intr(vmcs12)) {
> @@ -13353,10 +13344,7 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
> u32 exit_reason,
>   kvm_release_page_dirty(vmx->nested.apic_access_page);
>   vmx->nested.apic_access_page = NULL;
>   }
> - if (vmx->nested.virtual_apic_page) {
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> + kvm_vcpu_unmap(>nested.virtual_apic_map);
>   if (vmx->nested.pi_desc_page) {
>   kunmap(vmx->nested.pi_desc_page);
>   kvm_release_page_dirty(vmx->nested.pi_desc_page);
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > 
> > When expedited grace-period is set, both synchronize_sched
> > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > 
> > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> > concurrently when an expedited grace-period is set, however, given the
> > improved latency it does not really matter.
> > 
> > Cc: Paul E. McKenney 
> > Cc: Josh Triplett 
> > Cc: Steven Rostedt 
> > Cc: Mathieu Desnoyers 
> > Cc: Lai Jiangshan 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Cute!
> 
> Unfortunately, there are a few problems with this patch:
> 
> 1.I will be eliminating synchronize_rcu_mult() due to the fact that
>   the upcoming RCU flavor consolidation eliminates its sole caller.
>   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
>   in my -rcu tree.  This would of course also eliminate the effects
>   of this patch.

Your patch covers our use-case already, but I still think that the semantics 
for wait_rcu_gp is not clear to me.

The problem for us was that sched_cpu_deactivate would call
synchronize_rcu_mult which does not check for "expedited" at all. So even
though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult 
was just ignoring it.

That being said, I indeed overlooked rcu_normal and that it takes precedence 
over expedited and I did not notice at all the deadlock you mentioned below!

That can however be easily fixed by also checking for !rcu_gp_is_normal.

> 
> 2.The real-time guys' users are not going to be at all happy
>   with the IPIs resulting from the _expedited() API members.
>   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
>   always need that big a hammer, and use of this kernel parameter
>   can slow down boot, hibernation, suspend, network configuration,
>   and much else besides.  We therefore don't want them to have to
>   use rcupdate.rcu_normal=1 unless absolutely necessary.

I might be missing something here. Why would they need to "explicitly" use 
rcu_normal? If rcu_expedited is set, would not the expected behavior is to call 
into the expedited version?

My patch should only activate *expedited* if and only if it is set.

I think I might be misunderstanding the expected behavior 
from synchronize_rcu_mult. My understanding is that something like:

synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
identical behavior, right?

At least in this commit:

commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")

.. the change clearly gives the impression that they can be used 
interchangeably. The problem is that this is not true when you look at the 
implementation. One of them (i.e. synchronize_rcu) will respect the
expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
simply ignores it.

So my patch is about making sure that both of the variants actually respect 
it.

> 3.If the real-time guys' users were to have booted with
>   rcupdate.rcu_normal=1, then synchronize_sched_expedited()
>   would invoke _synchronize_rcu_expedited, which would invoke
>   wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
>   invoke __wait_rcu_gp(), which, given your patch, would in turn
>   invoke synchronize_sched_expedited().  This situation could
>   well prevent their systems from meeting their response-time
>   requirements.
> 
> So I cannot accept this patch nor for that matter any similar patch.
> 
> But what were you really trying to get done here?  If you were thinking
> of adding another synchronize_rcu_mult(), the flavor consolidation will
> make that unnecessary in most cases.  If you are trying to speed up
> CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
> when taking a CPU offline.  If something else, please let me know what
> it is so that we can work out how the problem might best be solved.
> 
>   Thanx, Paul
> 
> > 
> > ---
> >  kernel/rcu/update.c | 34 --
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 68fa19a..44b8817 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, 
> > call_rcu_func_t *crcu_array,
> > might_sleep();
> > continue;
> > }
> > -   init_rcu_head_on_stack(_array[i].head);
> > -   init_completion(_array[i].completion);
> > +
> > for (j = 0; j < i; j++)
> > if (crcu_array[j] == crcu_array[i])
> >

Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

2018-10-19 Thread Raslan, KarimAllah

On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > 
> > When expedited grace-period is set, both synchronize_sched
> > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > 
> > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> > concurrently when an expedited grace-period is set, however, given the
> > improved latency it does not really matter.
> > 
> > Cc: Paul E. McKenney 
> > Cc: Josh Triplett 
> > Cc: Steven Rostedt 
> > Cc: Mathieu Desnoyers 
> > Cc: Lai Jiangshan 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Cute!
> 
> Unfortunately, there are a few problems with this patch:
> 
> 1.I will be eliminating synchronize_rcu_mult() due to the fact that
>   the upcoming RCU flavor consolidation eliminates its sole caller.
>   See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
>   in my -rcu tree.  This would of course also eliminate the effects
>   of this patch.

Your patch covers our use-case already, but I still think that the semantics 
for wait_rcu_gp is not clear to me.

The problem for us was that sched_cpu_deactivate would call
synchronize_rcu_mult which does not check for "expedited" at all. So even
though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult 
was just ignoring it.

That being said, I indeed overlooked rcu_normal and that it takes precedence 
over expedited and I did not notice at all the deadlock you mentioned below!

That can however be easily fixed by also checking for !rcu_gp_is_normal.

> 
> 2.The real-time guys' users are not going to be at all happy
>   with the IPIs resulting from the _expedited() API members.
>   Yes, they can boot with rcupdate.rcu_normal=1, but they don't
>   always need that big a hammer, and use of this kernel parameter
>   can slow down boot, hibernation, suspend, network configuration,
>   and much else besides.  We therefore don't want them to have to
>   use rcupdate.rcu_normal=1 unless absolutely necessary.

I might be missing something here. Why would they need to "explicitly" use 
rcu_normal? If rcu_expedited is set, would not the expected behavior is to call 
into the expedited version?

My patch should only activate *expedited* if and only if it is set.

I think I might be misunderstanding the expected behavior 
from synchronize_rcu_mult. My understanding is that something like:

synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
identical behavior, right?

At least in this commit:

commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")

.. the change clearly gives the impression that they can be used 
interchangeably. The problem is that this is not true when you look at the 
implementation. One of them (i.e. synchronize_rcu) will respect the
expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
simply ignores it.

So my patch is about making sure that both of the variants actually respect 
it.

> 3.If the real-time guys' users were to have booted with
>   rcupdate.rcu_normal=1, then synchronize_sched_expedited()
>   would invoke _synchronize_rcu_expedited, which would invoke
>   wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
>   invoke __wait_rcu_gp(), which, given your patch, would in turn
>   invoke synchronize_sched_expedited().  This situation could
>   well prevent their systems from meeting their response-time
>   requirements.
> 
> So I cannot accept this patch nor for that matter any similar patch.
> 
> But what were you really trying to get done here?  If you were thinking
> of adding another synchronize_rcu_mult(), the flavor consolidation will
> make that unnecessary in most cases.  If you are trying to speed up
> CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
> when taking a CPU offline.  If something else, please let me know what
> it is so that we can work out how the problem might best be solved.
> 
>   Thanx, Paul
> 
> > 
> > ---
> >  kernel/rcu/update.c | 34 --
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 68fa19a..44b8817 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, 
> > call_rcu_func_t *crcu_array,
> > might_sleep();
> > continue;
> > }
> > -   init_rcu_head_on_stack(_array[i].head);
> > -   init_completion(_array[i].completion);
> > +
> > for (j = 0; j < i; j++)
> > if (crcu_array[j] == crcu_array[i])
> >

Re: [PATCH v2] PCI/IOV: Use VF0 cached config space size for other VFs

2018-10-11 Thread Raslan, KarimAllah

On Thu, 2018-10-11 at 11:51 -0500, Bjorn Helgaas wrote:
> On Wed, Oct 10, 2018 at 06:00:10PM +0200, KarimAllah Ahmed wrote:
> > 
> > Cache the config space size from VF0 and use it for all other VFs instead
> > of reading it from the config space of each VF. We assume that it will be
> > the same across all associated VFs.
> > 
> > This is an optimization when enabling SR-IOV on a device with many VFs.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Applied to pci/virtualization for v4.20, thanks!
> 
> As I mentioned last time, I think CONFIG_PCI_ATS is the wrong symbol to
> test here, so I changed that to CONFIG_PCI_IOV.

Ooops! Sorry, it has been a long time and I fogot :D

> I also moved the #ifdef wrapper so the caller doesn't need an ifdef.
> Please let me know if these break anything.  The patch I applied is appended.

Looks good to me. Thanks!

> > 
> > ---
> > v1 -> v2:
> > - Drop the __pci_cfg_space_size (bhelgaas@)
> > - Extend pci_cfg_space_size to return the cached value for all VFs except
> >   VF0 (bhelgaas@)
> > ---
> >  drivers/pci/iov.c   |  2 ++
> >  drivers/pci/pci.h   |  1 +
> >  drivers/pci/probe.c | 17 +
> >  3 files changed, 20 insertions(+)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index c5f3cd4e..4238b53 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -133,6 +133,8 @@ static void pci_read_vf_config_common(struct pci_dev 
> > *virtfn)
> >  >sriov->subsystem_vendor);
> > pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
> >  >sriov->subsystem_device);
> > +
> > +   physfn->sriov->cfg_size = pci_cfg_space_size(virtfn);
> >  }
> >  
> >  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 6e0d152..2f14542 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -285,6 +285,7 @@ struct pci_sriov {
> > u16 driver_max_VFs; /* Max num VFs driver supports */
> > struct pci_dev  *dev;   /* Lowest numbered PF */
> > struct pci_dev  *self;  /* This PF */
> > +   u32 cfg_size;   /* VF config space size */
> > u32 class;  /* VF device */
> > u8  hdr_type;   /* VF header type */
> > u16 subsystem_vendor; /* VF subsystem vendor */
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 201f9e5..8c0f428 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -1438,12 +1438,29 @@ static int pci_cfg_space_size_ext(struct pci_dev 
> > *dev)
> > return PCI_CFG_SPACE_EXP_SIZE;
> >  }
> >  
> > +#ifdef CONFIG_PCI_ATS
> > +static bool is_vf0(struct pci_dev *dev)
> > +{
> > +   if (pci_iov_virtfn_devfn(dev->physfn, 0) == dev->devfn &&
> > +   pci_iov_virtfn_bus(dev->physfn, 0) == dev->bus->number)
> > +   return true;
> > +
> > +   return false;
> > +}
> > +#endif
> > +
> >  int pci_cfg_space_size(struct pci_dev *dev)
> >  {
> > int pos;
> > u32 status;
> > u16 class;
> >  
> > +#ifdef CONFIG_PCI_ATS
> > +   /* Read cached value for all VFs except for VF0 */
> > +   if (dev->is_virtfn && !is_vf0(dev))
> > +   return dev->physfn->sriov->cfg_size;
> > +#endif
> > +
> > if (dev->bus->bus_flags & PCI_BUS_FLAGS_NO_EXTCFG)
> > return PCI_CFG_SPACE_SIZE;
> >  
> > -- 
> > 2.7.4
> > 
> 
> commit 601f9f6679157b70a7a4e752baa590bd2af69ffb
> Author: KarimAllah Ahmed 
> Date:   Thu Oct 11 11:49:58 2018 -0500
> 
> PCI/IOV: Use VF0 cached config space size for other VFs
> 
> Cache the config space size from VF0 and use it for all other VFs instead
> of reading it from the config space of each VF.  We assume that it will be
> the same across all associated VFs.
> 
> This is an optimization when enabling SR-IOV on a device with many VFs.
> 
> Signed-off-by: KarimAllah Ahmed 
> [bhelgaas: use CONFIG_PCI_IOV (not CONFIG_PCI_ATS), adjust is_vf0() 
> wrapper
> so caller doesn't need ifdef]
> Signed-off-by: Bjorn Helgaas 
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index c5f3cd4ed766..4238b539f9d8 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -133,6 +133,8 @@ static void pci_read_vf_config_common(struct pci_dev 
> *virtfn)
>>sriov->subsystem_vendor);
>   pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
>>sriov->subsystem_device);
> +
> + physfn->sriov->cfg_size = pci_cfg_space_size(virtfn);
>  }
>  
>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 6e0d1528d471..2f1454209257 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -285,6 +285,7 @@ struct pci_sriov {
>   u16 driver_max_VFs; /* Max num VFs driver supports */
>

Re: [PATCH v2] PCI/IOV: Use VF0 cached config space size for other VFs

2018-10-11 Thread Raslan, KarimAllah

On Thu, 2018-10-11 at 11:51 -0500, Bjorn Helgaas wrote:
> On Wed, Oct 10, 2018 at 06:00:10PM +0200, KarimAllah Ahmed wrote:
> > 
> > Cache the config space size from VF0 and use it for all other VFs instead
> > of reading it from the config space of each VF. We assume that it will be
> > the same across all associated VFs.
> > 
> > This is an optimization when enabling SR-IOV on a device with many VFs.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Applied to pci/virtualization for v4.20, thanks!
> 
> As I mentioned last time, I think CONFIG_PCI_ATS is the wrong symbol to
> test here, so I changed that to CONFIG_PCI_IOV.

Ooops! Sorry, it has been a long time and I fogot :D

> I also moved the #ifdef wrapper so the caller doesn't need an ifdef.
> Please let me know if these break anything.  The patch I applied is appended.

Looks good to me. Thanks!

> > 
> > ---
> > v1 -> v2:
> > - Drop the __pci_cfg_space_size (bhelgaas@)
> > - Extend pci_cfg_space_size to return the cached value for all VFs except
> >   VF0 (bhelgaas@)
> > ---
> >  drivers/pci/iov.c   |  2 ++
> >  drivers/pci/pci.h   |  1 +
> >  drivers/pci/probe.c | 17 +
> >  3 files changed, 20 insertions(+)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index c5f3cd4e..4238b53 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -133,6 +133,8 @@ static void pci_read_vf_config_common(struct pci_dev 
> > *virtfn)
> >  >sriov->subsystem_vendor);
> > pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
> >  >sriov->subsystem_device);
> > +
> > +   physfn->sriov->cfg_size = pci_cfg_space_size(virtfn);
> >  }
> >  
> >  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 6e0d152..2f14542 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -285,6 +285,7 @@ struct pci_sriov {
> > u16 driver_max_VFs; /* Max num VFs driver supports */
> > struct pci_dev  *dev;   /* Lowest numbered PF */
> > struct pci_dev  *self;  /* This PF */
> > +   u32 cfg_size;   /* VF config space size */
> > u32 class;  /* VF device */
> > u8  hdr_type;   /* VF header type */
> > u16 subsystem_vendor; /* VF subsystem vendor */
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 201f9e5..8c0f428 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -1438,12 +1438,29 @@ static int pci_cfg_space_size_ext(struct pci_dev 
> > *dev)
> > return PCI_CFG_SPACE_EXP_SIZE;
> >  }
> >  
> > +#ifdef CONFIG_PCI_ATS
> > +static bool is_vf0(struct pci_dev *dev)
> > +{
> > +   if (pci_iov_virtfn_devfn(dev->physfn, 0) == dev->devfn &&
> > +   pci_iov_virtfn_bus(dev->physfn, 0) == dev->bus->number)
> > +   return true;
> > +
> > +   return false;
> > +}
> > +#endif
> > +
> >  int pci_cfg_space_size(struct pci_dev *dev)
> >  {
> > int pos;
> > u32 status;
> > u16 class;
> >  
> > +#ifdef CONFIG_PCI_ATS
> > +   /* Read cached value for all VFs except for VF0 */
> > +   if (dev->is_virtfn && !is_vf0(dev))
> > +   return dev->physfn->sriov->cfg_size;
> > +#endif
> > +
> > if (dev->bus->bus_flags & PCI_BUS_FLAGS_NO_EXTCFG)
> > return PCI_CFG_SPACE_SIZE;
> >  
> > -- 
> > 2.7.4
> > 
> 
> commit 601f9f6679157b70a7a4e752baa590bd2af69ffb
> Author: KarimAllah Ahmed 
> Date:   Thu Oct 11 11:49:58 2018 -0500
> 
> PCI/IOV: Use VF0 cached config space size for other VFs
> 
> Cache the config space size from VF0 and use it for all other VFs instead
> of reading it from the config space of each VF.  We assume that it will be
> the same across all associated VFs.
> 
> This is an optimization when enabling SR-IOV on a device with many VFs.
> 
> Signed-off-by: KarimAllah Ahmed 
> [bhelgaas: use CONFIG_PCI_IOV (not CONFIG_PCI_ATS), adjust is_vf0() 
> wrapper
> so caller doesn't need ifdef]
> Signed-off-by: Bjorn Helgaas 
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index c5f3cd4ed766..4238b539f9d8 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -133,6 +133,8 @@ static void pci_read_vf_config_common(struct pci_dev 
> *virtfn)
>>sriov->subsystem_vendor);
>   pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
>>sriov->subsystem_device);
> +
> + physfn->sriov->cfg_size = pci_cfg_space_size(virtfn);
>  }
>  
>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 6e0d1528d471..2f1454209257 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -285,6 +285,7 @@ struct pci_sriov {
>   u16 driver_max_VFs; /* Max num VFs driver supports */
>

Re: [PATCH v2 00/12] KVM/X86: Introduce a new guest mapping interface

2018-07-10 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 13:10 +0200, Paolo Bonzini wrote:
> On 15/04/2018 23:53, KarimAllah Ahmed wrote:
> > 
> > Guest memory can either be directly managed by the kernel (i.e. have a 
> > "struct
> > page") or they can simply live outside kernel control (i.e. do not have a
> > "struct page"). KVM mostly support these two modes, except in a few places
> > where the code seems to assume that guest memory must have a "struct page".
> > 
> > This patchset introduces a new mapping interface to map guest memory into 
> > host
> > kernel memory which also supports PFN-based memory (i.e. memory without 
> > 'struct
> > page'). It also converts all offending code to this interface or simply
> > read/write directly from guest memory.
> > 
> > As far as I can see all offending code is now fixed except the APIC-access 
> > page
> > which I will handle in a seperate patch.
> 
> I assume the caching will also be a separate patch.
> 
> It looks good except that I'd squash patches 4 and 9 together.  But I'd
> like a second set of eyes to look at it.

BTW, Why did you want to squash these 2 patches specifically? They are 
very unrelated to me. The only common thing is that they switch from 
code that supports only "struct page" to code that supports PFN only
but this is also common for all other patches.

> 
> Thanks,
> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2 00/12] KVM/X86: Introduce a new guest mapping interface

2018-07-10 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 13:10 +0200, Paolo Bonzini wrote:
> On 15/04/2018 23:53, KarimAllah Ahmed wrote:
> > 
> > Guest memory can either be directly managed by the kernel (i.e. have a 
> > "struct
> > page") or they can simply live outside kernel control (i.e. do not have a
> > "struct page"). KVM mostly support these two modes, except in a few places
> > where the code seems to assume that guest memory must have a "struct page".
> > 
> > This patchset introduces a new mapping interface to map guest memory into 
> > host
> > kernel memory which also supports PFN-based memory (i.e. memory without 
> > 'struct
> > page'). It also converts all offending code to this interface or simply
> > read/write directly from guest memory.
> > 
> > As far as I can see all offending code is now fixed except the APIC-access 
> > page
> > which I will handle in a seperate patch.
> 
> I assume the caching will also be a separate patch.
> 
> It looks good except that I'd squash patches 4 and 9 together.  But I'd
> like a second set of eyes to look at it.

BTW, Why did you want to squash these 2 patches specifically? They are 
very unrelated to me. The only common thing is that they switch from 
code that supports only "struct page" to code that supports PFN only
but this is also common for all other patches.

> 
> Thanks,
> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-07-10 Thread Raslan, KarimAllah

On Thu, 2018-07-05 at 14:51 +0100, Mark Rutland wrote:
> On Sun, Apr 15, 2018 at 12:26:44AM +0200, KarimAllah Ahmed wrote:
> > 
> > Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> > use the size of "requests" instead of the hard-coded '32'.
> > 
> > That gives us a bit more room again for arch-specific requests as we
> > already ran out of space for x86 due to the hard-coded check.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  include/linux/kvm_host.h | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6930c63..fe4f46b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -129,7 +129,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQUEST_ARCH_BASE 8
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > -   BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> > +   BUILD_BUG_ON((unsigned)(nr) >= (sizeof(((struct kvm_vcpu 
> > *)0)->requests) * 8) - KVM_REQUEST_ARCH_BASE); \
> > (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
> >  })
> >  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)
> > @@ -223,7 +223,7 @@ struct kvm_vcpu {
> > int vcpu_id;
> > int srcu_idx;
> > int mode;
> > -   unsigned long requests;
> > +   u64 requests;
> 
> The usual thing to do for bitmaps is something like:
> 
> #define KVM_REQUEST_NR(KVM_REQUEST_ARCH_BASE + 
> KVM_REQUEST_ARCH_NR)
> 
>   unsigned long requests[BITS_TO_LONGS(NR_KVM_REQUESTS)];
> 
> > 
> > unsigned long guest_debug;
> >  
> > int pre_pcpu;
> > @@ -1122,7 +1122,7 @@ static inline void kvm_make_request(int req, struct 
> > kvm_vcpu *vcpu)
> >  * caller.  Paired with the smp_mb__after_atomic in kvm_check_request.
> >  */
> > smp_wmb();
> > -   set_bit(req & KVM_REQUEST_MASK, >requests);
> > +   set_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> 
> ... which wouldn't require a void cast to make the bit API functions
> happy (as these expect a pointer to the first unsigned long in the
> bitmap).

Ah, right! Good point.

Paolo,

Would you prefer to switch "requests" to a bitmap and 
update kvm_request_pending to handle this?

> 
> Thanks,
> Mark.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-07-10 Thread Raslan, KarimAllah

On Thu, 2018-07-05 at 14:51 +0100, Mark Rutland wrote:
> On Sun, Apr 15, 2018 at 12:26:44AM +0200, KarimAllah Ahmed wrote:
> > 
> > Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> > use the size of "requests" instead of the hard-coded '32'.
> > 
> > That gives us a bit more room again for arch-specific requests as we
> > already ran out of space for x86 due to the hard-coded check.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  include/linux/kvm_host.h | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6930c63..fe4f46b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -129,7 +129,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQUEST_ARCH_BASE 8
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > -   BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> > +   BUILD_BUG_ON((unsigned)(nr) >= (sizeof(((struct kvm_vcpu 
> > *)0)->requests) * 8) - KVM_REQUEST_ARCH_BASE); \
> > (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
> >  })
> >  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)
> > @@ -223,7 +223,7 @@ struct kvm_vcpu {
> > int vcpu_id;
> > int srcu_idx;
> > int mode;
> > -   unsigned long requests;
> > +   u64 requests;
> 
> The usual thing to do for bitmaps is something like:
> 
> #define KVM_REQUEST_NR(KVM_REQUEST_ARCH_BASE + 
> KVM_REQUEST_ARCH_NR)
> 
>   unsigned long requests[BITS_TO_LONGS(NR_KVM_REQUESTS)];
> 
> > 
> > unsigned long guest_debug;
> >  
> > int pre_pcpu;
> > @@ -1122,7 +1122,7 @@ static inline void kvm_make_request(int req, struct 
> > kvm_vcpu *vcpu)
> >  * caller.  Paired with the smp_mb__after_atomic in kvm_check_request.
> >  */
> > smp_wmb();
> > -   set_bit(req & KVM_REQUEST_MASK, >requests);
> > +   set_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> 
> ... which wouldn't require a void cast to make the bit API functions
> happy (as these expect a pointer to the first unsigned long in the
> bitmap).

Ah, right! Good point.

Paolo,

Would you prefer to switch "requests" to a bitmap and 
update kvm_request_pending to handle this?

> 
> Thanks,
> Mark.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-07-05 Thread Raslan, KarimAllah

On Tue, 2018-05-22 at 17:47 +0200, Paolo Bonzini wrote:
> On 22/05/2018 17:42, Raslan, KarimAllah wrote:
> > 
> > On Mon, 2018-04-16 at 18:28 +0200, Paolo Bonzini wrote:
> > > 
> > > On 15/04/2018 00:26, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check 
> > > > to
> > > > use the size of "requests" instead of the hard-coded '32'.
> > > > 
> > > > That gives us a bit more room again for arch-specific requests as we
> > > > already ran out of space for x86 due to the hard-coded check.
> > > > 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Radim Krčmář 
> > > > Cc: k...@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: KarimAllah Ahmed 
> > > 
> > > I'm afraid architectures like ARM 32 need this to be conditional (using
> > > Kconfig).
> > 
> > Why would using a 64-bit 'requests' be a problem for ARM32? Are you 
> > concerned about performance here or is there some symantic problem?
> 
> They don't support atomics on double-word data.

But they support atomics on single words. Of which there are two.
We don't need atomic updates of the whole 64-bit quantity (a là 
cmpxchg). Do we strictly need this to be atomic accross the 64-bit?

Looking at the use cases for "requests":

kvm_clear_request
kvm_test_request
kvm_request_pending
kvm_check_request

... and all of them would still work if the atomicity is only at the 
word level, right?

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-07-05 Thread Raslan, KarimAllah

On Tue, 2018-05-22 at 17:47 +0200, Paolo Bonzini wrote:
> On 22/05/2018 17:42, Raslan, KarimAllah wrote:
> > 
> > On Mon, 2018-04-16 at 18:28 +0200, Paolo Bonzini wrote:
> > > 
> > > On 15/04/2018 00:26, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check 
> > > > to
> > > > use the size of "requests" instead of the hard-coded '32'.
> > > > 
> > > > That gives us a bit more room again for arch-specific requests as we
> > > > already ran out of space for x86 due to the hard-coded check.
> > > > 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Radim Krčmář 
> > > > Cc: k...@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Signed-off-by: KarimAllah Ahmed 
> > > 
> > > I'm afraid architectures like ARM 32 need this to be conditional (using
> > > Kconfig).
> > 
> > Why would using a 64-bit 'requests' be a problem for ARM32? Are you 
> > concerned about performance here or is there some symantic problem?
> 
> They don't support atomics on double-word data.

But they support atomics on single words. Of which there are two.
We don't need atomic updates of the whole 64-bit quantity (a là 
cmpxchg). Do we strictly need this to be atomic accross the 64-bit?

Looking at the use cases for "requests":

kvm_clear_request
kvm_test_request
kvm_request_pending
kvm_check_request

... and all of them would still work if the atomicity is only at the 
word level, right?

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: general protection fault in vmx_vcpu_run

2018-07-04 Thread Raslan, KarimAllah

Dmitry,

Can you share the host kernel version?

I can not reproduce any of these crash signatures and I think it's 
really a nested virtualization bug. So I will need the exact host 
kernel version as well.

I am currently getting all sorts of:

"KVM: entry failed, hardware error 0x7"

... instead of the crash signatures that you are posting.

Regards.

On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote:
> Looking also at the other crash [0]:
> 
>         msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
> 811f65b7:   e8 44 cb 57 00  callq  81773100
> <__sanitizer_cov_trace_pc>
> 811f65bc:   48 8b 54 24 08  mov0x8(%rsp),%rdx
> 811f65c1:   48 b8 00 00 00 00 00movabs
> $0xdc00,%rax
> 811f65c8:   fc ff df
> 811f65cb:   48 c1 ea 03 shr$0x3,%rdx
> 811f65cf:   80 3c 02
> 00 cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
> 811f65d3:   0f 85 36 19 00 00   jne811f7f0f
> 
> 
> %rdx should contain a pointer to loaded_vmcs. It is directly loaded 
> from the stack [0x8(%rsp)]. This same stack location was just used 
> before the inlined assembly for VMRESUME/VMLAUNCH here:
> 
>         vmx->__launched = vmx->loaded_vmcs->launched;
> 811f639f:   e8 5c cd 57 00  callq  81773100
> <__sanitizer_cov_trace_pc>
> 811f63a4:   48 8b 54 24 08  mov0x8(%rsp),%rdx
> 811f63a9:   48 b8 00 00 00 00 00movabs
> $0xdc00,%rax
> 811f63b0:   fc ff df
> 811f63b3:   48 c1 ea 03 shr$0x3,%rdx
> 811f63b7:   80 3c 02
> 00 cmpb   $0x0,(%rdx,%rax,1)        <- used here.
> 
> ... and this stack location was never touched by anything in between! 
> So something must have corrupted the stack itself not really the 
> kvm_vc
> pu struct.
> 
> Obviously the inlined assembly block is using the stack as well, but I 
> can not see anything that would cause this corruption there.
> 
> That being said, looking at the %rsp and %rbp values that are dumped
> in the stack trace:
> 
> RSP: 8801b7d7f380
> RBP: 8801b8260140
> 
> ... they are almost 4.8 MiB apart! Should not these two register be a 
> bit closer to each other? :)
> 
> So 2 possibilities here:
> 
> 1- %rsp is wrong
> 
> That would explain why the loaded_vmcs was NULL. However, it is a bit 
> harder to understand how it became wrong! It should have been restored 
> during the VMEXIT from the HOST_RSP value in the VMCS!
> 
> Is this a nested setup?
> 
> 2- %rbp is wrong
> 
> That would also explain why the loaded_vmcs was NULL. Whatever
> corrupted the stack that caused loaded_vmcs to be NULL could have also
> corrupted the %rbp saved in the stack. That would mean that it happened
> during a function call. All function calls that happened between the
> point when the stack was sane (just before the "asm" block for
> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
> can not see where the stack would get corrupted though! Obviously
> another source of corruption can be a completely unrelated thread
> directly corruption this thread's memory.
> 
> Maybe it would be easier to just try to repro it first and see which 
> one is true (if at all).
> 
> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
> 
> 
> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
> > 
> >   22: 0f 01 c3  vmresume
> >   25: 48 89 4c 24 08mov%rcx,0x8(%rsp)
> >   2a: 59pop%rcx
> > 
> > :
> >   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
> >   32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx)
> >   39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx)
> > 
> > %rcx should be pointing to the vcpu_vmx structure, but it's not even
> > canonical: 110035842e78.
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: general protection fault in vmx_vcpu_run

2018-07-04 Thread Raslan, KarimAllah

Dmitry,

Can you share the host kernel version?

I can not reproduce any of these crash signatures and I think it's 
really a nested virtualization bug. So I will need the exact host 
kernel version as well.

I am currently getting all sorts of:

"KVM: entry failed, hardware error 0x7"

... instead of the crash signatures that you are posting.

Regards.

On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote:
> Looking also at the other crash [0]:
> 
>         msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
> 811f65b7:   e8 44 cb 57 00  callq  81773100
> <__sanitizer_cov_trace_pc>
> 811f65bc:   48 8b 54 24 08  mov0x8(%rsp),%rdx
> 811f65c1:   48 b8 00 00 00 00 00movabs
> $0xdc00,%rax
> 811f65c8:   fc ff df
> 811f65cb:   48 c1 ea 03 shr$0x3,%rdx
> 811f65cf:   80 3c 02
> 00 cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
> 811f65d3:   0f 85 36 19 00 00   jne811f7f0f
> 
> 
> %rdx should contain a pointer to loaded_vmcs. It is directly loaded 
> from the stack [0x8(%rsp)]. This same stack location was just used 
> before the inlined assembly for VMRESUME/VMLAUNCH here:
> 
>         vmx->__launched = vmx->loaded_vmcs->launched;
> 811f639f:   e8 5c cd 57 00  callq  81773100
> <__sanitizer_cov_trace_pc>
> 811f63a4:   48 8b 54 24 08  mov0x8(%rsp),%rdx
> 811f63a9:   48 b8 00 00 00 00 00movabs
> $0xdc00,%rax
> 811f63b0:   fc ff df
> 811f63b3:   48 c1 ea 03 shr$0x3,%rdx
> 811f63b7:   80 3c 02
> 00 cmpb   $0x0,(%rdx,%rax,1)        <- used here.
> 
> ... and this stack location was never touched by anything in between! 
> So something must have corrupted the stack itself not really the 
> kvm_vc
> pu struct.
> 
> Obviously the inlined assembly block is using the stack as well, but I 
> can not see anything that would cause this corruption there.
> 
> That being said, looking at the %rsp and %rbp values that are dumped
> in the stack trace:
> 
> RSP: 8801b7d7f380
> RBP: 8801b8260140
> 
> ... they are almost 4.8 MiB apart! Should not these two register be a 
> bit closer to each other? :)
> 
> So 2 possibilities here:
> 
> 1- %rsp is wrong
> 
> That would explain why the loaded_vmcs was NULL. However, it is a bit 
> harder to understand how it became wrong! It should have been restored 
> during the VMEXIT from the HOST_RSP value in the VMCS!
> 
> Is this a nested setup?
> 
> 2- %rbp is wrong
> 
> That would also explain why the loaded_vmcs was NULL. Whatever
> corrupted the stack that caused loaded_vmcs to be NULL could have also
> corrupted the %rbp saved in the stack. That would mean that it happened
> during a function call. All function calls that happened between the
> point when the stack was sane (just before the "asm" block for
> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
> can not see where the stack would get corrupted though! Obviously
> another source of corruption can be a completely unrelated thread
> directly corruption this thread's memory.
> 
> Maybe it would be easier to just try to repro it first and see which 
> one is true (if at all).
> 
> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550
> 
> 
> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
> > 
> >   22: 0f 01 c3  vmresume
> >   25: 48 89 4c 24 08mov%rcx,0x8(%rsp)
> >   2a: 59pop%rcx
> > 
> > :
> >   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
> >   32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx)
> >   39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx)
> > 
> > %rcx should be pointing to the vcpu_vmx structure, but it's not even
> > canonical: 110035842e78.
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: general protection fault in vmx_vcpu_run

2018-06-30 Thread Raslan, KarimAllah

Looking also at the other crash [0]:

        msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
811f65b7:   e8 44 cb 57 00  callq  81773100
<__sanitizer_cov_trace_pc>
811f65bc:   48 8b 54 24 08  mov0x8(%rsp),%rdx
811f65c1:   48 b8 00 00 00 00 00movabs
$0xdc00,%rax
811f65c8:   fc ff df
811f65cb:   48 c1 ea 03 shr$0x3,%rdx
811f65cf:   80 3c 02
00 cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
811f65d3:   0f 85 36 19 00 00   jne811f7f0f

%rdx should contain a pointer to loaded_vmcs. It is directly loaded 
from the stack [0x8(%rsp)]. This same stack location was just used 
before the inlined assembly for VMRESUME/VMLAUNCH here:

        vmx->__launched = vmx->loaded_vmcs->launched;
811f639f:   e8 5c cd 57 00  callq  81773100
<__sanitizer_cov_trace_pc>
811f63a4:   48 8b 54 24 08  mov0x8(%rsp),%rdx
811f63a9:   48 b8 00 00 00 00 00movabs
$0xdc00,%rax
811f63b0:   fc ff df
811f63b3:   48 c1 ea 03 shr$0x3,%rdx
811f63b7:   80 3c 02
00 cmpb   $0x0,(%rdx,%rax,1)        <- used here.

... and this stack location was never touched by anything in between! 
So something must have corrupted the stack itself not really the 
kvm_vc
pu struct.

Obviously the inlined assembly block is using the stack as well, but I 
can not see anything that would cause this corruption there.

That being said, looking at the %rsp and %rbp values that are dumped
in the stack trace:

RSP: 8801b7d7f380
RBP: 8801b8260140

... they are almost 4.8 MiB apart! Should not these two register be a 
bit closer to each other? :)

So 2 possibilities here:

1- %rsp is wrong

That would explain why the loaded_vmcs was NULL. However, it is a bit 
harder to understand how it became wrong! It should have been restored 
during the VMEXIT from the HOST_RSP value in the VMCS!

Is this a nested setup?

2- %rbp is wrong

That would also explain why the loaded_vmcs was NULL. Whatever
corrupted the stack that caused loaded_vmcs to be NULL could have also
corrupted the %rbp saved in the stack. That would mean that it happened
during a function call. All function calls that happened between the
point when the stack was sane (just before the "asm" block for
VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
can not see where the stack would get corrupted though! Obviously
another source of corruption can be a completely unrelated thread
directly corruption this thread's memory.

Maybe it would be easier to just try to repro it first and see which 
one is true (if at all).

[0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550

On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
>   22: 0f 01 c3  vmresume
>   25: 48 89 4c 24 08mov%rcx,0x8(%rsp)
>   2a: 59pop%rcx
> 
> :
>   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
>   32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx)
>   39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx)
> 
> %rcx should be pointing to the vcpu_vmx structure, but it's not even
> canonical: 110035842e78.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: general protection fault in vmx_vcpu_run

2018-06-30 Thread Raslan, KarimAllah

Looking also at the other crash [0]:

        msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
811f65b7:   e8 44 cb 57 00  callq  81773100
<__sanitizer_cov_trace_pc>
811f65bc:   48 8b 54 24 08  mov0x8(%rsp),%rdx
811f65c1:   48 b8 00 00 00 00 00movabs
$0xdc00,%rax
811f65c8:   fc ff df
811f65cb:   48 c1 ea 03 shr$0x3,%rdx
811f65cf:   80 3c 02
00 cmpb   $0x0,(%rdx,%rax,1)        <- fault here.
811f65d3:   0f 85 36 19 00 00   jne811f7f0f

%rdx should contain a pointer to loaded_vmcs. It is directly loaded 
from the stack [0x8(%rsp)]. This same stack location was just used 
before the inlined assembly for VMRESUME/VMLAUNCH here:

        vmx->__launched = vmx->loaded_vmcs->launched;
811f639f:   e8 5c cd 57 00  callq  81773100
<__sanitizer_cov_trace_pc>
811f63a4:   48 8b 54 24 08  mov0x8(%rsp),%rdx
811f63a9:   48 b8 00 00 00 00 00movabs
$0xdc00,%rax
811f63b0:   fc ff df
811f63b3:   48 c1 ea 03 shr$0x3,%rdx
811f63b7:   80 3c 02
00 cmpb   $0x0,(%rdx,%rax,1)        <- used here.

... and this stack location was never touched by anything in between! 
So something must have corrupted the stack itself not really the 
kvm_vc
pu struct.

Obviously the inlined assembly block is using the stack as well, but I 
can not see anything that would cause this corruption there.

That being said, looking at the %rsp and %rbp values that are dumped
in the stack trace:

RSP: 8801b7d7f380
RBP: 8801b8260140

... they are almost 4.8 MiB apart! Should not these two register be a 
bit closer to each other? :)

So 2 possibilities here:

1- %rsp is wrong

That would explain why the loaded_vmcs was NULL. However, it is a bit 
harder to understand how it became wrong! It should have been restored 
during the VMEXIT from the HOST_RSP value in the VMCS!

Is this a nested setup?

2- %rbp is wrong

That would also explain why the loaded_vmcs was NULL. Whatever
corrupted the stack that caused loaded_vmcs to be NULL could have also
corrupted the %rbp saved in the stack. That would mean that it happened
during a function call. All function calls that happened between the
point when the stack was sane (just before the "asm" block for
VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I
can not see where the stack would get corrupted though! Obviously
another source of corruption can be a completely unrelated thread
directly corruption this thread's memory.

Maybe it would be easier to just try to repro it first and see which 
one is true (if at all).

[0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550

On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote:
>   22: 0f 01 c3  vmresume
>   25: 48 89 4c 24 08mov%rcx,0x8(%rsp)
>   2a: 59pop%rcx
> 
> :
>   2b: 0f 96 81 88 56 00 00 setbe  0x5688(%rcx)
>   32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx)
>   39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx)
> 
> %rcx should be pointing to the vcpu_vmx structure, but it's not even
> canonical: 110035842e78.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2 00/12] KVM/X86: Introduce a new guest mapping interface

2018-05-22 Thread Raslan, KarimAllah

On Tue, 2018-05-15 at 12:06 -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 16, 2018 at 02:27:13PM +0200, Paolo Bonzini wrote:
> > 
> > On 16/04/2018 14:09, Raslan, KarimAllah wrote:
> > > 
> > > > 
> > > > I assume the caching will also be a separate patch.
> > > Yup, do you want me to include it in this one? I already have it, I
> > > just thought that I get those bits out first.
> > 
> > It's the same for me.
> > 
> > Paolo
> > 
> > > 
> > > > 
> > > > It looks good except that I'd squash patches 4 and 9 together.
> > > Yup, makes sense. I should have squashed them when I removed the 
> > > lifecycle change!
> > > 
> > > Thanks for the review :)
> > > 
> > > > 
> > > > But I'd like a second set of eyes to look at it.
> 
> Did anybody else end up reviewing these patches? And would it make sense
> to repost a new version with the #4 and #9 squashed? Thanks.

No, no second review yet. I will squash and repost a new version.

> 
> > 
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2 00/12] KVM/X86: Introduce a new guest mapping interface

2018-05-22 Thread Raslan, KarimAllah

On Tue, 2018-05-15 at 12:06 -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 16, 2018 at 02:27:13PM +0200, Paolo Bonzini wrote:
> > 
> > On 16/04/2018 14:09, Raslan, KarimAllah wrote:
> > > 
> > > > 
> > > > I assume the caching will also be a separate patch.
> > > Yup, do you want me to include it in this one? I already have it, I
> > > just thought that I get those bits out first.
> > 
> > It's the same for me.
> > 
> > Paolo
> > 
> > > 
> > > > 
> > > > It looks good except that I'd squash patches 4 and 9 together.
> > > Yup, makes sense. I should have squashed them when I removed the 
> > > lifecycle change!
> > > 
> > > Thanks for the review :)
> > > 
> > > > 
> > > > But I'd like a second set of eyes to look at it.
> 
> Did anybody else end up reviewing these patches? And would it make sense
> to repost a new version with the #4 and #9 squashed? Thanks.

No, no second review yet. I will squash and repost a new version.

> 
> > 
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-05-22 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 18:28 +0200, Paolo Bonzini wrote:
> On 15/04/2018 00:26, KarimAllah Ahmed wrote:
> > 
> > Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> > use the size of "requests" instead of the hard-coded '32'.
> > 
> > That gives us a bit more room again for arch-specific requests as we
> > already ran out of space for x86 due to the hard-coded check.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> I'm afraid architectures like ARM 32 need this to be conditional (using
> Kconfig).

Why would using a 64-bit 'requests' be a problem for ARM32? Are you 
concerned about performance here or is there some symantic problem?

> 
> Thanks,
> 
> Paolo
> 
> > 
> > ---
> >  include/linux/kvm_host.h | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6930c63..fe4f46b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -129,7 +129,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQUEST_ARCH_BASE 8
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > -   BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> > +   BUILD_BUG_ON((unsigned)(nr) >= (sizeof(((struct kvm_vcpu 
> > *)0)->requests) * 8) - KVM_REQUEST_ARCH_BASE); \
> > (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
> >  })
> >  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)
> > @@ -223,7 +223,7 @@ struct kvm_vcpu {
> > int vcpu_id;
> > int srcu_idx;
> > int mode;
> > -   unsigned long requests;
> > +   u64 requests;
> > unsigned long guest_debug;
> >  
> > int pre_pcpu;
> > @@ -1122,7 +1122,7 @@ static inline void kvm_make_request(int req, struct 
> > kvm_vcpu *vcpu)
> >  * caller.  Paired with the smp_mb__after_atomic in kvm_check_request.
> >  */
> > smp_wmb();
> > -   set_bit(req & KVM_REQUEST_MASK, >requests);
> > +   set_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> >  }
> >  
> >  static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
> > @@ -1132,12 +1132,12 @@ static inline bool kvm_request_pending(struct 
> > kvm_vcpu *vcpu)
> >  
> >  static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
> >  {
> > -   return test_bit(req & KVM_REQUEST_MASK, >requests);
> > +   return test_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> >  }
> >  
> >  static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
> >  {
> > -   clear_bit(req & KVM_REQUEST_MASK, >requests);
> > +   clear_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> >  }
> >  
> >  static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-05-22 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 18:28 +0200, Paolo Bonzini wrote:
> On 15/04/2018 00:26, KarimAllah Ahmed wrote:
> > 
> > Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> > use the size of "requests" instead of the hard-coded '32'.
> > 
> > That gives us a bit more room again for arch-specific requests as we
> > already ran out of space for x86 due to the hard-coded check.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> I'm afraid architectures like ARM 32 need this to be conditional (using
> Kconfig).

Why would using a 64-bit 'requests' be a problem for ARM32? Are you 
concerned about performance here or is there some symantic problem?

> 
> Thanks,
> 
> Paolo
> 
> > 
> > ---
> >  include/linux/kvm_host.h | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6930c63..fe4f46b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -129,7 +129,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQUEST_ARCH_BASE 8
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > -   BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> > +   BUILD_BUG_ON((unsigned)(nr) >= (sizeof(((struct kvm_vcpu 
> > *)0)->requests) * 8) - KVM_REQUEST_ARCH_BASE); \
> > (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
> >  })
> >  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)
> > @@ -223,7 +223,7 @@ struct kvm_vcpu {
> > int vcpu_id;
> > int srcu_idx;
> > int mode;
> > -   unsigned long requests;
> > +   u64 requests;
> > unsigned long guest_debug;
> >  
> > int pre_pcpu;
> > @@ -1122,7 +1122,7 @@ static inline void kvm_make_request(int req, struct 
> > kvm_vcpu *vcpu)
> >  * caller.  Paired with the smp_mb__after_atomic in kvm_check_request.
> >  */
> > smp_wmb();
> > -   set_bit(req & KVM_REQUEST_MASK, >requests);
> > +   set_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> >  }
> >  
> >  static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
> > @@ -1132,12 +1132,12 @@ static inline bool kvm_request_pending(struct 
> > kvm_vcpu *vcpu)
> >  
> >  static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
> >  {
> > -   return test_bit(req & KVM_REQUEST_MASK, >requests);
> > +   return test_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> >  }
> >  
> >  static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
> >  {
> > -   clear_bit(req & KVM_REQUEST_MASK, >requests);
> > +   clear_bit(req & KVM_REQUEST_MASK, (void *)>requests);
> >  }
> >  
> >  static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-16 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 09:22 -0700, Jim Mattson wrote:
> On Thu, Apr 12, 2018 at 8:12 AM, KarimAllah Ahmed  wrote:
> 
> > 
> > v2 -> v3:
> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
> >   problem is solved.
> > - Rebase again!
> > - Set nested_run_pending during restore (not sure if it makes sense yet or
> >   not).
> 
> This doesn't actually make sense. Nested_run_pending should only be
> set between L1 doing a VMLAUNCH/VMRESUME and the first instruction
> executing in L2. That is extremely unlikely at a restore point.

Yeah, I am afraid I put very little thought into it as I was focused
on the TSC issue :)

Will handle it properly in next version.

> 
> To deal with nested_run_pending and nested save/restore,
> nested_run_pending should be set to 1 before calling
> enter_vmx_non_root_mode, as it was prior to commit 7af40ad37b3f. That
> means that it has to be cleared when emulating VM-entry to the halted
> state (prior to calling kvm_vcpu_halt). And all of the from_vmentry
> arguments that Paolo added when rebasing commit cf8b84f48a59 should be
> removed, so that nested_run_pending is propagated correctly duting a
> restore.
> 
> It should be possible to eliminate this strange little wart, but I
> haven't looked deeply into it.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-16 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 09:22 -0700, Jim Mattson wrote:
> On Thu, Apr 12, 2018 at 8:12 AM, KarimAllah Ahmed  wrote:
> 
> > 
> > v2 -> v3:
> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
> >   problem is solved.
> > - Rebase again!
> > - Set nested_run_pending during restore (not sure if it makes sense yet or
> >   not).
> 
> This doesn't actually make sense. Nested_run_pending should only be
> set between L1 doing a VMLAUNCH/VMRESUME and the first instruction
> executing in L2. That is extremely unlikely at a restore point.

Yeah, I am afraid I put very little thought into it as I was focused
on the TSC issue :)

Will handle it properly in next version.

> 
> To deal with nested_run_pending and nested save/restore,
> nested_run_pending should be set to 1 before calling
> enter_vmx_non_root_mode, as it was prior to commit 7af40ad37b3f. That
> means that it has to be cleared when emulating VM-entry to the halted
> state (prior to calling kvm_vcpu_halt). And all of the from_vmentry
> arguments that Paolo added when rebasing commit cf8b84f48a59 should be
> removed, so that nested_run_pending is propagated correctly duting a
> restore.
> 
> It should be possible to eliminate this strange little wart, but I
> haven't looked deeply into it.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2 00/12] KVM/X86: Introduce a new guest mapping interface

2018-04-16 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 13:10 +0200, Paolo Bonzini wrote:
> On 15/04/2018 23:53, KarimAllah Ahmed wrote:
> > 
> > Guest memory can either be directly managed by the kernel (i.e. have a 
> > "struct
> > page") or they can simply live outside kernel control (i.e. do not have a
> > "struct page"). KVM mostly support these two modes, except in a few places
> > where the code seems to assume that guest memory must have a "struct page".
> > 
> > This patchset introduces a new mapping interface to map guest memory into 
> > host
> > kernel memory which also supports PFN-based memory (i.e. memory without 
> > 'struct
> > page'). It also converts all offending code to this interface or simply
> > read/write directly from guest memory.
> > 
> > As far as I can see all offending code is now fixed except the APIC-access 
> > page
> > which I will handle in a seperate patch.
> 
> I assume the caching will also be a separate patch.

Yup, do you want me to include it in this one? I already have it, I
just thought that I get those bits out first.

> 
> It looks good except that I'd squash patches 4 and 9 together.

Yup, makes sense. I should have squashed them when I removed the 
lifecycle change!

Thanks for the review :)

> But I'd like a second set of eyes to look at it.
> 
> Thanks,
> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2 00/12] KVM/X86: Introduce a new guest mapping interface

2018-04-16 Thread Raslan, KarimAllah

On Mon, 2018-04-16 at 13:10 +0200, Paolo Bonzini wrote:
> On 15/04/2018 23:53, KarimAllah Ahmed wrote:
> > 
> > Guest memory can either be directly managed by the kernel (i.e. have a 
> > "struct
> > page") or they can simply live outside kernel control (i.e. do not have a
> > "struct page"). KVM mostly support these two modes, except in a few places
> > where the code seems to assume that guest memory must have a "struct page".
> > 
> > This patchset introduces a new mapping interface to map guest memory into 
> > host
> > kernel memory which also supports PFN-based memory (i.e. memory without 
> > 'struct
> > page'). It also converts all offending code to this interface or simply
> > read/write directly from guest memory.
> > 
> > As far as I can see all offending code is now fixed except the APIC-access 
> > page
> > which I will handle in a seperate patch.
> 
> I assume the caching will also be a separate patch.

Yup, do you want me to include it in this one? I already have it, I
just thought that I get those bits out first.

> 
> It looks good except that I'd squash patches 4 and 9 together.

Yup, makes sense. I should have squashed them when I removed the 
lifecycle change!

Thanks for the review :)

> But I'd like a second set of eyes to look at it.
> 
> Thanks,
> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v4] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-15 Thread Raslan, KarimAllah

On Sat, 2018-04-14 at 05:10 +0200, KarimAllah Ahmed wrote:
> Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always
> captures the TSC_OFFSET of the running guest whether it is the L1 or L2
> guest.
> 
> Cc: Jim Mattson 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Paolo Bonzini 
> Signed-off-by: KarimAllah Ahmed 
> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
> Signed-off-by: Paolo Bonzini 
> 
> ---
> v3 -> v4:
> - Restore L01 tsc_offset on enter_vmx_non_root_mode failures.
> - Move tsc_offset update for L02 later in nested_vmx_run.
> 
> v2 -> v3:
> - Add AMD bits as well.
> - Fix update_ia32_tsc_adjust_msr.
> 
> v1 -> v2:
> - Rewrote the patch to always update tsc_offset to represent the current
>   guest (pbonzini@)
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  | 17 -
>  arch/x86/kvm/vmx.c  | 29 -
>  arch/x86/kvm/x86.c  |  6 --
>  4 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7a200f6..a40a32e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1016,6 +1016,7 @@ struct kvm_x86_ops {
>  
>   bool (*has_wbinvd_exit)(void);
>  
> + u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu);
>   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>   void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index b58787d..1f00c18 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1423,12 +1423,23 @@ static void init_sys_seg(struct vmcb_seg *seg, 
> uint32_t type)
>   seg->base = 0;
>  }
>  
> +static u64 svm_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + if (is_guest_mode(vcpu))
> + return svm->nested.hsave->control.tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
>   u64 g_tsc_offset = 0;
>  
>   if (is_guest_mode(vcpu)) {
> + /* Write L1's TSC offset.  */
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> @@ -3322,6 +,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>   /* Restore the original control entries */
>   copy_vmcb_control_area(vmcb, hsave);
>  
> + vcpu->arch.tsc_offset = svm->vmcb->control.tsc_offset;

Paolo,

'vcpu' is actually not defined in this context (and in all other 
occurrences below). Would you like me to send a fixed version of this 
bit or can you fix before applying?

>   kvm_clear_exception_queue(>vcpu);
>   kvm_clear_interrupt_queue(>vcpu);
>  
> @@ -3482,10 +3494,12 @@ static void enter_svm_guest_mode(struct vcpu_svm 
> *svm, u64 vmcb_gpa,
>   /* We don't want to see VMMCALLs from a nested guest */
>   clr_intercept(svm, INTERCEPT_VMMCALL);
>  
> + vcpu->arch.tsc_offset += nested_vmcb->control.tsc_offset;
> + svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset;
> +
>   svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
>   svm->vmcb->control.int_vector = nested_vmcb->control.int_vector;
>   svm->vmcb->control.int_state = nested_vmcb->control.int_state;
> - svm->vmcb->control.tsc_offset += nested_vmcb->control.tsc_offset;
>   svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
>   svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
>  
> @@ -7102,6 +7116,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = 
> {
>  
>   .has_wbinvd_exit = svm_has_wbinvd_exit,
>  
> + .read_l1_tsc_offset = svm_read_l1_tsc_offset,
>   .write_tsc_offset = svm_write_tsc_offset,
>  
>   .set_tdp_cr3 = set_tdp_cr3,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b6942de..05ba3c6 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2885,6 +2885,17 @@ static void setup_msrs(struct vcpu_vmx *vmx)
>   vmx_update_msr_bitmap(>vcpu);
>  }
>  
> +static u64 vmx_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +
> + if (is_guest_mode(vcpu) &&
> + (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING))
> + return vcpu->arch.tsc_offset - vmcs12->tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  /*
>   * reads and returns guest's timestamp counter "register"
>   * guest_tsc = (host_tsc * tsc multiplier)

Re: [PATCH v4] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-15 Thread Raslan, KarimAllah

On Sat, 2018-04-14 at 05:10 +0200, KarimAllah Ahmed wrote:
> Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always
> captures the TSC_OFFSET of the running guest whether it is the L1 or L2
> guest.
> 
> Cc: Jim Mattson 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Paolo Bonzini 
> Signed-off-by: KarimAllah Ahmed 
> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
> Signed-off-by: Paolo Bonzini 
> 
> ---
> v3 -> v4:
> - Restore L01 tsc_offset on enter_vmx_non_root_mode failures.
> - Move tsc_offset update for L02 later in nested_vmx_run.
> 
> v2 -> v3:
> - Add AMD bits as well.
> - Fix update_ia32_tsc_adjust_msr.
> 
> v1 -> v2:
> - Rewrote the patch to always update tsc_offset to represent the current
>   guest (pbonzini@)
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  | 17 -
>  arch/x86/kvm/vmx.c  | 29 -
>  arch/x86/kvm/x86.c  |  6 --
>  4 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7a200f6..a40a32e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1016,6 +1016,7 @@ struct kvm_x86_ops {
>  
>   bool (*has_wbinvd_exit)(void);
>  
> + u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu);
>   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>   void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index b58787d..1f00c18 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1423,12 +1423,23 @@ static void init_sys_seg(struct vmcb_seg *seg, 
> uint32_t type)
>   seg->base = 0;
>  }
>  
> +static u64 svm_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + if (is_guest_mode(vcpu))
> + return svm->nested.hsave->control.tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
>   u64 g_tsc_offset = 0;
>  
>   if (is_guest_mode(vcpu)) {
> + /* Write L1's TSC offset.  */
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> @@ -3322,6 +,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>   /* Restore the original control entries */
>   copy_vmcb_control_area(vmcb, hsave);
>  
> + vcpu->arch.tsc_offset = svm->vmcb->control.tsc_offset;

Paolo,

'vcpu' is actually not defined in this context (and in all other 
occurrences below). Would you like me to send a fixed version of this 
bit or can you fix before applying?

>   kvm_clear_exception_queue(>vcpu);
>   kvm_clear_interrupt_queue(>vcpu);
>  
> @@ -3482,10 +3494,12 @@ static void enter_svm_guest_mode(struct vcpu_svm 
> *svm, u64 vmcb_gpa,
>   /* We don't want to see VMMCALLs from a nested guest */
>   clr_intercept(svm, INTERCEPT_VMMCALL);
>  
> + vcpu->arch.tsc_offset += nested_vmcb->control.tsc_offset;
> + svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset;
> +
>   svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
>   svm->vmcb->control.int_vector = nested_vmcb->control.int_vector;
>   svm->vmcb->control.int_state = nested_vmcb->control.int_state;
> - svm->vmcb->control.tsc_offset += nested_vmcb->control.tsc_offset;
>   svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
>   svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
>  
> @@ -7102,6 +7116,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = 
> {
>  
>   .has_wbinvd_exit = svm_has_wbinvd_exit,
>  
> + .read_l1_tsc_offset = svm_read_l1_tsc_offset,
>   .write_tsc_offset = svm_write_tsc_offset,
>  
>   .set_tdp_cr3 = set_tdp_cr3,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b6942de..05ba3c6 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2885,6 +2885,17 @@ static void setup_msrs(struct vcpu_vmx *vmx)
>   vmx_update_msr_bitmap(>vcpu);
>  }
>  
> +static u64 vmx_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +
> + if (is_guest_mode(vcpu) &&
> + (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING))
> + return vcpu->arch.tsc_offset - vmcs12->tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  /*
>   * reads and returns guest's timestamp counter "register"
>   * guest_tsc = (host_tsc * tsc multiplier) >> 48 + tsc_offset
> @@ -2,11 +11123,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-04-15 Thread Raslan, KarimAllah

On Sun, 2018-04-15 at 00:26 +0200, KarimAllah Ahmed wrote:
> Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> use the size of "requests" instead of the hard-coded '32'.
> 
> That gives us a bit more room again for arch-specific requests as we
> already ran out of space for x86 due to the hard-coded check.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: KarimAllah Ahmed 
> ---
>  include/linux/kvm_host.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6930c63..fe4f46b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -129,7 +129,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQUEST_ARCH_BASE 8
>  
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> - BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> + BUILD_BUG_ON((unsigned)(nr) >= (sizeof(((struct kvm_vcpu 
> *)0)->requests) * 8) - KVM_REQUEST_ARCH_BASE); \

While looking at a completely unrelated code I realize that there 
is FIELD_SIZEOF that is doing exactly what I did. Will use it in v2.

>   (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
>  })
>  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)
> @@ -223,7 +223,7 @@ struct kvm_vcpu {
>   int vcpu_id;
>   int srcu_idx;
>   int mode;
> - unsigned long requests;
> + u64 requests;
>   unsigned long guest_debug;
>  
>   int pre_pcpu;
> @@ -1122,7 +1122,7 @@ static inline void kvm_make_request(int req, struct 
> kvm_vcpu *vcpu)
>* caller.  Paired with the smp_mb__after_atomic in kvm_check_request.
>*/
>   smp_wmb();
> - set_bit(req & KVM_REQUEST_MASK, >requests);
> + set_bit(req & KVM_REQUEST_MASK, (void *)>requests);
>  }
>  
>  static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
> @@ -1132,12 +1132,12 @@ static inline bool kvm_request_pending(struct 
> kvm_vcpu *vcpu)
>  
>  static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
>  {
> - return test_bit(req & KVM_REQUEST_MASK, >requests);
> + return test_bit(req & KVM_REQUEST_MASK, (void *)>requests);
>  }
>  
>  static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
>  {
> - clear_bit(req & KVM_REQUEST_MASK, >requests);
> + clear_bit(req & KVM_REQUEST_MASK, (void *)>requests);
>  }
>  
>  static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] KVM: Switch 'requests' to be 64-bit (explicitly)

2018-04-15 Thread Raslan, KarimAllah

On Sun, 2018-04-15 at 00:26 +0200, KarimAllah Ahmed wrote:
> Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
> use the size of "requests" instead of the hard-coded '32'.
> 
> That gives us a bit more room again for arch-specific requests as we
> already ran out of space for x86 due to the hard-coded check.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: KarimAllah Ahmed 
> ---
>  include/linux/kvm_host.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6930c63..fe4f46b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -129,7 +129,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQUEST_ARCH_BASE 8
>  
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> - BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> + BUILD_BUG_ON((unsigned)(nr) >= (sizeof(((struct kvm_vcpu 
> *)0)->requests) * 8) - KVM_REQUEST_ARCH_BASE); \

While looking at a completely unrelated code I realize that there 
is FIELD_SIZEOF that is doing exactly what I did. Will use it in v2.

>   (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
>  })
>  #define KVM_ARCH_REQ(nr)   KVM_ARCH_REQ_FLAGS(nr, 0)
> @@ -223,7 +223,7 @@ struct kvm_vcpu {
>   int vcpu_id;
>   int srcu_idx;
>   int mode;
> - unsigned long requests;
> + u64 requests;
>   unsigned long guest_debug;
>  
>   int pre_pcpu;
> @@ -1122,7 +1122,7 @@ static inline void kvm_make_request(int req, struct 
> kvm_vcpu *vcpu)
>* caller.  Paired with the smp_mb__after_atomic in kvm_check_request.
>*/
>   smp_wmb();
> - set_bit(req & KVM_REQUEST_MASK, >requests);
> + set_bit(req & KVM_REQUEST_MASK, (void *)>requests);
>  }
>  
>  static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
> @@ -1132,12 +1132,12 @@ static inline bool kvm_request_pending(struct 
> kvm_vcpu *vcpu)
>  
>  static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
>  {
> - return test_bit(req & KVM_REQUEST_MASK, >requests);
> + return test_bit(req & KVM_REQUEST_MASK, (void *)>requests);
>  }
>  
>  static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
>  {
> - clear_bit(req & KVM_REQUEST_MASK, >requests);
> + clear_bit(req & KVM_REQUEST_MASK, (void *)>requests);
>  }
>  
>  static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-14 Thread Raslan, KarimAllah

On Sat, 2018-04-14 at 15:56 +, Raslan, KarimAllah wrote:
> On Thu, 2018-04-12 at 17:12 +0200, KarimAllah Ahmed wrote:
> > 
> > From: Jim Mattson <jmatt...@google.com>
> > 
> > For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> > this state can not be captured through the currently available IOCTLs. In
> > fact the state captured through all of these IOCTLs is usually a mix of L1
> > and L2 state. It is also dependent on whether the L2 guest was running at
> > the moment when the process was interrupted to save its state.
> > 
> > With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> > KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> > in VMX operation.
> > 
> > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > Cc: Radim Krčmář <rkrc...@redhat.com>
> > Cc: Thomas Gleixner <t...@linutronix.de>
> > Cc: Ingo Molnar <mi...@redhat.com>
> > Cc: H. Peter Anvin <h...@zytor.com>
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Jim Mattson <jmatt...@google.com>
> > [karahmed@ - rename structs and functions and make them ready for AMD and
> >  address previous comments.
> >- rebase & a bit of refactoring.
> >- Merge 7/8 and 8/8 into one patch.
> >- Force a VMExit from L2 after reading the kvm_state to avoid
> >  mixed state between L1 and L2 on resurrecting the instance. ]
> > Signed-off-by: KarimAllah Ahmed <karah...@amazon.de>
> > ---
> > v2 -> v3:
> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
> >   problem is solved.
> > - Rebase again!
> > - Set nested_run_pending during restore (not sure if it makes sense yet or
> >   not).
> > - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
> >   to switch everything to u64)
> > 
> > v1 -> v2:
> > - Rename structs and functions and make them ready for AMD and address
> >   previous comments.
> > - Rebase & a bit of refactoring.
> > - Merge 7/8 and 8/8 into one patch.
> > - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
> >   between L1 and L2 on resurrecting the instance.
> > ---
> >  Documentation/virtual/kvm/api.txt |  47 ++
> >  arch/x86/include/asm/kvm_host.h   |   7 ++
> >  arch/x86/include/uapi/asm/kvm.h   |  38 
> >  arch/x86/kvm/vmx.c| 177 
> > +-
> >  arch/x86/kvm/x86.c|  21 +
> >  include/linux/kvm_host.h  |   2 +-
> >  include/uapi/linux/kvm.h  |   5 ++
> >  7 files changed, 292 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt 
> > b/Documentation/virtual/kvm/api.txt
> > index 1c7958b..c51d5d3 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -3548,6 +3548,53 @@ Returns: 0 on success,
> > -ENOENT on deassign if the conn_id isn't registered
> > -EEXIST on assign if the conn_id is already registered
> >  
> > +4.114 KVM_GET_STATE
> > +
> > +Capability: KVM_CAP_STATE
> > +Architectures: x86
> > +Type: vcpu ioctl
> > +Parameters: struct kvm_state (in/out)
> > +Returns: 0 on success, -1 on error
> > +Errors:
> > +  E2BIG: the data size exceeds the value of 'size' specified by
> > + the user (the size required will be written into size).
> > +
> > +struct kvm_state {
> > +   __u16 flags;
> > +   __u16 format;
> > +   __u32 size;
> > +   union {
> > +   struct kvm_vmx_state vmx;
> > +   struct kvm_svm_state svm;
> > +   __u8 pad[120];
> > +   };
> > +   __u8 data[0];
> > +};
> > +
> > +This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
> > +
> > +4.115 KVM_SET_STATE
> > +
> > +Capability: KVM_CAP_STATE
> > +Architectures: x86
> > +Type: vcpu ioctl
> > +Parameters: struct kvm_state (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +struct kvm_state {
> > +   __u16 flags;
> > +   __u16 format;
> > +   __u32 size;
> > +   union {
> > +   struct kvm_vmx_state vmx;
> > +   struct kvm_svm_state svm;
> > +   __u8 pad[120];
> > +   };
> > +   __u8 data

Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-14 Thread Raslan, KarimAllah

On Sat, 2018-04-14 at 15:56 +, Raslan, KarimAllah wrote:
> On Thu, 2018-04-12 at 17:12 +0200, KarimAllah Ahmed wrote:
> > 
> > From: Jim Mattson 
> > 
> > For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> > this state can not be captured through the currently available IOCTLs. In
> > fact the state captured through all of these IOCTLs is usually a mix of L1
> > and L2 state. It is also dependent on whether the L2 guest was running at
> > the moment when the process was interrupted to save its state.
> > 
> > With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> > KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> > in VMX operation.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Jim Mattson 
> > [karahmed@ - rename structs and functions and make them ready for AMD and
> >  address previous comments.
> >- rebase & a bit of refactoring.
> >- Merge 7/8 and 8/8 into one patch.
> >- Force a VMExit from L2 after reading the kvm_state to avoid
> >  mixed state between L1 and L2 on resurrecting the instance. ]
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v2 -> v3:
> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
> >   problem is solved.
> > - Rebase again!
> > - Set nested_run_pending during restore (not sure if it makes sense yet or
> >   not).
> > - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
> >   to switch everything to u64)
> > 
> > v1 -> v2:
> > - Rename structs and functions and make them ready for AMD and address
> >   previous comments.
> > - Rebase & a bit of refactoring.
> > - Merge 7/8 and 8/8 into one patch.
> > - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
> >   between L1 and L2 on resurrecting the instance.
> > ---
> >  Documentation/virtual/kvm/api.txt |  47 ++
> >  arch/x86/include/asm/kvm_host.h   |   7 ++
> >  arch/x86/include/uapi/asm/kvm.h   |  38 
> >  arch/x86/kvm/vmx.c| 177 
> > +-
> >  arch/x86/kvm/x86.c|  21 +
> >  include/linux/kvm_host.h  |   2 +-
> >  include/uapi/linux/kvm.h  |   5 ++
> >  7 files changed, 292 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt 
> > b/Documentation/virtual/kvm/api.txt
> > index 1c7958b..c51d5d3 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -3548,6 +3548,53 @@ Returns: 0 on success,
> > -ENOENT on deassign if the conn_id isn't registered
> > -EEXIST on assign if the conn_id is already registered
> >  
> > +4.114 KVM_GET_STATE
> > +
> > +Capability: KVM_CAP_STATE
> > +Architectures: x86
> > +Type: vcpu ioctl
> > +Parameters: struct kvm_state (in/out)
> > +Returns: 0 on success, -1 on error
> > +Errors:
> > +  E2BIG: the data size exceeds the value of 'size' specified by
> > + the user (the size required will be written into size).
> > +
> > +struct kvm_state {
> > +   __u16 flags;
> > +   __u16 format;
> > +   __u32 size;
> > +   union {
> > +   struct kvm_vmx_state vmx;
> > +   struct kvm_svm_state svm;
> > +   __u8 pad[120];
> > +   };
> > +   __u8 data[0];
> > +};
> > +
> > +This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
> > +
> > +4.115 KVM_SET_STATE
> > +
> > +Capability: KVM_CAP_STATE
> > +Architectures: x86
> > +Type: vcpu ioctl
> > +Parameters: struct kvm_state (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +struct kvm_state {
> > +   __u16 flags;
> > +   __u16 format;
> > +   __u32 size;
> > +   union {
> > +   struct kvm_vmx_state vmx;
> > +   struct kvm_svm_state svm;
> > +   __u8 pad[120];
> > +   };
> > +   __u8 data[0];
> > +};
> > +
> > +This copies the vcpu's kvm_state struct from userspace to the kernel.
> > +>>>>>>> 13a7c9e... kvm: nVMX: Introduce KVM_CAP_STATE
> >  
> >  5. The kvm_run structure
> >

Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-14 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 17:12 +0200, KarimAllah Ahmed wrote:
> From: Jim Mattson 
> 
> For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> this state can not be captured through the currently available IOCTLs. In
> fact the state captured through all of these IOCTLs is usually a mix of L1
> and L2 state. It is also dependent on whether the L2 guest was running at
> the moment when the process was interrupted to save its state.
> 
> With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> in VMX operation.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: H. Peter Anvin 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jim Mattson 
> [karahmed@ - rename structs and functions and make them ready for AMD and
>  address previous comments.
>- rebase & a bit of refactoring.
>- Merge 7/8 and 8/8 into one patch.
>- Force a VMExit from L2 after reading the kvm_state to avoid
>  mixed state between L1 and L2 on resurrecting the instance. ]
> Signed-off-by: KarimAllah Ahmed 
> ---
> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).
> - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
>   to switch everything to u64)
> 
> v1 -> v2:
> - Rename structs and functions and make them ready for AMD and address
>   previous comments.
> - Rebase & a bit of refactoring.
> - Merge 7/8 and 8/8 into one patch.
> - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
>   between L1 and L2 on resurrecting the instance.
> ---
>  Documentation/virtual/kvm/api.txt |  47 ++
>  arch/x86/include/asm/kvm_host.h   |   7 ++
>  arch/x86/include/uapi/asm/kvm.h   |  38 
>  arch/x86/kvm/vmx.c| 177 
> +-
>  arch/x86/kvm/x86.c|  21 +
>  include/linux/kvm_host.h  |   2 +-
>  include/uapi/linux/kvm.h  |   5 ++
>  7 files changed, 292 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 1c7958b..c51d5d3 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3548,6 +3548,53 @@ Returns: 0 on success,
>   -ENOENT on deassign if the conn_id isn't registered
>   -EEXIST on assign if the conn_id is already registered
>  
> +4.114 KVM_GET_STATE
> +
> +Capability: KVM_CAP_STATE
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_state (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  E2BIG: the data size exceeds the value of 'size' specified by
> + the user (the size required will be written into size).
> +
> +struct kvm_state {
> + __u16 flags;
> + __u16 format;
> + __u32 size;
> + union {
> + struct kvm_vmx_state vmx;
> + struct kvm_svm_state svm;
> + __u8 pad[120];
> + };
> + __u8 data[0];
> +};
> +
> +This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
> +
> +4.115 KVM_SET_STATE
> +
> +Capability: KVM_CAP_STATE
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_state (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_state {
> + __u16 flags;
> + __u16 format;
> + __u32 size;
> + union {
> + struct kvm_vmx_state vmx;
> + struct kvm_svm_state svm;
> + __u8 pad[120];
> + };
> + __u8 data[0];
> +};
> +
> +This copies the vcpu's kvm_state struct from userspace to the kernel.
> +>>> 13a7c9e... kvm: nVMX: Introduce KVM_CAP_STATE
>  
>  5. The kvm_run structure
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9fa4f57..ad2116a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -75,6 +75,7 @@
>  #define KVM_REQ_HV_EXIT  KVM_ARCH_REQ(21)
>  #define KVM_REQ_HV_STIMERKVM_ARCH_REQ(22)
>  #define KVM_REQ_LOAD_EOI_EXITMAP KVM_ARCH_REQ(23)
> +#define KVM_REQ_GET_VMCS12_PAGES KVM_ARCH_REQ(24)
>  
>  #define CR0_RESERVED_BITS   \
>   (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -1084,6 +1085,12 @@ struct kvm_x86_ops {
>  
>   void (*setup_mce)(struct kvm_vcpu *vcpu);
>  
> + int (*get_state)(struct kvm_vcpu *vcpu,
> +  struct kvm_state __user

Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-14 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 17:12 +0200, KarimAllah Ahmed wrote:
> From: Jim Mattson 
> 
> For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> this state can not be captured through the currently available IOCTLs. In
> fact the state captured through all of these IOCTLs is usually a mix of L1
> and L2 state. It is also dependent on whether the L2 guest was running at
> the moment when the process was interrupted to save its state.
> 
> With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> in VMX operation.
> 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: H. Peter Anvin 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jim Mattson 
> [karahmed@ - rename structs and functions and make them ready for AMD and
>  address previous comments.
>- rebase & a bit of refactoring.
>- Merge 7/8 and 8/8 into one patch.
>- Force a VMExit from L2 after reading the kvm_state to avoid
>  mixed state between L1 and L2 on resurrecting the instance. ]
> Signed-off-by: KarimAllah Ahmed 
> ---
> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).
> - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
>   to switch everything to u64)
> 
> v1 -> v2:
> - Rename structs and functions and make them ready for AMD and address
>   previous comments.
> - Rebase & a bit of refactoring.
> - Merge 7/8 and 8/8 into one patch.
> - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
>   between L1 and L2 on resurrecting the instance.
> ---
>  Documentation/virtual/kvm/api.txt |  47 ++
>  arch/x86/include/asm/kvm_host.h   |   7 ++
>  arch/x86/include/uapi/asm/kvm.h   |  38 
>  arch/x86/kvm/vmx.c| 177 
> +-
>  arch/x86/kvm/x86.c|  21 +
>  include/linux/kvm_host.h  |   2 +-
>  include/uapi/linux/kvm.h  |   5 ++
>  7 files changed, 292 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 1c7958b..c51d5d3 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3548,6 +3548,53 @@ Returns: 0 on success,
>   -ENOENT on deassign if the conn_id isn't registered
>   -EEXIST on assign if the conn_id is already registered
>  
> +4.114 KVM_GET_STATE
> +
> +Capability: KVM_CAP_STATE
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_state (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  E2BIG: the data size exceeds the value of 'size' specified by
> + the user (the size required will be written into size).
> +
> +struct kvm_state {
> + __u16 flags;
> + __u16 format;
> + __u32 size;
> + union {
> + struct kvm_vmx_state vmx;
> + struct kvm_svm_state svm;
> + __u8 pad[120];
> + };
> + __u8 data[0];
> +};
> +
> +This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
> +
> +4.115 KVM_SET_STATE
> +
> +Capability: KVM_CAP_STATE
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_state (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_state {
> + __u16 flags;
> + __u16 format;
> + __u32 size;
> + union {
> + struct kvm_vmx_state vmx;
> + struct kvm_svm_state svm;
> + __u8 pad[120];
> + };
> + __u8 data[0];
> +};
> +
> +This copies the vcpu's kvm_state struct from userspace to the kernel.
> +>>> 13a7c9e... kvm: nVMX: Introduce KVM_CAP_STATE
>  
>  5. The kvm_run structure
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9fa4f57..ad2116a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -75,6 +75,7 @@
>  #define KVM_REQ_HV_EXIT  KVM_ARCH_REQ(21)
>  #define KVM_REQ_HV_STIMERKVM_ARCH_REQ(22)
>  #define KVM_REQ_LOAD_EOI_EXITMAP KVM_ARCH_REQ(23)
> +#define KVM_REQ_GET_VMCS12_PAGES KVM_ARCH_REQ(24)
>  
>  #define CR0_RESERVED_BITS   \
>   (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -1084,6 +1085,12 @@ struct kvm_x86_ops {
>  
>   void (*setup_mce)(struct kvm_vcpu *vcpu);
>  
> + int (*get_state)(struct kvm_vcpu *vcpu,
> +  struct kvm_state __user *user_kvm_state);
> + int (*set_state)(struct kvm_vcpu *vcpu,
> +  struct kvm_state __user *user_kvm_state);
> + void

Re: [PATCH 1/2] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-14 Thread Raslan, KarimAllah

On Fri, 2018-04-13 at 17:35 +0200, Paolo Bonzini wrote:
> On 13/04/2018 14:40, Raslan, KarimAllah wrote:
> > 
> > > 
> > >  
> > >  static void update_ia32_tsc_adjust_msr(struct kvm_vcpu *vcpu, s64 offset)
> > >  {
> > > - u64 curr_offset = vcpu->arch.tsc_offset;
> > > + u64 curr_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu);
> > I might be missing something but is this really strictly needed or is
> > it really a bug?
> > 
> > I can see update_ia32_tsc_adjust_msr called from kvm_write_tsc only 
> > which is called from a) vmx_set_msr or b) kvm_arch_vcpu_postcreate.
> > The adjust_msr would only be called if !host_initiated. So only 
> > vmx_set_msr which is coming from an L1 write (or a restore but that
> > would not be !host_initiated). So the only that tsc_adjust is called is
> > !is_guest_mode.
> 
> It can also be called from guest mode if the MSR bitmap says there's no
> L1 vmexit for that MSR; that's what the testcases do.

Apparently I will never wrap my head around this nested stuff :D

> 
> Paolo
> 
> > 
> > > 
> > >   vcpu->arch.ia32_tsc_adjust_msr += offset - curr_offset;
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-14 Thread Raslan, KarimAllah

On Fri, 2018-04-13 at 17:35 +0200, Paolo Bonzini wrote:
> On 13/04/2018 14:40, Raslan, KarimAllah wrote:
> > 
> > > 
> > >  
> > >  static void update_ia32_tsc_adjust_msr(struct kvm_vcpu *vcpu, s64 offset)
> > >  {
> > > - u64 curr_offset = vcpu->arch.tsc_offset;
> > > + u64 curr_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu);
> > I might be missing something but is this really strictly needed or is
> > it really a bug?
> > 
> > I can see update_ia32_tsc_adjust_msr called from kvm_write_tsc only 
> > which is called from a) vmx_set_msr or b) kvm_arch_vcpu_postcreate.
> > The adjust_msr would only be called if !host_initiated. So only 
> > vmx_set_msr which is coming from an L1 write (or a restore but that
> > would not be !host_initiated). So the only that tsc_adjust is called is
> > !is_guest_mode.
> 
> It can also be called from guest mode if the MSR bitmap says there's no
> L1 vmexit for that MSR; that's what the testcases do.

Apparently I will never wrap my head around this nested stuff :D

> 
> Paolo
> 
> > 
> > > 
> > >   vcpu->arch.ia32_tsc_adjust_msr += offset - curr_offset;
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-13 Thread Raslan, KarimAllah

On Fri, 2018-04-13 at 18:04 +0200, Paolo Bonzini wrote:
> On 13/04/2018 18:02, Jim Mattson wrote:
> > 
> > On Fri, Apr 13, 2018 at 4:23 AM, Paolo Bonzini  wrote:
> > > 
> > > From: KarimAllah Ahmed 
> > > 
> > > Update 'tsc_offset' on vmenty/vmexit of L2 guests to ensure that it always
> > > captures the TSC_OFFSET of the running guest whether it is the L1 or L2
> > > guest.
> > > 
> > > Cc: Jim Mattson 
> > > Cc: Paolo Bonzini 
> > > Cc: Radim Krčmář 
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Suggested-by: Paolo Bonzini 
> > > Signed-off-by: KarimAllah Ahmed 
> > > [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
> > > Signed-off-by: Paolo Bonzini 
> > 
> > > 
> > > @@ -11489,6 +11497,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, 
> > > bool launch)
> > > if (enable_shadow_vmcs)
> > > copy_shadow_to_vmcs12(vmx);
> > > 
> > > +   if (vmcs12->cpu_based_vm_exec_control & 
> > > CPU_BASED_USE_TSC_OFFSETING)
> > > +   vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > > +
> > 
> > This seems a little early, since we don't restore the L1 TSC offset on
> > the nested_vmx_failValid path.
> > 
> 
> Now this can be a nice one to introduce the VMX API tests. :)  I'll try
> to do it on Monday as punishment for not noticing the bug.  In the
> meanwhile, Karim, can you post a fixed fixed version?

done

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-13 Thread Raslan, KarimAllah

On Fri, 2018-04-13 at 18:04 +0200, Paolo Bonzini wrote:
> On 13/04/2018 18:02, Jim Mattson wrote:
> > 
> > On Fri, Apr 13, 2018 at 4:23 AM, Paolo Bonzini  wrote:
> > > 
> > > From: KarimAllah Ahmed 
> > > 
> > > Update 'tsc_offset' on vmenty/vmexit of L2 guests to ensure that it always
> > > captures the TSC_OFFSET of the running guest whether it is the L1 or L2
> > > guest.
> > > 
> > > Cc: Jim Mattson 
> > > Cc: Paolo Bonzini 
> > > Cc: Radim Krčmář 
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Suggested-by: Paolo Bonzini 
> > > Signed-off-by: KarimAllah Ahmed 
> > > [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
> > > Signed-off-by: Paolo Bonzini 
> > 
> > > 
> > > @@ -11489,6 +11497,9 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, 
> > > bool launch)
> > > if (enable_shadow_vmcs)
> > > copy_shadow_to_vmcs12(vmx);
> > > 
> > > +   if (vmcs12->cpu_based_vm_exec_control & 
> > > CPU_BASED_USE_TSC_OFFSETING)
> > > +   vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > > +
> > 
> > This seems a little early, since we don't restore the L1 TSC offset on
> > the nested_vmx_failValid path.
> > 
> 
> Now this can be a nice one to introduce the VMX API tests. :)  I'll try
> to do it on Monday as punishment for not noticing the bug.  In the
> meanwhile, Karim, can you post a fixed fixed version?

done

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-13 Thread Raslan, KarimAllah

On Fri, 2018-04-13 at 13:23 +0200, Paolo Bonzini wrote:
> From: KarimAllah Ahmed 
> 
> Update 'tsc_offset' on vmenty/vmexit of L2 guests to ensure that it always
> captures the TSC_OFFSET of the running guest whether it is the L1 or L2
> guest.
> 
> Cc: Jim Mattson 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Paolo Bonzini 
> Signed-off-by: KarimAllah Ahmed 
> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  | 17 -
>  arch/x86/kvm/vmx.c  | 25 -
>  arch/x86/kvm/x86.c  |  6 --
>  4 files changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 949c977bc4c9..c25775fad4ed 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1013,6 +1013,7 @@ struct kvm_x86_ops {
>  
>   bool (*has_wbinvd_exit)(void);
>  
> + u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu);
>   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>   void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index b3ebc8ad6891..ea7c6d29aca5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c

Thank you for adding the AMD bits, I did not have a machine to test the
AMD bits on so I left it untouched :)

> @@ -1424,12 +1424,23 @@ static void init_sys_seg(struct vmcb_seg *seg, 
> uint32_t type)
>   seg->base = 0;
>  }
>  
> +static u64 svm_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + if (is_guest_mode(vcpu))
> + return svm->nested.hsave->control.tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
>   u64 g_tsc_offset = 0;
>  
>   if (is_guest_mode(vcpu)) {
> + /* Write L1's TSC offset.  */
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> @@ -3323,6 +3334,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>   /* Restore the original control entries */
>   copy_vmcb_control_area(vmcb, hsave);
>  
> + vcpu->arch.tsc_offset = svm->vmcb->control.tsc_offset;
>   kvm_clear_exception_queue(>vcpu);
>   kvm_clear_interrupt_queue(>vcpu);
>  
> @@ -3483,10 +3495,12 @@ static void enter_svm_guest_mode(struct vcpu_svm 
> *svm, u64 vmcb_gpa,
>   /* We don't want to see VMMCALLs from a nested guest */
>   clr_intercept(svm, INTERCEPT_VMMCALL);
>  
> + vcpu->arch.tsc_offset += nested_vmcb->control.tsc_offset;
> + svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset;
> +
>   svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
>   svm->vmcb->control.int_vector = nested_vmcb->control.int_vector;
>   svm->vmcb->control.int_state = nested_vmcb->control.int_state;
> - svm->vmcb->control.tsc_offset += nested_vmcb->control.tsc_offset;
>   svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
>   svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
>  
> @@ -7102,6 +7116,7 @@ static int svm_unregister_enc_region(struct kvm *kvm,
>  
>   .has_wbinvd_exit = svm_has_wbinvd_exit,
>  
> + .read_l1_tsc_offset = svm_read_l1_tsc_offset,
>   .write_tsc_offset = svm_write_tsc_offset,
>  
>   .set_tdp_cr3 = set_tdp_cr3,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a13c603bdefb..6553419202ee 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2874,6 +2874,17 @@ static void setup_msrs(struct vcpu_vmx *vmx)
>   vmx_update_msr_bitmap(>vcpu);
>  }
>  
> +static u64 vmx_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +
> + if (is_guest_mode(vcpu) &&
> + (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING))
> + return vcpu->arch.tsc_offset - vmcs12->tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  /*
>   * reads and returns guest's timestamp counter "register"
>   * guest_tsc = (host_tsc * tsc multiplier) >> 48 + tsc_offset
> @@ -11175,11 +11186,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
>   }
>  
> - if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> - vmcs_write64(TSC_OFFSET,
> -

Re: [PATCH 1/2] X86/KVM: Properly update 'tsc_offset' to represent the running guest

2018-04-13 Thread Raslan, KarimAllah

On Fri, 2018-04-13 at 13:23 +0200, Paolo Bonzini wrote:
> From: KarimAllah Ahmed 
> 
> Update 'tsc_offset' on vmenty/vmexit of L2 guests to ensure that it always
> captures the TSC_OFFSET of the running guest whether it is the L1 or L2
> guest.
> 
> Cc: Jim Mattson 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Paolo Bonzini 
> Signed-off-by: KarimAllah Ahmed 
> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  | 17 -
>  arch/x86/kvm/vmx.c  | 25 -
>  arch/x86/kvm/x86.c  |  6 --
>  4 files changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 949c977bc4c9..c25775fad4ed 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1013,6 +1013,7 @@ struct kvm_x86_ops {
>  
>   bool (*has_wbinvd_exit)(void);
>  
> + u64 (*read_l1_tsc_offset)(struct kvm_vcpu *vcpu);
>   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>   void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index b3ebc8ad6891..ea7c6d29aca5 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c

Thank you for adding the AMD bits, I did not have a machine to test the
AMD bits on so I left it untouched :)

> @@ -1424,12 +1424,23 @@ static void init_sys_seg(struct vmcb_seg *seg, 
> uint32_t type)
>   seg->base = 0;
>  }
>  
> +static u64 svm_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + if (is_guest_mode(vcpu))
> + return svm->nested.hsave->control.tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
>   u64 g_tsc_offset = 0;
>  
>   if (is_guest_mode(vcpu)) {
> + /* Write L1's TSC offset.  */
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> @@ -3323,6 +3334,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>   /* Restore the original control entries */
>   copy_vmcb_control_area(vmcb, hsave);
>  
> + vcpu->arch.tsc_offset = svm->vmcb->control.tsc_offset;
>   kvm_clear_exception_queue(>vcpu);
>   kvm_clear_interrupt_queue(>vcpu);
>  
> @@ -3483,10 +3495,12 @@ static void enter_svm_guest_mode(struct vcpu_svm 
> *svm, u64 vmcb_gpa,
>   /* We don't want to see VMMCALLs from a nested guest */
>   clr_intercept(svm, INTERCEPT_VMMCALL);
>  
> + vcpu->arch.tsc_offset += nested_vmcb->control.tsc_offset;
> + svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset;
> +
>   svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
>   svm->vmcb->control.int_vector = nested_vmcb->control.int_vector;
>   svm->vmcb->control.int_state = nested_vmcb->control.int_state;
> - svm->vmcb->control.tsc_offset += nested_vmcb->control.tsc_offset;
>   svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
>   svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
>  
> @@ -7102,6 +7116,7 @@ static int svm_unregister_enc_region(struct kvm *kvm,
>  
>   .has_wbinvd_exit = svm_has_wbinvd_exit,
>  
> + .read_l1_tsc_offset = svm_read_l1_tsc_offset,
>   .write_tsc_offset = svm_write_tsc_offset,
>  
>   .set_tdp_cr3 = set_tdp_cr3,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a13c603bdefb..6553419202ee 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2874,6 +2874,17 @@ static void setup_msrs(struct vcpu_vmx *vmx)
>   vmx_update_msr_bitmap(>vcpu);
>  }
>  
> +static u64 vmx_read_l1_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +
> + if (is_guest_mode(vcpu) &&
> + (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING))
> + return vcpu->arch.tsc_offset - vmcs12->tsc_offset;
> +
> + return vcpu->arch.tsc_offset;
> +}
> +
>  /*
>   * reads and returns guest's timestamp counter "register"
>   * guest_tsc = (host_tsc * tsc multiplier) >> 48 + tsc_offset
> @@ -11175,11 +11186,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
>   }
>  
> - if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> - vmcs_write64(TSC_OFFSET,
> - vcpu->arch.tsc_offset + vmcs12->tsc_offset);
> - else
> - vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
> +

Re: [PATCH 00/10] KVM/X86: Handle guest memory that does not have a struct page

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 16:59 +0200, Paolo Bonzini wrote:
> On 21/02/2018 18:47, KarimAllah Ahmed wrote:
> > 
> > For the most part, KVM can handle guest memory that does not have a struct
> > page (i.e. not directly managed by the kernel). However, There are a few 
> > places
> > in the code, specially in the nested code, that does not support that.
> > 
> > Patch 1, 2, and 3 avoid the mapping and unmapping all together and just
> > directly use kvm_guest_read and kvm_guest_write.
> > 
> > Patch 4 introduces a new guest mapping interface that encapsulate all the
> > bioler plate code that is needed to map and unmap guest memory. It also
> > supports guest memory without "struct page".
> > 
> > Patch 5, 6, 7, 8, 9, and 10 switch most of the offending code in VMX and 
> > hyperv
> > to use the new guest mapping API.
> > 
> > This patch series is the first set of fixes. Handling SVM and APIC-access 
> > page
> > will be handled in a different patch series.
> 
> I like the patches and the new API.  However, I'm a bit less convinced
> about the caching aspect; keeping a page pinned is not the nicest thing
> with respect (for example) to memory hot-unplug.
> 
> Since you're basically reinventing kmap_high, or alternatively
> (depending on your background) xc_map_foreign_pages, it's not surprising
> that memremap is slow.  How slow is it really (as seen e.g. with
> vmexit.flat running in L1, on EC2 compared to vanilla KVM)?

I have not actually compared EC2 vs vanilla but I compared between the 
version with cached vs no-cache (both in EC2 setup). The one that 
cached the mappings was an order of magnitude better. For example, 
booting an Ubuntu L2 guest with QEMU took around 10-13 seconds with 
this caching and it took over 5 minutes without the caching.

I will test with vanilla KVM and post the results.

> 
> Perhaps you can keep some kind of per-CPU cache of the last N remapped
> pfns?  This cache would sit between memremap and __kvm_map_gfn and it
> would be completely transparent to the layer below since it takes raw
> pfns.  This removes the need to store the memslots generation etc.  (If
> you go this way please place it in virt/kvm/pfncache.[ch], since
> kvm_main.c is already way too big).

Yup, that sounds like a good idea. I actually already implemented some 
sort of a per-CPU mapping pool in order to reduce the overhead when
the vCPU is over-committed. I will clean this and post as you
suggested.

> 
> Thanks,
> 
> Paolo
> 
> > 
> > KarimAllah Ahmed (10):
> >   X86/nVMX: handle_vmon: Read 4 bytes from guest memory instead of
> > map->read->unmap sequence
> >   X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
> > instead of map->copy->unmap sequence.
> >   X86/nVMX: Update the PML table without mapping and unmapping the page
> >   KVM: Introduce a new guest mapping API
> >   KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
> >   KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
> >   KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
> > descriptor table
> >   KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
> >   KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
> >   KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
> > 
> >  arch/x86/kvm/hyperv.c|  28 -
> >  arch/x86/kvm/vmx.c   | 144 
> > +++
> >  arch/x86/kvm/x86.c   |  13 ++---
> >  include/linux/kvm_host.h |  15 +
> >  virt/kvm/kvm_main.c  |  50 
> >  5 files changed, 129 insertions(+), 121 deletions(-)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 00/10] KVM/X86: Handle guest memory that does not have a struct page

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 16:59 +0200, Paolo Bonzini wrote:
> On 21/02/2018 18:47, KarimAllah Ahmed wrote:
> > 
> > For the most part, KVM can handle guest memory that does not have a struct
> > page (i.e. not directly managed by the kernel). However, There are a few 
> > places
> > in the code, specially in the nested code, that does not support that.
> > 
> > Patch 1, 2, and 3 avoid the mapping and unmapping all together and just
> > directly use kvm_guest_read and kvm_guest_write.
> > 
> > Patch 4 introduces a new guest mapping interface that encapsulate all the
> > bioler plate code that is needed to map and unmap guest memory. It also
> > supports guest memory without "struct page".
> > 
> > Patch 5, 6, 7, 8, 9, and 10 switch most of the offending code in VMX and 
> > hyperv
> > to use the new guest mapping API.
> > 
> > This patch series is the first set of fixes. Handling SVM and APIC-access 
> > page
> > will be handled in a different patch series.
> 
> I like the patches and the new API.  However, I'm a bit less convinced
> about the caching aspect; keeping a page pinned is not the nicest thing
> with respect (for example) to memory hot-unplug.
> 
> Since you're basically reinventing kmap_high, or alternatively
> (depending on your background) xc_map_foreign_pages, it's not surprising
> that memremap is slow.  How slow is it really (as seen e.g. with
> vmexit.flat running in L1, on EC2 compared to vanilla KVM)?

I have not actually compared EC2 vs vanilla but I compared between the 
version with cached vs no-cache (both in EC2 setup). The one that 
cached the mappings was an order of magnitude better. For example, 
booting an Ubuntu L2 guest with QEMU took around 10-13 seconds with 
this caching and it took over 5 minutes without the caching.

I will test with vanilla KVM and post the results.

> 
> Perhaps you can keep some kind of per-CPU cache of the last N remapped
> pfns?  This cache would sit between memremap and __kvm_map_gfn and it
> would be completely transparent to the layer below since it takes raw
> pfns.  This removes the need to store the memslots generation etc.  (If
> you go this way please place it in virt/kvm/pfncache.[ch], since
> kvm_main.c is already way too big).

Yup, that sounds like a good idea. I actually already implemented some 
sort of a per-CPU mapping pool in order to reduce the overhead when
the vCPU is over-committed. I will clean this and post as you
suggested.

> 
> Thanks,
> 
> Paolo
> 
> > 
> > KarimAllah Ahmed (10):
> >   X86/nVMX: handle_vmon: Read 4 bytes from guest memory instead of
> > map->read->unmap sequence
> >   X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
> > instead of map->copy->unmap sequence.
> >   X86/nVMX: Update the PML table without mapping and unmapping the page
> >   KVM: Introduce a new guest mapping API
> >   KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
> >   KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
> >   KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
> > descriptor table
> >   KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
> >   KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
> >   KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
> > 
> >  arch/x86/kvm/hyperv.c|  28 -
> >  arch/x86/kvm/vmx.c   | 144 
> > +++
> >  arch/x86/kvm/x86.c   |  13 ++---
> >  include/linux/kvm_host.h |  15 +
> >  virt/kvm/kvm_main.c  |  50 
> >  5 files changed, 129 insertions(+), 121 deletions(-)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 22:21 +0200, Paolo Bonzini wrote:
> On 12/04/2018 19:21, Raslan, KarimAllah wrote:
> > 
> > Now looking further at the code, it seems that everywhere in the code
> > tsc_offset is treated as the L01 TSC_OFFSET.
> > 
> > Like here:
> > 
> >         if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETING)
> > vmcs_write64(TSC_OFFSET,
> > vcpu->arch.tsc_offset + vmcs12->tsc_offset);
> > 
> > and here:
> > 
> >         vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
> > 
> > and here:
> > 
> > u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> > {
> > return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> > }
> > EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > 
> > ... would not it be simpler and more inline with the current code to
> > just do what I did above + remove the "+ l1_tsc_offset" + probably
> > document tsc_offset ?
> 
> Problem is, I don't think it's correct. :)  A good start would be to try
> disabling MSR_IA32_TSC interception in KVM, prepare a kvm-unit-tests
> test that reads the MSR, and see if you get the host or guest TSC...

I actually just submitted a patch with your original suggestion
(I hope)  because I realized that adjust tsc was still using the wrong
tsc_offset anyway :)

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 22:21 +0200, Paolo Bonzini wrote:
> On 12/04/2018 19:21, Raslan, KarimAllah wrote:
> > 
> > Now looking further at the code, it seems that everywhere in the code
> > tsc_offset is treated as the L01 TSC_OFFSET.
> > 
> > Like here:
> > 
> >         if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETING)
> > vmcs_write64(TSC_OFFSET,
> > vcpu->arch.tsc_offset + vmcs12->tsc_offset);
> > 
> > and here:
> > 
> >         vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
> > 
> > and here:
> > 
> > u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> > {
> > return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> > }
> > EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > 
> > ... would not it be simpler and more inline with the current code to
> > just do what I did above + remove the "+ l1_tsc_offset" + probably
> > document tsc_offset ?
> 
> Problem is, I don't think it's correct. :)  A good start would be to try
> disabling MSR_IA32_TSC interception in KVM, prepare a kvm-unit-tests
> test that reads the MSR, and see if you get the host or guest TSC...

I actually just submitted a patch with your original suggestion
(I hope)  because I realized that adjust tsc was still using the wrong
tsc_offset anyway :)

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 17:04 +, Raslan, KarimAllah wrote:
> On Thu, 2018-04-12 at 18:35 +0200, Paolo Bonzini wrote:
> > 
> > On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> > > 
> > > 
> > > When the TSC MSR is captured while an L2 guest is running then restored,
> > > the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> > > TSC_OFFSET. So ensure that this is compensated for when storing the value.
> > > 
> > > Cc: Jim Mattson <jmatt...@google.com>
> > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > Cc: Radim Krčmář <rkrc...@redhat.com>
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed <karah...@amazon.de>
> > > ---
> > >  arch/x86/kvm/vmx.c | 12 +---
> > >  arch/x86/kvm/x86.c |  1 -
> > >  2 files changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index cff2f50..2f57571 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
> > >   */
> > >  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  {
> > > + u64 l1_tsc_offset = 0;
> > > +
> > >   if (is_guest_mode(vcpu)) {
> > >   /*
> > >* We're here if L1 chose not to trap WRMSR to TSC. According
> > > @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
> > > *vcpu, u64 offset)
> > >* to the newly set TSC to get L2's TSC.
> > >*/
> > >   struct vmcs12 *vmcs12;
> > > +
> > >   /* recalculate vmcs02.TSC_OFFSET: */
> > >   vmcs12 = get_vmcs12(vcpu);
> > > - vmcs_write64(TSC_OFFSET, offset +
> > > - (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > > -  vmcs12->tsc_offset : 0));
> > > +
> > > + l1_tsc_offset = nested_cpu_has(vmcs12, 
> > > CPU_BASED_USE_TSC_OFFSETING) ?
> > > + vmcs12->tsc_offset : 0;
> > > + vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
> > >   } else {
> > >   trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> > >  vmcs_read64(TSC_OFFSET), offset);
> > >   vmcs_write64(TSC_OFFSET, offset);
> > >   }
> > > +
> > > + vcpu->arch.tsc_offset = offset - l1_tsc_offset;
> > 
> > Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
> > function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
> > l1_tsc_offset", then "offset" must be written to TSC_OFFSET.
> 
> Ooops! I forgot to remove the + l1_tsc_offset :D
> 
> > 
> > 
> > I think the bug was introduced by commit 3e3f50262.  Before,
> > vmx_read_tsc_offset returned the L02 offset; now it always contains the
> > L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
> > nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
> > a new kvm_x86_ops callback to subtract the L12 offset from the value it
> > returns.
> 
> ack!

Now looking further at the code, it seems that everywhere in the code
tsc_offset is treated as the L01 TSC_OFFSET.

Like here:

        if (vmcs12->cpu_based_vm_exec_control &
CPU_BASED_USE_TSC_OFFSETING)
vmcs_write64(TSC_OFFSET,
vcpu->arch.tsc_offset + vmcs12->tsc_offset);

and here:

        vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

and here:

u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
{
return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
}
EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);

... would not it be simpler and more inline with the current code to
just do what I did above + remove the "+ l1_tsc_offset" + probably
document tsc_offset ?

> 
> > 
> > 
> > Thanks,
> > 
> > Paolo
> > 
> > > 
> > > 
> > >  }
> > >  
> > >  /*
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index ac42c85..1a2ed92 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > >  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  {
> > >   kvm_x86_ops->write_tsc_offset(vcpu, offset);
> > > - vcpu->arch.tsc_offset = offset;
> > >  }
> > >  
> > >  static inline bool kvm_check_tsc_unstable(void)
> > > 
> > 
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 17:04 +, Raslan, KarimAllah wrote:
> On Thu, 2018-04-12 at 18:35 +0200, Paolo Bonzini wrote:
> > 
> > On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> > > 
> > > 
> > > When the TSC MSR is captured while an L2 guest is running then restored,
> > > the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> > > TSC_OFFSET. So ensure that this is compensated for when storing the value.
> > > 
> > > Cc: Jim Mattson 
> > > Cc: Paolo Bonzini 
> > > Cc: Radim Krčmář 
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed 
> > > ---
> > >  arch/x86/kvm/vmx.c | 12 +---
> > >  arch/x86/kvm/x86.c |  1 -
> > >  2 files changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index cff2f50..2f57571 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
> > >   */
> > >  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  {
> > > + u64 l1_tsc_offset = 0;
> > > +
> > >   if (is_guest_mode(vcpu)) {
> > >   /*
> > >* We're here if L1 chose not to trap WRMSR to TSC. According
> > > @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
> > > *vcpu, u64 offset)
> > >* to the newly set TSC to get L2's TSC.
> > >*/
> > >   struct vmcs12 *vmcs12;
> > > +
> > >   /* recalculate vmcs02.TSC_OFFSET: */
> > >   vmcs12 = get_vmcs12(vcpu);
> > > - vmcs_write64(TSC_OFFSET, offset +
> > > - (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > > -  vmcs12->tsc_offset : 0));
> > > +
> > > + l1_tsc_offset = nested_cpu_has(vmcs12, 
> > > CPU_BASED_USE_TSC_OFFSETING) ?
> > > + vmcs12->tsc_offset : 0;
> > > + vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
> > >   } else {
> > >   trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> > >  vmcs_read64(TSC_OFFSET), offset);
> > >   vmcs_write64(TSC_OFFSET, offset);
> > >   }
> > > +
> > > + vcpu->arch.tsc_offset = offset - l1_tsc_offset;
> > 
> > Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
> > function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
> > l1_tsc_offset", then "offset" must be written to TSC_OFFSET.
> 
> Ooops! I forgot to remove the + l1_tsc_offset :D
> 
> > 
> > 
> > I think the bug was introduced by commit 3e3f50262.  Before,
> > vmx_read_tsc_offset returned the L02 offset; now it always contains the
> > L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
> > nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
> > a new kvm_x86_ops callback to subtract the L12 offset from the value it
> > returns.
> 
> ack!

Now looking further at the code, it seems that everywhere in the code
tsc_offset is treated as the L01 TSC_OFFSET.

Like here:

        if (vmcs12->cpu_based_vm_exec_control &
CPU_BASED_USE_TSC_OFFSETING)
vmcs_write64(TSC_OFFSET,
vcpu->arch.tsc_offset + vmcs12->tsc_offset);

and here:

        vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

and here:

u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
{
return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
}
EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);

... would not it be simpler and more inline with the current code to
just do what I did above + remove the "+ l1_tsc_offset" + probably
document tsc_offset ?

> 
> > 
> > 
> > Thanks,
> > 
> > Paolo
> > 
> > > 
> > > 
> > >  }
> > >  
> > >  /*
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index ac42c85..1a2ed92 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > >  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  {
> > >   kvm_x86_ops->write_tsc_offset(vcpu, offset);
> > > - vcpu->arch.tsc_offset = offset;
> > >  }
> > >  
> > >  static inline bool kvm_check_tsc_unstable(void)
> > > 
> > 
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 18:35 +0200, Paolo Bonzini wrote:
> On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> > 
> > When the TSC MSR is captured while an L2 guest is running then restored,
> > the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> > TSC_OFFSET. So ensure that this is compensated for when storing the value.
> > 
> > Cc: Jim Mattson 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/vmx.c | 12 +---
> >  arch/x86/kvm/x86.c |  1 -
> >  2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index cff2f50..2f57571 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
> >   */
> >  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  {
> > +   u64 l1_tsc_offset = 0;
> > +
> > if (is_guest_mode(vcpu)) {
> > /*
> >  * We're here if L1 chose not to trap WRMSR to TSC. According
> > @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
> > *vcpu, u64 offset)
> >  * to the newly set TSC to get L2's TSC.
> >  */
> > struct vmcs12 *vmcs12;
> > +
> > /* recalculate vmcs02.TSC_OFFSET: */
> > vmcs12 = get_vmcs12(vcpu);
> > -   vmcs_write64(TSC_OFFSET, offset +
> > -   (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > -vmcs12->tsc_offset : 0));
> > +
> > +   l1_tsc_offset = nested_cpu_has(vmcs12, 
> > CPU_BASED_USE_TSC_OFFSETING) ?
> > +   vmcs12->tsc_offset : 0;
> > +   vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
> > } else {
> > trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> >vmcs_read64(TSC_OFFSET), offset);
> > vmcs_write64(TSC_OFFSET, offset);
> > }
> > +
> > +   vcpu->arch.tsc_offset = offset - l1_tsc_offset;
> 
> Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
> function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
> l1_tsc_offset", then "offset" must be written to TSC_OFFSET.

Ooops! I forgot to remove the + l1_tsc_offset :D

> 
> I think the bug was introduced by commit 3e3f50262.  Before,
> vmx_read_tsc_offset returned the L02 offset; now it always contains the
> L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
> nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
> a new kvm_x86_ops callback to subtract the L12 offset from the value it
> returns.

ack!

> 
> Thanks,
> 
> Paolo
> 
> > 
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ac42c85..1a2ed92 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> >  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  {
> > kvm_x86_ops->write_tsc_offset(vcpu, offset);
> > -   vcpu->arch.tsc_offset = offset;
> >  }
> >  
> >  static inline bool kvm_check_tsc_unstable(void)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

2018-04-12 Thread Raslan, KarimAllah

On Thu, 2018-04-12 at 18:35 +0200, Paolo Bonzini wrote:
> On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> > 
> > When the TSC MSR is captured while an L2 guest is running then restored,
> > the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> > TSC_OFFSET. So ensure that this is compensated for when storing the value.
> > 
> > Cc: Jim Mattson 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/vmx.c | 12 +---
> >  arch/x86/kvm/x86.c |  1 -
> >  2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index cff2f50..2f57571 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
> >   */
> >  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  {
> > +   u64 l1_tsc_offset = 0;
> > +
> > if (is_guest_mode(vcpu)) {
> > /*
> >  * We're here if L1 chose not to trap WRMSR to TSC. According
> > @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
> > *vcpu, u64 offset)
> >  * to the newly set TSC to get L2's TSC.
> >  */
> > struct vmcs12 *vmcs12;
> > +
> > /* recalculate vmcs02.TSC_OFFSET: */
> > vmcs12 = get_vmcs12(vcpu);
> > -   vmcs_write64(TSC_OFFSET, offset +
> > -   (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > -vmcs12->tsc_offset : 0));
> > +
> > +   l1_tsc_offset = nested_cpu_has(vmcs12, 
> > CPU_BASED_USE_TSC_OFFSETING) ?
> > +   vmcs12->tsc_offset : 0;
> > +   vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
> > } else {
> > trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> >vmcs_read64(TSC_OFFSET), offset);
> > vmcs_write64(TSC_OFFSET, offset);
> > }
> > +
> > +   vcpu->arch.tsc_offset = offset - l1_tsc_offset;
> 
> Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
> function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
> l1_tsc_offset", then "offset" must be written to TSC_OFFSET.

Ooops! I forgot to remove the + l1_tsc_offset :D

> 
> I think the bug was introduced by commit 3e3f50262.  Before,
> vmx_read_tsc_offset returned the L02 offset; now it always contains the
> L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
> nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
> a new kvm_x86_ops callback to subtract the L12 offset from the value it
> returns.

ack!

> 
> Thanks,
> 
> Paolo
> 
> > 
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ac42c85..1a2ed92 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> >  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  {
> > kvm_x86_ops->write_tsc_offset(vcpu, offset);
> > -   vcpu->arch.tsc_offset = offset;
> >  }
> >  
> >  static inline bool kvm_check_tsc_unstable(void)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v3] X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted

2018-04-11 Thread Raslan, KarimAllah

On Wed, 2018-04-11 at 09:24 +0800, Wanpeng Li wrote:
> 2018-04-10 20:15 GMT+08:00 KarimAllah Ahmed :
> > 
> > The VMX-preemption timer is used by KVM as a way to set deadlines for the
> > guest (i.e. timer emulation). That was safe till very recently when
> > capability KVM_X86_DISABLE_EXITS_MWAIT to disable intercepting MWAIT was
> > introduced. According to Intel SDM 25.5.1:
> > 
> > """
> > The VMX-preemption timer operates in the C-states C0, C1, and C2; it also
> > operates in the shutdown and wait-for-SIPI states. If the timer counts down
> > to zero in any state other than the wait-for SIPI state, the logical
> > processor transitions to the C0 C-state and causes a VM exit; the timer
> > does not cause a VM exit if it counts down to zero in the wait-for-SIPI
> > state. The timer is not decremented in C-states deeper than C2.
> > """
> 
> Thanks for the patch. In addition, does it also mean we should prevent
> host from entering deeper C-states than C2 even if w/o disable
> intercept stuffs?

The only thing that we should be worried about is the availability of 
LAPIC ARAT. If it is available then even if the guest issued an MWAIT 
that went to C6 state. The LAPIC timer will still be ticket and will 
still cause a VMExit when it ticks to meet some host kernel timer 
deadline.

Ironically I was about to say that we already do that for MWAIT 
passthrough, but I decided to also paste the snippet of the code that
shows that does it .. then I realized that when we upstreamed the
MWAIT passthrough we dropped this check by acciddent!

Anyway .. I send this patch to fix it:
https://lkml.org/lkml/2018/4/11/194

> 
> Regards,
> Wanpeng Li
> 
> > 
> > 
> > Now once the guest issues the MWAIT with a c-state deeper than
> > C2 the preemption timer will never wake it up again since it stopped
> > ticking! Usually this is compensated by other activities in the system that
> > would wake the core from the deep C-state (and cause a VMExit). For
> > example, if the host itself is ticking or it received interrupts, etc!
> > 
> > So disable the VMX-preemption timer if MWAIT is exposed to the guest!
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v2 -> v3:
> > - return -EOPNOTSUPP before any other operation in vmx_set_hv_timer
> > 
> > v1 -> v2:
> > - Drop everything .. just return -EOPNOTSUPP (pbonzini@) :D
> > ---
> >  arch/x86/kvm/vmx.c | 14 ++
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index d2e54e7..31a4204 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -11903,10 +11903,16 @@ static inline int u64_shl_div_u64(u64 a, unsigned 
> > int shift,
> > 
> >  static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
> >  {
> > -   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > -   u64 tscl = rdtsc();
> > -   u64 guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > -   u64 delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> > +   struct vcpu_vmx *vmx;
> > +   u64 tscl, guest_tscl, delta_tsc;
> > +
> > +   if (kvm_pause_in_guest(vcpu->kvm))
> > +   return -EOPNOTSUPP;
> > +
> > +   vmx = to_vmx(vcpu);
> > +   tscl = rdtsc();
> > +   guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > +   delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> > 
> > /* Convert to host delta tsc if tsc scaling is enabled */
> > if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
> > --
> > 2.7.4
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v3] X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted

2018-04-11 Thread Raslan, KarimAllah

On Wed, 2018-04-11 at 09:24 +0800, Wanpeng Li wrote:
> 2018-04-10 20:15 GMT+08:00 KarimAllah Ahmed :
> > 
> > The VMX-preemption timer is used by KVM as a way to set deadlines for the
> > guest (i.e. timer emulation). That was safe till very recently when
> > capability KVM_X86_DISABLE_EXITS_MWAIT to disable intercepting MWAIT was
> > introduced. According to Intel SDM 25.5.1:
> > 
> > """
> > The VMX-preemption timer operates in the C-states C0, C1, and C2; it also
> > operates in the shutdown and wait-for-SIPI states. If the timer counts down
> > to zero in any state other than the wait-for SIPI state, the logical
> > processor transitions to the C0 C-state and causes a VM exit; the timer
> > does not cause a VM exit if it counts down to zero in the wait-for-SIPI
> > state. The timer is not decremented in C-states deeper than C2.
> > """
> 
> Thanks for the patch. In addition, does it also mean we should prevent
> host from entering deeper C-states than C2 even if w/o disable
> intercept stuffs?

The only thing that we should be worried about is the availability of 
LAPIC ARAT. If it is available then even if the guest issued an MWAIT 
that went to C6 state. The LAPIC timer will still be ticket and will 
still cause a VMExit when it ticks to meet some host kernel timer 
deadline.

Ironically I was about to say that we already do that for MWAIT 
passthrough, but I decided to also paste the snippet of the code that
shows that does it .. then I realized that when we upstreamed the
MWAIT passthrough we dropped this check by acciddent!

Anyway .. I send this patch to fix it:
https://lkml.org/lkml/2018/4/11/194

> 
> Regards,
> Wanpeng Li
> 
> > 
> > 
> > Now once the guest issues the MWAIT with a c-state deeper than
> > C2 the preemption timer will never wake it up again since it stopped
> > ticking! Usually this is compensated by other activities in the system that
> > would wake the core from the deep C-state (and cause a VMExit). For
> > example, if the host itself is ticking or it received interrupts, etc!
> > 
> > So disable the VMX-preemption timer if MWAIT is exposed to the guest!
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v2 -> v3:
> > - return -EOPNOTSUPP before any other operation in vmx_set_hv_timer
> > 
> > v1 -> v2:
> > - Drop everything .. just return -EOPNOTSUPP (pbonzini@) :D
> > ---
> >  arch/x86/kvm/vmx.c | 14 ++
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index d2e54e7..31a4204 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -11903,10 +11903,16 @@ static inline int u64_shl_div_u64(u64 a, unsigned 
> > int shift,
> > 
> >  static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc)
> >  {
> > -   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > -   u64 tscl = rdtsc();
> > -   u64 guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > -   u64 delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> > +   struct vcpu_vmx *vmx;
> > +   u64 tscl, guest_tscl, delta_tsc;
> > +
> > +   if (kvm_pause_in_guest(vcpu->kvm))
> > +   return -EOPNOTSUPP;
> > +
> > +   vmx = to_vmx(vcpu);
> > +   tscl = rdtsc();
> > +   guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > +   delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> > 
> > /* Convert to host delta tsc if tsc scaling is enabled */
> > if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
> > --
> > 2.7.4
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2] X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted

2018-04-10 Thread Raslan, KarimAllah

On Tue, 2018-04-10 at 13:07 +0200, Paolo Bonzini wrote:
> On 10/04/2018 12:08, KarimAllah Ahmed wrote:
> > 
> > @@ -11908,6 +11908,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, 
> > u64 guest_deadline_tsc)
> > u64 guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > u64 delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> >  
> > +   if (kvm_pause_in_guest(vcpu->kvm))
> > +   return -EOPNOTSUPP;
> > +
> 
> This is still doing a relatively expensive kvm_read_l1_tsc, so move it
> even further up. :)

hehe .. done in v3 :)

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2] X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted

2018-04-10 Thread Raslan, KarimAllah

On Tue, 2018-04-10 at 13:07 +0200, Paolo Bonzini wrote:
> On 10/04/2018 12:08, KarimAllah Ahmed wrote:
> > 
> > @@ -11908,6 +11908,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, 
> > u64 guest_deadline_tsc)
> > u64 guest_tscl = kvm_read_l1_tsc(vcpu, tscl);
> > u64 delta_tsc = max(guest_deadline_tsc, guest_tscl) - guest_tscl;
> >  
> > +   if (kvm_pause_in_guest(vcpu->kvm))
> > +   return -EOPNOTSUPP;
> > +
> 
> This is still doing a relatively expensive kvm_read_l1_tsc, so move it
> even further up. :)

hehe .. done in v3 :)

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] X86/VMX: Disable VMX preempition timer if MWAIT is not intercepted

2018-04-10 Thread Raslan, KarimAllah

On Tue, 2018-04-10 at 11:04 +0200, Paolo Bonzini wrote:
> On 10/04/2018 10:50, KarimAllah Ahmed wrote:
> > 
> > WARN_ON(preemptible());
> > -   if (!kvm_x86_ops->set_hv_timer)
> > +   if (!kvm_x86_ops->has_hv_timer ||
> > +   !kvm_x86_ops->has_hv_timer(apic->vcpu))
> > return false;
> >  
> > if (!apic_lvtt_period(apic) && atomic_read(>pending))
> 
> Why not just return -ENOTSUP from vmx_set_hv_timer?

hehe .. good point :)

I just sent v2!

> 
> Thanks,
> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] X86/VMX: Disable VMX preempition timer if MWAIT is not intercepted

2018-04-10 Thread Raslan, KarimAllah

On Tue, 2018-04-10 at 11:04 +0200, Paolo Bonzini wrote:
> On 10/04/2018 10:50, KarimAllah Ahmed wrote:
> > 
> > WARN_ON(preemptible());
> > -   if (!kvm_x86_ops->set_hv_timer)
> > +   if (!kvm_x86_ops->has_hv_timer ||
> > +   !kvm_x86_ops->has_hv_timer(apic->vcpu))
> > return false;
> >  
> > if (!apic_lvtt_period(apic) && atomic_read(>pending))
> 
> Why not just return -ENOTSUP from vmx_set_hv_timer?

hehe .. good point :)

I just sent v2!

> 
> Thanks,
> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-10 Thread Raslan, KarimAllah

On Mon, 2018-04-09 at 13:26 +0200, David Hildenbrand wrote:
> On 09.04.2018 10:37, KarimAllah Ahmed wrote:
> > 
> > From: Jim Mattson 
> > 
> > For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> > this state can not be captured through the currently available IOCTLs. In
> > fact the state captured through all of these IOCTLs is usually a mix of L1
> > and L2 state. It is also dependent on whether the L2 guest was running at
> > the moment when the process was interrupted to save its state.
> > 
> > With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> > KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> > in VMX operation.
> > 
> 
> Very nice work!
> 
> > 
> >  
> > +static int get_vmcs_cache(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state)
> > +{
> > +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +
> > +   /*
> > +* When running L2, the authoritative vmcs12 state is in the
> > +* vmcs02. When running L1, the authoritative vmcs12 state is
> > +* in the shadow vmcs linked to vmcs01, unless
> > +* sync_shadow_vmcs is set, in which case, the authoritative
> > +* vmcs12 state is in the vmcs12 already.
> > +*/
> > +   if (is_guest_mode(vcpu))
> > +   sync_vmcs12(vcpu, vmcs12);
> > +   else if (enable_shadow_vmcs && !vmx->nested.sync_shadow_vmcs)
> > +   copy_shadow_to_vmcs12(vmx);
> > +
> > +   if (copy_to_user(user_kvm_state->data, vmcs12, sizeof(*vmcs12)))
> > +   return -EFAULT;
> > +
> > +   /*
> > +* Force a nested exit that guarantees that any state capture
> > +* afterwards by any IOCTLs (MSRs, etc) will not capture a mix of L1
> > +* and L2 state.> +  *
> 
> I totally understand why this is nice, I am worried about the
> implications. Let's assume migration fails and we want to continue
> running the guest on the source. We would now have a "bad" state.
> 
> How is this to be handled (e.g. is a SET_STATE necessary?)? I think this
> implication should be documented for KVM_GET_STATE.

Yup, I SET_STATE will be needed. That being said, I guess I will do 
what Jim mentioned and just fix the issue outlined here and then I can 
remove this VMExit.

> > 
> > +* One example where that would lead to an issue is the TSC DEADLINE
> > +* MSR vs the guest TSC. If the L2 guest is running, the guest TSC will
> > +* be the L2 TSC while the TSC deadline MSR will contain the L1 TSC
> > +* deadline MSR. That would lead to a very large (and wrong) "expire"
> > +* diff when LAPIC is initialized during instance restore (i.e. the
> > +* instance will appear to have hanged!).
> > +*/
> > +   if (is_guest_mode(vcpu))
> > +   nested_vmx_vmexit(vcpu, -1, 0, 0);
> > +
> > +   return 0;
> > +}
> > +
> > +static int get_vmx_state(struct kvm_vcpu *vcpu,
> > +struct kvm_state __user *user_kvm_state)
> > +{
> > +   u32 user_data_size;
> > +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct kvm_state kvm_state = {
> > +   .flags = 0,
> > +   .format = 0,
> > +   .size = sizeof(kvm_state),
> > +   .vmx.vmxon_pa = -1ull,
> > +   .vmx.vmcs_pa = -1ull,
> > +   };
> > +
> > +   if (copy_from_user(_data_size, _kvm_state->size,
> > +  sizeof(user_data_size)))
> > +   return -EFAULT;
> > +
> > +   if (nested_vmx_allowed(vcpu) && vmx->nested.vmxon) {
> > +   kvm_state.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
> > +   kvm_state.vmx.vmcs_pa = vmx->nested.current_vmptr;
> > +
> > +   if (vmx->nested.current_vmptr != -1ull)
> > +   kvm_state.size += VMCS12_SIZE;
> > +
> > +   if (is_guest_mode(vcpu)) {
> > +   kvm_state.flags |= KVM_STATE_GUEST_MODE;
> > +
> > +   if (vmx->nested.nested_run_pending)
> > +   kvm_state.flags |= KVM_STATE_RUN_PENDING;
> > +   }
> > +   }
> > +
> > +   if (user_data_size < kvm_state.size) {
> > +   if (copy_to_user(_kvm_state->size, _state.size,
> > +sizeof(kvm_state.size)))
> > +   return -EFAULT;
> > +   return -E2BIG;
> > +   }
> > +
> > +   if (copy_to_user(user_kvm_state, _state, sizeof(kvm_state)))
> > +   return -EFAULT;
> > +
> > +   if (vmx->nested.current_vmptr == -1ull)
> > +   return 0;
> > +
> > +   return get_vmcs_cache(vcpu, user_kvm_state);
> > +}
> > +
> > +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state,
> > + struct kvm_state *kvm_state)
> > +
> > +{
> > +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +   u32 exit_qual;
> > +   int ret;
> > +
> > +   if ((kvm_state->size < (sizeof(*vmcs12) + sizeof(*kvm_state))) ||
> >

Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-10 Thread Raslan, KarimAllah

On Mon, 2018-04-09 at 13:26 +0200, David Hildenbrand wrote:
> On 09.04.2018 10:37, KarimAllah Ahmed wrote:
> > 
> > From: Jim Mattson 
> > 
> > For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> > this state can not be captured through the currently available IOCTLs. In
> > fact the state captured through all of these IOCTLs is usually a mix of L1
> > and L2 state. It is also dependent on whether the L2 guest was running at
> > the moment when the process was interrupted to save its state.
> > 
> > With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> > KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> > in VMX operation.
> > 
> 
> Very nice work!
> 
> > 
> >  
> > +static int get_vmcs_cache(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state)
> > +{
> > +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +
> > +   /*
> > +* When running L2, the authoritative vmcs12 state is in the
> > +* vmcs02. When running L1, the authoritative vmcs12 state is
> > +* in the shadow vmcs linked to vmcs01, unless
> > +* sync_shadow_vmcs is set, in which case, the authoritative
> > +* vmcs12 state is in the vmcs12 already.
> > +*/
> > +   if (is_guest_mode(vcpu))
> > +   sync_vmcs12(vcpu, vmcs12);
> > +   else if (enable_shadow_vmcs && !vmx->nested.sync_shadow_vmcs)
> > +   copy_shadow_to_vmcs12(vmx);
> > +
> > +   if (copy_to_user(user_kvm_state->data, vmcs12, sizeof(*vmcs12)))
> > +   return -EFAULT;
> > +
> > +   /*
> > +* Force a nested exit that guarantees that any state capture
> > +* afterwards by any IOCTLs (MSRs, etc) will not capture a mix of L1
> > +* and L2 state.> +  *
> 
> I totally understand why this is nice, I am worried about the
> implications. Let's assume migration fails and we want to continue
> running the guest on the source. We would now have a "bad" state.
> 
> How is this to be handled (e.g. is a SET_STATE necessary?)? I think this
> implication should be documented for KVM_GET_STATE.

Yup, I SET_STATE will be needed. That being said, I guess I will do 
what Jim mentioned and just fix the issue outlined here and then I can 
remove this VMExit.

> > 
> > +* One example where that would lead to an issue is the TSC DEADLINE
> > +* MSR vs the guest TSC. If the L2 guest is running, the guest TSC will
> > +* be the L2 TSC while the TSC deadline MSR will contain the L1 TSC
> > +* deadline MSR. That would lead to a very large (and wrong) "expire"
> > +* diff when LAPIC is initialized during instance restore (i.e. the
> > +* instance will appear to have hanged!).
> > +*/
> > +   if (is_guest_mode(vcpu))
> > +   nested_vmx_vmexit(vcpu, -1, 0, 0);
> > +
> > +   return 0;
> > +}
> > +
> > +static int get_vmx_state(struct kvm_vcpu *vcpu,
> > +struct kvm_state __user *user_kvm_state)
> > +{
> > +   u32 user_data_size;
> > +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct kvm_state kvm_state = {
> > +   .flags = 0,
> > +   .format = 0,
> > +   .size = sizeof(kvm_state),
> > +   .vmx.vmxon_pa = -1ull,
> > +   .vmx.vmcs_pa = -1ull,
> > +   };
> > +
> > +   if (copy_from_user(_data_size, _kvm_state->size,
> > +  sizeof(user_data_size)))
> > +   return -EFAULT;
> > +
> > +   if (nested_vmx_allowed(vcpu) && vmx->nested.vmxon) {
> > +   kvm_state.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
> > +   kvm_state.vmx.vmcs_pa = vmx->nested.current_vmptr;
> > +
> > +   if (vmx->nested.current_vmptr != -1ull)
> > +   kvm_state.size += VMCS12_SIZE;
> > +
> > +   if (is_guest_mode(vcpu)) {
> > +   kvm_state.flags |= KVM_STATE_GUEST_MODE;
> > +
> > +   if (vmx->nested.nested_run_pending)
> > +   kvm_state.flags |= KVM_STATE_RUN_PENDING;
> > +   }
> > +   }
> > +
> > +   if (user_data_size < kvm_state.size) {
> > +   if (copy_to_user(_kvm_state->size, _state.size,
> > +sizeof(kvm_state.size)))
> > +   return -EFAULT;
> > +   return -E2BIG;
> > +   }
> > +
> > +   if (copy_to_user(user_kvm_state, _state, sizeof(kvm_state)))
> > +   return -EFAULT;
> > +
> > +   if (vmx->nested.current_vmptr == -1ull)
> > +   return 0;
> > +
> > +   return get_vmcs_cache(vcpu, user_kvm_state);
> > +}
> > +
> > +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state,
> > + struct kvm_state *kvm_state)
> > +
> > +{
> > +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +   u32 exit_qual;
> > +   int ret;
> > +
> > +   if ((kvm_state->size < (sizeof(*vmcs12) + sizeof(*kvm_state))) ||
> > +

Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-10 Thread Raslan, KarimAllah

On Mon, 2018-04-09 at 12:24 -0700, Jim Mattson wrote:
> On Mon, Apr 9, 2018 at 1:37 AM, KarimAllah Ahmed  wrote:
> 
> > 
> > +   /*
> > +* Force a nested exit that guarantees that any state capture
> > +* afterwards by any IOCTLs (MSRs, etc) will not capture a mix of L1
> > +* and L2 state.
> > +*
> > +* One example where that would lead to an issue is the TSC DEADLINE
> > +* MSR vs the guest TSC. If the L2 guest is running, the guest TSC 
> > will
> > +* be the L2 TSC while the TSC deadline MSR will contain the L1 TSC
> > +* deadline MSR. That would lead to a very large (and wrong) 
> > "expire"
> > +* diff when LAPIC is initialized during instance restore (i.e. the
> > +* instance will appear to have hanged!).
> > +*/
> 
> This sounds like a bug in the virtualization of IA32_TSC_DEADLINE.
> Without involving save/restore, what happens if L2 sets
> IA32_TSC_DEADLINE (and L1 permits it via the MSR permission bitmap)?
> The IA32_TSC_DEADLINE MSR is always specified with respect to L1's
> time domain.

That makes sense! Let me look into that!

Thanks!

> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE

2018-04-10 Thread Raslan, KarimAllah

On Mon, 2018-04-09 at 12:24 -0700, Jim Mattson wrote:
> On Mon, Apr 9, 2018 at 1:37 AM, KarimAllah Ahmed  wrote:
> 
> > 
> > +   /*
> > +* Force a nested exit that guarantees that any state capture
> > +* afterwards by any IOCTLs (MSRs, etc) will not capture a mix of L1
> > +* and L2 state.
> > +*
> > +* One example where that would lead to an issue is the TSC DEADLINE
> > +* MSR vs the guest TSC. If the L2 guest is running, the guest TSC 
> > will
> > +* be the L2 TSC while the TSC deadline MSR will contain the L1 TSC
> > +* deadline MSR. That would lead to a very large (and wrong) 
> > "expire"
> > +* diff when LAPIC is initialized during instance restore (i.e. the
> > +* instance will appear to have hanged!).
> > +*/
> 
> This sounds like a bug in the virtualization of IA32_TSC_DEADLINE.
> Without involving save/restore, what happens if L2 sets
> IA32_TSC_DEADLINE (and L1 permits it via the MSR permission bitmap)?
> The IA32_TSC_DEADLINE MSR is always specified with respect to L1's
> time domain.

That makes sense! Let me look into that!

Thanks!

> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] X86/KVM: Update the exit_qualification access bits while walking an address

2018-03-12 Thread Raslan, KarimAllah

On Mon, 2018-03-12 at 08:52 +, Raslan, KarimAllah wrote:
> On Sun, 2018-03-04 at 10:17 +0000, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2018-03-02 at 18:41 +0100, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 28/02/2018 19:06, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > 
> > > > ... to avoid having a stale value when handling an EPT misconfig for 
> > > > MMIO
> > > > regions.
> > > > 
> > > > MMIO regions that are not passed-through to the guest are handled 
> > > > through
> > > > EPT misconfigs. The first time a certain MMIO page is touched it causes 
> > > > an
> > > > EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
> > > > instead. Any subsequent accesses to the entry will generate an EPT
> > > > misconfig.
> > > > 
> > > > Things gets slightly complicated with nested guest handling for MMIO
> > > > regions that are not passed through from L0 (i.e. emulated by L0
> > > > user-space).
> > > > 
> > > > An EPT violation for one of these MMIO regions from L2, exits to L0
> > > > hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor 
> > > > and
> > > > realize it is not present (or not sufficient to serve the request). 
> > > > Then L0
> > > > injects an EPT violation to L1. L1 would then update its EPT mappings. 
> > > > The
> > > > EXIT_QUALIFICATION value for L1 would come from exit_qualification 
> > > > variable
> > > > in "struct vcpu". The problem is that this variable is only updated on 
> > > > EPT
> > > > violation and not on EPT misconfig. So if an EPT violation because of a
> > > > read happened first, then an EPT misconfig because of a write happened
> > > > afterwards. The L0 hypervisor will still contain exit_qualification 
> > > > value
> > > > from the previous read instead of the write and end up injecting an EPT
> > > > violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
> > > > 
> > > > The EPT violation that is injected from L0 to L1 needs to have the 
> > > > correct
> > > > EXIT_QUALIFICATION specially for the access bits because the individual
> > > > access bits for MMIO EPTs are updated only on actual access of this
> > > > specific type. So for the example above, the L1 hypervisor will keep
> > > > updating only the read bit in the EPT then resume the L2 guest. The L2
> > > > guest would end up causing another exit where the L0 *again* will inject
> > > > another EPT violation to L1 hypervisor with *again* an out of date
> > > > exit_qualification which indicates a read and not a write. Then this
> > > > ping-pong just keeps happening without making any forward progress.
> > > > 
> > > > The behavior of mapping MMIO regions changed in:
> > > > 
> > > >commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable 
> > > > (if possible)")
> > > > 
> > > > ... where an EPT violation for a read would also fixup the write bits to
> > > > avoid another EPT violation which by acciddent would fix the bug 
> > > > mentioned
> > > > above.
> > > > 
> > > > This commit fixes this situation and ensures that the access bits for 
> > > > the
> > > > exit_qualifcation is up to date. That ensures that even L1 hypervisor
> > > > running with a KVM version before the commit mentioned above would still
> > > > work.
> > > > 
> > > > ( The description above assumes EPT to be available and used by L1
> > > >   hypervisor + the L1 hypervisor is passing through the MMIO region to 
> > > > the L2
> > > >   guest while this MMIO region is emulated by the L0 user-space ).
> > > 
> > > This looks okay.  Would it be possible to add a kvm-unit-tests testcase
> > > for this?
> > 
> > Yup, makes sense. Just sent out a patch for kvm-unit-tests.
> 
> Was the kvm-unit-test that I posted sufficient?

Never mind, I just noticed that Radim already pulled this fix to the 
kvm/queue.

Thanks Radim :)

> 
> > 
> > 
> > Thanks.
> > 
> > > 
> > > 
> > > 
> > > Thanks,
> > > 
> > > Paolo
> > > 
> > > > 
> > > > 
> > &g

Re: [PATCH] X86/KVM: Update the exit_qualification access bits while walking an address

2018-03-12 Thread Raslan, KarimAllah

On Mon, 2018-03-12 at 08:52 +, Raslan, KarimAllah wrote:
> On Sun, 2018-03-04 at 10:17 +0000, Raslan, KarimAllah wrote:
> > 
> > On Fri, 2018-03-02 at 18:41 +0100, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 28/02/2018 19:06, KarimAllah Ahmed wrote:
> > > > 
> > > > 
> > > > 
> > > > ... to avoid having a stale value when handling an EPT misconfig for 
> > > > MMIO
> > > > regions.
> > > > 
> > > > MMIO regions that are not passed-through to the guest are handled 
> > > > through
> > > > EPT misconfigs. The first time a certain MMIO page is touched it causes 
> > > > an
> > > > EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
> > > > instead. Any subsequent accesses to the entry will generate an EPT
> > > > misconfig.
> > > > 
> > > > Things gets slightly complicated with nested guest handling for MMIO
> > > > regions that are not passed through from L0 (i.e. emulated by L0
> > > > user-space).
> > > > 
> > > > An EPT violation for one of these MMIO regions from L2, exits to L0
> > > > hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor 
> > > > and
> > > > realize it is not present (or not sufficient to serve the request). 
> > > > Then L0
> > > > injects an EPT violation to L1. L1 would then update its EPT mappings. 
> > > > The
> > > > EXIT_QUALIFICATION value for L1 would come from exit_qualification 
> > > > variable
> > > > in "struct vcpu". The problem is that this variable is only updated on 
> > > > EPT
> > > > violation and not on EPT misconfig. So if an EPT violation because of a
> > > > read happened first, then an EPT misconfig because of a write happened
> > > > afterwards. The L0 hypervisor will still contain exit_qualification 
> > > > value
> > > > from the previous read instead of the write and end up injecting an EPT
> > > > violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
> > > > 
> > > > The EPT violation that is injected from L0 to L1 needs to have the 
> > > > correct
> > > > EXIT_QUALIFICATION specially for the access bits because the individual
> > > > access bits for MMIO EPTs are updated only on actual access of this
> > > > specific type. So for the example above, the L1 hypervisor will keep
> > > > updating only the read bit in the EPT then resume the L2 guest. The L2
> > > > guest would end up causing another exit where the L0 *again* will inject
> > > > another EPT violation to L1 hypervisor with *again* an out of date
> > > > exit_qualification which indicates a read and not a write. Then this
> > > > ping-pong just keeps happening without making any forward progress.
> > > > 
> > > > The behavior of mapping MMIO regions changed in:
> > > > 
> > > >commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable 
> > > > (if possible)")
> > > > 
> > > > ... where an EPT violation for a read would also fixup the write bits to
> > > > avoid another EPT violation which by acciddent would fix the bug 
> > > > mentioned
> > > > above.
> > > > 
> > > > This commit fixes this situation and ensures that the access bits for 
> > > > the
> > > > exit_qualifcation is up to date. That ensures that even L1 hypervisor
> > > > running with a KVM version before the commit mentioned above would still
> > > > work.
> > > > 
> > > > ( The description above assumes EPT to be available and used by L1
> > > >   hypervisor + the L1 hypervisor is passing through the MMIO region to 
> > > > the L2
> > > >   guest while this MMIO region is emulated by the L0 user-space ).
> > > 
> > > This looks okay.  Would it be possible to add a kvm-unit-tests testcase
> > > for this?
> > 
> > Yup, makes sense. Just sent out a patch for kvm-unit-tests.
> 
> Was the kvm-unit-test that I posted sufficient?

Never mind, I just noticed that Radim already pulled this fix to the 
kvm/queue.

Thanks Radim :)

> 
> > 
> > 
> > Thanks.
> > 
> > > 
> > > 
> > > 
> > > Thanks,
> > > 
> > > Paolo
> > > 
> > > > 
> > > > 
&g

Re: [PATCH] X86/KVM: Update the exit_qualification access bits while walking an address

2018-03-12 Thread Raslan, KarimAllah

On Sun, 2018-03-04 at 10:17 +, Raslan, KarimAllah wrote:
> On Fri, 2018-03-02 at 18:41 +0100, Paolo Bonzini wrote:
> > 
> > On 28/02/2018 19:06, KarimAllah Ahmed wrote:
> > > 
> > > 
> > > ... to avoid having a stale value when handling an EPT misconfig for MMIO
> > > regions.
> > > 
> > > MMIO regions that are not passed-through to the guest are handled through
> > > EPT misconfigs. The first time a certain MMIO page is touched it causes an
> > > EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
> > > instead. Any subsequent accesses to the entry will generate an EPT
> > > misconfig.
> > > 
> > > Things gets slightly complicated with nested guest handling for MMIO
> > > regions that are not passed through from L0 (i.e. emulated by L0
> > > user-space).
> > > 
> > > An EPT violation for one of these MMIO regions from L2, exits to L0
> > > hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and
> > > realize it is not present (or not sufficient to serve the request). Then 
> > > L0
> > > injects an EPT violation to L1. L1 would then update its EPT mappings. The
> > > EXIT_QUALIFICATION value for L1 would come from exit_qualification 
> > > variable
> > > in "struct vcpu". The problem is that this variable is only updated on EPT
> > > violation and not on EPT misconfig. So if an EPT violation because of a
> > > read happened first, then an EPT misconfig because of a write happened
> > > afterwards. The L0 hypervisor will still contain exit_qualification value
> > > from the previous read instead of the write and end up injecting an EPT
> > > violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
> > > 
> > > The EPT violation that is injected from L0 to L1 needs to have the correct
> > > EXIT_QUALIFICATION specially for the access bits because the individual
> > > access bits for MMIO EPTs are updated only on actual access of this
> > > specific type. So for the example above, the L1 hypervisor will keep
> > > updating only the read bit in the EPT then resume the L2 guest. The L2
> > > guest would end up causing another exit where the L0 *again* will inject
> > > another EPT violation to L1 hypervisor with *again* an out of date
> > > exit_qualification which indicates a read and not a write. Then this
> > > ping-pong just keeps happening without making any forward progress.
> > > 
> > > The behavior of mapping MMIO regions changed in:
> > > 
> > >commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable 
> > > (if possible)")
> > > 
> > > ... where an EPT violation for a read would also fixup the write bits to
> > > avoid another EPT violation which by acciddent would fix the bug mentioned
> > > above.
> > > 
> > > This commit fixes this situation and ensures that the access bits for the
> > > exit_qualifcation is up to date. That ensures that even L1 hypervisor
> > > running with a KVM version before the commit mentioned above would still
> > > work.
> > > 
> > > ( The description above assumes EPT to be available and used by L1
> > >   hypervisor + the L1 hypervisor is passing through the MMIO region to 
> > > the L2
> > >   guest while this MMIO region is emulated by the L0 user-space ).
> > 
> > This looks okay.  Would it be possible to add a kvm-unit-tests testcase
> > for this?
> 
> Yup, makes sense. Just sent out a patch for kvm-unit-tests.

Was the kvm-unit-test that I posted sufficient?

> 
> Thanks.
> 
> > 
> > 
> > Thanks,
> > 
> > Paolo
> > 
> > > 
> > > 
> > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > Cc: Radim Krčmář <rkrc...@redhat.com>
> > > Cc: Thomas Gleixner <t...@linutronix.de>
> > > Cc: Ingo Molnar <mi...@redhat.com>
> > > Cc: H. Peter Anvin <h...@zytor.com>
> > > Cc: x...@kernel.org
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed <karah...@amazon.de>
> > > ---
> > >  arch/x86/kvm/paging_tmpl.h | 11 +--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > index 5abae72..6288e9d 100644
> > > --- a/arch/x86/kvm/paging_tmpl.h
> > > +++ b/arch/x

Re: [PATCH] X86/KVM: Update the exit_qualification access bits while walking an address

2018-03-12 Thread Raslan, KarimAllah

On Sun, 2018-03-04 at 10:17 +, Raslan, KarimAllah wrote:
> On Fri, 2018-03-02 at 18:41 +0100, Paolo Bonzini wrote:
> > 
> > On 28/02/2018 19:06, KarimAllah Ahmed wrote:
> > > 
> > > 
> > > ... to avoid having a stale value when handling an EPT misconfig for MMIO
> > > regions.
> > > 
> > > MMIO regions that are not passed-through to the guest are handled through
> > > EPT misconfigs. The first time a certain MMIO page is touched it causes an
> > > EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
> > > instead. Any subsequent accesses to the entry will generate an EPT
> > > misconfig.
> > > 
> > > Things gets slightly complicated with nested guest handling for MMIO
> > > regions that are not passed through from L0 (i.e. emulated by L0
> > > user-space).
> > > 
> > > An EPT violation for one of these MMIO regions from L2, exits to L0
> > > hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and
> > > realize it is not present (or not sufficient to serve the request). Then 
> > > L0
> > > injects an EPT violation to L1. L1 would then update its EPT mappings. The
> > > EXIT_QUALIFICATION value for L1 would come from exit_qualification 
> > > variable
> > > in "struct vcpu". The problem is that this variable is only updated on EPT
> > > violation and not on EPT misconfig. So if an EPT violation because of a
> > > read happened first, then an EPT misconfig because of a write happened
> > > afterwards. The L0 hypervisor will still contain exit_qualification value
> > > from the previous read instead of the write and end up injecting an EPT
> > > violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
> > > 
> > > The EPT violation that is injected from L0 to L1 needs to have the correct
> > > EXIT_QUALIFICATION specially for the access bits because the individual
> > > access bits for MMIO EPTs are updated only on actual access of this
> > > specific type. So for the example above, the L1 hypervisor will keep
> > > updating only the read bit in the EPT then resume the L2 guest. The L2
> > > guest would end up causing another exit where the L0 *again* will inject
> > > another EPT violation to L1 hypervisor with *again* an out of date
> > > exit_qualification which indicates a read and not a write. Then this
> > > ping-pong just keeps happening without making any forward progress.
> > > 
> > > The behavior of mapping MMIO regions changed in:
> > > 
> > >commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable 
> > > (if possible)")
> > > 
> > > ... where an EPT violation for a read would also fixup the write bits to
> > > avoid another EPT violation which by acciddent would fix the bug mentioned
> > > above.
> > > 
> > > This commit fixes this situation and ensures that the access bits for the
> > > exit_qualifcation is up to date. That ensures that even L1 hypervisor
> > > running with a KVM version before the commit mentioned above would still
> > > work.
> > > 
> > > ( The description above assumes EPT to be available and used by L1
> > >   hypervisor + the L1 hypervisor is passing through the MMIO region to 
> > > the L2
> > >   guest while this MMIO region is emulated by the L0 user-space ).
> > 
> > This looks okay.  Would it be possible to add a kvm-unit-tests testcase
> > for this?
> 
> Yup, makes sense. Just sent out a patch for kvm-unit-tests.

Was the kvm-unit-test that I posted sufficient?

> 
> Thanks.
> 
> > 
> > 
> > Thanks,
> > 
> > Paolo
> > 
> > > 
> > > 
> > > Cc: Paolo Bonzini 
> > > Cc: Radim Krčmář 
> > > Cc: Thomas Gleixner 
> > > Cc: Ingo Molnar 
> > > Cc: H. Peter Anvin 
> > > Cc: x...@kernel.org
> > > Cc: k...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed 
> > > ---
> > >  arch/x86/kvm/paging_tmpl.h | 11 +--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > index 5abae72..6288e9d 100644
> > > --- a/arch/x86/kvm/paging_tmpl.h
> > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > @@ -452,14 +452,21 @@ static int FNAME(walk_addr_generic)(struct 
> > > guest_walker *wa

Re: [PATCH v3 1/2] PCI/IOV: Store more data about VFs into the SRIOV struct

2018-03-06 Thread Raslan, KarimAllah

On Fri, 2018-03-02 at 15:36 -0600, Bjorn Helgaas wrote:
> On Thu, Mar 01, 2018 at 10:31:36PM +0100, KarimAllah Ahmed wrote:
> > 
> > Store more data about PCI VFs into the SRIOV to avoid reading them from the
> > config space of all the PCI VFs. This is specially a useful optimization
> > when bringing up thousands of VFs.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Applied to pci/virtualization for v4.17, thanks!
> 
> I removed the pci_sriov.device field, which seemed to be unused, and
> tweaked a few other things, so make sure I didn't break anything.

Yup, still looks good (and works) for me. Thanks.

> Here's what I have currently applied:
> 
> commit e17b7b429b095200f93ad37c4efeb7a99b6fce3b
> Author: KarimAllah Ahmed 
> Date:   Thu Mar 1 22:31:36 2018 +0100
> 
> PCI/IOV: Use VF0 cached config registers for other VFs
> 
> Cache some config data from VF0 and use it for all other VFs instead of
> reading it from the config space of each VF.  We assume these items are 
> the
> same across all associated VFs:
> 
>   Revision ID
>   Class Code
>   Subsystem Vendor ID
>   Subsystem ID
> 
> This is an optimization when enabling SR-IOV on a device with many VFs.
> 
> Signed-off-by: KarimAllah Ahmed 
> [bhelgaas: changelog, simplify comments, remove unused "device"]
> Signed-off-by: Bjorn Helgaas 
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924ae0350..30bf8f706ed9 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -114,6 +114,29 @@ resource_size_t pci_iov_resource_size(struct pci_dev 
> *dev, int resno)
>   return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
>  }
>  
> +static void pci_read_vf_config_common(struct pci_dev *virtfn)
> +{
> + struct pci_dev *physfn = virtfn->physfn;
> +
> + /*
> +  * Some config registers are the same across all associated VFs.
> +  * Read them once from VF0 so we can skip reading them from the
> +  * other VFs.
> +  *
> +  * PCIe r4.0, sec 9.3.4.1, technically doesn't require all VFs to
> +  * have the same Revision ID and Subsystem ID, but we assume they
> +  * do.
> +  */
> + pci_read_config_dword(virtfn, PCI_CLASS_REVISION,
> +   >sriov->class);
> + pci_read_config_byte(virtfn, PCI_HEADER_TYPE,
> +  >sriov->hdr_type);
> + pci_read_config_word(virtfn, PCI_SUBSYSTEM_VENDOR_ID,
> +  >sriov->subsystem_vendor);
> + pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
> +  >sriov->subsystem_device);
> +}
> +
>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  {
>   int i;
> @@ -136,13 +159,17 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>   virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>   virtfn->vendor = dev->vendor;
>   virtfn->device = iov->vf_device;
> + virtfn->is_virtfn = 1;
> + virtfn->physfn = pci_dev_get(dev);
> +
> + if (id == 0)
> + pci_read_vf_config_common(virtfn);
> +
>   rc = pci_setup_device(virtfn);
>   if (rc)
> - goto failed0;
> + goto failed1;
>  
>   virtfn->dev.parent = dev->dev.parent;
> - virtfn->physfn = pci_dev_get(dev);
> - virtfn->is_virtfn = 1;
>   virtfn->multifunction = 0;
>  
>   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> @@ -163,10 +190,10 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>   sprintf(buf, "virtfn%u", id);
>   rc = sysfs_create_link(>dev.kobj, >dev.kobj, buf);
>   if (rc)
> - goto failed1;
> + goto failed2;
>   rc = sysfs_create_link(>dev.kobj, >dev.kobj, "physfn");
>   if (rc)
> - goto failed2;
> + goto failed3;
>  
>   kobject_uevent(>dev.kobj, KOBJ_CHANGE);
>  
> @@ -174,11 +201,12 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  
>   return 0;
>  
> -failed2:
> +failed3:
>   sysfs_remove_link(>dev.kobj, buf);
> +failed2:
> + pci_stop_and_remove_bus_device(virtfn);
>  failed1:
>   pci_dev_put(dev);
> - pci_stop_and_remove_bus_device(virtfn);
>  failed0:
>   virtfn_remove_bus(dev->bus, bus);
>  failed:
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fcd81911b127..db76933be859 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -271,6 +271,10 @@ struct pci_sriov {
>   u16 driver_max_VFs; /* Max num VFs driver supports */
>   struct pci_dev  *dev;   /* Lowest numbered PF */
>   struct pci_dev  *self;  /* This PF */
> + u32 class;  /* VF class */
> + u8  hdr_type;   /* VF header type */
> + u16 subsystem_vendor; /* VF subsystem vendor */
> + u16

Re: [PATCH v3 1/2] PCI/IOV: Store more data about VFs into the SRIOV struct

2018-03-06 Thread Raslan, KarimAllah

On Fri, 2018-03-02 at 15:36 -0600, Bjorn Helgaas wrote:
> On Thu, Mar 01, 2018 at 10:31:36PM +0100, KarimAllah Ahmed wrote:
> > 
> > Store more data about PCI VFs into the SRIOV to avoid reading them from the
> > config space of all the PCI VFs. This is specially a useful optimization
> > when bringing up thousands of VFs.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> Applied to pci/virtualization for v4.17, thanks!
> 
> I removed the pci_sriov.device field, which seemed to be unused, and
> tweaked a few other things, so make sure I didn't break anything.

Yup, still looks good (and works) for me. Thanks.

> Here's what I have currently applied:
> 
> commit e17b7b429b095200f93ad37c4efeb7a99b6fce3b
> Author: KarimAllah Ahmed 
> Date:   Thu Mar 1 22:31:36 2018 +0100
> 
> PCI/IOV: Use VF0 cached config registers for other VFs
> 
> Cache some config data from VF0 and use it for all other VFs instead of
> reading it from the config space of each VF.  We assume these items are 
> the
> same across all associated VFs:
> 
>   Revision ID
>   Class Code
>   Subsystem Vendor ID
>   Subsystem ID
> 
> This is an optimization when enabling SR-IOV on a device with many VFs.
> 
> Signed-off-by: KarimAllah Ahmed 
> [bhelgaas: changelog, simplify comments, remove unused "device"]
> Signed-off-by: Bjorn Helgaas 
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924ae0350..30bf8f706ed9 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -114,6 +114,29 @@ resource_size_t pci_iov_resource_size(struct pci_dev 
> *dev, int resno)
>   return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
>  }
>  
> +static void pci_read_vf_config_common(struct pci_dev *virtfn)
> +{
> + struct pci_dev *physfn = virtfn->physfn;
> +
> + /*
> +  * Some config registers are the same across all associated VFs.
> +  * Read them once from VF0 so we can skip reading them from the
> +  * other VFs.
> +  *
> +  * PCIe r4.0, sec 9.3.4.1, technically doesn't require all VFs to
> +  * have the same Revision ID and Subsystem ID, but we assume they
> +  * do.
> +  */
> + pci_read_config_dword(virtfn, PCI_CLASS_REVISION,
> +   >sriov->class);
> + pci_read_config_byte(virtfn, PCI_HEADER_TYPE,
> +  >sriov->hdr_type);
> + pci_read_config_word(virtfn, PCI_SUBSYSTEM_VENDOR_ID,
> +  >sriov->subsystem_vendor);
> + pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
> +  >sriov->subsystem_device);
> +}
> +
>  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  {
>   int i;
> @@ -136,13 +159,17 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>   virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>   virtfn->vendor = dev->vendor;
>   virtfn->device = iov->vf_device;
> + virtfn->is_virtfn = 1;
> + virtfn->physfn = pci_dev_get(dev);
> +
> + if (id == 0)
> + pci_read_vf_config_common(virtfn);
> +
>   rc = pci_setup_device(virtfn);
>   if (rc)
> - goto failed0;
> + goto failed1;
>  
>   virtfn->dev.parent = dev->dev.parent;
> - virtfn->physfn = pci_dev_get(dev);
> - virtfn->is_virtfn = 1;
>   virtfn->multifunction = 0;
>  
>   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> @@ -163,10 +190,10 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>   sprintf(buf, "virtfn%u", id);
>   rc = sysfs_create_link(>dev.kobj, >dev.kobj, buf);
>   if (rc)
> - goto failed1;
> + goto failed2;
>   rc = sysfs_create_link(>dev.kobj, >dev.kobj, "physfn");
>   if (rc)
> - goto failed2;
> + goto failed3;
>  
>   kobject_uevent(>dev.kobj, KOBJ_CHANGE);
>  
> @@ -174,11 +201,12 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>  
>   return 0;
>  
> -failed2:
> +failed3:
>   sysfs_remove_link(>dev.kobj, buf);
> +failed2:
> + pci_stop_and_remove_bus_device(virtfn);
>  failed1:
>   pci_dev_put(dev);
> - pci_stop_and_remove_bus_device(virtfn);
>  failed0:
>   virtfn_remove_bus(dev->bus, bus);
>  failed:
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fcd81911b127..db76933be859 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -271,6 +271,10 @@ struct pci_sriov {
>   u16 driver_max_VFs; /* Max num VFs driver supports */
>   struct pci_dev  *dev;   /* Lowest numbered PF */
>   struct pci_dev  *self;  /* This PF */
> + u32 class;  /* VF class */
> + u8  hdr_type;   /* VF header type */
> + u16 subsystem_vendor; /* VF subsystem vendor */
> + u16 subsystem_device; /* VF subsystem device */
>   resource_size_t

Re: [PATCH] X86/KVM: Update the exit_qualification access bits while walking an address

2018-03-04 Thread Raslan, KarimAllah

On Fri, 2018-03-02 at 18:41 +0100, Paolo Bonzini wrote:
> On 28/02/2018 19:06, KarimAllah Ahmed wrote:
> > 
> > ... to avoid having a stale value when handling an EPT misconfig for MMIO
> > regions.
> > 
> > MMIO regions that are not passed-through to the guest are handled through
> > EPT misconfigs. The first time a certain MMIO page is touched it causes an
> > EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
> > instead. Any subsequent accesses to the entry will generate an EPT
> > misconfig.
> > 
> > Things gets slightly complicated with nested guest handling for MMIO
> > regions that are not passed through from L0 (i.e. emulated by L0
> > user-space).
> > 
> > An EPT violation for one of these MMIO regions from L2, exits to L0
> > hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and
> > realize it is not present (or not sufficient to serve the request). Then L0
> > injects an EPT violation to L1. L1 would then update its EPT mappings. The
> > EXIT_QUALIFICATION value for L1 would come from exit_qualification variable
> > in "struct vcpu". The problem is that this variable is only updated on EPT
> > violation and not on EPT misconfig. So if an EPT violation because of a
> > read happened first, then an EPT misconfig because of a write happened
> > afterwards. The L0 hypervisor will still contain exit_qualification value
> > from the previous read instead of the write and end up injecting an EPT
> > violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
> > 
> > The EPT violation that is injected from L0 to L1 needs to have the correct
> > EXIT_QUALIFICATION specially for the access bits because the individual
> > access bits for MMIO EPTs are updated only on actual access of this
> > specific type. So for the example above, the L1 hypervisor will keep
> > updating only the read bit in the EPT then resume the L2 guest. The L2
> > guest would end up causing another exit where the L0 *again* will inject
> > another EPT violation to L1 hypervisor with *again* an out of date
> > exit_qualification which indicates a read and not a write. Then this
> > ping-pong just keeps happening without making any forward progress.
> > 
> > The behavior of mapping MMIO regions changed in:
> > 
> >commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable (if 
> > possible)")
> > 
> > ... where an EPT violation for a read would also fixup the write bits to
> > avoid another EPT violation which by acciddent would fix the bug mentioned
> > above.
> > 
> > This commit fixes this situation and ensures that the access bits for the
> > exit_qualifcation is up to date. That ensures that even L1 hypervisor
> > running with a KVM version before the commit mentioned above would still
> > work.
> > 
> > ( The description above assumes EPT to be available and used by L1
> >   hypervisor + the L1 hypervisor is passing through the MMIO region to the 
> > L2
> >   guest while this MMIO region is emulated by the L0 user-space ).
> 
> This looks okay.  Would it be possible to add a kvm-unit-tests testcase
> for this?

Yup, makes sense. Just sent out a patch for kvm-unit-tests.

Thanks.

> 
> Thanks,
> 
> Paolo
> 
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/paging_tmpl.h | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 5abae72..6288e9d 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -452,14 +452,21 @@ static int FNAME(walk_addr_generic)(struct 
> > guest_walker *walker,
> >  * done by is_rsvd_bits_set() above.
> >  *
> >  * We set up the value of exit_qualification to inject:
> > -* [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
> > +* [2:0] - Derive from the access bits. The exit_qualification might be
> > +* out of date if it is serving an EPT misconfiguration.
> >  * [5:3] - Calculated by the page walk of the guest EPT page tables
> >  * [7:8] - Derived from [7:8] of real exit_qualification
> >  *
> >  * The other bits are set to 0.
> >  */
> > if (!(errcode & PFERR_RSVD_MASK)) {
> > -   vcpu->arch.exit_qualification &= 0x187;
> > +   vcpu->arch.exit_qualification &= 0x180;
> > +   if (write_fault)
> > +   vcpu->arch.exit_qualification |= 
> > EPT_VIOLATION_ACC_WRITE;
> > +   if (user_fault)
> > +   vcpu->arch.exit_qualification |= EPT_VIOLATION_ACC_READ;
> > +   if (fetch_fault)
> > +   vcpu->arch.exit_qualification |=

Re: [PATCH] X86/KVM: Update the exit_qualification access bits while walking an address

2018-03-04 Thread Raslan, KarimAllah

On Fri, 2018-03-02 at 18:41 +0100, Paolo Bonzini wrote:
> On 28/02/2018 19:06, KarimAllah Ahmed wrote:
> > 
> > ... to avoid having a stale value when handling an EPT misconfig for MMIO
> > regions.
> > 
> > MMIO regions that are not passed-through to the guest are handled through
> > EPT misconfigs. The first time a certain MMIO page is touched it causes an
> > EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
> > instead. Any subsequent accesses to the entry will generate an EPT
> > misconfig.
> > 
> > Things gets slightly complicated with nested guest handling for MMIO
> > regions that are not passed through from L0 (i.e. emulated by L0
> > user-space).
> > 
> > An EPT violation for one of these MMIO regions from L2, exits to L0
> > hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and
> > realize it is not present (or not sufficient to serve the request). Then L0
> > injects an EPT violation to L1. L1 would then update its EPT mappings. The
> > EXIT_QUALIFICATION value for L1 would come from exit_qualification variable
> > in "struct vcpu". The problem is that this variable is only updated on EPT
> > violation and not on EPT misconfig. So if an EPT violation because of a
> > read happened first, then an EPT misconfig because of a write happened
> > afterwards. The L0 hypervisor will still contain exit_qualification value
> > from the previous read instead of the write and end up injecting an EPT
> > violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
> > 
> > The EPT violation that is injected from L0 to L1 needs to have the correct
> > EXIT_QUALIFICATION specially for the access bits because the individual
> > access bits for MMIO EPTs are updated only on actual access of this
> > specific type. So for the example above, the L1 hypervisor will keep
> > updating only the read bit in the EPT then resume the L2 guest. The L2
> > guest would end up causing another exit where the L0 *again* will inject
> > another EPT violation to L1 hypervisor with *again* an out of date
> > exit_qualification which indicates a read and not a write. Then this
> > ping-pong just keeps happening without making any forward progress.
> > 
> > The behavior of mapping MMIO regions changed in:
> > 
> >commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable (if 
> > possible)")
> > 
> > ... where an EPT violation for a read would also fixup the write bits to
> > avoid another EPT violation which by acciddent would fix the bug mentioned
> > above.
> > 
> > This commit fixes this situation and ensures that the access bits for the
> > exit_qualifcation is up to date. That ensures that even L1 hypervisor
> > running with a KVM version before the commit mentioned above would still
> > work.
> > 
> > ( The description above assumes EPT to be available and used by L1
> >   hypervisor + the L1 hypervisor is passing through the MMIO region to the 
> > L2
> >   guest while this MMIO region is emulated by the L0 user-space ).
> 
> This looks okay.  Would it be possible to add a kvm-unit-tests testcase
> for this?

Yup, makes sense. Just sent out a patch for kvm-unit-tests.

Thanks.

> 
> Thanks,
> 
> Paolo
> 
> > 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: H. Peter Anvin 
> > Cc: x...@kernel.org
> > Cc: k...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  arch/x86/kvm/paging_tmpl.h | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 5abae72..6288e9d 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -452,14 +452,21 @@ static int FNAME(walk_addr_generic)(struct 
> > guest_walker *walker,
> >  * done by is_rsvd_bits_set() above.
> >  *
> >  * We set up the value of exit_qualification to inject:
> > -* [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
> > +* [2:0] - Derive from the access bits. The exit_qualification might be
> > +* out of date if it is serving an EPT misconfiguration.
> >  * [5:3] - Calculated by the page walk of the guest EPT page tables
> >  * [7:8] - Derived from [7:8] of real exit_qualification
> >  *
> >  * The other bits are set to 0.
> >  */
> > if (!(errcode & PFERR_RSVD_MASK)) {
> > -   vcpu->arch.exit_qualification &= 0x187;
> > +   vcpu->arch.exit_qualification &= 0x180;
> > +   if (write_fault)
> > +   vcpu->arch.exit_qualification |= 
> > EPT_VIOLATION_ACC_WRITE;
> > +   if (user_fault)
> > +   vcpu->arch.exit_qualification |= EPT_VIOLATION_ACC_READ;
> > +   if (fetch_fault)
> > +   vcpu->arch.exit_qualification |= 
> > EPT_VIOLATION_ACC_INSTR;
> > vcpu->arch.exit_qualification |= (pte_access & 0x7) << 3;
> > }
>

Re: [PATCH v3 2/2] PCI/IOV: Use the cached VF BARs size instead of re-reading them

2018-03-02 Thread Raslan, KarimAllah

On Fri, 2018-03-02 at 15:48 -0600, Bjorn Helgaas wrote:
> On Thu, Mar 01, 2018 at 10:31:37PM +0100, KarimAllah Ahmed wrote:
> > 
> > Use the cached VF BARs size instead of re-reading them from the hardware.
> > That avoids doing unnecessarily bus transactions which is specially
> > noticable when you have a PF with a large number of VFs.
> 
> Thanks a lot for breaking this out!  It seems trivial, but it did make it
> much easier for me to think about this one.
> 
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  drivers/pci/probe.c | 24 ++--
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index a96837e..aeaa10a 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -180,6 +180,7 @@ static inline unsigned long decode_bar(struct pci_dev 
> > *dev, u32 bar)
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > struct resource *res, unsigned int pos)
> >  {
> > +   int bar = res - dev->resource;
> > u32 l = 0, sz = 0, mask;
> > u64 l64, sz64, mask64;
> > u16 orig_cmd;
> > @@ -199,9 +200,13 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> > res->name = pci_name(dev);
> >  
> > pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l | mask);
> > -   pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l);
> > +   if (dev->is_virtfn) {
> > +   sz = dev->physfn->sriov->barsz[bar] & 0x;
> > +   } else {
> > +   pci_write_config_dword(dev, pos, l | mask);
> > +   pci_read_config_dword(dev, pos, );
> > +   pci_write_config_dword(dev, pos, l);
> > +   }
> 
> I don't quite understand this.  This is reading the regular BARs (config
> offsets 0x10, 0x14, ..., 0x24).  Per sec 9.3.4.1.11, these are all RO Zero
> for VFs.  That should make them look like they're all unimplemented.
> 
> But this patch makes us use the size we discovered from the PF's VF BARn
> registers in its SR-IOV capability.  Won't that cause us to fill in the
> VF's dev->resource[n], when we didn't do it before?

Oh .. that is correct! I did not notice this part from the spec :)

> 
> > 
> > /*
> >  * All bits set in sz means the device isn't working properly.
> > @@ -241,9 +246,14 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> >  
> > if (res->flags & IORESOURCE_MEM_64) {
> > pci_read_config_dword(dev, pos + 4, );
> > -   pci_write_config_dword(dev, pos + 4, ~0);
> > -   pci_read_config_dword(dev, pos + 4, );
> > -   pci_write_config_dword(dev, pos + 4, l);
> > +
> > +   if (dev->is_virtfn) {
> > +   sz = (dev->physfn->sriov->barsz[bar] >> 32) & 
> > 0x;
> > +   } else {
> > +   pci_write_config_dword(dev, pos + 4, ~0);
> > +   pci_read_config_dword(dev, pos + 4, );
> > +   pci_write_config_dword(dev, pos + 4, l);
> > +   }
> >  
> > l64 |= ((u64)l << 32);
> > sz64 |= ((u64)sz << 32);
> > @@ -332,6 +342,8 @@ static void pci_read_bases(struct pci_dev *dev, 
> > unsigned int howmany, int rom)
> > for (pos = 0; pos < howmany; pos++) {
> > struct resource *res = >resource[pos];
> > reg = PCI_BASE_ADDRESS_0 + (pos << 2);
> > +   if (dev->is_virtfn && dev->physfn->sriov->barsz[pos] == 0)
> > +   continue;
> 
> Since we know the VF BARs are all zero (the ones in the VF config space,
> not the ones in the PF SR-IOV capability), including the VF ROM BAR, it
> would make sense to me to totally skip this whole function, e.g.,
> 
>   if (dev->non_compliant_bars)
> return;
> 
>   if (dev->is_virtfn)
> return;
> 

Correct! Done.

> > 
> > pos += __pci_read_base(dev, pci_bar_unknown, res, reg);
> > }
> >  
> > -- 
> > 2.7.4
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v3 2/2] PCI/IOV: Use the cached VF BARs size instead of re-reading them

2018-03-02 Thread Raslan, KarimAllah

On Fri, 2018-03-02 at 15:48 -0600, Bjorn Helgaas wrote:
> On Thu, Mar 01, 2018 at 10:31:37PM +0100, KarimAllah Ahmed wrote:
> > 
> > Use the cached VF BARs size instead of re-reading them from the hardware.
> > That avoids doing unnecessarily bus transactions which is specially
> > noticable when you have a PF with a large number of VFs.
> 
> Thanks a lot for breaking this out!  It seems trivial, but it did make it
> much easier for me to think about this one.
> 
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> >  drivers/pci/probe.c | 24 ++--
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index a96837e..aeaa10a 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -180,6 +180,7 @@ static inline unsigned long decode_bar(struct pci_dev 
> > *dev, u32 bar)
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > struct resource *res, unsigned int pos)
> >  {
> > +   int bar = res - dev->resource;
> > u32 l = 0, sz = 0, mask;
> > u64 l64, sz64, mask64;
> > u16 orig_cmd;
> > @@ -199,9 +200,13 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> > res->name = pci_name(dev);
> >  
> > pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l | mask);
> > -   pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l);
> > +   if (dev->is_virtfn) {
> > +   sz = dev->physfn->sriov->barsz[bar] & 0x;
> > +   } else {
> > +   pci_write_config_dword(dev, pos, l | mask);
> > +   pci_read_config_dword(dev, pos, );
> > +   pci_write_config_dword(dev, pos, l);
> > +   }
> 
> I don't quite understand this.  This is reading the regular BARs (config
> offsets 0x10, 0x14, ..., 0x24).  Per sec 9.3.4.1.11, these are all RO Zero
> for VFs.  That should make them look like they're all unimplemented.
> 
> But this patch makes us use the size we discovered from the PF's VF BARn
> registers in its SR-IOV capability.  Won't that cause us to fill in the
> VF's dev->resource[n], when we didn't do it before?

Oh .. that is correct! I did not notice this part from the spec :)

> 
> > 
> > /*
> >  * All bits set in sz means the device isn't working properly.
> > @@ -241,9 +246,14 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> >  
> > if (res->flags & IORESOURCE_MEM_64) {
> > pci_read_config_dword(dev, pos + 4, );
> > -   pci_write_config_dword(dev, pos + 4, ~0);
> > -   pci_read_config_dword(dev, pos + 4, );
> > -   pci_write_config_dword(dev, pos + 4, l);
> > +
> > +   if (dev->is_virtfn) {
> > +   sz = (dev->physfn->sriov->barsz[bar] >> 32) & 
> > 0x;
> > +   } else {
> > +   pci_write_config_dword(dev, pos + 4, ~0);
> > +   pci_read_config_dword(dev, pos + 4, );
> > +   pci_write_config_dword(dev, pos + 4, l);
> > +   }
> >  
> > l64 |= ((u64)l << 32);
> > sz64 |= ((u64)sz << 32);
> > @@ -332,6 +342,8 @@ static void pci_read_bases(struct pci_dev *dev, 
> > unsigned int howmany, int rom)
> > for (pos = 0; pos < howmany; pos++) {
> > struct resource *res = >resource[pos];
> > reg = PCI_BASE_ADDRESS_0 + (pos << 2);
> > +   if (dev->is_virtfn && dev->physfn->sriov->barsz[pos] == 0)
> > +   continue;
> 
> Since we know the VF BARs are all zero (the ones in the VF config space,
> not the ones in the PF SR-IOV capability), including the VF ROM BAR, it
> would make sense to me to totally skip this whole function, e.g.,
> 
>   if (dev->non_compliant_bars)
> return;
> 
>   if (dev->is_virtfn)
> return;
> 

Correct! Done.

> > 
> > pos += __pci_read_base(dev, pci_bar_unknown, res, reg);
> > }
> >  
> > -- 
> > 2.7.4
> > 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH v2] pci: Store more data about VFs into the SRIOV struct

2018-03-01 Thread Raslan, KarimAllah

On Thu, 2018-03-01 at 13:34 -0600, Bjorn Helgaas wrote:
> s|pci: Store|PCI/IOV: Store|
> 
> (run "git log --oneline drivers/pci/probe.c" to see why)
> 
> On Thu, Mar 01, 2018 at 02:26:04PM +0100, KarimAllah Ahmed wrote:
> > 
> > ... to avoid reading them from the config space of all the PCI VFs. This is
> > specially a useful optimization when bringing up thousands of VFs.
> 
> Please make the changelog complete in itself, so it doesn't have to be
> read in conjunction with the subject.  It's OK if you have to repeat
> the subject in the changelog.

ack.

> 
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v1 -> v2:
> > * Rebase on latest + remove dependency on a non-upstream patch.
> > 
> >  drivers/pci/iov.c   | 16 
> >  drivers/pci/pci.h   |  5 +
> >  drivers/pci/probe.c | 42 --
> >  3 files changed, 53 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 677924a..e1d2e3f 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -114,6 +114,19 @@ resource_size_t pci_iov_resource_size(struct pci_dev 
> > *dev, int resno)
> > return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
> >  }
> >  
> > +static void pci_read_vf_config_common(struct pci_bus *bus, struct pci_dev 
> > *dev)
> > +{
> > +   int devfn = pci_iov_virtfn_devfn(dev, 0);
> > +
> > +   pci_bus_read_config_dword(bus, devfn, PCI_CLASS_REVISION,
> > + >sriov->class);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_ID,
> > +>sriov->subsystem_device);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_VENDOR_ID,
> > +>sriov->subsystem_vendor);
> > +   pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, 
> > >sriov->hdr_type);
> 
> Can't you do this a little later, e.g., after pci_iov_add_virtfn()
> calls pci_setup_device(), and then use the standard
> pci_read_config_*() interfaces instead of the special
> pci_bus_read_config*() ones?

ack.

I moved it after "pci_iov_virtfn_devfn".

> 
> > 
> > +}
> > +
> >  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >  {
> > int i;
> > @@ -133,6 +146,9 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> > if (!virtfn)
> > goto failed0;
> >  
> > +   if (id == 0)
> > +   pci_read_vf_config_common(bus, dev);
> > +
> > virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
> > virtfn->vendor = dev->vendor;
> > virtfn->device = iov->vf_device;
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index fcd8191..346daa5 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -271,6 +271,11 @@ struct pci_sriov {
> > u16 driver_max_VFs; /* Max num VFs driver supports */
> > struct pci_dev  *dev;   /* Lowest numbered PF */
> > struct pci_dev  *self;  /* This PF */
> > +   u8 hdr_type;/* VF header type */
> > +   u32 class;  /* VF device */
> > +   u16 device; /* VF device */
> > +   u16 subsystem_vendor;   /* VF subsystem vendor */
> > +   u16 subsystem_device;   /* VF subsystem device */
> 
> Please make the whitespace here match the existing code, i.e.,
> line up the structure element names and comments.

ack!

> 
> > 
> > resource_size_t barsz[PCI_SRIOV_NUM_BARS];  /* VF BAR size */
> > booldrivers_autoprobe; /* Auto probing of VFs by driver */
> >  };
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index ef53774..aeaa10a 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -180,6 +180,7 @@ static inline unsigned long decode_bar(struct pci_dev 
> > *dev, u32 bar)
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > struct resource *res, unsigned int pos)
> >  {
> > +   int bar = res - dev->resource;
> > u32 l = 0, sz = 0, mask;
> > u64 l64, sz64, mask64;
> > u16 orig_cmd;
> > @@ -199,9 +200,13 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> > res->name = pci_name(dev);
> >  
> > pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l | mask);
> > -   pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l);
> > +   if (dev->is_virtfn) {
> > +   sz = dev->physfn->sriov->barsz[bar] & 0x;
> > +   } else {
> > +   pci_write_config_dword(dev, pos, l | mask);
> > +   pci_read_config_dword(dev, pos, );
> > +   pci_write_config_dword(dev, pos, l);
> > +   }
> 
> This part is not like the others, i.e., the others are caching info
> from VF 0 in newly-added elements of struct pci_sriov.  This also uses
> information from struct pci_sriov, but it's qualitatively different,
> so it should be in a

Re: [PATCH v2] pci: Store more data about VFs into the SRIOV struct

2018-03-01 Thread Raslan, KarimAllah

On Thu, 2018-03-01 at 13:34 -0600, Bjorn Helgaas wrote:
> s|pci: Store|PCI/IOV: Store|
> 
> (run "git log --oneline drivers/pci/probe.c" to see why)
> 
> On Thu, Mar 01, 2018 at 02:26:04PM +0100, KarimAllah Ahmed wrote:
> > 
> > ... to avoid reading them from the config space of all the PCI VFs. This is
> > specially a useful optimization when bringing up thousands of VFs.
> 
> Please make the changelog complete in itself, so it doesn't have to be
> read in conjunction with the subject.  It's OK if you have to repeat
> the subject in the changelog.

ack.

> 
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> > ---
> > v1 -> v2:
> > * Rebase on latest + remove dependency on a non-upstream patch.
> > 
> >  drivers/pci/iov.c   | 16 
> >  drivers/pci/pci.h   |  5 +
> >  drivers/pci/probe.c | 42 --
> >  3 files changed, 53 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 677924a..e1d2e3f 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -114,6 +114,19 @@ resource_size_t pci_iov_resource_size(struct pci_dev 
> > *dev, int resno)
> > return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
> >  }
> >  
> > +static void pci_read_vf_config_common(struct pci_bus *bus, struct pci_dev 
> > *dev)
> > +{
> > +   int devfn = pci_iov_virtfn_devfn(dev, 0);
> > +
> > +   pci_bus_read_config_dword(bus, devfn, PCI_CLASS_REVISION,
> > + >sriov->class);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_ID,
> > +>sriov->subsystem_device);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_VENDOR_ID,
> > +>sriov->subsystem_vendor);
> > +   pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, 
> > >sriov->hdr_type);
> 
> Can't you do this a little later, e.g., after pci_iov_add_virtfn()
> calls pci_setup_device(), and then use the standard
> pci_read_config_*() interfaces instead of the special
> pci_bus_read_config*() ones?

ack.

I moved it after "pci_iov_virtfn_devfn".

> 
> > 
> > +}
> > +
> >  int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >  {
> > int i;
> > @@ -133,6 +146,9 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> > if (!virtfn)
> > goto failed0;
> >  
> > +   if (id == 0)
> > +   pci_read_vf_config_common(bus, dev);
> > +
> > virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
> > virtfn->vendor = dev->vendor;
> > virtfn->device = iov->vf_device;
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index fcd8191..346daa5 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -271,6 +271,11 @@ struct pci_sriov {
> > u16 driver_max_VFs; /* Max num VFs driver supports */
> > struct pci_dev  *dev;   /* Lowest numbered PF */
> > struct pci_dev  *self;  /* This PF */
> > +   u8 hdr_type;/* VF header type */
> > +   u32 class;  /* VF device */
> > +   u16 device; /* VF device */
> > +   u16 subsystem_vendor;   /* VF subsystem vendor */
> > +   u16 subsystem_device;   /* VF subsystem device */
> 
> Please make the whitespace here match the existing code, i.e.,
> line up the structure element names and comments.

ack!

> 
> > 
> > resource_size_t barsz[PCI_SRIOV_NUM_BARS];  /* VF BAR size */
> > booldrivers_autoprobe; /* Auto probing of VFs by driver */
> >  };
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index ef53774..aeaa10a 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -180,6 +180,7 @@ static inline unsigned long decode_bar(struct pci_dev 
> > *dev, u32 bar)
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > struct resource *res, unsigned int pos)
> >  {
> > +   int bar = res - dev->resource;
> > u32 l = 0, sz = 0, mask;
> > u64 l64, sz64, mask64;
> > u16 orig_cmd;
> > @@ -199,9 +200,13 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> > res->name = pci_name(dev);
> >  
> > pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l | mask);
> > -   pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l);
> > +   if (dev->is_virtfn) {
> > +   sz = dev->physfn->sriov->barsz[bar] & 0x;
> > +   } else {
> > +   pci_write_config_dword(dev, pos, l | mask);
> > +   pci_read_config_dword(dev, pos, );
> > +   pci_write_config_dword(dev, pos, l);
> > +   }
> 
> This part is not like the others, i.e., the others are caching info
> from VF 0 in newly-added elements of struct pci_sriov.  This also uses
> information from struct pci_sriov, but it's qualitatively different,
> so it should be in a separate patch.

ack. Moved to a seperate

Re: [PATCH 00/10] KVM/X86: Handle guest memory that does not have a struct page

2018-03-01 Thread Raslan, KarimAllah

Jim/Paolo/Radim,

Any complains about the current API? (introduced in 4/10)

I have more patches on top and I would like to ensure that this is 
agreed upon at least before sending more revisions/patches.

Also 1, 2, and 3 should be a bit straight forward and does not use 
this API.

Thanks.

On Wed, 2018-02-21 at 18:47 +0100, KarimAllah Ahmed wrote:
> For the most part, KVM can handle guest memory that does not have a struct
> page (i.e. not directly managed by the kernel). However, There are a few 
> places
> in the code, specially in the nested code, that does not support that.
> 
> Patch 1, 2, and 3 avoid the mapping and unmapping all together and just
> directly use kvm_guest_read and kvm_guest_write.
> 
> Patch 4 introduces a new guest mapping interface that encapsulate all the
> bioler plate code that is needed to map and unmap guest memory. It also
> supports guest memory without "struct page".
> 
> Patch 5, 6, 7, 8, 9, and 10 switch most of the offending code in VMX and 
> hyperv
> to use the new guest mapping API.
> 
> This patch series is the first set of fixes. Handling SVM and APIC-access page
> will be handled in a different patch series.
> 
> KarimAllah Ahmed (10):
>   X86/nVMX: handle_vmon: Read 4 bytes from guest memory instead of
> map->read->unmap sequence
>   X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
> instead of map->copy->unmap sequence.
>   X86/nVMX: Update the PML table without mapping and unmapping the page
>   KVM: Introduce a new guest mapping API
>   KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
>   KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
>   KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
> descriptor table
>   KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
>   KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
>   KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
> 
>  arch/x86/kvm/hyperv.c|  28 -
>  arch/x86/kvm/vmx.c   | 144 
> +++
>  arch/x86/kvm/x86.c   |  13 ++---
>  include/linux/kvm_host.h |  15 +
>  virt/kvm/kvm_main.c  |  50 
>  5 files changed, 129 insertions(+), 121 deletions(-)
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH 00/10] KVM/X86: Handle guest memory that does not have a struct page

2018-03-01 Thread Raslan, KarimAllah

Jim/Paolo/Radim,

Any complains about the current API? (introduced in 4/10)

I have more patches on top and I would like to ensure that this is 
agreed upon at least before sending more revisions/patches.

Also 1, 2, and 3 should be a bit straight forward and does not use 
this API.

Thanks.

On Wed, 2018-02-21 at 18:47 +0100, KarimAllah Ahmed wrote:
> For the most part, KVM can handle guest memory that does not have a struct
> page (i.e. not directly managed by the kernel). However, There are a few 
> places
> in the code, specially in the nested code, that does not support that.
> 
> Patch 1, 2, and 3 avoid the mapping and unmapping all together and just
> directly use kvm_guest_read and kvm_guest_write.
> 
> Patch 4 introduces a new guest mapping interface that encapsulate all the
> bioler plate code that is needed to map and unmap guest memory. It also
> supports guest memory without "struct page".
> 
> Patch 5, 6, 7, 8, 9, and 10 switch most of the offending code in VMX and 
> hyperv
> to use the new guest mapping API.
> 
> This patch series is the first set of fixes. Handling SVM and APIC-access page
> will be handled in a different patch series.
> 
> KarimAllah Ahmed (10):
>   X86/nVMX: handle_vmon: Read 4 bytes from guest memory instead of
> map->read->unmap sequence
>   X86/nVMX: handle_vmptrld: Copy the VMCS12 directly from guest memory
> instead of map->copy->unmap sequence.
>   X86/nVMX: Update the PML table without mapping and unmapping the page
>   KVM: Introduce a new guest mapping API
>   KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
>   KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
>   KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
> descriptor table
>   KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
>   KVM/X86: hyperv: Use kvm_vcpu_map in synic_clear_sint_msg_pending
>   KVM/X86: hyperv: Use kvm_vcpu_map in synic_deliver_msg
> 
>  arch/x86/kvm/hyperv.c|  28 -
>  arch/x86/kvm/vmx.c   | 144 
> +++
>  arch/x86/kvm/x86.c   |  13 ++---
>  include/linux/kvm_host.h |  15 +
>  virt/kvm/kvm_main.c  |  50 
>  5 files changed, 129 insertions(+), 121 deletions(-)
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [PATCH] pci: Store more data about VFs into the SRIOV struct

2018-02-28 Thread Raslan, KarimAllah

On Wed, 2018-02-28 at 15:30 -0600, Bjorn Helgaas wrote:
> On Wed, Jan 17, 2018 at 06:44:23PM +0100, KarimAllah Ahmed wrote:
> > 
> > ... to avoid reading them from the config space of all the PCI VFs. This is
> > specially a useful optimization when bringing up thousands of VFs.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> What does this patch apply to?  It doesn't apply to v4.16-rc1 (my
> "master" branch).  I don't see anything in the history of
> drivers/pci/iov.c about pci_iov_wq_fn().

Ah, right! I had a few patches in my branch and I decided to only post
this one for now. The pci_iov_wq_fn was part of one of them.

Will shuffle the patches, rebase and repost.

Thanks.

> 
> > 
> > ---
> >  drivers/pci/iov.c   | 20 ++--
> >  drivers/pci/pci.h   |  6 +-
> >  drivers/pci/probe.c | 42 --
> >  3 files changed, 55 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 168328a..78e9595 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -129,7 +129,7 @@ resource_size_t pci_iov_resource_size(struct pci_dev 
> > *dev, int resno)
> > if (!dev->is_physfn)
> > return 0;
> >  
> > -   return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
> > +   return dev->sriov->vf_barsz[resno - PCI_IOV_RESOURCES];
> >  }
> >  
> >  int batch_pci_iov_add_virtfn(struct pci_dev *dev, struct pci_bus **bus,
> > @@ -325,6 +325,20 @@ static void pci_iov_wq_fn(struct work_struct *work)
> > kfree(req);
> >  }
> >  
> > +static void pci_read_vf_config_common(struct pci_bus *bus,
> > + struct pci_dev *dev)
> > +{
> > +   int devfn = pci_iov_virtfn_devfn(dev, 0);
> > +
> > +   pci_bus_read_config_dword(bus, devfn, PCI_CLASS_REVISION,
> > + >sriov->vf_class);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_ID,
> > +>sriov->vf_subsystem_device);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_VENDOR_ID,
> > +>sriov->vf_subsystem_vendor);
> > +   pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, 
> > >sriov->vf_hdr_type);
> > +}
> > +
> >  static struct workqueue_struct *pci_iov_wq;
> >  
> >  static int __init init_pci_iov_wq(void)
> > @@ -361,6 +375,8 @@ static int enable_vfs(struct pci_dev *dev, int nr_vfs)
> > goto add_bus_fail;
> > }
> >  
> > +   pci_read_vf_config_common(bus[0], dev);
> > +
> > while (remaining_vfs > 0) {
> > bool ret;
> > struct pci_iov_wq_item *req;
> > @@ -617,7 +633,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
> > rc = -EIO;
> > goto failed;
> > }
> > -   iov->barsz[i] = resource_size(res);
> > +   iov->vf_barsz[i] = resource_size(res);
> > res->end = res->start + resource_size(res) * total - 1;
> > dev_info(>dev, "VF(n) BAR%d space: %pR (contains BAR%d for 
> > %d VFs)\n",
> >  i, res, i, total);
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index f6b58b3..3264c9e 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -271,7 +271,11 @@ struct pci_sriov {
> > u16 driver_max_VFs; /* max num VFs driver supports */
> > struct pci_dev *dev;/* lowest numbered PF */
> > struct pci_dev *self;   /* this PF */
> > -   resource_size_t barsz[PCI_SRIOV_NUM_BARS];  /* VF BAR size */
> > +   u8 vf_hdr_type; /* VF header type */
> > +   u32 vf_class;   /* VF device */
> > +   u16 vf_subsystem_vendor;/* VF subsystem vendor */
> > +   u16 vf_subsystem_device;/* VF subsystem device */
> > +   resource_size_t vf_barsz[PCI_SRIOV_NUM_BARS];   /* VF BAR size */
> > bool drivers_autoprobe; /* auto probing of VFs by driver */
> >  };
> >  
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 14e0ea1..65099d0 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -175,6 +175,7 @@ static inline unsigned long decode_bar(struct pci_dev 
> > *dev, u32 bar)
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > struct resource *res, unsigned int pos)
> >  {
> > +   int bar = res - dev->resource;
> > u32 l = 0, sz = 0, mask;
> > u64 l64, sz64, mask64;
> > u16 orig_cmd;
> > @@ -194,9 +195,13 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> > res->name = pci_name(dev);
> >  
> > pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l | mask);
> > -   pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l);
> > +   if (dev->is_virtfn) {
> > +   sz = dev->physfn->sriov->vf_barsz[bar] & 0x;
> > +   }

Re: [PATCH] pci: Store more data about VFs into the SRIOV struct

2018-02-28 Thread Raslan, KarimAllah

On Wed, 2018-02-28 at 15:30 -0600, Bjorn Helgaas wrote:
> On Wed, Jan 17, 2018 at 06:44:23PM +0100, KarimAllah Ahmed wrote:
> > 
> > ... to avoid reading them from the config space of all the PCI VFs. This is
> > specially a useful optimization when bringing up thousands of VFs.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed 
> 
> What does this patch apply to?  It doesn't apply to v4.16-rc1 (my
> "master" branch).  I don't see anything in the history of
> drivers/pci/iov.c about pci_iov_wq_fn().

Ah, right! I had a few patches in my branch and I decided to only post
this one for now. The pci_iov_wq_fn was part of one of them.

Will shuffle the patches, rebase and repost.

Thanks.

> 
> > 
> > ---
> >  drivers/pci/iov.c   | 20 ++--
> >  drivers/pci/pci.h   |  6 +-
> >  drivers/pci/probe.c | 42 --
> >  3 files changed, 55 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 168328a..78e9595 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -129,7 +129,7 @@ resource_size_t pci_iov_resource_size(struct pci_dev 
> > *dev, int resno)
> > if (!dev->is_physfn)
> > return 0;
> >  
> > -   return dev->sriov->barsz[resno - PCI_IOV_RESOURCES];
> > +   return dev->sriov->vf_barsz[resno - PCI_IOV_RESOURCES];
> >  }
> >  
> >  int batch_pci_iov_add_virtfn(struct pci_dev *dev, struct pci_bus **bus,
> > @@ -325,6 +325,20 @@ static void pci_iov_wq_fn(struct work_struct *work)
> > kfree(req);
> >  }
> >  
> > +static void pci_read_vf_config_common(struct pci_bus *bus,
> > + struct pci_dev *dev)
> > +{
> > +   int devfn = pci_iov_virtfn_devfn(dev, 0);
> > +
> > +   pci_bus_read_config_dword(bus, devfn, PCI_CLASS_REVISION,
> > + >sriov->vf_class);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_ID,
> > +>sriov->vf_subsystem_device);
> > +   pci_bus_read_config_word(bus, devfn, PCI_SUBSYSTEM_VENDOR_ID,
> > +>sriov->vf_subsystem_vendor);
> > +   pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, 
> > >sriov->vf_hdr_type);
> > +}
> > +
> >  static struct workqueue_struct *pci_iov_wq;
> >  
> >  static int __init init_pci_iov_wq(void)
> > @@ -361,6 +375,8 @@ static int enable_vfs(struct pci_dev *dev, int nr_vfs)
> > goto add_bus_fail;
> > }
> >  
> > +   pci_read_vf_config_common(bus[0], dev);
> > +
> > while (remaining_vfs > 0) {
> > bool ret;
> > struct pci_iov_wq_item *req;
> > @@ -617,7 +633,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
> > rc = -EIO;
> > goto failed;
> > }
> > -   iov->barsz[i] = resource_size(res);
> > +   iov->vf_barsz[i] = resource_size(res);
> > res->end = res->start + resource_size(res) * total - 1;
> > dev_info(>dev, "VF(n) BAR%d space: %pR (contains BAR%d for 
> > %d VFs)\n",
> >  i, res, i, total);
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index f6b58b3..3264c9e 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -271,7 +271,11 @@ struct pci_sriov {
> > u16 driver_max_VFs; /* max num VFs driver supports */
> > struct pci_dev *dev;/* lowest numbered PF */
> > struct pci_dev *self;   /* this PF */
> > -   resource_size_t barsz[PCI_SRIOV_NUM_BARS];  /* VF BAR size */
> > +   u8 vf_hdr_type; /* VF header type */
> > +   u32 vf_class;   /* VF device */
> > +   u16 vf_subsystem_vendor;/* VF subsystem vendor */
> > +   u16 vf_subsystem_device;/* VF subsystem device */
> > +   resource_size_t vf_barsz[PCI_SRIOV_NUM_BARS];   /* VF BAR size */
> > bool drivers_autoprobe; /* auto probing of VFs by driver */
> >  };
> >  
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 14e0ea1..65099d0 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -175,6 +175,7 @@ static inline unsigned long decode_bar(struct pci_dev 
> > *dev, u32 bar)
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > struct resource *res, unsigned int pos)
> >  {
> > +   int bar = res - dev->resource;
> > u32 l = 0, sz = 0, mask;
> > u64 l64, sz64, mask64;
> > u16 orig_cmd;
> > @@ -194,9 +195,13 @@ int __pci_read_base(struct pci_dev *dev, enum 
> > pci_bar_type type,
> > res->name = pci_name(dev);
> >  
> > pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l | mask);
> > -   pci_read_config_dword(dev, pos, );
> > -   pci_write_config_dword(dev, pos, l);
> > +   if (dev->is_virtfn) {
> > +   sz = dev->physfn->sriov->vf_barsz[bar] & 0x;
> > +   } else {
> > +

Re: [PATCH 04/10] KVM: Introduce a new guest mapping API

2018-02-23 Thread Raslan, KarimAllah

On Fri, 2018-02-23 at 09:37 +0800, kbuild test robot wrote:
> Hi KarimAllah,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on tip/auto-latest]
> [also build test ERROR on v4.16-rc2 next-20180222]
> [cannot apply to kvm/linux-next]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/KarimAllah-Ahmed/KVM-X86-Handle-guest-memory-that-does-not-have-a-struct-page/20180223-064826
> config: mips-malta_kvm_defconfig (attached as .config)
> compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=mips 
> 
> All error/warnings (new ones prefixed by >>):
> 
>In file included from include/linux/linkage.h:7:0,
> from include/linux/preempt.h:10,
> from include/linux/hardirq.h:5,
> from include/linux/kvm_host.h:10,
> from arch/mips/kvm/../../../virt/kvm/kvm_main.c:21:
> > 
> > > 
> > > arch/mips/kvm/../../../virt/kvm/kvm_main.c:1669:19: error: 
> > > 'kvm_vcpu_gfn_to_kaddr' undeclared here (not in a function); did you mean 
> > > 'kvm_vcpu_gfn_to_page'?
> EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_kaddr);
>   ^
>include/linux/export.h:65:16: note: in definition of macro 
> '___EXPORT_SYMBOL'
>  extern typeof(sym) sym;  \
>^~~
> > 
> > > 
> > > arch/mips/kvm/../../../virt/kvm/kvm_main.c:1669:1: note: in expansion of 
> > > macro 'EXPORT_SYMBOL_GPL'
> EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_kaddr);
> ^

Ooops! I will make sure I build KVM as a module as well before posting 
v2.

I will also drop "kvm_vcpu_map_valid" since it is no longer used.

> 
> vim +1669 arch/mips/kvm/../../../virt/kvm/kvm_main.c
> 
>   1634
>   1635bool kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct 
> kvm_host_map *map)
>   1636{
>   1637kvm_pfn_t pfn;
>   1638void *kaddr = NULL;
>   1639struct page *page = NULL;
>   1640
>   1641if (map->kaddr && map->gfn == gfn)
>   1642/* If the mapping is valid and guest memory is 
> already mapped */
>   1643return true;
>   1644else if (map->kaddr)
>   1645/* If the mapping is valid but trying to map a 
> different guest pfn */
>   1646kvm_vcpu_unmap(map);
>   1647
>   1648pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
>   1649if (is_error_pfn(pfn))
>   1650return false;
>   1651
>   1652if (pfn_valid(pfn)) {
>   1653page = pfn_to_page(pfn);
>   1654kaddr = vmap(, 1, VM_MAP, PAGE_KERNEL);
>   1655} else {
>   1656kaddr = memremap(pfn_to_hpa(pfn), PAGE_SIZE, 
> MEMREMAP_WB);
>   1657}
>   1658
>   1659if (!kaddr)
>   1660return false;
>   1661
>   1662map->page = page;
>   1663map->kaddr = kaddr;
>   1664map->pfn = pfn;
>   1665map->gfn = gfn;
>   1666
>   1667return true;
>   1668}
> > 
> > 1669EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_kaddr);
>   1670
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

1 2 >

1 - 100 of 115 matches

Mail list logo