[PATCH] drivers/nvdimm/e820: turn off write cache by default

2022-09-29 Thread Roman Kagan
When regular DRAM is registered for use as PMEM via "memmap" command
line parameter, there's no write cache in front of the backing store of
this PMEM (as there's no backing store itself), so there's no point
doing expensive cache flush on sync etc.

Mark the regions being registered with e820 as ND_REGION_PERSIST_CACHE
so that write cache is off by default for the respective DAX devices.
This also matches the assumed behavior of the flag
ND_REGION_PERSIST_CACHE:

  Platform ensures entire CPU store data path is flushed to pmem on
  system power loss.

for the only usecase where such regions actually kind of persist the
data -- across kexec.

Signed-off-by: Roman Kagan 
---
 drivers/nvdimm/e820.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index 4cd18be9d0e9..3af63b7b6d23 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -28,6 +28,7 @@ static int e820_register_one(struct resource *res, void *data)
ndr_desc.numa_node = numa_map_to_online_node(nid);
ndr_desc.target_node = nid;
set_bit(ND_REGION_PAGEMAP, _desc.flags);
+   set_bit(ND_REGION_PERSIST_CACHE, _desc.flags);
if (!nvdimm_pmem_region_create(nvdimm_bus, _desc))
return -ENXIO;
return 0;
-- 
2.37.3




[tip: x86/urgent] x86/hyperv: Make vapic support x2apic mode

2019-10-15 Thread tip-bot2 for Roman Kagan
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: e211288b72f15259da86eed6eca680758dbe9e74
Gitweb:
https://git.kernel.org/tip/e211288b72f15259da86eed6eca680758dbe9e74
Author:Roman Kagan 
AuthorDate:Thu, 10 Oct 2019 12:33:05 
Committer: Thomas Gleixner 
CommitterDate: Tue, 15 Oct 2019 10:57:09 +02:00

x86/hyperv: Make vapic support x2apic mode

Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.

However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.

As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.

According to Michael Kelley, when in x2apic mode, the Hyper-V synthetic
apic MSRs behave exactly the same as the corresponding architectural
x2apic MSRs, so there's no need to override the apic accessors.  The
only exception is hv_apic_eoi_write, which benefits from lazy EOI when
available; however, its implementation works for both xapic and x2apic
modes.

Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Suggested-by: Michael Kelley 
Signed-off-by: Roman Kagan 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Vitaly Kuznetsov 
Reviewed-by: Michael Kelley 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20191010123258.16919-1-rka...@virtuozzo.com

---
 arch/x86/hyperv/hv_apic.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 5c056b8..e01078e 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -260,11 +260,21 @@ void __init hv_apic_init(void)
}
 
if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
-   pr_info("Hyper-V: Using MSR based APIC access\n");
+   pr_info("Hyper-V: Using enlightened APIC (%s mode)",
+   x2apic_enabled() ? "x2apic" : "xapic");
+   /*
+* With x2apic, architectural x2apic MSRs are equivalent to the
+* respective synthetic MSRs, so there's no need to override
+* the apic accessors.  The only exception is
+* hv_apic_eoi_write, because it benefits from lazy EOI when
+* available, but it works for both xapic and x2apic modes.
+*/
apic_set_eoi_write(hv_apic_eoi_write);
-   apic->read  = hv_apic_read;
-   apic->write = hv_apic_write;
-   apic->icr_write = hv_apic_icr_write;
-   apic->icr_read  = hv_apic_icr_read;
+   if (!x2apic_enabled()) {
+   apic->read  = hv_apic_read;
+   apic->write = hv_apic_write;
+   apic->icr_write = hv_apic_icr_write;
+   apic->icr_read  = hv_apic_icr_read;
+   }
}
 }


[tip: x86/urgent] x86/hyperv: Make vapic support x2apic mode

2019-10-15 Thread tip-bot2 for Roman Kagan
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: e211288b72f15259da86eed6eca680758dbe9e74
Gitweb:
https://git.kernel.org/tip/e211288b72f15259da86eed6eca680758dbe9e74
Author:Roman Kagan 
AuthorDate:Thu, 10 Oct 2019 12:33:05 
Committer: Thomas Gleixner 
CommitterDate: Tue, 15 Oct 2019 10:57:09 +02:00

x86/hyperv: Make vapic support x2apic mode

Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.

However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.

As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.

According to Michael Kelley, when in x2apic mode, the Hyper-V synthetic
apic MSRs behave exactly the same as the corresponding architectural
x2apic MSRs, so there's no need to override the apic accessors.  The
only exception is hv_apic_eoi_write, which benefits from lazy EOI when
available; however, its implementation works for both xapic and x2apic
modes.

Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Suggested-by: Michael Kelley 
Signed-off-by: Roman Kagan 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Vitaly Kuznetsov 
Reviewed-by: Michael Kelley 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20191010123258.16919-1-rka...@virtuozzo.com

---
 arch/x86/hyperv/hv_apic.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 5c056b8..e01078e 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -260,11 +260,21 @@ void __init hv_apic_init(void)
}
 
if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
-   pr_info("Hyper-V: Using MSR based APIC access\n");
+   pr_info("Hyper-V: Using enlightened APIC (%s mode)",
+   x2apic_enabled() ? "x2apic" : "xapic");
+   /*
+* With x2apic, architectural x2apic MSRs are equivalent to the
+* respective synthetic MSRs, so there's no need to override
+* the apic accessors.  The only exception is
+* hv_apic_eoi_write, because it benefits from lazy EOI when
+* available, but it works for both xapic and x2apic modes.
+*/
apic_set_eoi_write(hv_apic_eoi_write);
-   apic->read  = hv_apic_read;
-   apic->write = hv_apic_write;
-   apic->icr_write = hv_apic_icr_write;
-   apic->icr_read  = hv_apic_icr_read;
+   if (!x2apic_enabled()) {
+   apic->read  = hv_apic_read;
+   apic->write = hv_apic_write;
+   apic->icr_write = hv_apic_icr_write;
+   apic->icr_read  = hv_apic_icr_read;
+   }
}
 }


[PATCH v4] x86/hyperv: make vapic support x2apic mode

2019-10-10 Thread Roman Kagan
Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.

However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.

As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.

According to Michael Kelley, when in x2apic mode, the Hyper-V synthetic
apic MSRs behave exactly the same as the corresponding architectural
x2apic MSRs, so there's no need to override the apic accessors.  The
only exception is hv_apic_eoi_write, which benefits from lazy EOI when
available; however, its implementation works for both xapic and x2apic
modes.

Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Cc: sta...@vger.kernel.org
Suggested-by: Michael Kelley 
Reviewed-by: Vitaly Kuznetsov 
Reviewed-by: Michael Kelley 
Signed-off-by: Roman Kagan 
---
v3 -> v4:
- adjust the log message [Vitaly, Michael]

v2 -> v3:
- do not introduce x2apic-capable hv_apic accessors; leave original
  x2apic accessors instead

v1 -> v2:
- add ifdefs to handle !CONFIG_X86_X2APIC

 arch/x86/hyperv/hv_apic.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 5c056b8aebef..e01078e93dd3 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -260,11 +260,21 @@ void __init hv_apic_init(void)
}
 
if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
-   pr_info("Hyper-V: Using MSR based APIC access\n");
+   pr_info("Hyper-V: Using enlightened APIC (%s mode)",
+   x2apic_enabled() ? "x2apic" : "xapic");
+   /*
+* With x2apic, architectural x2apic MSRs are equivalent to the
+* respective synthetic MSRs, so there's no need to override
+* the apic accessors.  The only exception is
+* hv_apic_eoi_write, because it benefits from lazy EOI when
+* available, but it works for both xapic and x2apic modes.
+*/
apic_set_eoi_write(hv_apic_eoi_write);
-   apic->read  = hv_apic_read;
-   apic->write = hv_apic_write;
-   apic->icr_write = hv_apic_icr_write;
-   apic->icr_read  = hv_apic_icr_read;
+   if (!x2apic_enabled()) {
+   apic->read  = hv_apic_read;
+   apic->write = hv_apic_write;
+   apic->icr_write = hv_apic_icr_write;
+   apic->icr_read  = hv_apic_icr_read;
+   }
}
 }
-- 
2.21.0



Re: [RFC v2 0/2] kvm: Use host timekeeping in guest.

2019-10-10 Thread Roman Kagan
On Thu, Oct 10, 2019 at 04:30:53PM +0900, Suleiman Souhlal wrote:
> This RFC is to try to solve the following problem:
> 
> We have some applications that are currently running in their
> own namespace, that still talk to other processes on the
> machine, using IPC, and expect to run on the same machine.
> 
> We want to move them into a virtual machine, for the usual
> benefits of virtualization.
> 
> However, some of these programs use CLOCK_MONOTONIC and
> CLOCK_BOOTTIME timestamps, as part of their protocol, when talking
> to the host.
> 
> Generally speaking, we have multiple event sources, for example
> sensors, input devices, display controller vsync, etc and we would
> like to rely on them in the guest for various scenarios.
> 
> As a specific example, we are trying to run some wayland clients
> (in the guest) who talk to the server (in the host), and the server
> gives input events based on host time. Additionally, there are also
> vsync events that the clients use for timing their rendering.
> 
> Another use case we have are timestamps from IIO sensors and cameras.
> There are applications that need to determine how the timestamps
> relate to the current time and the only way to get current time is
> clock_gettime(), which would return a value from a different time
> domain than the timestamps.
> 
> In this case, it is not feasible to change these programs, due to
> the number of the places we would have to change.
> 
> We spent some time thinking about this, and the best solution we
> could come up with was the following:
> 
> Make the guest kernel return the same CLOCK_MONOTONIC and
> CLOCK_GETTIME timestamps as the host.
> 
> To do that, I am changing kvmclock to request to the host to copy
> its timekeeping parameters (mult, base, cycle_last, etc), so that
> the guest timekeeper can use the same values, so that time can
> be synchronized between the guest and the host.

I wonder how feasible it is to map the host's vdso into the guest and
thus make the guest use the *same* (as opposed to "synchronized") clock
as the host's userspace?  Another benefit is that it's essentially an
ABI so is not changed as liberally as internal structures like
timekeeper, etc.  There is probably certain complication in handling the
syscall fallback in the vdso when used in the guest kernel, though.

You'll also need to ensure neither tsc scaling and nor offsetting is
applied to the VM once this clock is enabled.

Roman.


[PATCH v3] x86/hyperv: make vapic support x2apic mode

2019-10-09 Thread Roman Kagan
Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.

However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.

As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.

According to Michael Kelley, when in x2apic mode, the Hyper-V synthetic
apic MSRs behave exactly the same as the corresponding architectural
x2apic MSRs, so there's no need to override the apic accessors.  The
only exception is hv_apic_eoi_write, which benefits from lazy EOI when
available; however, its implementation works for both xapic and x2apic
modes.

Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Cc: sta...@vger.kernel.org
Suggested-by: Michael Kelley 
Signed-off-by: Roman Kagan 
---
v2 -> v3:
- do not introduce x2apic-capable hv_apic accessors; leave original
  x2apic accessors instead

v1 -> v2:
- add ifdefs to handle !CONFIG_X86_X2APIC

 arch/x86/hyperv/hv_apic.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 5c056b8aebef..26eeff5bd535 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -261,10 +261,19 @@ void __init hv_apic_init(void)
 
if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
pr_info("Hyper-V: Using MSR based APIC access\n");
+   /*
+* With x2apic, architectural x2apic MSRs are equivalent to the
+* respective synthetic MSRs, so there's no need to override
+* the apic accessors.  The only exception is
+* hv_apic_eoi_write, because it benefits from lazy EOI when
+* available, but it works for both xapic and x2apic modes.
+*/
apic_set_eoi_write(hv_apic_eoi_write);
-   apic->read  = hv_apic_read;
-   apic->write = hv_apic_write;
-   apic->icr_write = hv_apic_icr_write;
-   apic->icr_read  = hv_apic_icr_read;
+   if (!x2apic_enabled()) {
+   apic->read  = hv_apic_read;
+   apic->write = hv_apic_write;
+   apic->icr_write = hv_apic_icr_write;
+   apic->icr_read  = hv_apic_icr_read;
+   }
}
 }
-- 
2.21.0



Re: [PATCH v3 15/16] kvm: x86: ioapic: Lazy update IOAPIC EOI

2019-10-09 Thread Roman Kagan
On Wed, Oct 09, 2019 at 11:21:41AM +0200, Paolo Bonzini wrote:
> On 13/09/19 21:01, Suthikulpanit, Suravee wrote:
> > /*
> > +* In case APICv accelerate EOI write and do not trap,
> > +* in-kernel IOAPIC will not be able to receive the EOI.
> > +* In this case, we do lazy update of the pending EOI when
> > +* trying to set IOAPIC irq.
> > +*/
> > +   if (kvm_apicv_eoi_accelerate(ioapic->kvm, edge))
> > +   ioapic_lazy_update_eoi(ioapic, irq);
> > +
> 
> This is okay for the RTC, and in fact I suggest that you make it work
> like this even for Intel.  This will get rid of kvm_apicv_eoi_accelerate
> and be nicer overall.
> 
> However, it cannot work for the in-kernel PIT, because it is currently
> checking ps->irq_ack before kvm_set_irq.  Unfortunately, the in-kernel
> PIT is relying on the ack notifier to timely queue the pt->worker work
> item and reinject the missed tick.
> 
> Thus, you cannot enable APICv if ps->reinject is true.
> 
> Perhaps you can make kvm->arch.apicv_state a disabled counter?  Then
> Hyper-V can increment it when enabled, PIT can increment it when
> ps->reinject becomes true and decrement it when it becomes false;
> finally, svm.c can increment it when an SVM guest is launched and
> increment/decrement it around ExtINT handling?

This can benefit Hyper-V emulation too.  The point is that it's only
AutoEOI feature in SynIC that is incompatible with APICv.  So the VM can
use APICv until the guest configures its first AutoEOI SINT.  If the
hypervisor sets HV_DEPRECATING_AEOI_RECOMMENDED (bit 9) in
HYPERV_CPUID_ENLIGHTMENT_INFO (0x4004) cpuid this may never happen
so we will not be pessimizing guests on modern hardware by merely
enabling SynIC.  I started looking into this recently and would be happy
to piggy-back on this series.

Roman.

> (This conflicts with some of the suggestions I made earlier, which
> implied the existence of apicv_state, but everything should if anything
> become easier).
> 
> Paolo


Re: [PATCH v2] x86/hyperv: make vapic support x2apic mode

2019-10-04 Thread Roman Kagan
On Fri, Oct 04, 2019 at 03:01:51AM +, Michael Kelley wrote:
> From: Roman Kagan  Sent: Thursday, October 3, 2019 5:53 
> AM
> > >
> > > AFAIU you're trying to mirror native_x2apic_icr_write() here but this is
> > > different from what hv_apic_icr_write() does
> > > (SET_APIC_DEST_FIELD(id)).
> > 
> > Right.  In xapic mode the ICR2 aka the high 4 bytes of ICR is programmed
> > with the destination id in the highest byte; in x2apic mode the whole
> > ICR2 is set to the 32bit destination id.
> > 
> > > Is it actually correct? (I think you've tested this and it is but)
> > 
> > As I wrote in the commit log, I haven't tested it in the sense that I
> > ran a Linux guest in a Hyper-V VM exposing x2apic to the guest, because
> > I didn't manage to configure it to do so.  OTOH I did run a Windows
> > guest in QEMU/KVM with hv_apic and x2apic enabled and saw it write
> > destination ids unshifted to the ICR2 part of ICR, so I assume it's
> > correct.
> > 
> > > Michael, could you please shed some light here?
> > 
> > Would be appreciated, indeed.
> > 
> 
> The newest version of Hyper-V provides an x2apic in a guest VM when the
> number of vCPUs in the VM is > 240.  This version of Hyper-V is beginning
> to be deployed in Azure to enable the M416v2 VM size, but the functionality
> is not yet available for the on-premises version of Hyper-V.  However, I can
> test this configuration internally with the above patch -- give me a few days.
> 
> An additional complication is that when running on Intel processors that offer
> vAPIC functionality, the Hyper-V "hints" value does *not* recommend using the
> MSR-based APIC accesses.  In this case, memory-mapped access to the x2apic
> registers is faster than the synthetic MSRs.

I guess you mean "using regular x2apic MSRs compared to the synthetic
MSRs".  Indeed they do essentially the same thing, and there's no reason
for one set of MSRs to be significantly faster than the other.  However,
hv_apic_eoi_write makes use of "apic assists" aka lazy EOI which is
certainly a win, and I'm not sure if it works without hv_apic.

> I've already looked at a VM that has
> the x2apic, and indeed that is the case, so the above code wouldn't run
> anyway.  But I can temporarily code around that for testing purposes and see
> if everything works.

Thanks!
Roman.


Re: [PATCH v2] x86/hyperv: make vapic support x2apic mode

2019-10-03 Thread Roman Kagan
On Thu, Oct 03, 2019 at 12:54:03PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
> > when supported by the vcpus.
> >
> > However, the apic access functions for Hyper-V enlightened apic assume
> > xapic mode only.
> >
> > As a result, Linux fails to bring up secondary cpus when run as a guest
> > in QEMU/KVM with both hv_apic and x2apic enabled.
> >
> > I didn't manage to make my instance of Hyper-V expose x2apic to the
> > guest; nor does Hyper-V spec document the expected behavior.  However,
> > a Windows guest running in QEMU/KVM with hv_apic and x2apic and a big
> > number of vcpus (so that it turns on x2apic mode) does use enlightened
> > apic MSRs passing unshifted 32bit destination id and falls back to the
> > regular x2apic MSRs for less frequently used apic fields.
> >
> > So implement the same behavior, by replacing enlightened apic access
> > functions (only those where it makes a difference) with their
> > x2apic-aware versions when x2apic is in use.
> >
> > Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
> > Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Roman Kagan 
> > ---
> > v1 -> v2:
> > - add ifdefs to handle !CONFIG_X86_X2APIC
> >
> >  arch/x86/hyperv/hv_apic.c | 54 ---
> >  1 file changed, 51 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> > index 5c056b8aebef..eb1434ae9e46 100644
> > --- a/arch/x86/hyperv/hv_apic.c
> > +++ b/arch/x86/hyperv/hv_apic.c
> > @@ -84,6 +84,44 @@ static void hv_apic_write(u32 reg, u32 val)
> > }
> >  }
> >  
> > +#ifdef CONFIG_X86_X2APIC
> > +static void hv_x2apic_icr_write(u32 low, u32 id)
> > +{
> > +   wrmsr(HV_X64_MSR_ICR, low, id);
> > +}
> 
> AFAIU you're trying to mirror native_x2apic_icr_write() here but this is
> different from what hv_apic_icr_write() does
> (SET_APIC_DEST_FIELD(id)).

Right.  In xapic mode the ICR2 aka the high 4 bytes of ICR is programmed
with the destination id in the highest byte; in x2apic mode the whole
ICR2 is set to the 32bit destination id.

> Is it actually correct? (I think you've tested this and it is but)

As I wrote in the commit log, I haven't tested it in the sense that I
ran a Linux guest in a Hyper-V VM exposing x2apic to the guest, because
I didn't manage to configure it to do so.  OTOH I did run a Windows
guest in QEMU/KVM with hv_apic and x2apic enabled and saw it write
destination ids unshifted to the ICR2 part of ICR, so I assume it's
correct.

> Michael, could you please shed some light here?

Would be appreciated, indeed.

> > +static u32 hv_x2apic_read(u32 reg)
> > +{
> > +   u32 reg_val, hi;
> > +
> > +   switch (reg) {
> > +   case APIC_EOI:
> > +   rdmsr(HV_X64_MSR_EOI, reg_val, hi);
> > +   return reg_val;
> > +   case APIC_TASKPRI:
> > +   rdmsr(HV_X64_MSR_TPR, reg_val, hi);
> > +   return reg_val;
> > +
> > +   default:
> > +   return native_apic_msr_read(reg);
> > +   }
> > +}
> > +
> > +static void hv_x2apic_write(u32 reg, u32 val)
> > +{
> > +   switch (reg) {
> > +   case APIC_EOI:
> > +   wrmsr(HV_X64_MSR_EOI, val, 0);
> > +   break;
> > +   case APIC_TASKPRI:
> > +   wrmsr(HV_X64_MSR_TPR, val, 0);
> > +   break;
> > +   default:
> > +   native_apic_msr_write(reg, val);
> > +   }
> > +}
> > +#endif /* CONFIG_X86_X2APIC */
> > +
> >  static void hv_apic_eoi_write(u32 reg, u32 val)
> >  {
> > struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
> > @@ -262,9 +300,19 @@ void __init hv_apic_init(void)
> > if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
> > pr_info("Hyper-V: Using MSR based APIC access\n");
> > apic_set_eoi_write(hv_apic_eoi_write);
> > -   apic->read  = hv_apic_read;
> > -   apic->write = hv_apic_write;
> > -   apic->icr_write = hv_apic_icr_write;
> > +#ifdef CONFIG_X86_X2APIC
> > +   if (x2apic_enabled()) {
> > +   apic->read  = hv_x2apic_read;
> > +   apic->write = hv_x2apic_write;
> > +   apic->icr_write = hv_x2apic_icr_write;
> > +   } else {
> > +#e

[PATCH v2] x86/hyperv: make vapic support x2apic mode

2019-10-02 Thread Roman Kagan
Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.

However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.

As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.

I didn't manage to make my instance of Hyper-V expose x2apic to the
guest; nor does Hyper-V spec document the expected behavior.  However,
a Windows guest running in QEMU/KVM with hv_apic and x2apic and a big
number of vcpus (so that it turns on x2apic mode) does use enlightened
apic MSRs passing unshifted 32bit destination id and falls back to the
regular x2apic MSRs for less frequently used apic fields.

So implement the same behavior, by replacing enlightened apic access
functions (only those where it makes a difference) with their
x2apic-aware versions when x2apic is in use.

Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Cc: sta...@vger.kernel.org
Signed-off-by: Roman Kagan 
---
v1 -> v2:
- add ifdefs to handle !CONFIG_X86_X2APIC

 arch/x86/hyperv/hv_apic.c | 54 ---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 5c056b8aebef..eb1434ae9e46 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -84,6 +84,44 @@ static void hv_apic_write(u32 reg, u32 val)
}
 }
 
+#ifdef CONFIG_X86_X2APIC
+static void hv_x2apic_icr_write(u32 low, u32 id)
+{
+   wrmsr(HV_X64_MSR_ICR, low, id);
+}
+
+static u32 hv_x2apic_read(u32 reg)
+{
+   u32 reg_val, hi;
+
+   switch (reg) {
+   case APIC_EOI:
+   rdmsr(HV_X64_MSR_EOI, reg_val, hi);
+   return reg_val;
+   case APIC_TASKPRI:
+   rdmsr(HV_X64_MSR_TPR, reg_val, hi);
+   return reg_val;
+
+   default:
+   return native_apic_msr_read(reg);
+   }
+}
+
+static void hv_x2apic_write(u32 reg, u32 val)
+{
+   switch (reg) {
+   case APIC_EOI:
+   wrmsr(HV_X64_MSR_EOI, val, 0);
+   break;
+   case APIC_TASKPRI:
+   wrmsr(HV_X64_MSR_TPR, val, 0);
+   break;
+   default:
+   native_apic_msr_write(reg, val);
+   }
+}
+#endif /* CONFIG_X86_X2APIC */
+
 static void hv_apic_eoi_write(u32 reg, u32 val)
 {
struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
@@ -262,9 +300,19 @@ void __init hv_apic_init(void)
if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
pr_info("Hyper-V: Using MSR based APIC access\n");
apic_set_eoi_write(hv_apic_eoi_write);
-   apic->read  = hv_apic_read;
-   apic->write = hv_apic_write;
-   apic->icr_write = hv_apic_icr_write;
+#ifdef CONFIG_X86_X2APIC
+   if (x2apic_enabled()) {
+   apic->read  = hv_x2apic_read;
+   apic->write = hv_x2apic_write;
+   apic->icr_write = hv_x2apic_icr_write;
+   } else {
+#endif
+   apic->read  = hv_apic_read;
+   apic->write = hv_apic_write;
+   apic->icr_write = hv_apic_icr_write;
+#ifdef CONFIG_X86_X2APIC
+   }
+#endif
apic->icr_read  = hv_apic_icr_read;
}
 }
-- 
2.21.0



Re: [PATCH] x86/hyperv: make vapic support x2apic mode

2019-10-01 Thread Roman Kagan
On Tue, Oct 01, 2019 at 06:44:08AM +0800, kbuild test robot wrote:
> url:
> https://github.com/0day-ci/linux/commits/Roman-Kagan/x86-hyperv-make-vapic-support-x2apic-mode/20191001-044238
> config: x86_64-randconfig-e004-201939 (attached as .config)
> compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
> 
> All errors (new ones prefixed by >>):
> 
>arch/x86/hyperv/hv_apic.c: In function 'hv_x2apic_read':
> >> arch/x86/hyperv/hv_apic.c:91:10: error: implicit declaration of function 
> >> 'native_apic_msr_read'; did you mean 'native_apic_icr_read'? 
> >> [-Werror=implicit-function-declaration]
>   return native_apic_msr_read(reg);
>  ^~~~
>  native_apic_icr_read
>arch/x86/hyperv/hv_apic.c: In function 'hv_x2apic_write':
> >> arch/x86/hyperv/hv_apic.c:119:3: error: implicit declaration of function 
> >> 'native_apic_msr_write'; did you mean 'native_apic_icr_write'? 
> >> [-Werror=implicit-function-declaration]
>   native_apic_msr_write(reg, val);
>   ^
>   native_apic_icr_write
>cc1: some warnings being treated as errors

Oops, !CONFIG_X86_X2APIC needs to be handled.  Will post v2.

Roman.


[PATCH] x86/hyperv: make vapic support x2apic mode

2019-09-30 Thread Roman Kagan
Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.

However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.

As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.

I didn't manage to make my instance of Hyper-V expose x2apic to the
guest; nor does Hyper-V spec document the expected behavior.  However,
a Windows guest running in QEMU/KVM with hv_apic and x2apic and a big
number of vcpus (so that it turns on x2apic mode) does use enlightened
apic MSRs passing unshifted 32bit destination id and falls back to the
regular x2apic MSRs for less frequently used apic fields.

So implement the same behavior, by replacing enlightened apic access
functions (only those where it makes a difference) with their
x2apic-aware versions when x2apic is in use.

Fixes: 29217a474683 ("iommu/hyper-v: Add Hyper-V stub IOMMU driver")
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Cc: sta...@vger.kernel.org
Signed-off-by: Roman Kagan 
---
 arch/x86/hyperv/hv_apic.c | 48 ---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 5c056b8aebef..9564fec00375 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -53,6 +53,11 @@ static void hv_apic_icr_write(u32 low, u32 id)
wrmsrl(HV_X64_MSR_ICR, reg_val);
 }
 
+static void hv_x2apic_icr_write(u32 low, u32 id)
+{
+   wrmsr(HV_X64_MSR_ICR, low, id);
+}
+
 static u32 hv_apic_read(u32 reg)
 {
u32 reg_val, hi;
@@ -70,6 +75,23 @@ static u32 hv_apic_read(u32 reg)
}
 }
 
+static u32 hv_x2apic_read(u32 reg)
+{
+   u32 reg_val, hi;
+
+   switch (reg) {
+   case APIC_EOI:
+   rdmsr(HV_X64_MSR_EOI, reg_val, hi);
+   return reg_val;
+   case APIC_TASKPRI:
+   rdmsr(HV_X64_MSR_TPR, reg_val, hi);
+   return reg_val;
+
+   default:
+   return native_apic_msr_read(reg);
+   }
+}
+
 static void hv_apic_write(u32 reg, u32 val)
 {
switch (reg) {
@@ -84,6 +106,20 @@ static void hv_apic_write(u32 reg, u32 val)
}
 }
 
+static void hv_x2apic_write(u32 reg, u32 val)
+{
+   switch (reg) {
+   case APIC_EOI:
+   wrmsr(HV_X64_MSR_EOI, val, 0);
+   break;
+   case APIC_TASKPRI:
+   wrmsr(HV_X64_MSR_TPR, val, 0);
+   break;
+   default:
+   native_apic_msr_write(reg, val);
+   }
+}
+
 static void hv_apic_eoi_write(u32 reg, u32 val)
 {
struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
@@ -262,9 +298,15 @@ void __init hv_apic_init(void)
if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
pr_info("Hyper-V: Using MSR based APIC access\n");
apic_set_eoi_write(hv_apic_eoi_write);
-   apic->read  = hv_apic_read;
-   apic->write = hv_apic_write;
-   apic->icr_write = hv_apic_icr_write;
+   if (x2apic_enabled()) {
+   apic->read  = hv_x2apic_read;
+   apic->write = hv_x2apic_write;
+   apic->icr_write = hv_x2apic_icr_write;
+   } else {
+   apic->read  = hv_apic_read;
+   apic->write = hv_apic_write;
+   apic->icr_write = hv_apic_icr_write;
+   }
apic->icr_read  = hv_apic_icr_read;
}
 }
-- 
2.21.0



Re: [PATCH] [v3] x86: kvm: avoid -Wsometimes-uninitized warning

2019-07-12 Thread Roman Kagan
On Fri, Jul 12, 2019 at 04:13:09PM +0200, Arnd Bergmann wrote:
> Clang notices a code path in which some variables are never
> initialized, but fails to figure out that this can never happen
> on i386 because is_64_bit_mode() always returns false.
> 
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'ingpa' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> if (!longmode) {
> ^
> arch/x86/kvm/hyperv.c:1632:55: note: uninitialized use occurs here
> trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
>  ^
> arch/x86/kvm/hyperv.c:1610:2: note: remove the 'if' if its condition is 
> always true
> if (!longmode) {
> ^~~
> arch/x86/kvm/hyperv.c:1595:18: note: initialize the variable 'ingpa' to 
> silence this warning
> u64 param, ingpa, outgpa, ret = HV_STATUS_SUCCESS;
> ^
>  = 0
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'outgpa' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'param' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> 
> Flip the condition around to avoid the conditional execution on i386.
> 
> Signed-off-by: Arnd Bergmann 
> ---
> v3: reword commit log, simplify patch again
> v2: make the change inside of is_64_bit_mode().
> ---
>  arch/x86/kvm/hyperv.c | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH] [v2] x86: kvm: avoid -Wsometimes-uninitized warning

2019-07-12 Thread Roman Kagan
On Fri, Jul 12, 2019 at 03:32:43PM +0200, Arnd Bergmann wrote:
> clang points out that running a 64-bit guest on a 32-bit host
> would lead to uninitialized variables:
> 
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'ingpa' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> if (!longmode) {
> ^
> arch/x86/kvm/hyperv.c:1632:55: note: uninitialized use occurs here
> trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
>  ^
> arch/x86/kvm/hyperv.c:1610:2: note: remove the 'if' if its condition is 
> always true
> if (!longmode) {
> ^~~
> arch/x86/kvm/hyperv.c:1595:18: note: initialize the variable 'ingpa' to 
> silence this warning
> u64 param, ingpa, outgpa, ret = HV_STATUS_SUCCESS;
> ^
>  = 0
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'outgpa' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'param' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> 
> Since that combination is not supported anyway, change the condition
> to tell the compiler how the code is actually executed.
> 
> Signed-off-by: Arnd Bergmann 
> ---
> v2: make the change inside of is_64_bit_mode().
> ---
>  arch/x86/kvm/hyperv.c | 6 ++
>  arch/x86/kvm/x86.h| 4 
>  2 files changed, 6 insertions(+), 4 deletions(-)

Reviewed-by: Roman Kagan 

However I still think the log message could state it more explicitly
that it was the compiler's fault, and the patch is a workaround for it.

Otherwise later on someone may decide to restore the similarity of
is_64_bit_mode() to other inlines in this file, and will be extremely
unlikely to test clang 32-bit build...

Roman.


Re: [PATCH 1/2] x86: kvm: avoid -Wsometimes-uninitized warning

2019-07-12 Thread Roman Kagan
On Fri, Jul 12, 2019 at 11:12:29AM +0200, Arnd Bergmann wrote:
> clang points out that running a 64-bit guest on a 32-bit host
> would lead to uninitialized variables:
> 
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'ingpa' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> if (!longmode) {
> ^
> arch/x86/kvm/hyperv.c:1632:55: note: uninitialized use occurs here
> trace_kvm_hv_hypercall(code, fast, rep_cnt, rep_idx, ingpa, outgpa);
>  ^
> arch/x86/kvm/hyperv.c:1610:2: note: remove the 'if' if its condition is 
> always true
> if (!longmode) {
> ^~~
> arch/x86/kvm/hyperv.c:1595:18: note: initialize the variable 'ingpa' to 
> silence this warning
> u64 param, ingpa, outgpa, ret = HV_STATUS_SUCCESS;
> ^
>  = 0
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'outgpa' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> arch/x86/kvm/hyperv.c:1610:6: error: variable 'param' is used uninitialized 
> whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
> 
> Since that combination is not supported anyway, change the condition
> to tell the compiler how the code is actually executed.

Hmm, the compiler *is* told all it needs:


arch/x86/kvm/x86.h:
...
static inline int is_long_mode(struct kvm_vcpu *vcpu)
{
#ifdef CONFIG_X86_64
return vcpu->arch.efer & EFER_LMA;
#else
return 0;
#endif
}

static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
{
int cs_db, cs_l;

if (!is_long_mode(vcpu))
return false;
kvm_x86_ops->get_cs_db_l_bits(vcpu, _db, _l);
return cs_l;
}
...

so in !CONFIG_X86_64 case is_64_bit_mode() unconditionally returns
false, and the branch setting the values of the variables is always
taken.

> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/kvm/hyperv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index a39e38f13029..950436c502ba 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1607,7 +1607,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>  
>   longmode = is_64_bit_mode(vcpu);
>  
> - if (!longmode) {
> + if (!IS_ENABLED(CONFIG_X86_64) || !longmode) {
>   param = ((u64)kvm_rdx_read(vcpu) << 32) |
>   (kvm_rax_read(vcpu) & 0x);
>   ingpa = ((u64)kvm_rbx_read(vcpu) << 32) |

So this is rather a workaround for the compiler giving false positive.
I suggest to at least rephrase the log message to inidcate this.

Roman.


Re: [PATCH] x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init

2019-03-18 Thread Roman Kagan
On Wed, Mar 13, 2019 at 06:13:42PM +0100, Vitaly Kuznetsov wrote:
> When userspace initializes guest vCPUs it may want to zero all supported
> MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
> HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5d89a ("kvm/x86: Update SynIC
> timers on guest entry only") we began doing stimer_mark_pending()
> unconditionally on every config change.
> 
> The issue I'm observing manifests itself as following:
> - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
>   pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
> - kvm_hv_has_stimer_pending() starts returning true;
> - kvm_vcpu_has_events() starts returning true;
> - kvm_arch_vcpu_runnable() starts returning true;
> - when kvm_arch_vcpu_ioctl_run() gets into
>   (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
>   - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
> immediately, avoiding normal wait path;
>   - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
> userspace to retry.
> 
> So instead of normal wait path we get a busy loop on all secondary vCPUs
> before they get INIT signal. This seems to be undesirable, especially given
> that this happens even when Hyper-V extensions are not used.
> 
> Generally, it seems to be pointless to mark an stimer as pending in
> stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
> kvm_hv_process_stimers() will do is clear the corresponding bit. We may
> just not mark disabled timers as pending instead.
> 
> Fixes: f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 1/2] x86/hyper-v: Stop caring about EOI for direct stimers

2018-12-13 Thread Roman Kagan
On Wed, Dec 12, 2018 at 05:50:16PM +0100, Vitaly Kuznetsov wrote:
> Turns out we over-engineered Direct Mode for stimers a bit: unlike
> traditional stimers where we may want to try to re-inject the message upon
> EOI, Direct Mode stimers just set the irq in APIC and kvm_apic_set_irq()
> fails only when APIC is disabled (see APIC_DM_FIXED case in
> __apic_accept_irq()). Remove the redundant part.
> 
> Suggested-by: Roman Kagan 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 36 +++-
>  1 file changed, 3 insertions(+), 33 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 2/2] x86/kvm/hyper-v: disallow setting illegal vectors for direct mode stimers

2018-12-13 Thread Roman Kagan
On Wed, Dec 12, 2018 at 05:50:17PM +0100, Vitaly Kuznetsov wrote:
> APIC vectors used for direct mode stimers should be valid for lAPIC and
> just like genuine Hyper-V we should #GP when an illegal one is specified.
> 
> Add the appropriate check to stimer_set_config()
> 
> Suggested-by: Roman Kagan 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 5 +
>  1 file changed, 5 insertions(+)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 4/7] x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID

2018-12-11 Thread Roman Kagan
On Tue, Dec 11, 2018 at 04:04:01PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Tue, Dec 11, 2018 at 02:28:14PM +0100, Vitaly Kuznetsov wrote:
> >> Roman Kagan  writes:
> >> 
> >> > On Mon, Dec 10, 2018 at 06:21:56PM +0100, Vitaly Kuznetsov wrote:
> >> 
> >> >> +
> >> >> +Currently, the following list of CPUID leaves are returned:
> >> >> + HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS
> >> >> + HYPERV_CPUID_INTERFACE
> >> >> + HYPERV_CPUID_VERSION
> >> >> + HYPERV_CPUID_FEATURES
> >> >> + HYPERV_CPUID_ENLIGHTMENT_INFO
> >> >> + HYPERV_CPUID_IMPLEMENT_LIMITS
> >> >> + HYPERV_CPUID_NESTED_FEATURES
> >> >> +
> >> >> +HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened 
> >> >> VMCS was
> >> >> +enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
> >> >
> >> > IOW the output of ioctl(KVM_GET_SUPPORTED_HV_CPUID) depends on
> >> > whether ioctl(KVM_ENABLE_CAP, KVM_CAP_HYPERV_ENLIGHTENED_VMCS) has
> >> > already been called on that vcpu?  I wonder if this fits the intended
> >> > usage?
> >> 
> >> I added HYPERV_CPUID_NESTED_FEATURES in the list (and made the new ioctl
> >> per-cpu and not per-vm) for consistency. *In theory*
> >> KVM_CAP_HYPERV_ENLIGHTENED_VMCS is also enabled per-vcpu so some
> >> hypothetical userspace can later check enabled eVMCS versions (which can
> >> differ across vCPUs!) with KVM_GET_SUPPORTED_HV_CPUID. We will also have
> >> direct tlb flush and other nested features there so to avoid addning new
> >> KVM_CAP_* for them we need the CPUID.
> >
> > This is different from how KVM_GET_SUPPORTED_CPUID is used: QEMU assumes
> > that its output doesn't change between calls, and even caches the result
> > calling the ioctl only once.
> >
> 
> Yes, I'm not sure if we have to have full consistency between
> KVM_GET_SUPPORTED_CPUID and KVM_GET_SUPPORTED_HV_CPUID.

Neither do I.  I just noticed the difference and thought it might
matter.

> >> Another thing I'm thinking about is something like 'hv_all' cpu flag for
> >> Qemu which would enable everything by setting guest CPUIDs to what
> >> KVM_GET_SUPPORTED_HV_CPUID returns. In that case it would also be
> >> convenient to have HYPERV_CPUID_NESTED_FEATURES properly filled (or not
> >> filled when eVMCS was not enabled).
> >
> > I think this is orthogonal to the way you obtain capability info from
> > the kernel.
> 
> Not necessarily. If very dumb userspace does 'host passthrough' for
> Hyper-V features without doing anything (e.g. not enabling Enlightened
> VMCS) it will just put the result of KVM_GET_SUPPORTED_HV_CPUID in guest
> facing CPUIDs and it will all work. In case eVMCS was previously enabled
> it again just copies everything and this still works.
> 
> We don't probably need this for Qemu though. If you think it would be
> better to have HYPERV_CPUID_NESTED_FEATURES returned regardless of eVMCS
> enablement I'm ready to budge)

I have no opinion on this.  I hope Paolo, who requested the feature, can
explain the desired semantics :)

Roman.


Re: [PATCH v2 4/7] x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID

2018-12-11 Thread Roman Kagan
On Tue, Dec 11, 2018 at 02:28:14PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Mon, Dec 10, 2018 at 06:21:56PM +0100, Vitaly Kuznetsov wrote:
> 
> >> +
> >> +Currently, the following list of CPUID leaves are returned:
> >> + HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS
> >> + HYPERV_CPUID_INTERFACE
> >> + HYPERV_CPUID_VERSION
> >> + HYPERV_CPUID_FEATURES
> >> + HYPERV_CPUID_ENLIGHTMENT_INFO
> >> + HYPERV_CPUID_IMPLEMENT_LIMITS
> >> + HYPERV_CPUID_NESTED_FEATURES
> >> +
> >> +HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS 
> >> was
> >> +enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
> >
> > IOW the output of ioctl(KVM_GET_SUPPORTED_HV_CPUID) depends on
> > whether ioctl(KVM_ENABLE_CAP, KVM_CAP_HYPERV_ENLIGHTENED_VMCS) has
> > already been called on that vcpu?  I wonder if this fits the intended
> > usage?
> 
> I added HYPERV_CPUID_NESTED_FEATURES in the list (and made the new ioctl
> per-cpu and not per-vm) for consistency. *In theory*
> KVM_CAP_HYPERV_ENLIGHTENED_VMCS is also enabled per-vcpu so some
> hypothetical userspace can later check enabled eVMCS versions (which can
> differ across vCPUs!) with KVM_GET_SUPPORTED_HV_CPUID. We will also have
> direct tlb flush and other nested features there so to avoid addning new
> KVM_CAP_* for them we need the CPUID.

This is different from how KVM_GET_SUPPORTED_CPUID is used: QEMU assumes
that its output doesn't change between calls, and even caches the result
calling the ioctl only once.

> Another thing I'm thinking about is something like 'hv_all' cpu flag for
> Qemu which would enable everything by setting guest CPUIDs to what
> KVM_GET_SUPPORTED_HV_CPUID returns. In that case it would also be
> convenient to have HYPERV_CPUID_NESTED_FEATURES properly filled (or not
> filled when eVMCS was not enabled).

I think this is orthogonal to the way you obtain capability info from
the kernel.

Roman.


Re: [PATCH v2 4/7] x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID

2018-12-11 Thread Roman Kagan
On Mon, Dec 10, 2018 at 06:21:56PM +0100, Vitaly Kuznetsov wrote:
> With every new Hyper-V Enlightenment we implement we're forced to add a
> KVM_CAP_HYPERV_* capability. While this approach works it is fairly
> inconvenient: the majority of the enlightenments we do have corresponding
> CPUID feature bit(s) and userspace has to know this anyways to be able to
> expose the feature to the guest.
> 
> Add KVM_GET_SUPPORTED_HV_CPUID ioctl (backed by KVM_CAP_HYPERV_CPUID, "one
> cap to rule them all!") returning all Hyper-V CPUID feature leaves.
> 
> Using the existing KVM_GET_SUPPORTED_CPUID doesn't seem to be possible:
> Hyper-V CPUID feature leaves intersect with KVM's (e.g. 0x4000,
> 0x4001) and we would probably confuse userspace in case we decide to
> return these twice.
> 
> KVM_CAP_HYPERV_CPUID's number is interim: we're intended to drop
> KVM_CAP_HYPERV_STIMER_DIRECT and use its number instead.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> Changes since v1:
> - Document KVM_GET_SUPPORTED_HV_CPUID and KVM_CAP_HYPERV_CPUID.
> - Fix error handling in kvm_vcpu_ioctl_get_hv_cpuid()
> ---
>  Documentation/virtual/kvm/api.txt |  57 ++
>  arch/x86/kvm/hyperv.c | 121 ++
>  arch/x86/kvm/hyperv.h |   2 +
>  arch/x86/kvm/x86.c|  20 +
>  include/uapi/linux/kvm.h  |   4 +
>  5 files changed, 204 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index cd209f7730af..5b72de0af24d 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3753,6 +3753,63 @@ Coalesced pio is based on coalesced mmio. There is 
> little difference
>  between coalesced mmio and pio except that coalesced pio records accesses
>  to I/O ports.
>  
> +4.117 KVM_GET_SUPPORTED_HV_CPUID
> +
> +Capability: KVM_CAP_HYPERV_CPUID
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_cpuid2 (in/out)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_cpuid2 {
> + __u32 nent;
> + __u32 padding;
> + struct kvm_cpuid_entry2 entries[0];
> +};
> +
> +struct kvm_cpuid_entry2 {
> + __u32 function;
> + __u32 index;
> + __u32 flags;
> + __u32 eax;
> + __u32 ebx;
> + __u32 ecx;
> + __u32 edx;
> + __u32 padding[3];
> +};
> +
> +This ioctl returns x86 cpuid features leaves related to Hyper-V emulation in
> +KVM.  Userspace can use the information returned by this ioctl to construct
> +cpuid information presented to guests consuming Hyper-V enlightenments (e.g.
> +Windows or Hyper-V guests).
> +
> +CPUID feature leaves returned by this ioctl are defined by Hyper-V Top Level
> +Functional Specification (TLFS). These leaves can't be obtained with
> +KVM_GET_SUPPORTED_CPUID ioctl because some of them intersect with KVM feature
> +leaves (0x4000, 0x4001).
> +
> +Currently, the following list of CPUID leaves are returned:
> + HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS
> + HYPERV_CPUID_INTERFACE
> + HYPERV_CPUID_VERSION
> + HYPERV_CPUID_FEATURES
> + HYPERV_CPUID_ENLIGHTMENT_INFO
> + HYPERV_CPUID_IMPLEMENT_LIMITS
> + HYPERV_CPUID_NESTED_FEATURES
> +
> +HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
> +enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).

IOW the output of ioctl(KVM_GET_SUPPORTED_HV_CPUID) depends on
whether ioctl(KVM_ENABLE_CAP, KVM_CAP_HYPERV_ENLIGHTENED_VMCS) has
already been called on that vcpu?  I wonder if this fits the intended
usage?

Roman.

> +
> +Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
> +with the 'nent' field indicating the number of entries in the variable-size
> +array 'entries'.  If the number of entries is too low to describe all Hyper-V
> +feature leaves, an error (E2BIG) is returned. If the number is more or equal
> +to the number of Hyper-V feature leaves, the 'nent' field is adjusted to the
> +number of valid entries in the 'entries' array, which is then filled.
> +
> +'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently 
> reserved,
> +userspace should not expect to get any particular value there.
> +
>  5. The kvm_run structure
>  
>  
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index cecf907e4c49..04c3cdf3389e 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1804,3 +1804,124 @@ int kvm_vm_ioctl_hv_eventfd(struct kvm *kvm, struct 
> kvm_hyperv_eventfd *args)
>   return kvm_hv_eventfd_deassign(kvm, args->conn_id);
>   return kvm_hv_eventfd_assign(kvm, args->conn_id, args->fd);
>  }
> +
> +int kvm_vcpu_ioctl_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 
> *cpuid,
> + struct kvm_cpuid_entry2 __user *entries)
> +{
> + uint16_t evmcs_ver = kvm_x86_ops->nested_get_evmcs_version(vcpu);
> + struct kvm_cpuid_entry2 cpuid_entries[] = {

Re: [PATCH] x86/hyper-v: Stop caring about EOI for direct stimers

2018-12-10 Thread Roman Kagan
[ Sorry, missed this one ]

On Wed, Dec 05, 2018 at 04:36:21PM +0100, Vitaly Kuznetsov wrote:
> Turns out we over-engineered Direct Mode for stimers a bit: unlike
> traditional stimers where we may want to try to re-inject the message upon
> EOI, Direct Mode stimers just set the irq in APIC and kvm_apic_set_irq()
> fails only when APIC is disabled (see APIC_DM_FIXED case in
> __apic_accept_irq()). Remove the redundant part.
> 
> Suggested-by: Roman Kagan 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 36 +++-
>  1 file changed, 3 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index e6a2a085644a..0a16a77e6ac3 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -56,21 +56,8 @@ static inline int synic_get_sint_vector(u64 sint_value)
>  static bool synic_has_vector_connected(struct kvm_vcpu_hv_synic *synic,
> int vector)
>  {
> - struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
> - struct kvm_vcpu_hv *hv_vcpu = vcpu_to_hv_vcpu(vcpu);
> - struct kvm_vcpu_hv_stimer *stimer;
>   int i;
>  
> - for (i = 0; i < ARRAY_SIZE(hv_vcpu->stimer); i++) {
> - stimer = _vcpu->stimer[i];
> - if (stimer->config.enable && stimer->config.direct_mode &&
> - stimer->config.apic_vector == vector)
> - return true;
> - }
> -
> - if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> - return false;
> -
>   for (i = 0; i < ARRAY_SIZE(synic->sint); i++) {
>   if (synic_get_sint_vector(synic_read_sint(synic, i)) == vector)
>   return true;
> @@ -96,14 +83,14 @@ static bool synic_has_vector_auto_eoi(struct 
> kvm_vcpu_hv_synic *synic,
>  static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>   int vector)
>  {
> + if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> + return;
> +
>   if (synic_has_vector_connected(synic, vector))
>   __set_bit(vector, synic->vec_bitmap);
>   else
>   __clear_bit(vector, synic->vec_bitmap);
>  
> - if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> - return;
> -
>   if (synic_has_vector_auto_eoi(synic, vector))
>   __set_bit(vector, synic->auto_eoi_bitmap);
>   else
> @@ -382,9 +369,7 @@ int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vpidx, u32 
> sint)
>  
>  void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector)
>  {
> - struct kvm_vcpu_hv *hv_vcpu = vcpu_to_hv_vcpu(vcpu);
>   struct kvm_vcpu_hv_synic *synic = vcpu_to_synic(vcpu);
> - struct kvm_vcpu_hv_stimer *stimer;
>   int i;
>  
>   trace_kvm_hv_synic_send_eoi(vcpu->vcpu_id, vector);
> @@ -392,14 +377,6 @@ void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int 
> vector)
>   for (i = 0; i < ARRAY_SIZE(synic->sint); i++)
>   if (synic_get_sint_vector(synic_read_sint(synic, i)) == vector)
>   kvm_hv_notify_acked_sint(vcpu, i);
> -
> - for (i = 0; i < ARRAY_SIZE(hv_vcpu->stimer); i++) {
> - stimer = _vcpu->stimer[i];
> - if (stimer->msg_pending && stimer->config.enable &&
> - stimer->config.direct_mode &&
> - stimer->config.apic_vector == vector)
> - stimer_mark_pending(stimer, false);
> - }
>  }
>  
>  static int kvm_hv_set_sint_gsi(struct kvm *kvm, u32 vpidx, u32 sint, int gsi)
> @@ -566,8 +543,6 @@ static int stimer_start(struct kvm_vcpu_hv_stimer *stimer)
>  static int stimer_set_config(struct kvm_vcpu_hv_stimer *stimer, u64 config,
>bool host)
>  {
> - struct kvm_vcpu *vcpu = stimer_to_vcpu(stimer);
> - struct kvm_vcpu_hv *hv_vcpu = vcpu_to_hv_vcpu(vcpu);
>   union hv_stimer_config new_config = {.as_uint64 = config},
>   old_config = {.as_uint64 = stimer->config.as_uint64};
>  
> @@ -580,11 +555,6 @@ static int stimer_set_config(struct kvm_vcpu_hv_stimer 
> *stimer, u64 config,
>   new_config.enable = 0;
>   stimer->config.as_uint64 = new_config.as_uint64;
>  
> - if (old_config.direct_mode)
> - synic_update_vector(_vcpu->synic, old_config.apic_vector);
> - if (new_config.direct_mode)
> - synic_update_vector(_vcpu->synic, new_config.apic_vector);
> -
>   stimer_mark_pending(stimer, false);
>   return 0;
>  }

As discussed in another thread, it seems worth while to make
stimer_set_config reject vectors 0..15.

Besides I'd rather sqwash this patch into the one that introduced direct
timers, before it reached Linus' tree.

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-12-10 Thread Roman Kagan
On Mon, Dec 10, 2018 at 01:54:18PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> > Just noticed that the patch seems to assume that "direct" timers are
> > allowed to use any vectors including 0-15.  I guess this is incorrect,
> > and instead stimer_set_config should error out on direct mode with a
> > vector less than HV_SYNIC_FIRST_VALID_VECTOR.
> 
> The spec is really vague about this and I'm not sure that this has
> anything to do with HV_SYNIC_FIRST_VALID_VECTOR (as these are actually
> not "synic" vectors, I *think* that SynIC doesn't even need to be
> enabled to make them work).
> 
> I checked and Hyper-V 2016 uses vector '0xff', not sure if it proves
> your point :-)
> 
> Do you envision any issues in KVM if we keep allowing vectors <
> HV_SYNIC_FIRST_VALID_VECTOR?

It's actually lapic that treats vectors 0..15 as illegal.  Nothing
Hyper-V specific here.

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-12-10 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:31PM +0100, Vitaly Kuznetsov wrote:
> Turns out Hyper-V on KVM (as of 2016) will only use synthetic timers
> if direct mode is available. With direct mode we notify the guest by
> asserting APIC irq instead of sending a SynIC message.
> 
> The implementation uses existing vec_bitmap for letting lapic code
> know that we're interested in the particular IRQ's EOI request. We assume
> that the same APIC irq won't be used by the guest for both direct mode
> stimer and as sint source (especially with AutoEOI semantics). It is
> unclear how things should be handled if that's not true.
> 
> Direct mode is also somewhat less expensive; in my testing
> stimer_send_msg() takes not less than 1500 cpu cycles and
> stimer_notify_direct() can usually be done in 300-400. WS2016 without
> Hyper-V, however, always sticks to non-direct version.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> - Changes since v1: avoid open-coding stimer_mark_pending() in
>   kvm_hv_synic_send_eoi() [Paolo Bonzini]
> ---
>  arch/x86/kvm/hyperv.c| 67 +++-
>  arch/x86/kvm/trace.h | 10 +++---
>  arch/x86/kvm/x86.c   |  1 +
>  include/uapi/linux/kvm.h |  1 +
>  4 files changed, 67 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index eaec15c738df..9533133be566 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -38,6 +38,9 @@
>  
>  #define KVM_HV_MAX_SPARSE_VCPU_SET_BITS DIV_ROUND_UP(KVM_MAX_VCPUS, 64)
>  
> +static void stimer_mark_pending(struct kvm_vcpu_hv_stimer *stimer,
> + bool vcpu_kick);
> +
>  static inline u64 synic_read_sint(struct kvm_vcpu_hv_synic *synic, int sint)
>  {
>   return atomic64_read(>sint[sint]);
> @@ -53,8 +56,21 @@ static inline int synic_get_sint_vector(u64 sint_value)
>  static bool synic_has_vector_connected(struct kvm_vcpu_hv_synic *synic,
> int vector)
>  {
> + struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
> + struct kvm_vcpu_hv *hv_vcpu = vcpu_to_hv_vcpu(vcpu);
> + struct kvm_vcpu_hv_stimer *stimer;
>   int i;
>  
> + for (i = 0; i < ARRAY_SIZE(hv_vcpu->stimer); i++) {
> + stimer = _vcpu->stimer[i];
> + if (stimer->config.enable && stimer->config.direct_mode &&
> + stimer->config.apic_vector == vector)
> + return true;
> + }
> +
> + if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> + return false;
> +
>   for (i = 0; i < ARRAY_SIZE(synic->sint); i++) {
>   if (synic_get_sint_vector(synic_read_sint(synic, i)) == vector)
>   return true;
> @@ -80,14 +96,14 @@ static bool synic_has_vector_auto_eoi(struct 
> kvm_vcpu_hv_synic *synic,
>  static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>   int vector)
>  {
> - if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> - return;
> -
>   if (synic_has_vector_connected(synic, vector))
>   __set_bit(vector, synic->vec_bitmap);
>   else
>   __clear_bit(vector, synic->vec_bitmap);
>  
> + if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> + return;
> +

Just noticed that the patch seems to assume that "direct" timers are
allowed to use any vectors including 0-15.  I guess this is incorrect,
and instead stimer_set_config should error out on direct mode with a
vector less than HV_SYNIC_FIRST_VALID_VECTOR.

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-12-03 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:31PM +0100, Vitaly Kuznetsov wrote:
> @@ -379,6 +398,14 @@ void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int 
> vector)
>   for (i = 0; i < ARRAY_SIZE(synic->sint); i++)
>   if (synic_get_sint_vector(synic_read_sint(synic, i)) == vector)
>   kvm_hv_notify_acked_sint(vcpu, i);
> +
> + for (i = 0; i < ARRAY_SIZE(hv_vcpu->stimer); i++) {
> + stimer = _vcpu->stimer[i];
> + if (stimer->msg_pending && stimer->config.enable &&
> + stimer->config.direct_mode &&
> + stimer->config.apic_vector == vector)
> + stimer_mark_pending(stimer, false);
> + }
>  }

While debugging another issue with synic timers, it just occurred to me
that with direct timers no extra processing is necessary on EOI: unlike
traditional synic timers which may have failed to deliver a message and
want to be notified when they can retry, direct timers just set the irq
directly in the apic.

So this hunk shouldn't be needed, should it?

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-12-03 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:31PM +0100, Vitaly Kuznetsov wrote:
> @@ -379,6 +398,14 @@ void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int 
> vector)
>   for (i = 0; i < ARRAY_SIZE(synic->sint); i++)
>   if (synic_get_sint_vector(synic_read_sint(synic, i)) == vector)
>   kvm_hv_notify_acked_sint(vcpu, i);
> +
> + for (i = 0; i < ARRAY_SIZE(hv_vcpu->stimer); i++) {
> + stimer = _vcpu->stimer[i];
> + if (stimer->msg_pending && stimer->config.enable &&
> + stimer->config.direct_mode &&
> + stimer->config.apic_vector == vector)
> + stimer_mark_pending(stimer, false);
> + }
>  }

While debugging another issue with synic timers, it just occurred to me
that with direct timers no extra processing is necessary on EOI: unlike
traditional synic timers which may have failed to deliver a message and
want to be notified when they can retry, direct timers just set the irq
directly in the apic.

So this hunk shouldn't be needed, should it?

Roman.


Re: [PATCH v2] x86/hyper-v: Mark TLFS structures packed

2018-12-02 Thread Roman Kagan
On Mon, Dec 03, 2018 at 12:35:35AM +0100, Vitaly Kuznetsov wrote:
> Nadav Amit  writes:
> 
> [skip]
> 
> >
> > Having said that, something else is sort of strange in the TLFS definitions,
> > I think (I really know little about this whole protocol). Look at the
> > following definitions from hyperv-tlfs.h:
> >
> >> struct hv_vpset {
> >> u64 format;
> >> u64 valid_bank_mask;
> >> u64 bank_contents[];
> >> };
> >> 
> >> struct hv_tlb_flush_ex {
> >> u64 address_space;
> >> u64 flags;
> >> struct hv_vpset hv_vp_set;
> >> u64 gva_list[];
> >> };
> >
> > It seems you have two flexible array members at the end of hv_tlb_flush_ex.
> > This causes bank_contents[x] and gva_list[x] to overlap. So unless they have
> > the same meaning, this asks for trouble IMHO.
> >
> 
> This is weird but intentional :-) We're just following Hyper-V spec
> here.
> 
> E.g. HvFlushVirtualAddressListEx hypercall has the following input ABI:
> 
> [Fixed len head][[Fixed len VP set spec]Var len VP set][Var len addr List]
> 
> "Fixed len VP set spec" defines the true length of "Var len VP set" and
> "Address List" starts right after that. The length of the whole
> structure is also known.
> 
> So bank_contents[] and gva_list[] do overlap (and have different
> meaning). We take special precautions when forming the structure
> (e.g. fill_gva_list() takes 'offset').

This basically means that the argument of this hypercall can't be
represented by a C struct.  gva_list just can't be used.  So I'd rather
remove it from the struct (but leave a comment to that end perhaps), and
construct the message in place (as is done now anyway).

Roman.


Re: [PATCH v2] x86/hyper-v: Mark TLFS structures packed

2018-12-02 Thread Roman Kagan
On Mon, Dec 03, 2018 at 12:35:35AM +0100, Vitaly Kuznetsov wrote:
> Nadav Amit  writes:
> 
> [skip]
> 
> >
> > Having said that, something else is sort of strange in the TLFS definitions,
> > I think (I really know little about this whole protocol). Look at the
> > following definitions from hyperv-tlfs.h:
> >
> >> struct hv_vpset {
> >> u64 format;
> >> u64 valid_bank_mask;
> >> u64 bank_contents[];
> >> };
> >> 
> >> struct hv_tlb_flush_ex {
> >> u64 address_space;
> >> u64 flags;
> >> struct hv_vpset hv_vp_set;
> >> u64 gva_list[];
> >> };
> >
> > It seems you have two flexible array members at the end of hv_tlb_flush_ex.
> > This causes bank_contents[x] and gva_list[x] to overlap. So unless they have
> > the same meaning, this asks for trouble IMHO.
> >
> 
> This is weird but intentional :-) We're just following Hyper-V spec
> here.
> 
> E.g. HvFlushVirtualAddressListEx hypercall has the following input ABI:
> 
> [Fixed len head][[Fixed len VP set spec]Var len VP set][Var len addr List]
> 
> "Fixed len VP set spec" defines the true length of "Var len VP set" and
> "Address List" starts right after that. The length of the whole
> structure is also known.
> 
> So bank_contents[] and gva_list[] do overlap (and have different
> meaning). We take special precautions when forming the structure
> (e.g. fill_gva_list() takes 'offset').

This basically means that the argument of this hypercall can't be
represented by a C struct.  gva_list just can't be used.  So I'd rather
remove it from the struct (but leave a comment to that end perhaps), and
construct the message in place (as is done now anyway).

Roman.


Re: [PATCH] x86/hyper-v: define structures from TLFS as packed

2018-11-30 Thread Roman Kagan
On Fri, Nov 30, 2018 at 02:44:54PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Nov 30, 2018 at 01:15:11PM +0100, Vitaly Kuznetsov wrote:
> >> Without 'packed' compiler is free to add optimization paddings and re-order
> >> structure fields for randomization/optimization. And structures from
> >> hyperv-tlfs.h are used for hypervisor-guest communication, we need to
> >> ultimately forbid such practices.
> >
> > Note that __packed also reduces the structure alignment to 1, which is
> > not necessarily what you want.
> >
> > E.g. some of these structures are passed by pointer to the hypercall,
> > which requires its arguments to be 8byte-aligned.
> 
> Hm,
> 
> I thought we always take precautions for Hyper-V hypercall arguments, in
> particular
> 
> PV IPI/TLB flush use pre-allocated hyperv_pcpu_input_arg,
> hv_post_message() uses pre-allocated message page, other call sites use
> fast hypercalls where we use registers.

Looks so indeed.

> I also checked this patch before sending out, WS2016 guest boots without
> issues. Any particular places you're worried about?

It's Linux guests on Hyper-V that need to be checked.

> >  I'm also not sure
> > that passing unaligned argument to [rw]msr is ok, need to
> > double-check.
> 
> My understanding is that rdmsr/wrmsr instuctions are registers-only.

Indeed.

> We can, of course, just add __aligned(8) to some structures but I'd like
> to find the reason first.

I guess you're right, in the current code the relevant structures are
made sufficiently aligned by other means.  False alarm, sorry.

Roman.


Re: [PATCH] x86/hyper-v: define structures from TLFS as packed

2018-11-30 Thread Roman Kagan
On Fri, Nov 30, 2018 at 02:44:54PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Nov 30, 2018 at 01:15:11PM +0100, Vitaly Kuznetsov wrote:
> >> Without 'packed' compiler is free to add optimization paddings and re-order
> >> structure fields for randomization/optimization. And structures from
> >> hyperv-tlfs.h are used for hypervisor-guest communication, we need to
> >> ultimately forbid such practices.
> >
> > Note that __packed also reduces the structure alignment to 1, which is
> > not necessarily what you want.
> >
> > E.g. some of these structures are passed by pointer to the hypercall,
> > which requires its arguments to be 8byte-aligned.
> 
> Hm,
> 
> I thought we always take precautions for Hyper-V hypercall arguments, in
> particular
> 
> PV IPI/TLB flush use pre-allocated hyperv_pcpu_input_arg,
> hv_post_message() uses pre-allocated message page, other call sites use
> fast hypercalls where we use registers.

Looks so indeed.

> I also checked this patch before sending out, WS2016 guest boots without
> issues. Any particular places you're worried about?

It's Linux guests on Hyper-V that need to be checked.

> >  I'm also not sure
> > that passing unaligned argument to [rw]msr is ok, need to
> > double-check.
> 
> My understanding is that rdmsr/wrmsr instuctions are registers-only.

Indeed.

> We can, of course, just add __aligned(8) to some structures but I'd like
> to find the reason first.

I guess you're right, in the current code the relevant structures are
made sufficiently aligned by other means.  False alarm, sorry.

Roman.


Re: [PATCH] x86/hyper-v: define structures from TLFS as packed

2018-11-30 Thread Roman Kagan
On Fri, Nov 30, 2018 at 01:15:11PM +0100, Vitaly Kuznetsov wrote:
> Without 'packed' compiler is free to add optimization paddings and re-order
> structure fields for randomization/optimization. And structures from
> hyperv-tlfs.h are used for hypervisor-guest communication, we need to
> ultimately forbid such practices.

Note that __packed also reduces the structure alignment to 1, which is
not necessarily what you want.

E.g. some of these structures are passed by pointer to the hypercall,
which requires its arguments to be 8byte-aligned.  I'm also not sure
that passing unaligned argument to [rw]msr is ok, need to double-check.

Roman.

> Suggested-by: Nadav Amit 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> - This is a follow-up to my "[PATCH v2 0/4] x86/kvm/hyper-v: Implement
>  Direct Mode for synthetic timers" series, as suggested by Thomas I'm
>  routing it to KVM tree to avoid merge conflicts.
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 50 +++---
>  1 file changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index ebfed56976d2..6a60fa17f6f2 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -271,7 +271,7 @@ union hv_x64_msr_hypercall_contents {
>   u64 enable:1;
>   u64 reserved:11;
>   u64 guest_physical_address:52;
> - };
> + } __packed;
>  };
>  
>  /*
> @@ -283,7 +283,7 @@ struct ms_hyperv_tsc_page {
>   volatile u64 tsc_scale;
>   volatile s64 tsc_offset;
>   u64 reserved2[509];
> -};
> +}  __packed;
>  
>  /*
>   * The guest OS needs to register the guest ID with the hypervisor.
> @@ -324,7 +324,7 @@ struct hv_reenlightenment_control {
>   __u64 enabled:1;
>   __u64 reserved2:15;
>   __u64 target_vp:32;
> -};
> +}  __packed;
>  
>  #define HV_X64_MSR_TSC_EMULATION_CONTROL 0x4107
>  #define HV_X64_MSR_TSC_EMULATION_STATUS  0x4108
> @@ -332,12 +332,12 @@ struct hv_reenlightenment_control {
>  struct hv_tsc_emulation_control {
>   __u64 enabled:1;
>   __u64 reserved:63;
> -};
> +} __packed;
>  
>  struct hv_tsc_emulation_status {
>   __u64 inprogress:1;
>   __u64 reserved:63;
> -};
> +} __packed;
>  
>  #define HV_X64_MSR_HYPERCALL_ENABLE  0x0001
>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT  12
> @@ -409,7 +409,7 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
>   __u32 res1;
>   __u64 tsc_scale;
>   __s64 tsc_offset;
> -} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
> +}  __packed HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
>  
>  /* Define the number of synthetic interrupt sources. */
>  #define HV_SYNIC_SINT_COUNT  (16)
> @@ -466,7 +466,7 @@ union hv_message_flags {
>   struct {
>   __u8 msg_pending:1;
>   __u8 reserved:7;
> - };
> + } __packed;
>  };
>  
>  /* Define port identifier type. */
> @@ -488,7 +488,7 @@ struct hv_message_header {
>   __u64 sender;
>   union hv_port_id port;
>   };
> -};
> +} __packed;
>  
>  /* Define synthetic interrupt controller message format. */
>  struct hv_message {
> @@ -496,12 +496,12 @@ struct hv_message {
>   union {
>   __u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
>   } u;
> -};
> +} __packed;
>  
>  /* Define the synthetic interrupt message page layout. */
>  struct hv_message_page {
>   struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
> -};
> +} __packed;
>  
>  /* Define timer message payload structure. */
>  struct hv_timer_message_payload {
> @@ -509,7 +509,7 @@ struct hv_timer_message_payload {
>   __u32 reserved;
>   __u64 expiration_time;  /* When the timer expired */
>   __u64 delivery_time;/* When the message was delivered */
> -};
> +} __packed;
>  
>  /* Define virtual processor assist page structure. */
>  struct hv_vp_assist_page {
> @@ -519,7 +519,7 @@ struct hv_vp_assist_page {
>   __u64 nested_enlightenments_control[2];
>   __u32 enlighten_vmentry;
>   __u64 current_nested_vmcs;
> -};
> +} __packed;
>  
>  struct hv_enlightened_vmcs {
>   u32 revision_id;
> @@ -693,7 +693,7 @@ struct hv_enlightened_vmcs {
>   u32 nested_flush_hypercall:1;
>   u32 msr_bitmap:1;
>   u32 reserved:30;
> - } hv_enlightenments_control;
> + }  __packed hv_enlightenments_control;
>   u32 hv_vp_id;
>  
>   u64 hv_vm_id;
> @@ -703,7 +703,7 @@ struct hv_enlightened_vmcs {
>   u64 padding64_5[7];
>   u64 xss_exit_bitmap;
>   u64 padding64_6[7];
> -};
> +} __packed;
>  
>  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE  0
>  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP BIT(0)
> @@ -744,7 +744,7 @@ union hv_stimer_config {
>   u64 reserved_z0:3;
>   u64 sintx:4;
>   u64 reserved_z1:44;
> - };
> + } 

Re: [PATCH] x86/hyper-v: define structures from TLFS as packed

2018-11-30 Thread Roman Kagan
On Fri, Nov 30, 2018 at 01:15:11PM +0100, Vitaly Kuznetsov wrote:
> Without 'packed' compiler is free to add optimization paddings and re-order
> structure fields for randomization/optimization. And structures from
> hyperv-tlfs.h are used for hypervisor-guest communication, we need to
> ultimately forbid such practices.

Note that __packed also reduces the structure alignment to 1, which is
not necessarily what you want.

E.g. some of these structures are passed by pointer to the hypercall,
which requires its arguments to be 8byte-aligned.  I'm also not sure
that passing unaligned argument to [rw]msr is ok, need to double-check.

Roman.

> Suggested-by: Nadav Amit 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> - This is a follow-up to my "[PATCH v2 0/4] x86/kvm/hyper-v: Implement
>  Direct Mode for synthetic timers" series, as suggested by Thomas I'm
>  routing it to KVM tree to avoid merge conflicts.
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 50 +++---
>  1 file changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index ebfed56976d2..6a60fa17f6f2 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -271,7 +271,7 @@ union hv_x64_msr_hypercall_contents {
>   u64 enable:1;
>   u64 reserved:11;
>   u64 guest_physical_address:52;
> - };
> + } __packed;
>  };
>  
>  /*
> @@ -283,7 +283,7 @@ struct ms_hyperv_tsc_page {
>   volatile u64 tsc_scale;
>   volatile s64 tsc_offset;
>   u64 reserved2[509];
> -};
> +}  __packed;
>  
>  /*
>   * The guest OS needs to register the guest ID with the hypervisor.
> @@ -324,7 +324,7 @@ struct hv_reenlightenment_control {
>   __u64 enabled:1;
>   __u64 reserved2:15;
>   __u64 target_vp:32;
> -};
> +}  __packed;
>  
>  #define HV_X64_MSR_TSC_EMULATION_CONTROL 0x4107
>  #define HV_X64_MSR_TSC_EMULATION_STATUS  0x4108
> @@ -332,12 +332,12 @@ struct hv_reenlightenment_control {
>  struct hv_tsc_emulation_control {
>   __u64 enabled:1;
>   __u64 reserved:63;
> -};
> +} __packed;
>  
>  struct hv_tsc_emulation_status {
>   __u64 inprogress:1;
>   __u64 reserved:63;
> -};
> +} __packed;
>  
>  #define HV_X64_MSR_HYPERCALL_ENABLE  0x0001
>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT  12
> @@ -409,7 +409,7 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
>   __u32 res1;
>   __u64 tsc_scale;
>   __s64 tsc_offset;
> -} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
> +}  __packed HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
>  
>  /* Define the number of synthetic interrupt sources. */
>  #define HV_SYNIC_SINT_COUNT  (16)
> @@ -466,7 +466,7 @@ union hv_message_flags {
>   struct {
>   __u8 msg_pending:1;
>   __u8 reserved:7;
> - };
> + } __packed;
>  };
>  
>  /* Define port identifier type. */
> @@ -488,7 +488,7 @@ struct hv_message_header {
>   __u64 sender;
>   union hv_port_id port;
>   };
> -};
> +} __packed;
>  
>  /* Define synthetic interrupt controller message format. */
>  struct hv_message {
> @@ -496,12 +496,12 @@ struct hv_message {
>   union {
>   __u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
>   } u;
> -};
> +} __packed;
>  
>  /* Define the synthetic interrupt message page layout. */
>  struct hv_message_page {
>   struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
> -};
> +} __packed;
>  
>  /* Define timer message payload structure. */
>  struct hv_timer_message_payload {
> @@ -509,7 +509,7 @@ struct hv_timer_message_payload {
>   __u32 reserved;
>   __u64 expiration_time;  /* When the timer expired */
>   __u64 delivery_time;/* When the message was delivered */
> -};
> +} __packed;
>  
>  /* Define virtual processor assist page structure. */
>  struct hv_vp_assist_page {
> @@ -519,7 +519,7 @@ struct hv_vp_assist_page {
>   __u64 nested_enlightenments_control[2];
>   __u32 enlighten_vmentry;
>   __u64 current_nested_vmcs;
> -};
> +} __packed;
>  
>  struct hv_enlightened_vmcs {
>   u32 revision_id;
> @@ -693,7 +693,7 @@ struct hv_enlightened_vmcs {
>   u32 nested_flush_hypercall:1;
>   u32 msr_bitmap:1;
>   u32 reserved:30;
> - } hv_enlightenments_control;
> + }  __packed hv_enlightenments_control;
>   u32 hv_vp_id;
>  
>   u64 hv_vm_id;
> @@ -703,7 +703,7 @@ struct hv_enlightened_vmcs {
>   u64 padding64_5[7];
>   u64 xss_exit_bitmap;
>   u64 padding64_6[7];
> -};
> +} __packed;
>  
>  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE  0
>  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP BIT(0)
> @@ -744,7 +744,7 @@ union hv_stimer_config {
>   u64 reserved_z0:3;
>   u64 sintx:4;
>   u64 reserved_z1:44;
> - };
> + } 

Re: [PATCH v2 1/4] x86/hyper-v: move synic/stimer control structures definitions to hyperv-tlfs.h

2018-11-28 Thread Roman Kagan
On Wed, Nov 28, 2018 at 02:07:42PM +0100, Thomas Gleixner wrote:
> On Wed, 28 Nov 2018, Vitaly Kuznetsov wrote:
> 
> > Nadav Amit  writes:
> > 
> > >
> > > On a different note: how come all of the hyper-v structs are not marked
> > > with the “packed" attribute?
> > 
> > "packed" should not be needed with proper padding; I vaguely remember
> > someone (from x86@?) arguing _against_ "packed".
> 
> Packed needs to be used, when describing fixed format data structures in
> hardware or other ABIs, so the compiler cannot put alignment holes into
> them.
> 
> Using packed for generic data structures might result in suboptimal layouts
> and prevents layout randomization.

Sorry for my illiteracy, I didn't watch this field closely so I had to
google about layout randomization.  Is my understanding correct that one
can't rely any more on the compiler to naturally align the struct
members with minimal padding?  My life will never be the same...

Roman.


Re: [PATCH v2 1/4] x86/hyper-v: move synic/stimer control structures definitions to hyperv-tlfs.h

2018-11-28 Thread Roman Kagan
On Wed, Nov 28, 2018 at 02:07:42PM +0100, Thomas Gleixner wrote:
> On Wed, 28 Nov 2018, Vitaly Kuznetsov wrote:
> 
> > Nadav Amit  writes:
> > 
> > >
> > > On a different note: how come all of the hyper-v structs are not marked
> > > with the “packed" attribute?
> > 
> > "packed" should not be needed with proper padding; I vaguely remember
> > someone (from x86@?) arguing _against_ "packed".
> 
> Packed needs to be used, when describing fixed format data structures in
> hardware or other ABIs, so the compiler cannot put alignment holes into
> them.
> 
> Using packed for generic data structures might result in suboptimal layouts
> and prevents layout randomization.

Sorry for my illiteracy, I didn't watch this field closely so I had to
google about layout randomization.  Is my understanding correct that one
can't rely any more on the compiler to naturally align the struct
members with minimal padding?  My life will never be the same...

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-11-27 Thread Roman Kagan
On Tue, Nov 27, 2018 at 02:54:21PM +0100, Paolo Bonzini wrote:
> On 27/11/18 09:37, Roman Kagan wrote:
> > On Mon, Nov 26, 2018 at 05:44:24PM +0100, Paolo Bonzini wrote:
> >> On 26/11/18 16:47, Vitaly Kuznetsov wrote:
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index 5cd5647120f2..b21b5ceb8d26 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -2997,6 +2997,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
> >>> long ext)
> >>>   case KVM_CAP_HYPERV_TLBFLUSH:
> >>>   case KVM_CAP_HYPERV_SEND_IPI:
> >>>   case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
> >>> + case KVM_CAP_HYPERV_STIMER_DIRECT:
> >>>   case KVM_CAP_PCI_SEGMENT:
> >>>   case KVM_CAP_DEBUGREGS:
> >>>   case KVM_CAP_X86_ROBUST_SINGLESTEP:
> >>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>> index 2b7a652c9fa4..b8da14cee8e5 100644
> >>> --- a/include/uapi/linux/kvm.h
> >>> +++ b/include/uapi/linux/kvm.h
> >>> @@ -975,6 +975,7 @@ struct kvm_ppc_resize_hpt {
> >>>  #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
> >>>  #define KVM_CAP_EXCEPTION_PAYLOAD 164
> >>>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
> >>> +#define KVM_CAP_HYPERV_STIMER_DIRECT 166
> >>
> >> I wonder if all these capabilities shouldn't be replaced by a single
> >> KVM_GET_HYPERV_SUPPORTED_CPUID ioctl, or something like that.
> > 
> > Hmm, why?  Are we running short of cap numbers?
> > 
> > Capabilities are a well-established and unambiguous negotiation
> > mechanism, why invent another one?  Besides, not all features map
> > conveniently onto cpuid bits, e.g. currently we have two versions of
> > SynIC support, which differ in the way the userspace deals with it, but
> > not in the cpuid bits we expose in the guest.  IMO such an ioctl would
> > bring more complexity rather than less.
> 
> Yes, in that case (for bugfixes basically) you'd have to use
> capabilities.  But if the feature is completely hidden to userspace
> except for the CPUID bit, it seems like a ioctl would be more consistent
> with the rest of the KVM API.

So there will be some features that are more equal than others?

I think that it's the current scheme which is more consistent: there are
features that are implemented in KVM, and there are caps for negotiating
them with userspace, and then -- separately -- there's a mix of bits to
show to the guest in response to CPUID, which only the userspace has to
bother with.

Thanks,
Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-11-27 Thread Roman Kagan
On Tue, Nov 27, 2018 at 02:54:21PM +0100, Paolo Bonzini wrote:
> On 27/11/18 09:37, Roman Kagan wrote:
> > On Mon, Nov 26, 2018 at 05:44:24PM +0100, Paolo Bonzini wrote:
> >> On 26/11/18 16:47, Vitaly Kuznetsov wrote:
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index 5cd5647120f2..b21b5ceb8d26 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -2997,6 +2997,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
> >>> long ext)
> >>>   case KVM_CAP_HYPERV_TLBFLUSH:
> >>>   case KVM_CAP_HYPERV_SEND_IPI:
> >>>   case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
> >>> + case KVM_CAP_HYPERV_STIMER_DIRECT:
> >>>   case KVM_CAP_PCI_SEGMENT:
> >>>   case KVM_CAP_DEBUGREGS:
> >>>   case KVM_CAP_X86_ROBUST_SINGLESTEP:
> >>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>> index 2b7a652c9fa4..b8da14cee8e5 100644
> >>> --- a/include/uapi/linux/kvm.h
> >>> +++ b/include/uapi/linux/kvm.h
> >>> @@ -975,6 +975,7 @@ struct kvm_ppc_resize_hpt {
> >>>  #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
> >>>  #define KVM_CAP_EXCEPTION_PAYLOAD 164
> >>>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
> >>> +#define KVM_CAP_HYPERV_STIMER_DIRECT 166
> >>
> >> I wonder if all these capabilities shouldn't be replaced by a single
> >> KVM_GET_HYPERV_SUPPORTED_CPUID ioctl, or something like that.
> > 
> > Hmm, why?  Are we running short of cap numbers?
> > 
> > Capabilities are a well-established and unambiguous negotiation
> > mechanism, why invent another one?  Besides, not all features map
> > conveniently onto cpuid bits, e.g. currently we have two versions of
> > SynIC support, which differ in the way the userspace deals with it, but
> > not in the cpuid bits we expose in the guest.  IMO such an ioctl would
> > bring more complexity rather than less.
> 
> Yes, in that case (for bugfixes basically) you'd have to use
> capabilities.  But if the feature is completely hidden to userspace
> except for the CPUID bit, it seems like a ioctl would be more consistent
> with the rest of the KVM API.

So there will be some features that are more equal than others?

I think that it's the current scheme which is more consistent: there are
features that are implemented in KVM, and there are caps for negotiating
them with userspace, and then -- separately -- there's a mix of bits to
show to the guest in response to CPUID, which only the userspace has to
bother with.

Thanks,
Roman.


Re: [PATCH v2 1/4] x86/hyper-v: move synic/stimer control structures definitions to hyperv-tlfs.h

2018-11-27 Thread Roman Kagan
On Tue, Nov 27, 2018 at 02:10:49PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> > On Mon, Nov 26, 2018 at 04:47:29PM +0100, Vitaly Kuznetsov wrote:
> > I personally tend to prefer masks over bitfields, so I'd rather do the
> > consolidation in the opposite direction: use the definitions in
> > hyperv-tlfs.h and replace those unions/bitfields elsewhere.  (I vaguely
> > remember posting such a patchset a couple of years ago but I lacked the
> > motivation to complete it).
> 
> Are there any known advantages of using masks over bitfields or the
> resulting binary code is the same?

Strictly speaking bitwise ops are portable while bitfields are not, but
I guess this is not much of an issue with gcc which is dependable to
produce the right thing.

I came to dislike the bitfields for the false feeling of atomicity of
assignment while most of the time they are read-modify-write operations.

And no, I don't feel strong about it, so if nobody backs me on this I
give up :)

Roman.


Re: [PATCH v2 1/4] x86/hyper-v: move synic/stimer control structures definitions to hyperv-tlfs.h

2018-11-27 Thread Roman Kagan
On Tue, Nov 27, 2018 at 02:10:49PM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> > On Mon, Nov 26, 2018 at 04:47:29PM +0100, Vitaly Kuznetsov wrote:
> > I personally tend to prefer masks over bitfields, so I'd rather do the
> > consolidation in the opposite direction: use the definitions in
> > hyperv-tlfs.h and replace those unions/bitfields elsewhere.  (I vaguely
> > remember posting such a patchset a couple of years ago but I lacked the
> > motivation to complete it).
> 
> Are there any known advantages of using masks over bitfields or the
> resulting binary code is the same?

Strictly speaking bitwise ops are portable while bitfields are not, but
I guess this is not much of an issue with gcc which is dependable to
produce the right thing.

I came to dislike the bitfields for the false feeling of atomicity of
assignment while most of the time they are read-modify-write operations.

And no, I don't feel strong about it, so if nobody backs me on this I
give up :)

Roman.


Re: [PATCH v2 4/4] x86/kvm/hyper-v: avoid open-coding stimer_mark_pending() in kvm_hv_notify_acked_sint()

2018-11-27 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:32PM +0100, Vitaly Kuznetsov wrote:
> stimers_pending optimization only helps us to avoid multiple
> kvm_make_request() calls. This doesn't happen very often and these
> calls are very cheap in the first place, remove open-coded version of
> stimer_mark_pending() from kvm_hv_notify_acked_sint().

Frankly speaking, I've yet to see a guest that configures more than one
SynIC timer.  So it was indeed a bit of overengineering in the first
place.

> Suggested-by: Paolo Bonzini 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 12 +++-
>  1 file changed, 3 insertions(+), 9 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 4/4] x86/kvm/hyper-v: avoid open-coding stimer_mark_pending() in kvm_hv_notify_acked_sint()

2018-11-27 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:32PM +0100, Vitaly Kuznetsov wrote:
> stimers_pending optimization only helps us to avoid multiple
> kvm_make_request() calls. This doesn't happen very often and these
> calls are very cheap in the first place, remove open-coded version of
> stimer_mark_pending() from kvm_hv_notify_acked_sint().

Frankly speaking, I've yet to see a guest that configures more than one
SynIC timer.  So it was indeed a bit of overengineering in the first
place.

> Suggested-by: Paolo Bonzini 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 12 +++-
>  1 file changed, 3 insertions(+), 9 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-11-27 Thread Roman Kagan
On Mon, Nov 26, 2018 at 05:44:24PM +0100, Paolo Bonzini wrote:
> On 26/11/18 16:47, Vitaly Kuznetsov wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5cd5647120f2..b21b5ceb8d26 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2997,6 +2997,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
> > long ext)
> > case KVM_CAP_HYPERV_TLBFLUSH:
> > case KVM_CAP_HYPERV_SEND_IPI:
> > case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
> > +   case KVM_CAP_HYPERV_STIMER_DIRECT:
> > case KVM_CAP_PCI_SEGMENT:
> > case KVM_CAP_DEBUGREGS:
> > case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 2b7a652c9fa4..b8da14cee8e5 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -975,6 +975,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
> >  #define KVM_CAP_EXCEPTION_PAYLOAD 164
> >  #define KVM_CAP_ARM_VM_IPA_SIZE 165
> > +#define KVM_CAP_HYPERV_STIMER_DIRECT 166
> 
> I wonder if all these capabilities shouldn't be replaced by a single
> KVM_GET_HYPERV_SUPPORTED_CPUID ioctl, or something like that.

Hmm, why?  Are we running short of cap numbers?

Capabilities are a well-established and unambiguous negotiation
mechanism, why invent another one?  Besides, not all features map
conveniently onto cpuid bits, e.g. currently we have two versions of
SynIC support, which differ in the way the userspace deals with it, but
not in the cpuid bits we expose in the guest.  IMO such an ioctl would
bring more complexity rather than less.

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-11-27 Thread Roman Kagan
On Mon, Nov 26, 2018 at 05:44:24PM +0100, Paolo Bonzini wrote:
> On 26/11/18 16:47, Vitaly Kuznetsov wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5cd5647120f2..b21b5ceb8d26 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2997,6 +2997,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
> > long ext)
> > case KVM_CAP_HYPERV_TLBFLUSH:
> > case KVM_CAP_HYPERV_SEND_IPI:
> > case KVM_CAP_HYPERV_ENLIGHTENED_VMCS:
> > +   case KVM_CAP_HYPERV_STIMER_DIRECT:
> > case KVM_CAP_PCI_SEGMENT:
> > case KVM_CAP_DEBUGREGS:
> > case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 2b7a652c9fa4..b8da14cee8e5 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -975,6 +975,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
> >  #define KVM_CAP_EXCEPTION_PAYLOAD 164
> >  #define KVM_CAP_ARM_VM_IPA_SIZE 165
> > +#define KVM_CAP_HYPERV_STIMER_DIRECT 166
> 
> I wonder if all these capabilities shouldn't be replaced by a single
> KVM_GET_HYPERV_SUPPORTED_CPUID ioctl, or something like that.

Hmm, why?  Are we running short of cap numbers?

Capabilities are a well-established and unambiguous negotiation
mechanism, why invent another one?  Besides, not all features map
conveniently onto cpuid bits, e.g. currently we have two versions of
SynIC support, which differ in the way the userspace deals with it, but
not in the cpuid bits we expose in the guest.  IMO such an ioctl would
bring more complexity rather than less.

Roman.


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-11-27 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:31PM +0100, Vitaly Kuznetsov wrote:
> Turns out Hyper-V on KVM (as of 2016) will only use synthetic timers
> if direct mode is available. With direct mode we notify the guest by
> asserting APIC irq instead of sending a SynIC message.
> 
> The implementation uses existing vec_bitmap for letting lapic code
> know that we're interested in the particular IRQ's EOI request. We assume
> that the same APIC irq won't be used by the guest for both direct mode
> stimer and as sint source (especially with AutoEOI semantics). It is
> unclear how things should be handled if that's not true.
> 
> Direct mode is also somewhat less expensive; in my testing
> stimer_send_msg() takes not less than 1500 cpu cycles and
> stimer_notify_direct() can usually be done in 300-400. WS2016 without
> Hyper-V, however, always sticks to non-direct version.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> - Changes since v1: avoid open-coding stimer_mark_pending() in
>   kvm_hv_synic_send_eoi() [Paolo Bonzini]
> ---
>  arch/x86/kvm/hyperv.c| 67 +++-
>  arch/x86/kvm/trace.h | 10 +++---
>  arch/x86/kvm/x86.c   |  1 +
>  include/uapi/linux/kvm.h |  1 +
>  4 files changed, 67 insertions(+), 12 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 3/4] x86/kvm/hyper-v: direct mode for synthetic timers

2018-11-27 Thread Roman Kagan
On Mon, Nov 26, 2018 at 04:47:31PM +0100, Vitaly Kuznetsov wrote:
> Turns out Hyper-V on KVM (as of 2016) will only use synthetic timers
> if direct mode is available. With direct mode we notify the guest by
> asserting APIC irq instead of sending a SynIC message.
> 
> The implementation uses existing vec_bitmap for letting lapic code
> know that we're interested in the particular IRQ's EOI request. We assume
> that the same APIC irq won't be used by the guest for both direct mode
> stimer and as sint source (especially with AutoEOI semantics). It is
> unclear how things should be handled if that's not true.
> 
> Direct mode is also somewhat less expensive; in my testing
> stimer_send_msg() takes not less than 1500 cpu cycles and
> stimer_notify_direct() can usually be done in 300-400. WS2016 without
> Hyper-V, however, always sticks to non-direct version.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> - Changes since v1: avoid open-coding stimer_mark_pending() in
>   kvm_hv_synic_send_eoi() [Paolo Bonzini]
> ---
>  arch/x86/kvm/hyperv.c| 67 +++-
>  arch/x86/kvm/trace.h | 10 +++---
>  arch/x86/kvm/x86.c   |  1 +
>  include/uapi/linux/kvm.h |  1 +
>  4 files changed, 67 insertions(+), 12 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v2 1/4] x86/hyper-v: move synic/stimer control structures definitions to hyperv-tlfs.h

2018-11-26 Thread Roman Kagan
[ Sorry for having missed v1 ]

On Mon, Nov 26, 2018 at 04:47:29PM +0100, Vitaly Kuznetsov wrote:
> We implement Hyper-V SynIC and synthetic timers in KVM too so there's some
> room for code sharing.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 69 ++
>  drivers/hv/hv.c|  2 +-
>  drivers/hv/hyperv_vmbus.h  | 68 -
>  3 files changed, 70 insertions(+), 69 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index 4139f7650fe5..b032c05cd3ee 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -731,6 +731,75 @@ struct hv_enlightened_vmcs {
>  #define HV_STIMER_AUTOENABLE (1ULL << 3)
>  #define HV_STIMER_SINT(config)   (__u8)(((config) >> 16) & 0x0F)
>  
> +/* Define synthetic interrupt controller flag constants. */
> +#define HV_EVENT_FLAGS_COUNT (256 * 8)
> +#define HV_EVENT_FLAGS_LONG_COUNT(256 / sizeof(unsigned long))
> +
> +/*
> + * Synthetic timer configuration.
> + */
> +union hv_stimer_config {
> + u64 as_uint64;
> + struct {
> + u64 enable:1;
> + u64 periodic:1;
> + u64 lazy:1;
> + u64 auto_enable:1;
> + u64 apic_vector:8;
> + u64 direct_mode:1;
> + u64 reserved_z0:3;
> + u64 sintx:4;
> + u64 reserved_z1:44;
> + };
> +};
> +
> +
> +/* Define the synthetic interrupt controller event flags format. */
> +union hv_synic_event_flags {
> + unsigned long flags[HV_EVENT_FLAGS_LONG_COUNT];
> +};
> +
> +/* Define SynIC control register. */
> +union hv_synic_scontrol {
> + u64 as_uint64;
> + struct {
> + u64 enable:1;
> + u64 reserved:63;
> + };
> +};
> +
> +/* Define synthetic interrupt source. */
> +union hv_synic_sint {
> + u64 as_uint64;
> + struct {
> + u64 vector:8;
> + u64 reserved1:8;
> + u64 masked:1;
> + u64 auto_eoi:1;
> + u64 reserved2:46;
> + };
> +};
> +
> +/* Define the format of the SIMP register */
> +union hv_synic_simp {
> + u64 as_uint64;
> + struct {
> + u64 simp_enabled:1;
> + u64 preserved:11;
> + u64 base_simp_gpa:52;
> + };
> +};
> +
> +/* Define the format of the SIEFP register */
> +union hv_synic_siefp {
> + u64 as_uint64;
> + struct {
> + u64 siefp_enabled:1;
> + u64 preserved:11;
> + u64 base_siefp_gpa:52;
> + };
> +};
> +
>  struct hv_vpset {
>   u64 format;
>   u64 valid_bank_mask;

I personally tend to prefer masks over bitfields, so I'd rather do the
consolidation in the opposite direction: use the definitions in
hyperv-tlfs.h and replace those unions/bitfields elsewhere.  (I vaguely
remember posting such a patchset a couple of years ago but I lacked the
motivation to complete it).

Roman.


Re: [PATCH v2 1/4] x86/hyper-v: move synic/stimer control structures definitions to hyperv-tlfs.h

2018-11-26 Thread Roman Kagan
[ Sorry for having missed v1 ]

On Mon, Nov 26, 2018 at 04:47:29PM +0100, Vitaly Kuznetsov wrote:
> We implement Hyper-V SynIC and synthetic timers in KVM too so there's some
> room for code sharing.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 69 ++
>  drivers/hv/hv.c|  2 +-
>  drivers/hv/hyperv_vmbus.h  | 68 -
>  3 files changed, 70 insertions(+), 69 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index 4139f7650fe5..b032c05cd3ee 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -731,6 +731,75 @@ struct hv_enlightened_vmcs {
>  #define HV_STIMER_AUTOENABLE (1ULL << 3)
>  #define HV_STIMER_SINT(config)   (__u8)(((config) >> 16) & 0x0F)
>  
> +/* Define synthetic interrupt controller flag constants. */
> +#define HV_EVENT_FLAGS_COUNT (256 * 8)
> +#define HV_EVENT_FLAGS_LONG_COUNT(256 / sizeof(unsigned long))
> +
> +/*
> + * Synthetic timer configuration.
> + */
> +union hv_stimer_config {
> + u64 as_uint64;
> + struct {
> + u64 enable:1;
> + u64 periodic:1;
> + u64 lazy:1;
> + u64 auto_enable:1;
> + u64 apic_vector:8;
> + u64 direct_mode:1;
> + u64 reserved_z0:3;
> + u64 sintx:4;
> + u64 reserved_z1:44;
> + };
> +};
> +
> +
> +/* Define the synthetic interrupt controller event flags format. */
> +union hv_synic_event_flags {
> + unsigned long flags[HV_EVENT_FLAGS_LONG_COUNT];
> +};
> +
> +/* Define SynIC control register. */
> +union hv_synic_scontrol {
> + u64 as_uint64;
> + struct {
> + u64 enable:1;
> + u64 reserved:63;
> + };
> +};
> +
> +/* Define synthetic interrupt source. */
> +union hv_synic_sint {
> + u64 as_uint64;
> + struct {
> + u64 vector:8;
> + u64 reserved1:8;
> + u64 masked:1;
> + u64 auto_eoi:1;
> + u64 reserved2:46;
> + };
> +};
> +
> +/* Define the format of the SIMP register */
> +union hv_synic_simp {
> + u64 as_uint64;
> + struct {
> + u64 simp_enabled:1;
> + u64 preserved:11;
> + u64 base_simp_gpa:52;
> + };
> +};
> +
> +/* Define the format of the SIEFP register */
> +union hv_synic_siefp {
> + u64 as_uint64;
> + struct {
> + u64 siefp_enabled:1;
> + u64 preserved:11;
> + u64 base_siefp_gpa:52;
> + };
> +};
> +
>  struct hv_vpset {
>   u64 format;
>   u64 valid_bank_mask;

I personally tend to prefer masks over bitfields, so I'd rather do the
consolidation in the opposite direction: use the definitions in
hyperv-tlfs.h and replace those unions/bitfields elsewhere.  (I vaguely
remember posting such a patchset a couple of years ago but I lacked the
motivation to complete it).

Roman.


Re: [PATCH v6 4/7] KVM: x86: hyperv: keep track of mismatched VP indexes

2018-10-01 Thread Roman Kagan
On Mon, Oct 01, 2018 at 03:54:26PM +, Roman Kagan wrote:
> On Mon, Oct 01, 2018 at 05:48:54PM +0200, Paolo Bonzini wrote:
> > On 27/09/2018 11:17, Vitaly Kuznetsov wrote:
> > > Roman Kagan  writes:
> > > 
> > >> On Wed, Sep 26, 2018 at 07:02:56PM +0200, Vitaly Kuznetsov wrote:
> > >>> In most common cases VP index of a vcpu matches its vcpu index. 
> > >>> Userspace
> > >>> is, however, free to set any mapping it wishes and we need to account 
> > >>> for
> > >>> that when we need to find a vCPU with a particular VP index. To keep 
> > >>> search
> > >>> algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
> > >>> counter showing how many vCPUs with mismatching VP index we have. In 
> > >>> case
> > >>> the counter is zero we can assume vp_index == vcpu_idx.
> > >>>
> > >>> Signed-off-by: Vitaly Kuznetsov 
> > >>> ---
> > >>>  arch/x86/include/asm/kvm_host.h |  3 +++
> > >>>  arch/x86/kvm/hyperv.c   | 26 +++---
> > >>>  2 files changed, 26 insertions(+), 3 deletions(-)
> > >>>
> > >>> diff --git a/arch/x86/include/asm/kvm_host.h 
> > >>> b/arch/x86/include/asm/kvm_host.h
> > >>> index 09b2e3e2cf1b..711f79f1b5e6 100644
> > >>> --- a/arch/x86/include/asm/kvm_host.h
> > >>> +++ b/arch/x86/include/asm/kvm_host.h
> > >>> @@ -781,6 +781,9 @@ struct kvm_hv {
> > >>> u64 hv_reenlightenment_control;
> > >>> u64 hv_tsc_emulation_control;
> > >>> u64 hv_tsc_emulation_status;
> > >>> +
> > >>> +   /* How many vCPUs have VP index != vCPU index */
> > >>> +   atomic_t num_mismatched_vp_indexes;
> > >>>  };
> > >>>  
> > >>>  enum kvm_irqchip_mode {
> > >>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > >>> index c8764faf783b..6a19c8e3c432 100644
> > >>> --- a/arch/x86/kvm/hyperv.c
> > >>> +++ b/arch/x86/kvm/hyperv.c
> > >>> @@ -1045,11 +1045,31 @@ static int kvm_hv_set_msr(struct kvm_vcpu 
> > >>> *vcpu, u32 msr, u64 data, bool host)
> > >>> struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
> > >>>  
> > >>> switch (msr) {
> > >>> -   case HV_X64_MSR_VP_INDEX:
> > >>> -   if (!host || (u32)data >= KVM_MAX_VCPUS)
> > >>> +   case HV_X64_MSR_VP_INDEX: {
> > >>> +   struct kvm_hv *hv = >kvm->arch.hyperv;
> > >>> +   int vcpu_idx = kvm_vcpu_get_idx(vcpu);
> > >>> +   u32 new_vp_index = (u32)data;
> > >>> +
> > >>> +   if (!host || new_vp_index >= KVM_MAX_VCPUS)
> > >>> return 1;
> > >>> -   hv_vcpu->vp_index = (u32)data;
> > >>> +
> > >>> +   if (new_vp_index == hv_vcpu->vp_index)
> > >>> +   return 0;
> > >>> +
> > >>> +   /*
> > >>> +* VP index is changing, increment 
> > >>> num_mismatched_vp_indexes in
> > >>> +* case it was equal to vcpu_idx before; on the other 
> > >>> hand, if
> > >>> +* the new VP index matches vcpu_idx 
> > >>> num_mismatched_vp_indexes
> > >>> +* needs to be decremented.
> > >>
> > >> It may be worth mentioning that the initial balance is provided by
> > >> kvm_hv_vcpu_postcreate setting vp_index = vcpu_idx.
> > >>
> > > 
> > > Of course, yes, will update the comment in case I'll be re-submitting.
> > 
> > /*
> >  * VP index is initialized to hv_vcpu->vp_index by
  ^
  vcpu_idx

> >  * kvm_hv_vcpu_postcreate so they initially match.  Now the
> >  * VP index is changing, adjust num_mismatched_vp_indexes if
> >  * it now matches or no longer matches vcpu_idx.
> >  */
> > 
> > ?
> 
> To my taste - perfect :)

Well, almost :)

Roman.


Re: [PATCH v6 4/7] KVM: x86: hyperv: keep track of mismatched VP indexes

2018-10-01 Thread Roman Kagan
On Mon, Oct 01, 2018 at 03:54:26PM +, Roman Kagan wrote:
> On Mon, Oct 01, 2018 at 05:48:54PM +0200, Paolo Bonzini wrote:
> > On 27/09/2018 11:17, Vitaly Kuznetsov wrote:
> > > Roman Kagan  writes:
> > > 
> > >> On Wed, Sep 26, 2018 at 07:02:56PM +0200, Vitaly Kuznetsov wrote:
> > >>> In most common cases VP index of a vcpu matches its vcpu index. 
> > >>> Userspace
> > >>> is, however, free to set any mapping it wishes and we need to account 
> > >>> for
> > >>> that when we need to find a vCPU with a particular VP index. To keep 
> > >>> search
> > >>> algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
> > >>> counter showing how many vCPUs with mismatching VP index we have. In 
> > >>> case
> > >>> the counter is zero we can assume vp_index == vcpu_idx.
> > >>>
> > >>> Signed-off-by: Vitaly Kuznetsov 
> > >>> ---
> > >>>  arch/x86/include/asm/kvm_host.h |  3 +++
> > >>>  arch/x86/kvm/hyperv.c   | 26 +++---
> > >>>  2 files changed, 26 insertions(+), 3 deletions(-)
> > >>>
> > >>> diff --git a/arch/x86/include/asm/kvm_host.h 
> > >>> b/arch/x86/include/asm/kvm_host.h
> > >>> index 09b2e3e2cf1b..711f79f1b5e6 100644
> > >>> --- a/arch/x86/include/asm/kvm_host.h
> > >>> +++ b/arch/x86/include/asm/kvm_host.h
> > >>> @@ -781,6 +781,9 @@ struct kvm_hv {
> > >>> u64 hv_reenlightenment_control;
> > >>> u64 hv_tsc_emulation_control;
> > >>> u64 hv_tsc_emulation_status;
> > >>> +
> > >>> +   /* How many vCPUs have VP index != vCPU index */
> > >>> +   atomic_t num_mismatched_vp_indexes;
> > >>>  };
> > >>>  
> > >>>  enum kvm_irqchip_mode {
> > >>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > >>> index c8764faf783b..6a19c8e3c432 100644
> > >>> --- a/arch/x86/kvm/hyperv.c
> > >>> +++ b/arch/x86/kvm/hyperv.c
> > >>> @@ -1045,11 +1045,31 @@ static int kvm_hv_set_msr(struct kvm_vcpu 
> > >>> *vcpu, u32 msr, u64 data, bool host)
> > >>> struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
> > >>>  
> > >>> switch (msr) {
> > >>> -   case HV_X64_MSR_VP_INDEX:
> > >>> -   if (!host || (u32)data >= KVM_MAX_VCPUS)
> > >>> +   case HV_X64_MSR_VP_INDEX: {
> > >>> +   struct kvm_hv *hv = >kvm->arch.hyperv;
> > >>> +   int vcpu_idx = kvm_vcpu_get_idx(vcpu);
> > >>> +   u32 new_vp_index = (u32)data;
> > >>> +
> > >>> +   if (!host || new_vp_index >= KVM_MAX_VCPUS)
> > >>> return 1;
> > >>> -   hv_vcpu->vp_index = (u32)data;
> > >>> +
> > >>> +   if (new_vp_index == hv_vcpu->vp_index)
> > >>> +   return 0;
> > >>> +
> > >>> +   /*
> > >>> +* VP index is changing, increment 
> > >>> num_mismatched_vp_indexes in
> > >>> +* case it was equal to vcpu_idx before; on the other 
> > >>> hand, if
> > >>> +* the new VP index matches vcpu_idx 
> > >>> num_mismatched_vp_indexes
> > >>> +* needs to be decremented.
> > >>
> > >> It may be worth mentioning that the initial balance is provided by
> > >> kvm_hv_vcpu_postcreate setting vp_index = vcpu_idx.
> > >>
> > > 
> > > Of course, yes, will update the comment in case I'll be re-submitting.
> > 
> > /*
> >  * VP index is initialized to hv_vcpu->vp_index by
  ^
  vcpu_idx

> >  * kvm_hv_vcpu_postcreate so they initially match.  Now the
> >  * VP index is changing, adjust num_mismatched_vp_indexes if
> >  * it now matches or no longer matches vcpu_idx.
> >  */
> > 
> > ?
> 
> To my taste - perfect :)

Well, almost :)

Roman.


Re: [PATCH v6 4/7] KVM: x86: hyperv: keep track of mismatched VP indexes

2018-10-01 Thread Roman Kagan
On Mon, Oct 01, 2018 at 05:48:54PM +0200, Paolo Bonzini wrote:
> On 27/09/2018 11:17, Vitaly Kuznetsov wrote:
> > Roman Kagan  writes:
> > 
> >> On Wed, Sep 26, 2018 at 07:02:56PM +0200, Vitaly Kuznetsov wrote:
> >>> In most common cases VP index of a vcpu matches its vcpu index. Userspace
> >>> is, however, free to set any mapping it wishes and we need to account for
> >>> that when we need to find a vCPU with a particular VP index. To keep 
> >>> search
> >>> algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
> >>> counter showing how many vCPUs with mismatching VP index we have. In case
> >>> the counter is zero we can assume vp_index == vcpu_idx.
> >>>
> >>> Signed-off-by: Vitaly Kuznetsov 
> >>> ---
> >>>  arch/x86/include/asm/kvm_host.h |  3 +++
> >>>  arch/x86/kvm/hyperv.c   | 26 +++---
> >>>  2 files changed, 26 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/arch/x86/include/asm/kvm_host.h 
> >>> b/arch/x86/include/asm/kvm_host.h
> >>> index 09b2e3e2cf1b..711f79f1b5e6 100644
> >>> --- a/arch/x86/include/asm/kvm_host.h
> >>> +++ b/arch/x86/include/asm/kvm_host.h
> >>> @@ -781,6 +781,9 @@ struct kvm_hv {
> >>>   u64 hv_reenlightenment_control;
> >>>   u64 hv_tsc_emulation_control;
> >>>   u64 hv_tsc_emulation_status;
> >>> +
> >>> + /* How many vCPUs have VP index != vCPU index */
> >>> + atomic_t num_mismatched_vp_indexes;
> >>>  };
> >>>  
> >>>  enum kvm_irqchip_mode {
> >>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> >>> index c8764faf783b..6a19c8e3c432 100644
> >>> --- a/arch/x86/kvm/hyperv.c
> >>> +++ b/arch/x86/kvm/hyperv.c
> >>> @@ -1045,11 +1045,31 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, 
> >>> u32 msr, u64 data, bool host)
> >>>   struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
> >>>  
> >>>   switch (msr) {
> >>> - case HV_X64_MSR_VP_INDEX:
> >>> - if (!host || (u32)data >= KVM_MAX_VCPUS)
> >>> + case HV_X64_MSR_VP_INDEX: {
> >>> + struct kvm_hv *hv = >kvm->arch.hyperv;
> >>> + int vcpu_idx = kvm_vcpu_get_idx(vcpu);
> >>> + u32 new_vp_index = (u32)data;
> >>> +
> >>> + if (!host || new_vp_index >= KVM_MAX_VCPUS)
> >>>   return 1;
> >>> - hv_vcpu->vp_index = (u32)data;
> >>> +
> >>> + if (new_vp_index == hv_vcpu->vp_index)
> >>> + return 0;
> >>> +
> >>> + /*
> >>> +  * VP index is changing, increment num_mismatched_vp_indexes in
> >>> +  * case it was equal to vcpu_idx before; on the other hand, if
> >>> +  * the new VP index matches vcpu_idx num_mismatched_vp_indexes
> >>> +  * needs to be decremented.
> >>
> >> It may be worth mentioning that the initial balance is provided by
> >> kvm_hv_vcpu_postcreate setting vp_index = vcpu_idx.
> >>
> > 
> > Of course, yes, will update the comment in case I'll be re-submitting.
> 
>   /*
>* VP index is initialized to hv_vcpu->vp_index by
>* kvm_hv_vcpu_postcreate so they initially match.  Now the
>* VP index is changing, adjust num_mismatched_vp_indexes if
>* it now matches or no longer matches vcpu_idx.
>*/
> 
> ?

To my taste - perfect :)

Roman.


Re: [PATCH v6 4/7] KVM: x86: hyperv: keep track of mismatched VP indexes

2018-10-01 Thread Roman Kagan
On Mon, Oct 01, 2018 at 05:48:54PM +0200, Paolo Bonzini wrote:
> On 27/09/2018 11:17, Vitaly Kuznetsov wrote:
> > Roman Kagan  writes:
> > 
> >> On Wed, Sep 26, 2018 at 07:02:56PM +0200, Vitaly Kuznetsov wrote:
> >>> In most common cases VP index of a vcpu matches its vcpu index. Userspace
> >>> is, however, free to set any mapping it wishes and we need to account for
> >>> that when we need to find a vCPU with a particular VP index. To keep 
> >>> search
> >>> algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
> >>> counter showing how many vCPUs with mismatching VP index we have. In case
> >>> the counter is zero we can assume vp_index == vcpu_idx.
> >>>
> >>> Signed-off-by: Vitaly Kuznetsov 
> >>> ---
> >>>  arch/x86/include/asm/kvm_host.h |  3 +++
> >>>  arch/x86/kvm/hyperv.c   | 26 +++---
> >>>  2 files changed, 26 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/arch/x86/include/asm/kvm_host.h 
> >>> b/arch/x86/include/asm/kvm_host.h
> >>> index 09b2e3e2cf1b..711f79f1b5e6 100644
> >>> --- a/arch/x86/include/asm/kvm_host.h
> >>> +++ b/arch/x86/include/asm/kvm_host.h
> >>> @@ -781,6 +781,9 @@ struct kvm_hv {
> >>>   u64 hv_reenlightenment_control;
> >>>   u64 hv_tsc_emulation_control;
> >>>   u64 hv_tsc_emulation_status;
> >>> +
> >>> + /* How many vCPUs have VP index != vCPU index */
> >>> + atomic_t num_mismatched_vp_indexes;
> >>>  };
> >>>  
> >>>  enum kvm_irqchip_mode {
> >>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> >>> index c8764faf783b..6a19c8e3c432 100644
> >>> --- a/arch/x86/kvm/hyperv.c
> >>> +++ b/arch/x86/kvm/hyperv.c
> >>> @@ -1045,11 +1045,31 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, 
> >>> u32 msr, u64 data, bool host)
> >>>   struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
> >>>  
> >>>   switch (msr) {
> >>> - case HV_X64_MSR_VP_INDEX:
> >>> - if (!host || (u32)data >= KVM_MAX_VCPUS)
> >>> + case HV_X64_MSR_VP_INDEX: {
> >>> + struct kvm_hv *hv = >kvm->arch.hyperv;
> >>> + int vcpu_idx = kvm_vcpu_get_idx(vcpu);
> >>> + u32 new_vp_index = (u32)data;
> >>> +
> >>> + if (!host || new_vp_index >= KVM_MAX_VCPUS)
> >>>   return 1;
> >>> - hv_vcpu->vp_index = (u32)data;
> >>> +
> >>> + if (new_vp_index == hv_vcpu->vp_index)
> >>> + return 0;
> >>> +
> >>> + /*
> >>> +  * VP index is changing, increment num_mismatched_vp_indexes in
> >>> +  * case it was equal to vcpu_idx before; on the other hand, if
> >>> +  * the new VP index matches vcpu_idx num_mismatched_vp_indexes
> >>> +  * needs to be decremented.
> >>
> >> It may be worth mentioning that the initial balance is provided by
> >> kvm_hv_vcpu_postcreate setting vp_index = vcpu_idx.
> >>
> > 
> > Of course, yes, will update the comment in case I'll be re-submitting.
> 
>   /*
>* VP index is initialized to hv_vcpu->vp_index by
>* kvm_hv_vcpu_postcreate so they initially match.  Now the
>* VP index is changing, adjust num_mismatched_vp_indexes if
>* it now matches or no longer matches vcpu_idx.
>*/
> 
> ?

To my taste - perfect :)

Roman.


Re: [PATCH v6 7/7] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:59PM +0200, Vitaly Kuznetsov wrote:
> Using hypercall for sending IPIs is faster because this allows to specify
> any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
> will take only one VMEXIT.
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses it as 'fast' so we need
> to support that.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  Documentation/virtual/kvm/api.txt |   7 ++
>  arch/x86/kvm/hyperv.c | 115 ++
>  arch/x86/kvm/trace.h  |  42 +++
>  arch/x86/kvm/x86.c|   1 +
>  include/uapi/linux/kvm.h  |   1 +
>  5 files changed, 166 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 647f94128a85..1659b75d577d 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4772,3 +4772,10 @@ CPU when the exception is taken. If this virtual 
> SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>  
>  See KVM_CAP_VCPU_EVENTS for more details.
> +8.20 KVM_CAP_HYPERV_SEND_IPI
> +
> +Architectures: x86
> +
> +This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> +hypercalls:
> +HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index cc0535a078f7..4b4a6d015ade 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1405,6 +1405,107 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 
> outgpa,
> +bool ex, bool fast)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_hv *hv = >arch.hyperv;
> + struct hv_send_ipi_ex send_ipi_ex;
> + struct hv_send_ipi send_ipi;
> + struct kvm_vcpu *vcpu;
> + unsigned long valid_bank_mask;
> + u64 sparse_banks[64];
> + int sparse_banks_len, bank, i, sbank;
> + struct kvm_lapic_irq irq = {.delivery_mode = APIC_DM_FIXED};
> + bool all_cpus;
> +
> + if (!ex) {
> + if (!fast) {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi,
> + sizeof(send_ipi
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = send_ipi.cpu_mask;
> + irq.vector = send_ipi.vector;
> + } else {
> + /* 'reserved' part of hv_send_ipi should be 0 */
> + if (unlikely(ingpa >> 32 != 0))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = outgpa;
> + irq.vector = (u32)ingpa;
> + }
> + all_cpus = false;
> + valid_bank_mask = BIT_ULL(0);
> +
> + trace_kvm_hv_send_ipi(irq.vector, sparse_banks[0]);
> + } else {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi_ex,
> + sizeof(send_ipi_ex
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
> +  send_ipi_ex.vp_set.format,
> +  send_ipi_ex.vp_set.valid_bank_mask);
> +
> + irq.vector = send_ipi_ex.vector;
> + valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> + sparse_banks_len = bitmap_weight(_bank_mask, 64) *
> + sizeof(sparse_banks[0]);
> +
> + all_cpus = send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL;
> +
> + if (!sparse_banks_len)
> + goto ret_success;
> +
> + if (!all_cpus &&
> + kvm_read_guest(kvm,
> +ingpa + offsetof(struct hv_send_ipi_ex,
> + vp_set.bank_contents),
> +sparse_banks,
> +sparse_banks_len))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +
> + if ((irq.vector < HV_IPI_LOW_VECTOR) ||
> + (irq.vector > HV_IPI_HIGH_VECTOR))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (all_cpus || atomic_read(>num_mismatched_vp_indexes)) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (all_cpus || hv_vcpu_in_sparse_set(
> + >arch.hyperv, sparse_banks,
> + valid_bank_mask)) {
> + /* We fail 

Re: [PATCH v6 7/7] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:59PM +0200, Vitaly Kuznetsov wrote:
> Using hypercall for sending IPIs is faster because this allows to specify
> any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
> will take only one VMEXIT.
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses it as 'fast' so we need
> to support that.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  Documentation/virtual/kvm/api.txt |   7 ++
>  arch/x86/kvm/hyperv.c | 115 ++
>  arch/x86/kvm/trace.h  |  42 +++
>  arch/x86/kvm/x86.c|   1 +
>  include/uapi/linux/kvm.h  |   1 +
>  5 files changed, 166 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 647f94128a85..1659b75d577d 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4772,3 +4772,10 @@ CPU when the exception is taken. If this virtual 
> SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>  
>  See KVM_CAP_VCPU_EVENTS for more details.
> +8.20 KVM_CAP_HYPERV_SEND_IPI
> +
> +Architectures: x86
> +
> +This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> +hypercalls:
> +HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index cc0535a078f7..4b4a6d015ade 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1405,6 +1405,107 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 
> outgpa,
> +bool ex, bool fast)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct kvm_hv *hv = >arch.hyperv;
> + struct hv_send_ipi_ex send_ipi_ex;
> + struct hv_send_ipi send_ipi;
> + struct kvm_vcpu *vcpu;
> + unsigned long valid_bank_mask;
> + u64 sparse_banks[64];
> + int sparse_banks_len, bank, i, sbank;
> + struct kvm_lapic_irq irq = {.delivery_mode = APIC_DM_FIXED};
> + bool all_cpus;
> +
> + if (!ex) {
> + if (!fast) {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi,
> + sizeof(send_ipi
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = send_ipi.cpu_mask;
> + irq.vector = send_ipi.vector;
> + } else {
> + /* 'reserved' part of hv_send_ipi should be 0 */
> + if (unlikely(ingpa >> 32 != 0))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = outgpa;
> + irq.vector = (u32)ingpa;
> + }
> + all_cpus = false;
> + valid_bank_mask = BIT_ULL(0);
> +
> + trace_kvm_hv_send_ipi(irq.vector, sparse_banks[0]);
> + } else {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi_ex,
> + sizeof(send_ipi_ex
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
> +  send_ipi_ex.vp_set.format,
> +  send_ipi_ex.vp_set.valid_bank_mask);
> +
> + irq.vector = send_ipi_ex.vector;
> + valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> + sparse_banks_len = bitmap_weight(_bank_mask, 64) *
> + sizeof(sparse_banks[0]);
> +
> + all_cpus = send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL;
> +
> + if (!sparse_banks_len)
> + goto ret_success;
> +
> + if (!all_cpus &&
> + kvm_read_guest(kvm,
> +ingpa + offsetof(struct hv_send_ipi_ex,
> + vp_set.bank_contents),
> +sparse_banks,
> +sparse_banks_len))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +
> + if ((irq.vector < HV_IPI_LOW_VECTOR) ||
> + (irq.vector > HV_IPI_HIGH_VECTOR))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (all_cpus || atomic_read(>num_mismatched_vp_indexes)) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (all_cpus || hv_vcpu_in_sparse_set(
> + >arch.hyperv, sparse_banks,
> + valid_bank_mask)) {
> + /* We fail 

Re: [PATCH v6 6/7] KVM: x86: hyperv: optimize kvm_hv_flush_tlb() for vp_index == vcpu_idx case

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:58PM +0200, Vitaly Kuznetsov wrote:
> VP inedx almost always matches VCPU and when it does it's faster to walk
> the sparse set instead of all vcpus.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 96 +++
>  1 file changed, 52 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index eeb12eacd525..cc0535a078f7 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1277,32 +1277,37 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 
> msr, u64 *pdata, bool host)
>   return kvm_hv_get_msr(vcpu, msr, pdata, host);
>  }
>  
> -static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int 
> bank_no)
> +static __always_inline bool hv_vcpu_in_sparse_set(struct kvm_vcpu_hv 
> *hv_vcpu,
> +   u64 sparse_banks[],
> +   u64 valid_bank_mask)
>  {
> - int i = 0, j;
> + int bank = hv_vcpu->vp_index / 64, sbank;
>  
> - if (!(valid_bank_mask & BIT_ULL(bank_no)))
> - return -1;
> + if (bank >= 64)
> + return false;
>  
> - for (j = 0; j < bank_no; j++)
> - if (valid_bank_mask & BIT_ULL(j))
> - i++;
> + if (!(valid_bank_mask & BIT_ULL(bank)))
> + return false;
>  
> - return i;
> + /* Sparse bank number equals to the number of set bits before it */
> + sbank = bitmap_weight((unsigned long *)_bank_mask, bank);
> +
> + return !!(sparse_banks[sbank] & BIT_ULL(hv_vcpu->vp_index % 64));
>  }
>  
>  static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
>   u16 rep_cnt, bool ex)
>  {
>   struct kvm *kvm = current_vcpu->kvm;
> - struct kvm_vcpu_hv *hv_current = _vcpu->arch.hyperv;
> + struct kvm_hv *hv = >arch.hyperv;
> + struct kvm_vcpu_hv *hv_vcpu = _vcpu->arch.hyperv;
>   struct hv_tlb_flush_ex flush_ex;
>   struct hv_tlb_flush flush;
>   struct kvm_vcpu *vcpu;
>   unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)] = {0};
> - u64 valid_bank_mask = 0;
> + u64 valid_bank_mask;
>   u64 sparse_banks[64];
> - int sparse_banks_len, i;
> + int sparse_banks_len, i, bank, sbank;
>   bool all_cpus;
>  
>   if (!ex) {
> @@ -1312,6 +1317,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   trace_kvm_hv_flush_tlb(flush.processor_mask,
>  flush.address_space, flush.flags);
>  
> + valid_bank_mask = BIT_ULL(0);
>   sparse_banks[0] = flush.processor_mask;
>   all_cpus = flush.flags & HV_FLUSH_ALL_PROCESSORS;
>   } else {
> @@ -1344,52 +1350,54 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   return HV_STATUS_INVALID_HYPERCALL_INPUT;
>   }
>  
> - cpumask_clear(_current->tlb_lush);
> + /*
> +  * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we can't
> +  * analyze it here, flush TLB regardless of the specified address space.
> +  */
> + cpumask_clear(_vcpu->tlb_lush);

Maybe squash this hv_current -> hv_vcpu renaming into patch 3?
(And yes this "lush" is funny, too ;)

>  
>   if (all_cpus) {
>   kvm_make_vcpus_request_mask(kvm,
>   KVM_REQ_TLB_FLUSH | KVM_REQUEST_NO_WAKEUP,
> - NULL, _current->tlb_lush);
> + NULL, _vcpu->tlb_lush);
>   goto ret_success;
>   }
>  
> - kvm_for_each_vcpu(i, vcpu, kvm) {
> - struct kvm_vcpu_hv *hv = >arch.hyperv;
> - int bank = hv->vp_index / 64, sbank = 0;
> -
> - /* Banks >64 can't be represented */
> - if (bank >= 64)
> - continue;
> -
> - /* Non-ex hypercalls can only address first 64 vCPUs */
> - if (!ex && bank)
> - continue;
> -
> - if (ex) {
> - /*
> -  * Check is the bank of this vCPU is in sparse
> -  * set and get the sparse bank number.
> -  */
> - sbank = get_sparse_bank_no(valid_bank_mask, bank);
> -
> - if (sbank < 0)
> - continue;
> + if (atomic_read(>num_mismatched_vp_indexes)) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (hv_vcpu_in_sparse_set(>arch.hyperv,
> +   sparse_banks,
> +   valid_bank_mask))
> + __set_bit(i, vcpu_bitmap);
>   }
> + goto flush_request;
> + }
>  
> - if (!(sparse_banks[sbank] & BIT_ULL(hv->vp_index % 64)))

Re: [PATCH v6 6/7] KVM: x86: hyperv: optimize kvm_hv_flush_tlb() for vp_index == vcpu_idx case

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:58PM +0200, Vitaly Kuznetsov wrote:
> VP inedx almost always matches VCPU and when it does it's faster to walk
> the sparse set instead of all vcpus.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 96 +++
>  1 file changed, 52 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index eeb12eacd525..cc0535a078f7 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1277,32 +1277,37 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 
> msr, u64 *pdata, bool host)
>   return kvm_hv_get_msr(vcpu, msr, pdata, host);
>  }
>  
> -static __always_inline int get_sparse_bank_no(u64 valid_bank_mask, int 
> bank_no)
> +static __always_inline bool hv_vcpu_in_sparse_set(struct kvm_vcpu_hv 
> *hv_vcpu,
> +   u64 sparse_banks[],
> +   u64 valid_bank_mask)
>  {
> - int i = 0, j;
> + int bank = hv_vcpu->vp_index / 64, sbank;
>  
> - if (!(valid_bank_mask & BIT_ULL(bank_no)))
> - return -1;
> + if (bank >= 64)
> + return false;
>  
> - for (j = 0; j < bank_no; j++)
> - if (valid_bank_mask & BIT_ULL(j))
> - i++;
> + if (!(valid_bank_mask & BIT_ULL(bank)))
> + return false;
>  
> - return i;
> + /* Sparse bank number equals to the number of set bits before it */
> + sbank = bitmap_weight((unsigned long *)_bank_mask, bank);
> +
> + return !!(sparse_banks[sbank] & BIT_ULL(hv_vcpu->vp_index % 64));
>  }
>  
>  static u64 kvm_hv_flush_tlb(struct kvm_vcpu *current_vcpu, u64 ingpa,
>   u16 rep_cnt, bool ex)
>  {
>   struct kvm *kvm = current_vcpu->kvm;
> - struct kvm_vcpu_hv *hv_current = _vcpu->arch.hyperv;
> + struct kvm_hv *hv = >arch.hyperv;
> + struct kvm_vcpu_hv *hv_vcpu = _vcpu->arch.hyperv;
>   struct hv_tlb_flush_ex flush_ex;
>   struct hv_tlb_flush flush;
>   struct kvm_vcpu *vcpu;
>   unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)] = {0};
> - u64 valid_bank_mask = 0;
> + u64 valid_bank_mask;
>   u64 sparse_banks[64];
> - int sparse_banks_len, i;
> + int sparse_banks_len, i, bank, sbank;
>   bool all_cpus;
>  
>   if (!ex) {
> @@ -1312,6 +1317,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   trace_kvm_hv_flush_tlb(flush.processor_mask,
>  flush.address_space, flush.flags);
>  
> + valid_bank_mask = BIT_ULL(0);
>   sparse_banks[0] = flush.processor_mask;
>   all_cpus = flush.flags & HV_FLUSH_ALL_PROCESSORS;
>   } else {
> @@ -1344,52 +1350,54 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   return HV_STATUS_INVALID_HYPERCALL_INPUT;
>   }
>  
> - cpumask_clear(_current->tlb_lush);
> + /*
> +  * vcpu->arch.cr3 may not be up-to-date for running vCPUs so we can't
> +  * analyze it here, flush TLB regardless of the specified address space.
> +  */
> + cpumask_clear(_vcpu->tlb_lush);

Maybe squash this hv_current -> hv_vcpu renaming into patch 3?
(And yes this "lush" is funny, too ;)

>  
>   if (all_cpus) {
>   kvm_make_vcpus_request_mask(kvm,
>   KVM_REQ_TLB_FLUSH | KVM_REQUEST_NO_WAKEUP,
> - NULL, _current->tlb_lush);
> + NULL, _vcpu->tlb_lush);
>   goto ret_success;
>   }
>  
> - kvm_for_each_vcpu(i, vcpu, kvm) {
> - struct kvm_vcpu_hv *hv = >arch.hyperv;
> - int bank = hv->vp_index / 64, sbank = 0;
> -
> - /* Banks >64 can't be represented */
> - if (bank >= 64)
> - continue;
> -
> - /* Non-ex hypercalls can only address first 64 vCPUs */
> - if (!ex && bank)
> - continue;
> -
> - if (ex) {
> - /*
> -  * Check is the bank of this vCPU is in sparse
> -  * set and get the sparse bank number.
> -  */
> - sbank = get_sparse_bank_no(valid_bank_mask, bank);
> -
> - if (sbank < 0)
> - continue;
> + if (atomic_read(>num_mismatched_vp_indexes)) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (hv_vcpu_in_sparse_set(>arch.hyperv,
> +   sparse_banks,
> +   valid_bank_mask))
> + __set_bit(i, vcpu_bitmap);
>   }
> + goto flush_request;
> + }
>  
> - if (!(sparse_banks[sbank] & BIT_ULL(hv->vp_index % 64)))

Re: [PATCH v6 5/7] KVM: x86: hyperv: valid_bank_mask should be 'u64'

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:57PM +0200, Vitaly Kuznetsov wrote:
> This probably doesn't matter much (KVM_MAX_VCPUS is much lower nowadays)
> but valid_bank_mask is really u64 and not unsigned long.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v6 5/7] KVM: x86: hyperv: valid_bank_mask should be 'u64'

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:57PM +0200, Vitaly Kuznetsov wrote:
> This probably doesn't matter much (KVM_MAX_VCPUS is much lower nowadays)
> but valid_bank_mask is really u64 and not unsigned long.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v6 4/7] KVM: x86: hyperv: keep track of mismatched VP indexes

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:56PM +0200, Vitaly Kuznetsov wrote:
> In most common cases VP index of a vcpu matches its vcpu index. Userspace
> is, however, free to set any mapping it wishes and we need to account for
> that when we need to find a vCPU with a particular VP index. To keep search
> algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
> counter showing how many vCPUs with mismatching VP index we have. In case
> the counter is zero we can assume vp_index == vcpu_idx.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/hyperv.c   | 26 +++---
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b2e3e2cf1b..711f79f1b5e6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -781,6 +781,9 @@ struct kvm_hv {
>   u64 hv_reenlightenment_control;
>   u64 hv_tsc_emulation_control;
>   u64 hv_tsc_emulation_status;
> +
> + /* How many vCPUs have VP index != vCPU index */
> + atomic_t num_mismatched_vp_indexes;
>  };
>  
>  enum kvm_irqchip_mode {
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index c8764faf783b..6a19c8e3c432 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1045,11 +1045,31 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 
> msr, u64 data, bool host)
>   struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
>  
>   switch (msr) {
> - case HV_X64_MSR_VP_INDEX:
> - if (!host || (u32)data >= KVM_MAX_VCPUS)
> + case HV_X64_MSR_VP_INDEX: {
> + struct kvm_hv *hv = >kvm->arch.hyperv;
> + int vcpu_idx = kvm_vcpu_get_idx(vcpu);
> + u32 new_vp_index = (u32)data;
> +
> + if (!host || new_vp_index >= KVM_MAX_VCPUS)
>   return 1;
> - hv_vcpu->vp_index = (u32)data;
> +
> + if (new_vp_index == hv_vcpu->vp_index)
> + return 0;
> +
> + /*
> +  * VP index is changing, increment num_mismatched_vp_indexes in
> +  * case it was equal to vcpu_idx before; on the other hand, if
> +  * the new VP index matches vcpu_idx num_mismatched_vp_indexes
> +  * needs to be decremented.

It may be worth mentioning that the initial balance is provided by
kvm_hv_vcpu_postcreate setting vp_index = vcpu_idx.

> +  */
> + if (hv_vcpu->vp_index == vcpu_idx)
> + atomic_inc(>num_mismatched_vp_indexes);
> + else if (new_vp_index == vcpu_idx)
> + atomic_dec(>num_mismatched_vp_indexes);

> +
> + hv_vcpu->vp_index = new_vp_index;
>   break;
> + }
>   case HV_X64_MSR_VP_ASSIST_PAGE: {
>   u64 gfn;
>   unsigned long addr;

Reviewed-by: Roman Kagan 


Re: [PATCH v6 4/7] KVM: x86: hyperv: keep track of mismatched VP indexes

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:56PM +0200, Vitaly Kuznetsov wrote:
> In most common cases VP index of a vcpu matches its vcpu index. Userspace
> is, however, free to set any mapping it wishes and we need to account for
> that when we need to find a vCPU with a particular VP index. To keep search
> algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
> counter showing how many vCPUs with mismatching VP index we have. In case
> the counter is zero we can assume vp_index == vcpu_idx.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/hyperv.c   | 26 +++---
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b2e3e2cf1b..711f79f1b5e6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -781,6 +781,9 @@ struct kvm_hv {
>   u64 hv_reenlightenment_control;
>   u64 hv_tsc_emulation_control;
>   u64 hv_tsc_emulation_status;
> +
> + /* How many vCPUs have VP index != vCPU index */
> + atomic_t num_mismatched_vp_indexes;
>  };
>  
>  enum kvm_irqchip_mode {
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index c8764faf783b..6a19c8e3c432 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1045,11 +1045,31 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 
> msr, u64 data, bool host)
>   struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
>  
>   switch (msr) {
> - case HV_X64_MSR_VP_INDEX:
> - if (!host || (u32)data >= KVM_MAX_VCPUS)
> + case HV_X64_MSR_VP_INDEX: {
> + struct kvm_hv *hv = >kvm->arch.hyperv;
> + int vcpu_idx = kvm_vcpu_get_idx(vcpu);
> + u32 new_vp_index = (u32)data;
> +
> + if (!host || new_vp_index >= KVM_MAX_VCPUS)
>   return 1;
> - hv_vcpu->vp_index = (u32)data;
> +
> + if (new_vp_index == hv_vcpu->vp_index)
> + return 0;
> +
> + /*
> +  * VP index is changing, increment num_mismatched_vp_indexes in
> +  * case it was equal to vcpu_idx before; on the other hand, if
> +  * the new VP index matches vcpu_idx num_mismatched_vp_indexes
> +  * needs to be decremented.

It may be worth mentioning that the initial balance is provided by
kvm_hv_vcpu_postcreate setting vp_index = vcpu_idx.

> +  */
> + if (hv_vcpu->vp_index == vcpu_idx)
> + atomic_inc(>num_mismatched_vp_indexes);
> + else if (new_vp_index == vcpu_idx)
> + atomic_dec(>num_mismatched_vp_indexes);

> +
> + hv_vcpu->vp_index = new_vp_index;
>   break;
> + }
>   case HV_X64_MSR_VP_ASSIST_PAGE: {
>   u64 gfn;
>   unsigned long addr;

Reviewed-by: Roman Kagan 


Re: [PATCH v6 3/7] KVM: x86: hyperv: consistently use 'hv_vcpu' for 'struct kvm_vcpu_hv' variables

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:55PM +0200, Vitaly Kuznetsov wrote:
> Rename 'hv' to 'hv_vcpu' in kvm_hv_set_msr/kvm_hv_get_msr(); 'hv' is
> 'reserved' for 'struct kvm_hv' variables across the file.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v6 3/7] KVM: x86: hyperv: consistently use 'hv_vcpu' for 'struct kvm_vcpu_hv' variables

2018-09-27 Thread Roman Kagan
On Wed, Sep 26, 2018 at 07:02:55PM +0200, Vitaly Kuznetsov wrote:
> Rename 'hv' to 'hv_vcpu' in kvm_hv_set_msr/kvm_hv_get_msr(); 'hv' is
> 'reserved' for 'struct kvm_hv' variables across the file.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 RESEND 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-09-25 Thread Roman Kagan
On Tue, Sep 25, 2018 at 11:29:57AM +0200, Paolo Bonzini wrote:
> On 25/09/2018 10:57, Roman Kagan wrote:
> > If we can assume that in all relevant cases vp_index coincides with the
> > cpu index (which I think we can) then Vitaly's approach is the most
> > efficient.
> > 
> > If, on the opposite, we want to optimize for random mapping between
> > vp_index and cpu index, then it's probably better instead to iterate
> > over vcpus and test if their vp_index belongs to the requested mask.
> 
> Yes, that would work too.  Perhaps we can do both?  You can have a
> kvm->num_mismatched_vp_indexes count to choose between the two.

Makes sense to me.

Roman.


Re: [PATCH v4 RESEND 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-09-25 Thread Roman Kagan
On Tue, Sep 25, 2018 at 11:29:57AM +0200, Paolo Bonzini wrote:
> On 25/09/2018 10:57, Roman Kagan wrote:
> > If we can assume that in all relevant cases vp_index coincides with the
> > cpu index (which I think we can) then Vitaly's approach is the most
> > efficient.
> > 
> > If, on the opposite, we want to optimize for random mapping between
> > vp_index and cpu index, then it's probably better instead to iterate
> > over vcpus and test if their vp_index belongs to the requested mask.
> 
> Yes, that would work too.  Perhaps we can do both?  You can have a
> kvm->num_mismatched_vp_indexes count to choose between the two.

Makes sense to me.

Roman.


Re: [PATCH v4 RESEND 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-09-25 Thread Roman Kagan
On Mon, Sep 24, 2018 at 06:55:28PM +0200, Paolo Bonzini wrote:
> On 24/09/2018 18:24, Paolo Bonzini wrote:
> > Hi Paolo,
> > 
> > could you please clarify what needs to be done to get this merged? In
> > particular, are you OK with my comment above or do we need to do
> > something with it (e.g. get back to the 'logarythmic search' from v2)?
> > 
> > In kvm/queue I can see only 'x86/hyper-v: rename ipi_arg_{ex,non_ex}
> > structures' patch from this series applied.
> 
> Hi,
> 
> my plan was to apply only 1/2/5 for now.  I singled out the rename patch
> because that one could be included in 4.19-rc kernels as a cleanup.

Is this supposed to mean you're not happy with the approach taken in
Vitaly's patch?  Can you explain why?  I take my part of guilt for it so
I'd like to know, too.

Speaking of the options we have, the choice depends on the assumptions
we take. (And I guess when you spoke of quadratic complexity you
referred to the algorithm to convert the vp_index mask into the KVM cpu
mask.)

If we can assume that in all relevant cases vp_index coincides with the
cpu index (which I think we can) then Vitaly's approach is the most
efficient.

If, on the opposite, we want to optimize for random mapping between
vp_index and cpu index, then it's probably better instead to iterate
over vcpus and test if their vp_index belongs to the requested mask.

Neither of the above is quadratic.

Dunno if we need to specifically consider intermediate situations.

Anyway using a havier vp_index -> cpu index translation looks like an
overkill to me.

What do you think?

Thanks,
Roman.


Re: [PATCH v4 RESEND 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-09-25 Thread Roman Kagan
On Mon, Sep 24, 2018 at 06:55:28PM +0200, Paolo Bonzini wrote:
> On 24/09/2018 18:24, Paolo Bonzini wrote:
> > Hi Paolo,
> > 
> > could you please clarify what needs to be done to get this merged? In
> > particular, are you OK with my comment above or do we need to do
> > something with it (e.g. get back to the 'logarythmic search' from v2)?
> > 
> > In kvm/queue I can see only 'x86/hyper-v: rename ipi_arg_{ex,non_ex}
> > structures' patch from this series applied.
> 
> Hi,
> 
> my plan was to apply only 1/2/5 for now.  I singled out the rename patch
> because that one could be included in 4.19-rc kernels as a cleanup.

Is this supposed to mean you're not happy with the approach taken in
Vitaly's patch?  Can you explain why?  I take my part of guilt for it so
I'd like to know, too.

Speaking of the options we have, the choice depends on the assumptions
we take. (And I guess when you spoke of quadratic complexity you
referred to the algorithm to convert the vp_index mask into the KVM cpu
mask.)

If we can assume that in all relevant cases vp_index coincides with the
cpu index (which I think we can) then Vitaly's approach is the most
efficient.

If, on the opposite, we want to optimize for random mapping between
vp_index and cpu index, then it's probably better instead to iterate
over vcpus and test if their vp_index belongs to the requested mask.

Neither of the above is quadratic.

Dunno if we need to specifically consider intermediate situations.

Anyway using a havier vp_index -> cpu index translation looks like an
overkill to me.

What do you think?

Thanks,
Roman.


Re: [PATCH v5 5/5] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-08-28 Thread Roman Kagan
 if (all_cpus) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + /* We fail only when APIC is disabled */
> + if (!kvm_apic_set_irq(vcpu, , NULL))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> + goto ret_success;
> + }
> +
> + for_each_set_bit(bank, (unsigned long *)_bank_mask, 64) {
> + for_each_set_bit(i, (unsigned long *)_banks[bank], 64) {
> + u32 vp_index = bank * 64 + i;
> + struct kvm_vcpu *vcpu =
> + get_vcpu_by_vpidx(kvm, vp_index);
> +
> + /* Unknown vCPU specified */
> + if (!vcpu)
> + continue;
> +
> + /* We fail only when APIC is disabled */
> + kvm_apic_set_irq(vcpu, , NULL);
> + }
> + }
> +
> +ret_success:
> + return HV_STATUS_SUCCESS;
> +}
> +

I still think that splitting kvm_hv_send_ipi into three functions would
make it more readable, but that's a matter of taste of course, so I'm OK
if Radim insists otherwise.

Reviewed-by: Roman Kagan 


Re: [PATCH v5 5/5] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-08-28 Thread Roman Kagan
 if (all_cpus) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + /* We fail only when APIC is disabled */
> + if (!kvm_apic_set_irq(vcpu, , NULL))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> + goto ret_success;
> + }
> +
> + for_each_set_bit(bank, (unsigned long *)_bank_mask, 64) {
> + for_each_set_bit(i, (unsigned long *)_banks[bank], 64) {
> + u32 vp_index = bank * 64 + i;
> + struct kvm_vcpu *vcpu =
> + get_vcpu_by_vpidx(kvm, vp_index);
> +
> + /* Unknown vCPU specified */
> + if (!vcpu)
> + continue;
> +
> + /* We fail only when APIC is disabled */
> + kvm_apic_set_irq(vcpu, , NULL);
> + }
> + }
> +
> +ret_success:
> + return HV_STATUS_SUCCESS;
> +}
> +

I still think that splitting kvm_hv_send_ipi into three functions would
make it more readable, but that's a matter of taste of course, so I'm OK
if Radim insists otherwise.

Reviewed-by: Roman Kagan 


Re: [PATCH v4 RESEND 5/5] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-08-23 Thread Roman Kagan
On Wed, Aug 22, 2018 at 12:18:32PM +0200, Vitaly Kuznetsov wrote:
> Using hypercall for sending IPIs is faster because this allows to specify
> any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
> will take only one VMEXIT.
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses it as 'fast' so we need
> to support that.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  Documentation/virtual/kvm/api.txt |   8 +++
>  arch/x86/kvm/hyperv.c | 109 
> ++
>  arch/x86/kvm/trace.h  |  42 +++
>  arch/x86/kvm/x86.c|   1 +
>  include/uapi/linux/kvm.h  |   1 +
>  5 files changed, 161 insertions(+)

Looks like I forgot to respond to this one, sorry.

> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 7b83b176c662..832ea72d43c1 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4690,3 +4690,11 @@ This capability indicates that KVM supports 
> paravirtualized Hyper-V TLB Flush
>  hypercalls:
>  HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
>  HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
> +
> +8.19 KVM_CAP_HYPERV_SEND_IPI
> +
> +Architectures: x86
> +
> +This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> +hypercalls:
> +HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index d1a911132b59..3183cf9bcb63 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1360,6 +1360,101 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 
> outgpa,
> +bool ex, bool fast)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct hv_send_ipi_ex send_ipi_ex;
> + struct hv_send_ipi send_ipi;
> + struct kvm_vcpu *vcpu;
> + unsigned long valid_bank_mask;
> + u64 sparse_banks[64];
> + int sparse_banks_len, bank, i;
> + struct kvm_lapic_irq irq = {.delivery_mode = APIC_DM_FIXED};
> + bool all_cpus;
> +
> + if (!ex) {
> + if (!fast) {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi,
> + sizeof(send_ipi
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = send_ipi.cpu_mask;
> + irq.vector = send_ipi.vector;
> + } else {
> + /* 'reserved' part of hv_send_ipi should be 0 */
> + if (unlikely(ingpa >> 32 != 0))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = outgpa;
> + irq.vector = (u32)ingpa;
> + }
> + all_cpus = false;
> + valid_bank_mask = BIT_ULL(0);
> +
> + trace_kvm_hv_send_ipi(irq.vector, sparse_banks[0]);
> + } else {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi_ex,
> + sizeof(send_ipi_ex
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
> +  send_ipi_ex.vp_set.format,
> +  send_ipi_ex.vp_set.valid_bank_mask);
> +
> + irq.vector = send_ipi_ex.vector;
> + valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> + sparse_banks_len = bitmap_weight(_bank_mask, 64) *
> + sizeof(sparse_banks[0]);
> +
> + all_cpus = send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL;
> +
> + if (!sparse_banks_len)
> + goto ret_success;
> +
> + if (!all_cpus &&
> + kvm_read_guest(kvm,
> +ingpa + offsetof(struct hv_send_ipi_ex,
> + vp_set.bank_contents),
> +sparse_banks,
> +sparse_banks_len))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +
> + if ((irq.vector < HV_IPI_LOW_VECTOR) ||
> + (irq.vector > HV_IPI_HIGH_VECTOR))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (all_cpus) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + /* We fail only when APIC is disabled */
> + if (!kvm_apic_set_irq(vcpu, , NULL))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +   

Re: [PATCH v4 RESEND 5/5] KVM: x86: hyperv: implement PV IPI send hypercalls

2018-08-23 Thread Roman Kagan
On Wed, Aug 22, 2018 at 12:18:32PM +0200, Vitaly Kuznetsov wrote:
> Using hypercall for sending IPIs is faster because this allows to specify
> any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
> will take only one VMEXIT.
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses it as 'fast' so we need
> to support that.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  Documentation/virtual/kvm/api.txt |   8 +++
>  arch/x86/kvm/hyperv.c | 109 
> ++
>  arch/x86/kvm/trace.h  |  42 +++
>  arch/x86/kvm/x86.c|   1 +
>  include/uapi/linux/kvm.h  |   1 +
>  5 files changed, 161 insertions(+)

Looks like I forgot to respond to this one, sorry.

> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 7b83b176c662..832ea72d43c1 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4690,3 +4690,11 @@ This capability indicates that KVM supports 
> paravirtualized Hyper-V TLB Flush
>  hypercalls:
>  HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
>  HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.
> +
> +8.19 KVM_CAP_HYPERV_SEND_IPI
> +
> +Architectures: x86
> +
> +This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> +hypercalls:
> +HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index d1a911132b59..3183cf9bcb63 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1360,6 +1360,101 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu 
> *current_vcpu, u64 ingpa,
>   ((u64)rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>  }
>  
> +static u64 kvm_hv_send_ipi(struct kvm_vcpu *current_vcpu, u64 ingpa, u64 
> outgpa,
> +bool ex, bool fast)
> +{
> + struct kvm *kvm = current_vcpu->kvm;
> + struct hv_send_ipi_ex send_ipi_ex;
> + struct hv_send_ipi send_ipi;
> + struct kvm_vcpu *vcpu;
> + unsigned long valid_bank_mask;
> + u64 sparse_banks[64];
> + int sparse_banks_len, bank, i;
> + struct kvm_lapic_irq irq = {.delivery_mode = APIC_DM_FIXED};
> + bool all_cpus;
> +
> + if (!ex) {
> + if (!fast) {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi,
> + sizeof(send_ipi
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = send_ipi.cpu_mask;
> + irq.vector = send_ipi.vector;
> + } else {
> + /* 'reserved' part of hv_send_ipi should be 0 */
> + if (unlikely(ingpa >> 32 != 0))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + sparse_banks[0] = outgpa;
> + irq.vector = (u32)ingpa;
> + }
> + all_cpus = false;
> + valid_bank_mask = BIT_ULL(0);
> +
> + trace_kvm_hv_send_ipi(irq.vector, sparse_banks[0]);
> + } else {
> + if (unlikely(kvm_read_guest(kvm, ingpa, _ipi_ex,
> + sizeof(send_ipi_ex
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
> +  send_ipi_ex.vp_set.format,
> +  send_ipi_ex.vp_set.valid_bank_mask);
> +
> + irq.vector = send_ipi_ex.vector;
> + valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> + sparse_banks_len = bitmap_weight(_bank_mask, 64) *
> + sizeof(sparse_banks[0]);
> +
> + all_cpus = send_ipi_ex.vp_set.format == HV_GENERIC_SET_ALL;
> +
> + if (!sparse_banks_len)
> + goto ret_success;
> +
> + if (!all_cpus &&
> + kvm_read_guest(kvm,
> +ingpa + offsetof(struct hv_send_ipi_ex,
> + vp_set.bank_contents),
> +sparse_banks,
> +sparse_banks_len))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +
> + if ((irq.vector < HV_IPI_LOW_VECTOR) ||
> + (irq.vector > HV_IPI_HIGH_VECTOR))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> +
> + if (all_cpus) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + /* We fail only when APIC is disabled */
> + if (!kvm_apic_set_irq(vcpu, , NULL))
> + return HV_STATUS_INVALID_HYPERCALL_INPUT;
> + }
> +   

Re: [PATCH v4 4/5] x86/hyper-v: rename ipi_arg_{ex,non_ex} structures

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:05PM +0200, Vitaly Kuznetsov wrote:
> These structures are going to be used from KVM code so let's make
> their names reflect their Hyper-V origin.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/hv_apic.c  | 12 ++--
>  arch/x86/include/asm/hyperv-tlfs.h | 16 +---
>  2 files changed, 15 insertions(+), 13 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 4/5] x86/hyper-v: rename ipi_arg_{ex,non_ex} structures

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:05PM +0200, Vitaly Kuznetsov wrote:
> These structures are going to be used from KVM code so let's make
> their names reflect their Hyper-V origin.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/hv_apic.c  | 12 ++--
>  arch/x86/include/asm/hyperv-tlfs.h | 16 +---
>  2 files changed, 15 insertions(+), 13 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:04PM +0200, Vitaly Kuznetsov wrote:
> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> use it instead of traversing full vCPU list every time.
> 
> To support the change split off get_vcpu_idx_by_vpidx() from
> get_vcpu_by_vpidx().
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 78 
> ---
>  1 file changed, 31 insertions(+), 47 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:04PM +0200, Vitaly Kuznetsov wrote:
> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> use it instead of traversing full vCPU list every time.
> 
> To support the change split off get_vcpu_idx_by_vpidx() from
> get_vcpu_by_vpidx().
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 78 
> ---
>  1 file changed, 31 insertions(+), 47 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 2/5] KVM: x86: hyperv: optimize 'all cpus' case in kvm_hv_flush_tlb()

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:03PM +0200, Vitaly Kuznetsov wrote:
> We can use 'NULL' to represent 'all cpus' case in
> kvm_make_vcpus_request_mask() and avoid building vCPU mask with
> all vCPUs.
> 
> Suggested-by: Radim Krčmář 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 42 +++---
>  virt/kvm/kvm_main.c   |  6 ++
>  2 files changed, 25 insertions(+), 23 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 2/5] KVM: x86: hyperv: optimize 'all cpus' case in kvm_hv_flush_tlb()

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:03PM +0200, Vitaly Kuznetsov wrote:
> We can use 'NULL' to represent 'all cpus' case in
> kvm_make_vcpus_request_mask() and avoid building vCPU mask with
> all vCPUs.
> 
> Suggested-by: Radim Krčmář 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 42 +++---
>  virt/kvm/kvm_main.c   |  6 ++
>  2 files changed, 25 insertions(+), 23 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 1/5] KVM: x86: hyperv: enforce vp_index < KVM_MAX_VCPUS

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:02PM +0200, Vitaly Kuznetsov wrote:
> Hyper-V TLFS (5.0b) states:
> 
> > Virtual processors are identified by using an index (VP index). The
> > maximum number of virtual processors per partition supported by the
> > current implementation of the hypervisor can be obtained through CPUID
> > leaf 0x4005. A virtual processor index must be less than the
> > maximum number of virtual processors per partition.
> 
> Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx()
> can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v4 1/5] KVM: x86: hyperv: enforce vp_index < KVM_MAX_VCPUS

2018-07-09 Thread Roman Kagan
On Tue, Jul 03, 2018 at 03:42:02PM +0200, Vitaly Kuznetsov wrote:
> Hyper-V TLFS (5.0b) states:
> 
> > Virtual processors are identified by using an index (VP index). The
> > maximum number of virtual processors per partition supported by the
> > current implementation of the hypervisor can be obtained through CPUID
> > leaf 0x4005. A virtual processor index must be less than the
> > maximum number of virtual processors per partition.
> 
> Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx()
> can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/hyperv.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)

Reviewed-by: Roman Kagan 


Re: [PATCH v3 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 07:26:36PM +0300, Roman Kagan wrote:
> On Fri, Jun 29, 2018 at 05:21:47PM +0200, Vitaly Kuznetsov wrote:
> > Roman Kagan  writes:
> > 
> > > On Fri, Jun 29, 2018 at 04:14:53PM +0200, Vitaly Kuznetsov wrote:
> > >> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> > >> use it instead of traversing full vCPU list every time.
> > >> 
> > >> To support the change switch kvm_make_vcpus_request_mask() to checking
> > >> vcpu_id instead of vcpu index,
> > >
> > > I'm afraid you can't do this: vcpu_id (== apic id) can be sparse, i.e.
> > > it's not very well suited for bitmaps and can exceed the max number of
> > > vcpus.
> > 
> > True. The bitmap should be of KVM_MAX_VCPU_ID size, not
> > KVM_MAX_VCPUS.
> > 
> > Unfortunately there's no convenient way to get VCPU idx from VCPU
> > id, kvm_vcpu_get_idx() just walks the whole list :-( I see two possible
> > options:
> > 1) Add vcpu_idx fields to struct kvm_vcpu
> > 2) Keep the change expecting masks of KVM_MAX_VCPU_ID in
> > kvm_make_vcpus_request_mask(). KVM_MAX_VCPU_ID is currently 1023 so our
> > bitmaps will be 16 longs long. Not sure if it's too much.
> 
> 3) rework get_vcpu_by_vpidx into get_vcpu_idx_by_vpidx followed by
> get_cpu, and use the former for your purposes

s/get_cpu/kvm_get_vcpu/ of course

> 4) duplicate get_vcpu_by_vpidx logic in get_vcpu_idx_by_vpidx
> 
> Roman.
> 
> P.S. I'm starting to wonder how safe this get_vcpu_* thing is WRT vcpu
> removal, but that's a different story anyway.


Re: [PATCH v3 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 07:26:36PM +0300, Roman Kagan wrote:
> On Fri, Jun 29, 2018 at 05:21:47PM +0200, Vitaly Kuznetsov wrote:
> > Roman Kagan  writes:
> > 
> > > On Fri, Jun 29, 2018 at 04:14:53PM +0200, Vitaly Kuznetsov wrote:
> > >> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> > >> use it instead of traversing full vCPU list every time.
> > >> 
> > >> To support the change switch kvm_make_vcpus_request_mask() to checking
> > >> vcpu_id instead of vcpu index,
> > >
> > > I'm afraid you can't do this: vcpu_id (== apic id) can be sparse, i.e.
> > > it's not very well suited for bitmaps and can exceed the max number of
> > > vcpus.
> > 
> > True. The bitmap should be of KVM_MAX_VCPU_ID size, not
> > KVM_MAX_VCPUS.
> > 
> > Unfortunately there's no convenient way to get VCPU idx from VCPU
> > id, kvm_vcpu_get_idx() just walks the whole list :-( I see two possible
> > options:
> > 1) Add vcpu_idx fields to struct kvm_vcpu
> > 2) Keep the change expecting masks of KVM_MAX_VCPU_ID in
> > kvm_make_vcpus_request_mask(). KVM_MAX_VCPU_ID is currently 1023 so our
> > bitmaps will be 16 longs long. Not sure if it's too much.
> 
> 3) rework get_vcpu_by_vpidx into get_vcpu_idx_by_vpidx followed by
> get_cpu, and use the former for your purposes

s/get_cpu/kvm_get_vcpu/ of course

> 4) duplicate get_vcpu_by_vpidx logic in get_vcpu_idx_by_vpidx
> 
> Roman.
> 
> P.S. I'm starting to wonder how safe this get_vcpu_* thing is WRT vcpu
> removal, but that's a different story anyway.


Re: [PATCH v3 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 05:21:47PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Jun 29, 2018 at 04:14:53PM +0200, Vitaly Kuznetsov wrote:
> >> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> >> use it instead of traversing full vCPU list every time.
> >> 
> >> To support the change switch kvm_make_vcpus_request_mask() to checking
> >> vcpu_id instead of vcpu index,
> >
> > I'm afraid you can't do this: vcpu_id (== apic id) can be sparse, i.e.
> > it's not very well suited for bitmaps and can exceed the max number of
> > vcpus.
> 
> True. The bitmap should be of KVM_MAX_VCPU_ID size, not
> KVM_MAX_VCPUS.
> 
> Unfortunately there's no convenient way to get VCPU idx from VCPU
> id, kvm_vcpu_get_idx() just walks the whole list :-( I see two possible
> options:
> 1) Add vcpu_idx fields to struct kvm_vcpu
> 2) Keep the change expecting masks of KVM_MAX_VCPU_ID in
> kvm_make_vcpus_request_mask(). KVM_MAX_VCPU_ID is currently 1023 so our
> bitmaps will be 16 longs long. Not sure if it's too much.

3) rework get_vcpu_by_vpidx into get_vcpu_idx_by_vpidx followed by
get_cpu, and use the former for your purposes
4) duplicate get_vcpu_by_vpidx logic in get_vcpu_idx_by_vpidx

Roman.

P.S. I'm starting to wonder how safe this get_vcpu_* thing is WRT vcpu
removal, but that's a different story anyway.


Re: [PATCH v3 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 05:21:47PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Jun 29, 2018 at 04:14:53PM +0200, Vitaly Kuznetsov wrote:
> >> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> >> use it instead of traversing full vCPU list every time.
> >> 
> >> To support the change switch kvm_make_vcpus_request_mask() to checking
> >> vcpu_id instead of vcpu index,
> >
> > I'm afraid you can't do this: vcpu_id (== apic id) can be sparse, i.e.
> > it's not very well suited for bitmaps and can exceed the max number of
> > vcpus.
> 
> True. The bitmap should be of KVM_MAX_VCPU_ID size, not
> KVM_MAX_VCPUS.
> 
> Unfortunately there's no convenient way to get VCPU idx from VCPU
> id, kvm_vcpu_get_idx() just walks the whole list :-( I see two possible
> options:
> 1) Add vcpu_idx fields to struct kvm_vcpu
> 2) Keep the change expecting masks of KVM_MAX_VCPU_ID in
> kvm_make_vcpus_request_mask(). KVM_MAX_VCPU_ID is currently 1023 so our
> bitmaps will be 16 longs long. Not sure if it's too much.

3) rework get_vcpu_by_vpidx into get_vcpu_idx_by_vpidx followed by
get_cpu, and use the former for your purposes
4) duplicate get_vcpu_by_vpidx logic in get_vcpu_idx_by_vpidx

Roman.

P.S. I'm starting to wonder how safe this get_vcpu_* thing is WRT vcpu
removal, but that's a different story anyway.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 05:25:56PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Jun 29, 2018 at 03:10:14PM +0200, Vitaly Kuznetsov wrote:
> >> Roman Kagan  writes:
> >> 
> >> > On Fri, Jun 29, 2018 at 01:37:44PM +0200, Vitaly Kuznetsov wrote:
> >> >> The problem we're trying to solve here is: with PV TLB flush and IPI we
> >> >> need to walk through the supplied list of VP_INDEXes and get VCPU
> >> >> ids. Usually they match. But in case they don't [...]
> >> >
> >> > Why wouldn't they *in practice*?  Only if the userspace wanted to be
> >> > funny and assigned VP_INDEXes randomly?  I'm not sure we need to
> >> > optimize for this case.
> >> 
> >> Can someone please remind me why we allow userspace to change it in the
> >> first place?
> >
> > I can ;)
> >
> > We used not to, and reported KVM's vcpu index as the VP_INDEX.  However,
> > later we realized that VP_INDEX needed to be persistent across
> > migrations and otherwise also known to userspace.  Relying on the kernel
> > to always initialize its indices in the same order was unacceptable, and
> > we came up with no better way of synchronizing VP_INDEX between the
> > userspace and the kernel than to let the former to set it explicitly.
> >
> > However, this is basically a future-proofing feature; in practice, both
> > QEMU and KVM initialize their indices in the same order.
> 
> 
> Thanks!
> 
> But in the theoretical case when these indices start to differ after
> migration, users will notice a slowdown which will be hard to explain,
> right?

That's exactly why I suggested a warning on VP_INDEX != vcpu index in
kvm_hv_set_msr.

> Does it justify the need for vp_idx_to_vcpu_idx?

I'd personally prefer being pointed at a scenario where this becomes
relevant first.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 05:25:56PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Jun 29, 2018 at 03:10:14PM +0200, Vitaly Kuznetsov wrote:
> >> Roman Kagan  writes:
> >> 
> >> > On Fri, Jun 29, 2018 at 01:37:44PM +0200, Vitaly Kuznetsov wrote:
> >> >> The problem we're trying to solve here is: with PV TLB flush and IPI we
> >> >> need to walk through the supplied list of VP_INDEXes and get VCPU
> >> >> ids. Usually they match. But in case they don't [...]
> >> >
> >> > Why wouldn't they *in practice*?  Only if the userspace wanted to be
> >> > funny and assigned VP_INDEXes randomly?  I'm not sure we need to
> >> > optimize for this case.
> >> 
> >> Can someone please remind me why we allow userspace to change it in the
> >> first place?
> >
> > I can ;)
> >
> > We used not to, and reported KVM's vcpu index as the VP_INDEX.  However,
> > later we realized that VP_INDEX needed to be persistent across
> > migrations and otherwise also known to userspace.  Relying on the kernel
> > to always initialize its indices in the same order was unacceptable, and
> > we came up with no better way of synchronizing VP_INDEX between the
> > userspace and the kernel than to let the former to set it explicitly.
> >
> > However, this is basically a future-proofing feature; in practice, both
> > QEMU and KVM initialize their indices in the same order.
> 
> 
> Thanks!
> 
> But in the theoretical case when these indices start to differ after
> migration, users will notice a slowdown which will be hard to explain,
> right?

That's exactly why I suggested a warning on VP_INDEX != vcpu index in
kvm_hv_set_msr.

> Does it justify the need for vp_idx_to_vcpu_idx?

I'd personally prefer being pointed at a scenario where this becomes
relevant first.

Roman.


Re: [PATCH v3 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 04:14:53PM +0200, Vitaly Kuznetsov wrote:
> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> use it instead of traversing full vCPU list every time.
> 
> To support the change switch kvm_make_vcpus_request_mask() to checking
> vcpu_id instead of vcpu index,

I'm afraid you can't do this: vcpu_id (== apic id) can be sparse, i.e.
it's not very well suited for bitmaps and can exceed the max number of
vcpus.

Roman.


Re: [PATCH v3 3/5] KVM: x86: hyperv: use get_vcpu_by_vpidx() in kvm_hv_flush_tlb()

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 04:14:53PM +0200, Vitaly Kuznetsov wrote:
> VP_INDEX almost always matches VCPU id and get_vcpu_by_vpidx() is fast,
> use it instead of traversing full vCPU list every time.
> 
> To support the change switch kvm_make_vcpus_request_mask() to checking
> vcpu_id instead of vcpu index,

I'm afraid you can't do this: vcpu_id (== apic id) can be sparse, i.e.
it's not very well suited for bitmaps and can exceed the max number of
vcpus.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 03:10:14PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Jun 29, 2018 at 01:37:44PM +0200, Vitaly Kuznetsov wrote:
> >> The problem we're trying to solve here is: with PV TLB flush and IPI we
> >> need to walk through the supplied list of VP_INDEXes and get VCPU
> >> ids. Usually they match. But in case they don't [...]
> >
> > Why wouldn't they *in practice*?  Only if the userspace wanted to be
> > funny and assigned VP_INDEXes randomly?  I'm not sure we need to
> > optimize for this case.
> 
> Can someone please remind me why we allow userspace to change it in the
> first place?

I can ;)

We used not to, and reported KVM's vcpu index as the VP_INDEX.  However,
later we realized that VP_INDEX needed to be persistent across
migrations and otherwise also known to userspace.  Relying on the kernel
to always initialize its indices in the same order was unacceptable, and
we came up with no better way of synchronizing VP_INDEX between the
userspace and the kernel than to let the former to set it explicitly.

However, this is basically a future-proofing feature; in practice, both
QEMU and KVM initialize their indices in the same order.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 03:10:14PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Fri, Jun 29, 2018 at 01:37:44PM +0200, Vitaly Kuznetsov wrote:
> >> The problem we're trying to solve here is: with PV TLB flush and IPI we
> >> need to walk through the supplied list of VP_INDEXes and get VCPU
> >> ids. Usually they match. But in case they don't [...]
> >
> > Why wouldn't they *in practice*?  Only if the userspace wanted to be
> > funny and assigned VP_INDEXes randomly?  I'm not sure we need to
> > optimize for this case.
> 
> Can someone please remind me why we allow userspace to change it in the
> first place?

I can ;)

We used not to, and reported KVM's vcpu index as the VP_INDEX.  However,
later we realized that VP_INDEX needed to be persistent across
migrations and otherwise also known to userspace.  Relying on the kernel
to always initialize its indices in the same order was unacceptable, and
we came up with no better way of synchronizing VP_INDEX between the
userspace and the kernel than to let the former to set it explicitly.

However, this is basically a future-proofing feature; in practice, both
QEMU and KVM initialize their indices in the same order.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 01:37:44PM +0200, Vitaly Kuznetsov wrote:
> The problem we're trying to solve here is: with PV TLB flush and IPI we
> need to walk through the supplied list of VP_INDEXes and get VCPU
> ids. Usually they match. But in case they don't [...]

Why wouldn't they *in practice*?  Only if the userspace wanted to be
funny and assigned VP_INDEXes randomly?  I'm not sure we need to
optimize for this case.

Note that the userspace can actually do nasty things with these
VP_INDEXes, like, say, have them non-unique.  We need to be resilent to
it, but don't need to optimize for it.

I think I'd rather have a warning in kvm_hv_set_msr if the VP_INDEX
being assigned is not equal to the vcpu index, and start worrying about
optimization only if this warning starts being triggered by real
hypervisor applications.

Anyway I don't see an urgent need to bloat this patchset with optimizing
this translation; it can be done separately, if needed.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 01:37:44PM +0200, Vitaly Kuznetsov wrote:
> The problem we're trying to solve here is: with PV TLB flush and IPI we
> need to walk through the supplied list of VP_INDEXes and get VCPU
> ids. Usually they match. But in case they don't [...]

Why wouldn't they *in practice*?  Only if the userspace wanted to be
funny and assigned VP_INDEXes randomly?  I'm not sure we need to
optimize for this case.

Note that the userspace can actually do nasty things with these
VP_INDEXes, like, say, have them non-unique.  We need to be resilent to
it, but don't need to optimize for it.

I think I'd rather have a warning in kvm_hv_set_msr if the VP_INDEX
being assigned is not equal to the vcpu index, and start worrying about
optimization only if this warning starts being triggered by real
hypervisor applications.

Anyway I don't see an urgent need to bloat this patchset with optimizing
this translation; it can be done separately, if needed.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 12:26:23PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Thu, Jun 28, 2018 at 03:53:10PM +0200, Vitaly Kuznetsov wrote:
> >> While it is easy to get VP index from vCPU index the reverse task is hard.
> >> Basically, to solve it we have to walk all vCPUs checking if their VP index
> >> matches. For hypercalls like HvFlushVirtualAddress{List,Space}* and the
> >> upcoming HvSendSyntheticClusterIpi* where a single CPU may be specified in
> >> the whole set this is obviously sub-optimal.
> >> 
> >> As VP index can be set to anything <= U32_MAX by userspace using plain
> >> [0..MAX_VP_INDEX] array is not a viable option. Use condensed sorted
> >> array with logarithmic search complexity instead. Use RCU to make read
> >> access as fast as possible and maintain atomicity of updates.
> >
> > Quoting TLFS 5.0C section 7.8.1:
> >
> >> Virtual processors are identified by using an index (VP index). The
> >> maximum number of virtual processors per partition supported by the
> >> current implementation of the hypervisor can be obtained through CPUID
> >> leaf 0x4005. A virtual processor index must be less than the
> >> maximum number of virtual processors per partition.
> >
> > so this is a dense index, and VP_INDEX >= KVM_MAX_VCPUS is invalid.  I
> > think we're better off enforcing this in kvm_hv_set_msr and keep the
> > translation simple.  If the algorithm in get_vcpu_by_vpidx is not good
> > enough (and yes it can be made to return NULL early on vpidx >=
> > KVM_MAX_VCPUS instead of taking the slow path) then a simple index array
> > of KVM_MAX_VCPUS entries should certainly do.
> 
> Sure, we can use pre-allocated [0..KVM_MAX_VCPUS] array instead and put
> limits on what userspace can assign VP_INDEX to. Howver, while thinking
> about it I decided to go with the more complex condensed array approach
> because the tendency is for KVM_MAX_VCPUS to grow and we will be
> pre-allocating more and more memory for no particular reason (so I think
> even 'struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]' in 'struct kvm' will need
> to be converted to something else eventually). 

We're talking of kilobytes here.  I guess this is going to be the least
of the scalability problems.

> Anyway, I'm flexible and if you think we should go this way now I'll do
> this in v3. We can re-think this when we later decide to raise
> KVM_MAX_VCPUS significantly.

Although there's no strict requirement for that I think every sensible
userspace will allocate VP_INDEX linearly resulting in it being equal to
KVM's vcpu index.  So we've yet to see a case where get_vcpu_by_vpidx
doesn't take the fast path.  If it ever starts appearing in the profiles
we may consider optimiziing it but ATM I don't even think introducing
the translation array is justified.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Fri, Jun 29, 2018 at 12:26:23PM +0200, Vitaly Kuznetsov wrote:
> Roman Kagan  writes:
> 
> > On Thu, Jun 28, 2018 at 03:53:10PM +0200, Vitaly Kuznetsov wrote:
> >> While it is easy to get VP index from vCPU index the reverse task is hard.
> >> Basically, to solve it we have to walk all vCPUs checking if their VP index
> >> matches. For hypercalls like HvFlushVirtualAddress{List,Space}* and the
> >> upcoming HvSendSyntheticClusterIpi* where a single CPU may be specified in
> >> the whole set this is obviously sub-optimal.
> >> 
> >> As VP index can be set to anything <= U32_MAX by userspace using plain
> >> [0..MAX_VP_INDEX] array is not a viable option. Use condensed sorted
> >> array with logarithmic search complexity instead. Use RCU to make read
> >> access as fast as possible and maintain atomicity of updates.
> >
> > Quoting TLFS 5.0C section 7.8.1:
> >
> >> Virtual processors are identified by using an index (VP index). The
> >> maximum number of virtual processors per partition supported by the
> >> current implementation of the hypervisor can be obtained through CPUID
> >> leaf 0x4005. A virtual processor index must be less than the
> >> maximum number of virtual processors per partition.
> >
> > so this is a dense index, and VP_INDEX >= KVM_MAX_VCPUS is invalid.  I
> > think we're better off enforcing this in kvm_hv_set_msr and keep the
> > translation simple.  If the algorithm in get_vcpu_by_vpidx is not good
> > enough (and yes it can be made to return NULL early on vpidx >=
> > KVM_MAX_VCPUS instead of taking the slow path) then a simple index array
> > of KVM_MAX_VCPUS entries should certainly do.
> 
> Sure, we can use pre-allocated [0..KVM_MAX_VCPUS] array instead and put
> limits on what userspace can assign VP_INDEX to. Howver, while thinking
> about it I decided to go with the more complex condensed array approach
> because the tendency is for KVM_MAX_VCPUS to grow and we will be
> pre-allocating more and more memory for no particular reason (so I think
> even 'struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]' in 'struct kvm' will need
> to be converted to something else eventually). 

We're talking of kilobytes here.  I guess this is going to be the least
of the scalability problems.

> Anyway, I'm flexible and if you think we should go this way now I'll do
> this in v3. We can re-think this when we later decide to raise
> KVM_MAX_VCPUS significantly.

Although there's no strict requirement for that I think every sensible
userspace will allocate VP_INDEX linearly resulting in it being equal to
KVM's vcpu index.  So we've yet to see a case where get_vcpu_by_vpidx
doesn't take the fast path.  If it ever starts appearing in the profiles
we may consider optimiziing it but ATM I don't even think introducing
the translation array is justified.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Thu, Jun 28, 2018 at 03:53:10PM +0200, Vitaly Kuznetsov wrote:
> While it is easy to get VP index from vCPU index the reverse task is hard.
> Basically, to solve it we have to walk all vCPUs checking if their VP index
> matches. For hypercalls like HvFlushVirtualAddress{List,Space}* and the
> upcoming HvSendSyntheticClusterIpi* where a single CPU may be specified in
> the whole set this is obviously sub-optimal.
> 
> As VP index can be set to anything <= U32_MAX by userspace using plain
> [0..MAX_VP_INDEX] array is not a viable option. Use condensed sorted
> array with logarithmic search complexity instead. Use RCU to make read
> access as fast as possible and maintain atomicity of updates.

Quoting TLFS 5.0C section 7.8.1:

> Virtual processors are identified by using an index (VP index). The
> maximum number of virtual processors per partition supported by the
> current implementation of the hypervisor can be obtained through CPUID
> leaf 0x4005. A virtual processor index must be less than the
> maximum number of virtual processors per partition.

so this is a dense index, and VP_INDEX >= KVM_MAX_VCPUS is invalid.  I
think we're better off enforcing this in kvm_hv_set_msr and keep the
translation simple.  If the algorithm in get_vcpu_by_vpidx is not good
enough (and yes it can be made to return NULL early on vpidx >=
KVM_MAX_VCPUS instead of taking the slow path) then a simple index array
of KVM_MAX_VCPUS entries should certainly do.

Roman.


Re: [PATCH v2 2/5] KVM: x86: hyperv: introduce vp_index_to_vcpu_idx mapping

2018-06-29 Thread Roman Kagan
On Thu, Jun 28, 2018 at 03:53:10PM +0200, Vitaly Kuznetsov wrote:
> While it is easy to get VP index from vCPU index the reverse task is hard.
> Basically, to solve it we have to walk all vCPUs checking if their VP index
> matches. For hypercalls like HvFlushVirtualAddress{List,Space}* and the
> upcoming HvSendSyntheticClusterIpi* where a single CPU may be specified in
> the whole set this is obviously sub-optimal.
> 
> As VP index can be set to anything <= U32_MAX by userspace using plain
> [0..MAX_VP_INDEX] array is not a viable option. Use condensed sorted
> array with logarithmic search complexity instead. Use RCU to make read
> access as fast as possible and maintain atomicity of updates.

Quoting TLFS 5.0C section 7.8.1:

> Virtual processors are identified by using an index (VP index). The
> maximum number of virtual processors per partition supported by the
> current implementation of the hypervisor can be obtained through CPUID
> leaf 0x4005. A virtual processor index must be less than the
> maximum number of virtual processors per partition.

so this is a dense index, and VP_INDEX >= KVM_MAX_VCPUS is invalid.  I
think we're better off enforcing this in kvm_hv_set_msr and keep the
translation simple.  If the algorithm in get_vcpu_by_vpidx is not good
enough (and yes it can be made to return NULL early on vpidx >=
KVM_MAX_VCPUS instead of taking the slow path) then a simple index array
of KVM_MAX_VCPUS entries should certainly do.

Roman.


Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-19 Thread Roman Kagan
On Fri, May 18, 2018 at 03:31:38PM -0700, Andrew Morton wrote:
> On Fri, 18 May 2018 10:50:25 -0700 Matthew Wilcox <wi...@infradead.org> wrote:
> 
> > If the radix tree underlying the IDR happens to be full and we attempt
> > to remove an id which is larger than any id in the IDR, we will call
> > __radix_tree_delete() with an uninitialised 'slot' pointer, at which
> > point anything could happen.  This was easiest to hit with a single entry
> > at id 0 and attempting to remove a non-0 id, but it could have happened
> > with 64 entries and attempting to remove an id >= 64.
> > 
> > Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
> > Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
> > Debugged-by: Roman Kagan <rka...@virtuozzo.com>
> > Signed-off-by: Matthew Wilcox <mawil...@microsoft.com>
> 
> Neither of the changelogs I'm seeing attempt to describe the end-user
> impact of the bug.  People like to know that so they can decide which
> kernel version(s) need patching, so please always remember it.

That's my fault, Matthew may not have seen the original discussion among
the KVM folks.

> Looknig at the sysbot report, the impact is at least "privileged user
> can trigger a WARN", but I assume there could be worse,

Unfortunately it is worse: the syzcaller test boils down to opening
/dev/kvm, creating an eventfd, and calling a couple of KVM ioctls.  None
of this requires superuser.  And the result is dereferencing an
uninitialized pointer which is likely a crash.

> as-yet-undiscovered impacts.  So I'm thinking a cc:stable is needed,
> yes?

Well the specific path caught by syzbot is via KVM_HYPERV_EVENTD ioctl
which is new in 4.17.  But I guess there are other user-triggerable
paths, so cc:stable is probably justified.

Thanks,
Roman.


Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-19 Thread Roman Kagan
On Fri, May 18, 2018 at 03:31:38PM -0700, Andrew Morton wrote:
> On Fri, 18 May 2018 10:50:25 -0700 Matthew Wilcox  wrote:
> 
> > If the radix tree underlying the IDR happens to be full and we attempt
> > to remove an id which is larger than any id in the IDR, we will call
> > __radix_tree_delete() with an uninitialised 'slot' pointer, at which
> > point anything could happen.  This was easiest to hit with a single entry
> > at id 0 and attempting to remove a non-0 id, but it could have happened
> > with 64 entries and attempting to remove an id >= 64.
> > 
> > Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
> > Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
> > Debugged-by: Roman Kagan 
> > Signed-off-by: Matthew Wilcox 
> 
> Neither of the changelogs I'm seeing attempt to describe the end-user
> impact of the bug.  People like to know that so they can decide which
> kernel version(s) need patching, so please always remember it.

That's my fault, Matthew may not have seen the original discussion among
the KVM folks.

> Looknig at the sysbot report, the impact is at least "privileged user
> can trigger a WARN", but I assume there could be worse,

Unfortunately it is worse: the syzcaller test boils down to opening
/dev/kvm, creating an eventfd, and calling a couple of KVM ioctls.  None
of this requires superuser.  And the result is dereferencing an
uninitialized pointer which is likely a crash.

> as-yet-undiscovered impacts.  So I'm thinking a cc:stable is needed,
> yes?

Well the specific path caught by syzbot is via KVM_HYPERV_EVENTD ioctl
which is new in 4.17.  But I guess there are other user-triggerable
paths, so cc:stable is probably justified.

Thanks,
Roman.


Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-18 Thread Roman Kagan
On Thu, May 10, 2018 at 10:16:34PM +0300, Roman Kagan wrote:
> If an IDR contains a single entry at index==0, the underlying radix tree
> has a single item in its root node, in which case
> __radix_tree_lookup(index!=0) doesn't set its *@nodep argument (in
> addition to returning NULL).
> 
> However, the tree itself is not empty, i.e. the tree root doesn't have
> IDR_FREE tag.
> 
> As a result, on an attempt to remove an index!=0 entry from such an IDR,
> radix_tree_delete_item doesn't return early and calls
> __radix_tree_delete with invalid parameters which are then dereferenced.
> 
> Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
> Signed-off-by: Roman Kagan <rka...@virtuozzo.com>
> ---
>  lib/radix-tree.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index da9e10c827df..10ff1bfae952 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -2040,8 +2040,9 @@ void *radix_tree_delete_item(struct radix_tree_root 
> *root,
>   void *entry;
>  
>   entry = __radix_tree_lookup(root, index, , );
> - if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
> - get_slot_offset(node, slot
> + if (!entry && (!is_idr(root) || !node ||
> +node_tag_get(root, node, IDR_FREE,
> + get_slot_offset(node, slot
>   return NULL;
>  
>   if (item && entry != item)

Turned out Matthew didn't receive my messages; now that he's found this
patch elsewhere he's responded with a correct fix:

- Forwarded message from Matthew Wilcox <wi...@infradead.org> -

Date: Fri, 18 May 2018 10:50:25 -0700
From: Matthew Wilcox <wi...@infradead.org>
To: Roman Kagan <rka...@virtuozzo.com>
Cc: Andrew Morton <a...@linux-foundation.org>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] idr: fix invalid ptr dereference on item delete

It'd be nice if you cc'd the person who wrote the code you're patching.
You'd get a response a lot quicker than waiting until I happened to
notice the email in a different forum.

Thanks for finding the situation that leads to the bug.  Your fix is
incorrect; it's legitimate to store a NULL value at offset 0, and
your patch makes it impossible to delete.  Fortunately, the test-suite
covers that case ;-)

Andrew, can you take this through your tree for extra testing?

--- >8 ---

From: Matthew Wilcox <mawil...@microsoft.com>

If the radix tree underlying the IDR happens to be full and we attempt
to remove an id which is larger than any id in the IDR, we will call
__radix_tree_delete() with an uninitialised 'slot' pointer, at which
point anything could happen.  This was easiest to hit with a single entry
at id 0 and attempting to remove a non-0 id, but it could have happened
with 64 entries and attempting to remove an id >= 64.

Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
Debugged-by: Roman Kagan <rka...@virtuozzo.com>
Signed-off-by: Matthew Wilcox <mawil...@microsoft.com>

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index da9e10c827df..4dd4fbc7279c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2036,10 +2036,12 @@ void *radix_tree_delete_item(struct radix_tree_root 
*root,
 unsigned long index, void *item)
 {
struct radix_tree_node *node = NULL;
-   void __rcu **slot;
+   void __rcu **slot = NULL;
void *entry;
 
entry = __radix_tree_lookup(root, index, , );
+   if (!slot)
+   return NULL;
if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
get_slot_offset(node, slot
return NULL;
diff --git a/tools/testing/radix-tree/idr-test.c 
b/tools/testing/radix-tree/idr-test.c
index 1c18617951dd..410ca58bbe9c 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -254,6 +254,13 @@ void idr_checks(void)
idr_remove(, 0xfedcba98U);
idr_remove(, 0);
 
+   assert(idr_alloc(, DUMMY_PTR, 0, 0, GFP_KERNEL) == 0);
+   idr_remove(, 1);
+   for (i = 1; i < RADIX_TREE_MAP_SIZE; i++)
+   assert(idr_alloc(, DUMMY_PTR, 0, 0, GFP_KERNEL) == i);
+   idr_remove(, 1 << 30);
+   idr_destroy();
+
for (i = INT_MAX - 3UL; i < INT_MAX + 1UL; i++) {
struct item *item = item_create(i, 0);
assert(idr_alloc(, item, i, i + 10, GFP_KERNEL) == i);

--- original email ---

If an IDR contains a single entry at index==0, the underlying radix tree
has a single item in its root node, in which case
__radix_tree_looku

Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-18 Thread Roman Kagan
On Thu, May 10, 2018 at 10:16:34PM +0300, Roman Kagan wrote:
> If an IDR contains a single entry at index==0, the underlying radix tree
> has a single item in its root node, in which case
> __radix_tree_lookup(index!=0) doesn't set its *@nodep argument (in
> addition to returning NULL).
> 
> However, the tree itself is not empty, i.e. the tree root doesn't have
> IDR_FREE tag.
> 
> As a result, on an attempt to remove an index!=0 entry from such an IDR,
> radix_tree_delete_item doesn't return early and calls
> __radix_tree_delete with invalid parameters which are then dereferenced.
> 
> Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
> Signed-off-by: Roman Kagan 
> ---
>  lib/radix-tree.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index da9e10c827df..10ff1bfae952 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -2040,8 +2040,9 @@ void *radix_tree_delete_item(struct radix_tree_root 
> *root,
>   void *entry;
>  
>   entry = __radix_tree_lookup(root, index, , );
> - if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
> - get_slot_offset(node, slot
> + if (!entry && (!is_idr(root) || !node ||
> +node_tag_get(root, node, IDR_FREE,
> + get_slot_offset(node, slot
>   return NULL;
>  
>   if (item && entry != item)

Turned out Matthew didn't receive my messages; now that he's found this
patch elsewhere he's responded with a correct fix:

- Forwarded message from Matthew Wilcox  -

Date: Fri, 18 May 2018 10:50:25 -0700
From: Matthew Wilcox 
To: Roman Kagan 
Cc: Andrew Morton , linux-kernel@vger.kernel.org
Subject: Re: [PATCH] idr: fix invalid ptr dereference on item delete

It'd be nice if you cc'd the person who wrote the code you're patching.
You'd get a response a lot quicker than waiting until I happened to
notice the email in a different forum.

Thanks for finding the situation that leads to the bug.  Your fix is
incorrect; it's legitimate to store a NULL value at offset 0, and
your patch makes it impossible to delete.  Fortunately, the test-suite
covers that case ;-)

Andrew, can you take this through your tree for extra testing?

--- >8 ---

From: Matthew Wilcox 

If the radix tree underlying the IDR happens to be full and we attempt
to remove an id which is larger than any id in the IDR, we will call
__radix_tree_delete() with an uninitialised 'slot' pointer, at which
point anything could happen.  This was easiest to hit with a single entry
at id 0 and attempting to remove a non-0 id, but it could have happened
with 64 entries and attempting to remove an id >= 64.

Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
Debugged-by: Roman Kagan 
Signed-off-by: Matthew Wilcox 

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index da9e10c827df..4dd4fbc7279c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2036,10 +2036,12 @@ void *radix_tree_delete_item(struct radix_tree_root 
*root,
 unsigned long index, void *item)
 {
struct radix_tree_node *node = NULL;
-   void __rcu **slot;
+   void __rcu **slot = NULL;
void *entry;
 
entry = __radix_tree_lookup(root, index, , );
+   if (!slot)
+   return NULL;
if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
get_slot_offset(node, slot
return NULL;
diff --git a/tools/testing/radix-tree/idr-test.c 
b/tools/testing/radix-tree/idr-test.c
index 1c18617951dd..410ca58bbe9c 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -254,6 +254,13 @@ void idr_checks(void)
idr_remove(, 0xfedcba98U);
idr_remove(, 0);
 
+   assert(idr_alloc(, DUMMY_PTR, 0, 0, GFP_KERNEL) == 0);
+   idr_remove(, 1);
+   for (i = 1; i < RADIX_TREE_MAP_SIZE; i++)
+   assert(idr_alloc(, DUMMY_PTR, 0, 0, GFP_KERNEL) == i);
+   idr_remove(, 1 << 30);
+   idr_destroy();
+
for (i = INT_MAX - 3UL; i < INT_MAX + 1UL; i++) {
struct item *item = item_create(i, 0);
assert(idr_alloc(, item, i, i + 10, GFP_KERNEL) == i);

--- original email ---

If an IDR contains a single entry at index==0, the underlying radix tree
has a single item in its root node, in which case
__radix_tree_lookup(index!=0) doesn't set its *@nodep argument (in
addition to returning NULL).

However, the tree itself is not empty, i.e. the tree root doesn't have
IDR_FREE tag.

As a result, on an attempt to remove an index!=0 entry 

Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-18 Thread Roman Kagan
On Fri, May 18, 2018 at 10:50:25AM -0700, Matthew Wilcox wrote:
> It'd be nice if you cc'd the person who wrote the code you're patching.
> You'd get a response a lot quicker than waiting until I happened to
> notice the email in a different forum.

I sent it to someone called "Matthew Wilcox ".
Also I cc'd that guy when I only started to point the finger at IDR as
the suspected culprit in that syzcaller report.  I thought it was him
who wrote the code...

> Thanks for finding the situation that leads to the bug.  Your fix is
> incorrect; it's legitimate to store a NULL value at offset 0, and
> your patch makes it impossible to delete.  Fortunately, the test-suite
> covers that case ;-)

How do you build it?  I wish I had it when debugging but I got linking
errors due to missing spin_lock_init, so I decided it wasn't up-to-date.

Thanks,
Roman.

P.S. I'll forward your message to all the recepients of my patch, to let
them know it's wrong and you have a better one.


Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-18 Thread Roman Kagan
On Fri, May 18, 2018 at 10:50:25AM -0700, Matthew Wilcox wrote:
> It'd be nice if you cc'd the person who wrote the code you're patching.
> You'd get a response a lot quicker than waiting until I happened to
> notice the email in a different forum.

I sent it to someone called "Matthew Wilcox ".
Also I cc'd that guy when I only started to point the finger at IDR as
the suspected culprit in that syzcaller report.  I thought it was him
who wrote the code...

> Thanks for finding the situation that leads to the bug.  Your fix is
> incorrect; it's legitimate to store a NULL value at offset 0, and
> your patch makes it impossible to delete.  Fortunately, the test-suite
> covers that case ;-)

How do you build it?  I wish I had it when debugging but I got linking
errors due to missing spin_lock_init, so I decided it wasn't up-to-date.

Thanks,
Roman.

P.S. I'll forward your message to all the recepients of my patch, to let
them know it's wrong and you have a better one.


Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-10 Thread Roman Kagan
On Fri, May 11, 2018 at 07:40:26AM +0200, Dmitry Vyukov wrote:
> On Fri, May 11, 2018 at 1:54 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> > On 10/05/2018 21:16, Roman Kagan wrote:
> >> If an IDR contains a single entry at index==0, the underlying radix tree
> >> has a single item in its root node, in which case
> >> __radix_tree_lookup(index!=0) doesn't set its *@nodep argument (in
> >> addition to returning NULL).
> >>
> >> However, the tree itself is not empty, i.e. the tree root doesn't have
> >> IDR_FREE tag.
> >>
> >> As a result, on an attempt to remove an index!=0 entry from such an IDR,
> >> radix_tree_delete_item doesn't return early and calls
> >> __radix_tree_delete with invalid parameters which are then dereferenced.
> >>
> >> Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
> >> Signed-off-by: Roman Kagan <rka...@virtuozzo.com>
> >> ---
> >>  lib/radix-tree.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> >> index da9e10c827df..10ff1bfae952 100644
> >> --- a/lib/radix-tree.c
> >> +++ b/lib/radix-tree.c
> >> @@ -2040,8 +2040,9 @@ void *radix_tree_delete_item(struct radix_tree_root 
> >> *root,
> >>   void *entry;
> >>
> >>   entry = __radix_tree_lookup(root, index, , );
> >> - if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
> >> - get_slot_offset(node, 
> >> slot
> >> + if (!entry && (!is_idr(root) || !node ||
> >> +node_tag_get(root, node, IDR_FREE,
> >> + get_slot_offset(node, slot
> >>   return NULL;
> >>
> >>   if (item && entry != item)
> >>
> >
> > I cannot really vouch for the patch, but if it is correct it's
> > definitely stuff for stable.  The KVM testcase is only for 4.17-rc but
> > this is a really nasty bug in a core data structure.
> >
> > Cc: sta...@vger.kernel.org
> >
> > Should radix-tree be compilable in userspace, so that we can add unit
> > tests for it?...
> 
> Good point.
> 
> For my education, what/where are the tests that run as user-space code?

Actually there are userspace tests for it under tools/tests/radix-tree,
but I didn't manage to get them to build.  Looks like the recent
introduction of a spin_lock in the radix_tree structure (for XArray
work?) broke them.

Roman.


Re: [PATCH] idr: fix invalid ptr dereference on item delete

2018-05-10 Thread Roman Kagan
On Fri, May 11, 2018 at 07:40:26AM +0200, Dmitry Vyukov wrote:
> On Fri, May 11, 2018 at 1:54 AM, Paolo Bonzini  wrote:
> > On 10/05/2018 21:16, Roman Kagan wrote:
> >> If an IDR contains a single entry at index==0, the underlying radix tree
> >> has a single item in its root node, in which case
> >> __radix_tree_lookup(index!=0) doesn't set its *@nodep argument (in
> >> addition to returning NULL).
> >>
> >> However, the tree itself is not empty, i.e. the tree root doesn't have
> >> IDR_FREE tag.
> >>
> >> As a result, on an attempt to remove an index!=0 entry from such an IDR,
> >> radix_tree_delete_item doesn't return early and calls
> >> __radix_tree_delete with invalid parameters which are then dereferenced.
> >>
> >> Reported-by: syzbot+35666cba7f0a337e2...@syzkaller.appspotmail.com
> >> Signed-off-by: Roman Kagan 
> >> ---
> >>  lib/radix-tree.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> >> index da9e10c827df..10ff1bfae952 100644
> >> --- a/lib/radix-tree.c
> >> +++ b/lib/radix-tree.c
> >> @@ -2040,8 +2040,9 @@ void *radix_tree_delete_item(struct radix_tree_root 
> >> *root,
> >>   void *entry;
> >>
> >>   entry = __radix_tree_lookup(root, index, , );
> >> - if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
> >> - get_slot_offset(node, 
> >> slot
> >> + if (!entry && (!is_idr(root) || !node ||
> >> +node_tag_get(root, node, IDR_FREE,
> >> + get_slot_offset(node, slot
> >>   return NULL;
> >>
> >>   if (item && entry != item)
> >>
> >
> > I cannot really vouch for the patch, but if it is correct it's
> > definitely stuff for stable.  The KVM testcase is only for 4.17-rc but
> > this is a really nasty bug in a core data structure.
> >
> > Cc: sta...@vger.kernel.org
> >
> > Should radix-tree be compilable in userspace, so that we can add unit
> > tests for it?...
> 
> Good point.
> 
> For my education, what/where are the tests that run as user-space code?

Actually there are userspace tests for it under tools/tests/radix-tree,
but I didn't manage to get them to build.  Looks like the recent
introduction of a spin_lock in the radix_tree structure (for XArray
work?) broke them.

Roman.


  1   2   3   >