[RFC PATCH] mm: fork: Prevent a NULL deref by getting mm only if the refcount isn't 0

2021-03-10 Thread Filippo Sironi
We've seen a number of crashes with the following signature:

BUG: kernel NULL pointer dereference, address: 
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
...
Oops:  [#1] SMP PTI
...
RIP: 0010:__rb_erase_color+0xc2/0x260
...
Call Trace:
 unlink_file_vma+0x36/0x50
 free_pgtables+0x62/0x110
 exit_mmap+0xd5/0x160
 ? put_dec+0x3a/0x90
 ? num_to_str+0xa8/0xc0
 mmput+0x11/0xb0
 do_task_stat+0x940/0xc80
 proc_single_show+0x49/0x80
 ? __check_object_size+0xcc/0x1a0
 seq_read+0xd3/0x400
 vfs_read+0x72/0xb0
 ksys_read+0x9c/0xd0
 do_syscall_64+0x69/0x400
 ? schedule+0x2a/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
...

This happens when a process goes through the tasks stats in procfs while
another is exiting.  This looks like a race where the process that's
exiting drops the last reference on the mm (with mmput) while the other
increases it (with mmget).  By only increasing when the reference isn't
0 to begin with, we prevent this from happening.

Signed-off-by: Filippo Sironi 
---
 kernel/fork.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d3171e8e88e5..a7541a85e5a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1209,10 +1209,8 @@ struct mm_struct *get_task_mm(struct task_struct *task)
task_lock(task);
mm = task->mm;
if (mm) {
-   if (task->flags & PF_KTHREAD)
+   if (task->flags & PF_KTHREAD || !mmget_not_zero(mm))
mm = NULL;
-   else
-   mmget(mm);
}
task_unlock(task);
return mm;
-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





Re: [PATCH v2] nvme: Add 48-bit DMA address quirk for Amazon NVMe controllers

2021-02-10 Thread Filippo Sironi

On 2/10/21 8:37 AM, Christoph Hellwig wrote:


On Wed, Feb 10, 2021 at 01:39:42AM +0100, Filippo Sironi wrote:

Amazon NVMe controllers do not support 64-bit DMA addresses; they are
limited to 48-bit DMA addresses.  Let's add a quirk to ensure that we
make use of 48-bit DMA addresses to avoid misbehavior.


This should probably say some, and mention that they do not follow
the spec.  But I can fix this up when applying the patch.



Thanks!

Filippo



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




[PATCH v2] nvme: Add 48-bit DMA address quirk for Amazon NVMe controllers

2021-02-09 Thread Filippo Sironi
Amazon NVMe controllers do not support 64-bit DMA addresses; they are
limited to 48-bit DMA addresses.  Let's add a quirk to ensure that we
make use of 48-bit DMA addresses to avoid misbehavior.

This affects all Amazon NVMe controllers that expose EBS volumes
(0x0061, 0x0065, 0x8061) and local instance storage (0xcd00, 0xcd01,
0xcd02).

Signed-off-by: Filippo Sironi 
---
 drivers/nvme/host/nvme.h |  5 +
 drivers/nvme/host/pci.c  | 17 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 88a6b97247f5..dae747b4ac35 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -144,6 +144,11 @@ enum nvme_quirks {
 * NVMe 1.3 compliance.
 */
NVME_QUIRK_NO_NS_DESC_LIST  = (1 << 15),
+
+   /*
+* The controller supports up to 48-bit DMA address.
+*/
+   NVME_QUIRK_DMA_ADDRESS_BITS_48  = (1 << 16),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6bad4d4dcdf0..e7001f5ed6e4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2362,13 +2362,16 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 {
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
+   int dma_address_bits = 64;
 
if (pci_enable_device_mem(pdev))
return result;
 
pci_set_master(pdev);
 
-   if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64)))
+   if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
+   dma_address_bits = 48;
+   if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits)))
goto disable;
 
if (readl(dev->bar + NVME_REG_CSTS) == -1) {
@@ -3263,6 +3266,18 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
{ PCI_DEVICE(0x2646, 0x2263),   /* KINGSTON A2000 NVMe SSD  */
.driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0061),
+   .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0065),
+   .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x8061),
+   .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0xcd00),
+   .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0xcd01),
+   .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0xcd02),
+   .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001),
.driver_data = NVME_QUIRK_SINGLE_VECTOR },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





Re: [PATCH] nvme: Add 48-bit DMA address quirk

2021-02-03 Thread Filippo Sironi


On 2/3/21 12:15 PM, Christoph Hellwig wrote:


On Wed, Feb 03, 2021 at 12:12:31PM +0100, Filippo Sironi wrote:

I don't disagree on the first part of your sentence, this is a big
oversight.


But it is not what your commit log suggests.


I can definitely rephrase the commit.


On the other hand, those controllers are out there and are in use by a lot
of customers.  We can keep relying on luck, hoping that customers don't run
into troubles or we can merge a few lines of code :)


Your patch does not just quirk a few controllers out there, but all
current and future controllers with an Amazon vendor ID.  We could
probably talk about quirking an existing vendor ID or two as long as
this doesn't happen for future hardware.


I know that the hardware team is working on this but I don't know the 
timelines and there are a few upcoming controllers - of which I don't 
know the device ids yet - that have the same issue.


To avoid issues, it is easier to apply the quirk to all Amazon NVMe 
controllers for now till the new lines of controllers with the fix comes 
out.  At that point, we'll be able to restrict the application to the 
known bad controllers.




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




Re: [PATCH] nvme: Add 48-bit DMA address quirk

2021-02-03 Thread Filippo Sironi


On 2/3/21 10:51 AM, Christoph Hellwig wrote:


On Wed, Feb 03, 2021 at 10:43:38AM +0100, Filippo Sironi wrote:

Certain NVMe controllers don't support 64-bit DMA addresses.  Instead,
they are limited to 48-bit DMA addresses.  Let's add a quirk to use them
properly.


WTF?  This is such a grave NVMe spec compiance bug that I do not think
we should support this buggy mess in Linux.



I don't disagree on the first part of your sentence, this is a big 
oversight.


On the other hand, those controllers are out there and are in use by a 
lot of customers.  We can keep relying on luck, hoping that customers 
don't run into troubles or we can merge a few lines of code :)




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




[PATCH] nvme: Add 48-bit DMA address quirk

2021-02-03 Thread Filippo Sironi
Certain NVMe controllers don't support 64-bit DMA addresses.  Instead,
they are limited to 48-bit DMA addresses.  Let's add a quirk to use them
properly.

Signed-off-by: Filippo Sironi 
---
 drivers/nvme/host/nvme.h |  5 +
 drivers/nvme/host/pci.c  | 12 +++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 88a6b97247f5..dae747b4ac35 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -144,6 +144,11 @@ enum nvme_quirks {
 * NVMe 1.3 compliance.
 */
NVME_QUIRK_NO_NS_DESC_LIST  = (1 << 15),
+
+   /*
+* The controller supports up to 48-bit DMA address.
+*/
+   NVME_QUIRK_DMA_ADDRESS_BITS_48  = (1 << 16),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 81e6389b2042..5716ae16c7a7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2362,13 +2362,16 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 {
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
+   int dma_address_bits = 64;
 
if (pci_enable_device_mem(pdev))
return result;
 
pci_set_master(pdev);
 
-   if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64)))
+   if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48)
+   dma_address_bits = 48;
+   if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits)))
goto disable;
 
if (readl(dev->bar + NVME_REG_CSTS) == -1) {
@@ -3259,6 +3262,13 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
{ PCI_DEVICE(0x1d97, 0x2263),   /* SPCC */
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
+   { .vendor = PCI_VENDOR_ID_AMAZON,
+ .device = PCI_ANY_ID,
+ .subvendor = PCI_ANY_ID,
+ .subdevice = PCI_ANY_ID,
+ .class = PCI_CLASS_STORAGE_EXPRESS,
+ .class_mask = 0xff,
+ .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48 },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001),
.driver_data = NVME_QUIRK_SINGLE_VECTOR },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request

2020-11-27 Thread Filippo Sironi
CPU is able to
+* deliver the interrupt.
+*/
+   if (kvm_cpu_has_extint(vcpu))
+   return false;
+
+   /* Acknowledging ExtINT does not happen if LINT0 is masked.  */
+   return !(lapic_in_kernel(vcpu) && !kvm_apic_accept_pic_intr(vcpu));
  }

-/*
- * if userspace requested an interrupt window, check that the
- * interrupt window is open.
- *
- * No need to exit to userspace if we already have an interrupt queued.
- */
  static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
  {
 return kvm_arch_interrupt_allowed(vcpu) &&
-   !kvm_cpu_has_interrupt(vcpu) &&
-   !kvm_event_needs_reinjection(vcpu) &&
 kvm_cpu_accept_dm_intr(vcpu);
  }

--
2.28.0



Reviewed-by: Filippo Sironi 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




Re: [PATCH 1/2] KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint

2020-11-27 Thread Filippo Sironi
in_kernel(v))
-   return v->arch.interrupt.nr;
-
-   vector = kvm_cpu_get_extint(v);
-
+   int vector = kvm_cpu_get_extint(v);
 if (vector != -1)
 return vector;  /* PIC */

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 105e7859d1f2..bb5ff761d5e2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2465,7 +2465,7 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
 struct kvm_lapic *apic = vcpu->arch.apic;
 u32 ppr;

-   if (!kvm_apic_hw_enabled(apic))
+   if (!kvm_apic_present(vcpu))
     return -1;

 __apic_update_ppr(apic, &ppr);
--
2.28.0




Reviewed-by: Filippo Sironi 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




[PATCH v2 2/2] KVM: x86: Implement the arch-specific hook to report the VM UUID

2019-05-14 Thread Filippo Sironi
On x86, we report the UUID in DMI System Information (i.e., DMI Type 1)
as VM UUID.

Signed-off-by: Filippo Sironi 
---
 arch/x86/kernel/kvm.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5c93a65ee1e5..441cab08a09d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -694,6 +695,12 @@ bool kvm_para_available(void)
 }
 EXPORT_SYMBOL_GPL(kvm_para_available);
 
+const char *kvm_para_get_uuid(void)
+{
+   return dmi_get_system_info(DMI_PRODUCT_UUID);
+}
+EXPORT_SYMBOL_GPL(kvm_para_get_uuid);
+
 unsigned int kvm_arch_para_features(void)
 {
return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
-- 
2.7.4



[PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-14 Thread Filippo Sironi
Start populating /sys/hypervisor with KVM entries when we're running on
KVM. This is to replicate functionality that's available when we're
running on Xen.

Start with /sys/hypervisor/uuid, which users prefer over
/sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
machine, since it's also available when running on Xen HVM and on Xen PV
and, on top of that doesn't require root privileges by default.
Let's create arch-specific hooks so that different architectures can
provide different implementations.

Signed-off-by: Filippo Sironi 
---
v2:
* move the retrieval of the VM UUID out of uuid_show and into
  kvm_para_get_uuid, which is a weak function that can be overwritten

 drivers/Kconfig  |  2 ++
 drivers/Makefile |  2 ++
 drivers/kvm/Kconfig  | 14 ++
 drivers/kvm/Makefile |  1 +
 drivers/kvm/sys-hypervisor.c | 30 ++
 5 files changed, 49 insertions(+)
 create mode 100644 drivers/kvm/Kconfig
 create mode 100644 drivers/kvm/Makefile
 create mode 100644 drivers/kvm/sys-hypervisor.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 45f9decb9848..90eb835fe951 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -146,6 +146,8 @@ source "drivers/hv/Kconfig"
 
 source "drivers/xen/Kconfig"
 
+source "drivers/kvm/Kconfig"
+
 source "drivers/staging/Kconfig"
 
 source "drivers/platform/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index c61cde554340..79cc92a3f6bf 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -44,6 +44,8 @@ obj-y += soc/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_XEN)  += xen/
 
+obj-$(CONFIG_KVM_GUEST)+= kvm/
+
 # regulators early, since some subsystems rely on them to initialize
 obj-$(CONFIG_REGULATOR)+= regulator/
 
diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
new file mode 100644
index ..3fc041df7c11
--- /dev/null
+++ b/drivers/kvm/Kconfig
@@ -0,0 +1,14 @@
+menu "KVM driver support"
+depends on KVM_GUEST
+
+config KVM_SYS_HYPERVISOR
+bool "Create KVM entries under /sys/hypervisor"
+depends on SYSFS
+select SYS_HYPERVISOR
+default y
+help
+  Create KVM entries under /sys/hypervisor (e.g., uuid). When running
+  native or on another hypervisor, /sys/hypervisor may still be
+  present, but it will have no KVM entries.
+
+endmenu
diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
new file mode 100644
index ..73a43fc994b9
--- /dev/null
+++ b/drivers/kvm/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_KVM_SYS_HYPERVISOR) += sys-hypervisor.o
diff --git a/drivers/kvm/sys-hypervisor.c b/drivers/kvm/sys-hypervisor.c
new file mode 100644
index ..43b1d1a09807
--- /dev/null
+++ b/drivers/kvm/sys-hypervisor.c
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include 
+
+#include 
+#include 
+
+__weak const char *kvm_para_get_uuid(void)
+{
+   return NULL;
+}
+
+static ssize_t uuid_show(struct kobject *obj,
+struct kobj_attribute *attr,
+char *buf)
+{
+   const char *uuid = kvm_para_get_uuid();
+   return sprintf(buf, "%s\n", uuid);
+}
+
+static struct kobj_attribute uuid = __ATTR_RO(uuid);
+
+static int __init uuid_init(void)
+{
+   if (!kvm_para_available())
+   return 0;
+   return sysfs_create_file(hypervisor_kobj, &uuid.attr);
+}
+
+device_initcall(uuid_init);
-- 
2.7.4



KVM: Start populating /sys/hypervisor with KVM entries

2019-05-14 Thread Filippo Sironi
Long-time Xen HVM and Xen PV users are missing /sys/hypervisor entries when
moving to KVM.  One report is about getting the VM UUID.  The VM UUID can
already be retrieved using /sys/devices/virtual/dmi/id/product_uuid.  This has
two downsides: (1) it requires root privileges and (2) it is only available on
KVM and Xen HVM.

By exposing /sys/hypervisor/uuid when running on KVM as well, we provide an
interface that's functional for KVM, Xen HVM, and Xen PV.  Let's do so by
providing arch-specific hooks so that different architectures can implement the
hooks in different ways.

Further work can be done by consolidating the creation of the basic
/sys/hypervisor across hypervisors.

Filippo Sironi (2):
  KVM: Start populating /sys/hypervisor with KVM entries
  KVM: x86: Implement the arch-specific hook to report the VM UUID



[PATCH] KVM: Start populating /sys/hypervisor with KVM entries

2018-10-09 Thread Filippo Sironi
Start populating /sys/hypervisor with KVM entries when we're running on
KVM. This is to replicate functionality that's available when we're
running on Xen.

Let's start with /sys/hypervisor/uuid, which users prefer over
/sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
machine, since it's also available when running on Xen HVM and on Xen PV
and, on top of that doesn't require root privileges by default.

Signed-off-by: Filippo Sironi 
---
 drivers/Kconfig  |  2 ++
 drivers/Makefile |  2 ++
 drivers/kvm/Kconfig  | 14 ++
 drivers/kvm/Makefile |  1 +
 drivers/kvm/sys-hypervisor.c | 26 ++
 5 files changed, 45 insertions(+)
 create mode 100644 drivers/kvm/Kconfig
 create mode 100644 drivers/kvm/Makefile
 create mode 100644 drivers/kvm/sys-hypervisor.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index afc942c54814..597519c5f7c8 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -135,6 +135,8 @@ source "drivers/hv/Kconfig"
 
 source "drivers/xen/Kconfig"
 
+source "drivers/kvm/Kconfig"
+
 source "drivers/staging/Kconfig"
 
 source "drivers/platform/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 1056f9699192..727205e287fc 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -47,6 +47,8 @@ obj-y += soc/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_XEN)  += xen/
 
+obj-$(CONFIG_KVM_GUEST)+= kvm/
+
 # regulators early, since some subsystems rely on them to initialize
 obj-$(CONFIG_REGULATOR)+= regulator/
 
diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
new file mode 100644
index ..3fc041df7c11
--- /dev/null
+++ b/drivers/kvm/Kconfig
@@ -0,0 +1,14 @@
+menu "KVM driver support"
+depends on KVM_GUEST
+
+config KVM_SYS_HYPERVISOR
+bool "Create KVM entries under /sys/hypervisor"
+depends on SYSFS
+select SYS_HYPERVISOR
+default y
+help
+  Create KVM entries under /sys/hypervisor (e.g., uuid). When running
+  native or on another hypervisor, /sys/hypervisor may still be
+  present, but it will have no KVM entries.
+
+endmenu
diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
new file mode 100644
index ..73a43fc994b9
--- /dev/null
+++ b/drivers/kvm/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_KVM_SYS_HYPERVISOR) += sys-hypervisor.o
diff --git a/drivers/kvm/sys-hypervisor.c b/drivers/kvm/sys-hypervisor.c
new file mode 100644
index ..ef04ca65cf1a
--- /dev/null
+++ b/drivers/kvm/sys-hypervisor.c
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static ssize_t uuid_show(struct kobject *obj,
+struct kobj_attribute *attr,
+char *buf)
+{
+   const char *uuid = dmi_get_system_info(DMI_PRODUCT_UUID);
+   return sprintf(buf, "%s\n", uuid);
+}
+
+static struct kobj_attribute uuid = __ATTR_RO(uuid);
+
+static int __init uuid_init(void)
+{
+   if (!kvm_para_available())
+   return 0;
+   return sysfs_create_file(hypervisor_kobj, &uuid.attr);
+}
+
+device_initcall(uuid_init);
-- 
2.7.4



[tip:x86/urgent] x86/microcode: Update the new microcode revision unconditionally

2018-09-02 Thread tip-bot for Filippo Sironi
Commit-ID:  8da38ebaad23fe1b0c4a205438676f6356607cfc
Gitweb: https://git.kernel.org/tip/8da38ebaad23fe1b0c4a205438676f6356607cfc
Author: Filippo Sironi 
AuthorDate: Tue, 31 Jul 2018 17:29:30 +0200
Committer:  Thomas Gleixner 
CommitDate: Sun, 2 Sep 2018 14:10:54 +0200

x86/microcode: Update the new microcode revision unconditionally

Handle the case where microcode gets loaded on the BSP's hyperthread
sibling first and the boot_cpu_data's microcode revision doesn't get
updated because of early exit due to the siblings sharing a microcode
engine.

For that, simply write the updated revision on all CPUs unconditionally.

Signed-off-by: Filippo Sironi 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Cc: pra...@redhat.com
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1533050970-14385-1-git-send-email-sir...@amazon.de
---
 arch/x86/kernel/cpu/microcode/amd.c   | 22 +-
 arch/x86/kernel/cpu/microcode/intel.c | 13 -
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/amd.c 
b/arch/x86/kernel/cpu/microcode/amd.c
index 602f17134103..07b5fc00b188 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -504,6 +504,7 @@ static enum ucode_state apply_microcode_amd(int cpu)
struct microcode_amd *mc_amd;
struct ucode_cpu_info *uci;
struct ucode_patch *p;
+   enum ucode_state ret;
u32 rev, dummy;
 
BUG_ON(raw_smp_processor_id() != cpu);
@@ -521,9 +522,8 @@ static enum ucode_state apply_microcode_amd(int cpu)
 
/* need to apply patch? */
if (rev >= mc_amd->hdr.patch_id) {
-   c->microcode = rev;
-   uci->cpu_sig.rev = rev;
-   return UCODE_OK;
+   ret = UCODE_OK;
+   goto out;
}
 
if (__apply_microcode_amd(mc_amd)) {
@@ -531,17 +531,21 @@ static enum ucode_state apply_microcode_amd(int cpu)
cpu, mc_amd->hdr.patch_id);
return UCODE_ERROR;
}
-   pr_info("CPU%d: new patch_level=0x%08x\n", cpu,
-   mc_amd->hdr.patch_id);
 
-   uci->cpu_sig.rev = mc_amd->hdr.patch_id;
-   c->microcode = mc_amd->hdr.patch_id;
+   rev = mc_amd->hdr.patch_id;
+   ret = UCODE_UPDATED;
+
+   pr_info("CPU%d: new patch_level=0x%08x\n", cpu, rev);
+
+out:
+   uci->cpu_sig.rev = rev;
+   c->microcode = rev;
 
/* Update boot_cpu_data's revision too, if we're on the BSP: */
if (c->cpu_index == boot_cpu_data.cpu_index)
-   boot_cpu_data.microcode = mc_amd->hdr.patch_id;
+   boot_cpu_data.microcode = rev;
 
-   return UCODE_UPDATED;
+   return ret;
 }
 
 static int install_equiv_cpu_table(const u8 *buf)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 256d336cbc04..16936a24795c 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -795,6 +795,7 @@ static enum ucode_state apply_microcode_intel(int cpu)
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
struct cpuinfo_x86 *c = &cpu_data(cpu);
struct microcode_intel *mc;
+   enum ucode_state ret;
static int prev_rev;
u32 rev;
 
@@ -817,9 +818,8 @@ static enum ucode_state apply_microcode_intel(int cpu)
 */
rev = intel_get_microcode_revision();
if (rev >= mc->hdr.rev) {
-   uci->cpu_sig.rev = rev;
-   c->microcode = rev;
-   return UCODE_OK;
+   ret = UCODE_OK;
+   goto out;
}
 
/*
@@ -848,14 +848,17 @@ static enum ucode_state apply_microcode_intel(int cpu)
prev_rev = rev;
}
 
+   ret = UCODE_UPDATED;
+
+out:
uci->cpu_sig.rev = rev;
-   c->microcode = rev;
+   c->microcode = rev;
 
/* Update boot_cpu_data's revision too, if we're on the BSP: */
if (c->cpu_index == boot_cpu_data.cpu_index)
boot_cpu_data.microcode = rev;
 
-   return UCODE_UPDATED;
+   return ret;
 }
 
 static enum ucode_state generic_load_microcode(int cpu, void *data, size_t 
size,


[PATCH] x86/microcode: Don't duplicate code to update ucode cpu info and cpu info

2018-07-31 Thread Filippo Sironi
... on late microcode loading when handling a CPU that's already been
updated and a CPU that's yet to be updated.

Signed-off-by: Filippo Sironi 
---
 arch/x86/kernel/cpu/microcode/amd.c   | 15 +--
 arch/x86/kernel/cpu/microcode/intel.c | 10 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/amd.c 
b/arch/x86/kernel/cpu/microcode/amd.c
index 0624957aa068..77758e10f16f 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -505,6 +505,7 @@ static enum ucode_state apply_microcode_amd(int cpu)
struct ucode_cpu_info *uci;
struct ucode_patch *p;
u32 rev, dummy;
+   enum ucode_state ret;
 
BUG_ON(raw_smp_processor_id() != cpu);
 
@@ -521,9 +522,8 @@ static enum ucode_state apply_microcode_amd(int cpu)
 
/* need to apply patch? */
if (rev >= mc_amd->hdr.patch_id) {
-   c->microcode = rev;
-   uci->cpu_sig.rev = rev;
-   return UCODE_OK;
+   ret = UCODE_OK;
+   goto out;
}
 
if (__apply_microcode_amd(mc_amd)) {
@@ -534,10 +534,13 @@ static enum ucode_state apply_microcode_amd(int cpu)
pr_info("CPU%d: new patch_level=0x%08x\n", cpu,
mc_amd->hdr.patch_id);
 
-   uci->cpu_sig.rev = mc_amd->hdr.patch_id;
-   c->microcode = mc_amd->hdr.patch_id;
+   ret = UCODE_UPDATED;
+   rev = mc_amd->hdr.patch_id;
+out:
+   uci->cpu_sig.rev = rev;
+   c->microcode = rev;
 
-   return UCODE_UPDATED;
+   return ret;
 }
 
 static int install_equiv_cpu_table(const u8 *buf)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 97ccf4c3b45b..4bc869e829eb 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -797,6 +797,7 @@ static enum ucode_state apply_microcode_intel(int cpu)
struct microcode_intel *mc;
static int prev_rev;
u32 rev;
+   enum ucode_state ret;
 
/* We should bind the task to the CPU */
if (WARN_ON(raw_smp_processor_id() != cpu))
@@ -817,9 +818,8 @@ static enum ucode_state apply_microcode_intel(int cpu)
 */
rev = intel_get_microcode_revision();
if (rev >= mc->hdr.rev) {
-   uci->cpu_sig.rev = rev;
-   c->microcode = rev;
-   return UCODE_OK;
+   ret = UCODE_OK;
+   goto out;
}
 
/*
@@ -848,10 +848,12 @@ static enum ucode_state apply_microcode_intel(int cpu)
prev_rev = rev;
}
 
+   ret = UCODE_UPDATED;
+out:
uci->cpu_sig.rev = rev;
c->microcode = rev;
 
-   return UCODE_UPDATED;
+   return ret;
 }
 
 static enum ucode_state generic_load_microcode(int cpu, void *data, size_t 
size,
-- 
2.7.4



[PATCH] x86/MCE: Get microcode revision from cpu_info instead of boot_cpu_data

2018-06-01 Thread Filippo Sironi
Commit fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check
records") extended MCE entries to report the microcode revision taken
from boot_cpu_data. Unfortunately, boot_cpu_data isn't updated on late
microcode loading, thus making MCE entries slightly incorrect.
Use cpu_info instead, which is updated on late microcode loading.

Fixes: fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check 
records")
Signed-off-by: Filippo Sironi 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..4be323f9b390 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -134,7 +134,7 @@ void mce_setup(struct mce *m)
if (this_cpu_has(X86_FEATURE_INTEL_PPIN))
rdmsrl(MSR_PPIN, m->ppin);
 
-   m->microcode = boot_cpu_data.microcode;
+   m->microcode = cpu_data(m->extcpu).microcode;
 }
 
 DEFINE_PER_CPU(struct mce, injectm);
-- 
2.7.4



[PATCH] vfio/type1: Search for a fitting iommu_domain before attaching the iommu_group

2018-03-05 Thread Filippo Sironi
... to avoid an unnecessary attach/detach of the iommu_group to the
newly created iommu_domain.  This also saves us a context-cache and an
IOTLB flush.

This is possible because allocating an iommu_domain for the iommu_group
we're attaching is enough to understand whether a fitting iommu_domain
already exists.

Signed-off-by: Filippo Sironi 
Cc: Alex Williamson 
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/vfio/vfio_iommu_type1.c | 32 ++--
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 45657e2b1ff7..88359b4993f3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1279,15 +1279,8 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
goto out_domain;
}
 
-   ret = iommu_attach_group(domain->domain, iommu_group);
-   if (ret)
-   goto out_domain;
-
resv_msi = vfio_iommu_has_sw_msi(iommu_group, &resv_msi_base);
 
-   INIT_LIST_HEAD(&domain->group_list);
-   list_add(&group->next, &domain->group_list);
-
msi_remap = irq_domain_check_msi_remap() ||
iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
 
@@ -1295,7 +1288,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
pr_warn("%s: No interrupt remapping support.  Use the module 
param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this 
platform\n",
   __func__);
ret = -EPERM;
-   goto out_detach;
+   goto out_domain;
}
 
if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
@@ -1311,21 +1304,24 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
list_for_each_entry(d, &iommu->domain_list, next) {
if (d->domain->ops == domain->domain->ops &&
d->prot == domain->prot) {
-   iommu_detach_group(domain->domain, iommu_group);
-   if (!iommu_attach_group(d->domain, iommu_group)) {
-   list_add(&group->next, &d->group_list);
-   iommu_domain_free(domain->domain);
-   kfree(domain);
-   mutex_unlock(&iommu->lock);
-   return 0;
-   }
-
-   ret = iommu_attach_group(domain->domain, iommu_group);
+   ret = iommu_attach_group(d->domain, iommu_group);
if (ret)
goto out_domain;
+   list_add(&group->next, &d->group_list);
+   iommu_domain_free(domain->domain);
+   kfree(domain);
+   mutex_unlock(&iommu->lock);
+   return 0;
}
}
 
+   ret = iommu_attach_group(domain->domain, iommu_group);
+   if (ret)
+   goto out_domain;
+
+   INIT_LIST_HEAD(&domain->group_list);
+   list_add(&group->next, &domain->group_list);
+
vfio_test_domain_fgsp(domain);
 
/* replay mappings on new domains */
-- 
2.7.4



[PATCH] sched/fair: Prevent a division by 0 in scale_rt_capacity()

2017-12-09 Thread Filippo Sironi
... since total = sched_avg_period() + delta can yield 0x1,
which results in a division by 0, given that div_u64() takes a u32
divisor.  Use div64_u64() instead.

divide error:  [#1] SMP
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.58 #1
Hardware name: ...
task: 8800a24e2800 task.stack: c974c000
RIP: 0010:[] [] 
update_group_capacity+0x16e/0x1c0
RSP: 0018:8800a74e3c18 EFLAGS: 00010246
RAX: 00445ced RBX: 0007 RCX: 024d
RDX:  RSI:  RDI: 000160c0
RBP: 8800a74e3c38 R08: 8800a17d5ac0 R09: 8800a74e
R10:  R11:  R12: 8800a297e400
R13: 8800a17d5ac0 R14:  R15: 8800a17d5ac0
FS: () GS:8800a74e() knlGS:
CS: 0010 DS:  ES:  CR0: 80050033
CR2: 006f3580 CR3: 01607000 CR4: 007426e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Stack:
8800a17d5180 8800a74e3e00 8800a17d5a01 8800a74e3c68
8800a74e3d90 810d37e6 fff8 002300010c40
0040 8800a17d5ad8  
Call Trace:
 [162553.008569] [] find_busiest_group+0xe6/0x950
[] load_balance+0x188/0xa70
[] ? update_rq_clock.part.88+0x13/0x30
[] rebalance_domains+0x210/0x290
[] run_rebalance_domains+0x1b0/0x1d0
[] __do_softirq+0x89/0x2b0
[] irq_exit+0xab/0xb0
[] smp_reschedule_interrupt+0x2e/0x30
[] reschedule_interrupt+0x84/0x90
 [162553.008603] [] ? cpuidle_enter_state+0x12f/0x2c0
[] cpuidle_enter+0x12/0x20
[] cpu_startup_entry+0x1a2/0x1f0
[] start_secondary+0x12d/0x140
Code: 0f 00 4c 8b 96 48 09 00 00 48 8b 86 40 09 00 00 48 8b b6 b0 08 00 00 48 
d1 ea 4c 29 d6 41 ba 00 00 00 00 49 0f 48 f2 01 d6 31 d2 <48> f7 f6 ba 00 04 00 
00 48 29 c2 48 3d ff 03 00 00 b8 01 00 00
RIP [] update_group_capacity+0x16e/0x1c0
RSP 

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Filippo Sironi 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4037e19bbca2..04b6f847a241 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7517,7 +7517,7 @@ static unsigned long scale_rt_capacity(int cpu)
 
total = sched_avg_period() + delta;
 
-   used = div_u64(avg, total);
+   used = div64_u64(avg, total);
 
if (likely(used < SCHED_CAPACITY_SCALE))
return SCHED_CAPACITY_SCALE - used;
-- 
2.7.4



[PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-11-26 Thread Filippo Sironi
... that the guest should see.
Guest operating systems may check the microcode version to decide whether
to disable certain features that are known to be buggy up to certain
microcode versions.  Address the issue by making the microcode version
that the guest should see settable.
The rationale for having userspace specifying the microcode version, rather
than having the kernel picking it, is to ensure consistency for live-migrated
instances; we don't want them to see a microcode version increase without a
reset.

Signed-off-by: Filippo Sironi 
---
 arch/x86/kvm/x86.c   | 23 +++
 include/uapi/linux/kvm.h |  3 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 925c3e29cad3..741588f27ebc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
} u;
 
switch (ioctl) {
+   case KVM_GET_MICROCODE_VERSION: {
+   r = -EFAULT;
+   if (copy_to_user(argp,
+&kvm->arch.microcode_version,
+sizeof(kvm->arch.microcode_version)))
+   goto out;
+   break;
+   }
+   case KVM_SET_MICROCODE_VERSION: {
+   u32 microcode_version;
+
+   r = -EFAULT;
+   if (copy_from_user(µcode_version,
+  argp,
+  sizeof(microcode_version)))
+   goto out;
+   r = -EINVAL;
+   if (!microcode_version)
+   goto out;
+   kvm->arch.microcode_version = microcode_version;
+   r = 0;
+   break;
+   }
case KVM_SET_TSS_ADDR:
r = kvm_vm_ioctl_set_tss_addr(kvm, arg);
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 282d7613fce8..e11887758e29 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1192,6 +1192,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_S390_UCAS_UNMAP  _IOW(KVMIO, 0x51, struct 
kvm_s390_ucas_mapping)
 #define KVM_S390_VCPU_FAULT _IOW(KVMIO, 0x52, unsigned long)
 
+#define KVM_GET_MICROCODE_VERSION _IOR(KVMIO, 0x5e, __u32)
+#define KVM_SET_MICROCODE_VERSION _IOW(KVMIO, 0x5f, __u32)
+
 /* Device model IOC */
 #define KVM_CREATE_IRQCHIP_IO(KVMIO,   0x60)
 #define KVM_IRQ_LINE  _IOW(KVMIO,  0x61, struct kvm_irq_level)
-- 
2.7.4



[PATCH 1/2] KVM: x86: Store the microcode version in struct kvm_arch

2017-11-26 Thread Filippo Sironi
... and read it from there when emulating accesses to
MSR_IA32_UCODE_REV.  This is the first step to allow userspace to define
what's the microcode version that the guest should see.

Signed-off-by: Filippo Sironi 
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/x86.c  | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1bfb99770c34..84b20139f4f1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -776,6 +776,8 @@ struct kvm_arch {
struct mutex apic_map_lock;
struct kvm_apic_map *apic_map;
 
+   u32 microcode_version;
+
unsigned int tss_addr;
bool apic_access_page_done;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34c85aa2e2d1..925c3e29cad3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2447,7 +2447,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = 0;
break;
case MSR_IA32_UCODE_REV:
-   msr_info->data = 0x1ULL;
+   msr_info->data = (u64)vcpu->kvm->arch.microcode_version << 32;
break;
case MSR_MTRRcap:
case 0x200 ... 0x2ff:
@@ -8121,6 +8121,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (type)
return -EINVAL;
 
+   kvm->arch.microcode_version = 0x1;
+
INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
-- 
2.7.4



[PATCH v2] pci: Expose offset, stride, and VF device ID via sysfs

2017-10-08 Thread Filippo Sironi
... to make it easier for userspace applications to consume them.

Signed-off-by: Filippo Sironi 
Cc: Bjorn Helgaas 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v2:
* follow up with the rename of vf_did to vf_device

 drivers/pci/pci-sysfs.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 2f3780b50723..e6f4133f8992 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -648,6 +648,33 @@ static ssize_t sriov_numvfs_store(struct device *dev,
return count;
 }
 
+static ssize_t sriov_offset_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%u\n", pdev->sriov->offset);
+}
+
+static ssize_t sriov_stride_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%u\n", pdev->sriov->stride);
+}
+
+static ssize_t sriov_vf_device_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%x\n", pdev->sriov->vf_device);
+}
+
 static ssize_t sriov_drivers_autoprobe_show(struct device *dev,
struct device_attribute *attr,
char *buf)
@@ -676,6 +703,9 @@ static struct device_attribute sriov_totalvfs_attr = 
__ATTR_RO(sriov_totalvfs);
 static struct device_attribute sriov_numvfs_attr =
__ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
   sriov_numvfs_show, sriov_numvfs_store);
+static struct device_attribute sriov_offset_attr = __ATTR_RO(sriov_offset);
+static struct device_attribute sriov_stride_attr = __ATTR_RO(sriov_stride);
+static struct device_attribute sriov_vf_device_attr = 
__ATTR_RO(sriov_vf_device);
 static struct device_attribute sriov_drivers_autoprobe_attr =
__ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP),
   sriov_drivers_autoprobe_show, 
sriov_drivers_autoprobe_store);
@@ -1744,6 +1774,9 @@ static struct attribute_group pci_dev_hp_attr_group = {
 static struct attribute *sriov_dev_attrs[] = {
&sriov_totalvfs_attr.attr,
&sriov_numvfs_attr.attr,
+   &sriov_offset_attr.attr,
+   &sriov_stride_attr.attr,
+   &sriov_vf_device_attr.attr,
&sriov_drivers_autoprobe_attr.attr,
NULL,
 };
-- 
2.7.4



[PATCH v2] pci: Expose offset, stride, and VF device ID via sysfs

2017-10-08 Thread Filippo Sironi
Testing done:

$ ls -l /sys/bus/pci/devices/\:03\:00.0/
total 0
-rw-r--r-- 1 root root4096 Oct  9 00:48 broken_parity_status
-r--r--r-- 1 root root4096 Oct  9 00:48 class
-rw-r--r-- 1 root root4096 Oct  9 00:46 config
-r--r--r-- 1 root root4096 Oct  9 00:48 consistent_dma_mask_bits
-r--r--r-- 1 root root4096 Oct  9 00:48 current_link_speed
-r--r--r-- 1 root root4096 Oct  9 00:48 current_link_width
-rw-r--r-- 1 root root4096 Oct  9 00:48 d3cold_allowed
-r--r--r-- 1 root root4096 Oct  9 00:46 device
-r--r--r-- 1 root root4096 Oct  9 00:48 dma_mask_bits
lrwxrwxrwx 1 root root   0 Oct  9 00:46 driver -> 
../../../../bus/pci/drivers/igb
-rw-r--r-- 1 root root4096 Oct  9 00:48 driver_override
-rw-r--r-- 1 root root4096 Oct  9 00:48 enable
lrwxrwxrwx 1 root root   0 Oct  9 00:48 firmware_node -> 
../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4b/device:4c
-r--r--r-- 1 root root4096 Oct  9 00:46 irq
-r--r--r-- 1 root root4096 Oct  9 00:48 local_cpulist
-r--r--r-- 1 root root4096 Oct  9 00:48 local_cpus
-r--r--r-- 1 root root4096 Oct  9 00:48 max_link_speed
-r--r--r-- 1 root root4096 Oct  9 00:48 max_link_width
-r--r--r-- 1 root root4096 Oct  9 00:48 modalias
-rw-r--r-- 1 root root4096 Oct  9 00:48 msi_bus
drwxr-xr-x 2 root root   0 Oct  9 00:48 msi_irqs
drwxr-xr-x 3 root root   0 Oct  9 00:46 net
-rw-r--r-- 1 root root4096 Oct  9 00:48 numa_node
drwxr-xr-x 2 root root   0 Oct  9 00:48 power
drwxr-xr-x 3 root root   0 Oct  9 00:46 ptp
--w--w 1 root root4096 Oct  9 00:48 remove
--w--w 1 root root4096 Oct  9 00:48 rescan
--w--- 1 root root4096 Oct  9 00:48 reset
-r--r--r-- 1 root root4096 Oct  9 00:46 resource
-rw--- 1 root root  131072 Oct  9 00:48 resource0
-rw--- 1 root root 4194304 Oct  9 00:48 resource1
-rw--- 1 root root  32 Oct  9 00:48 resource2
-rw--- 1 root root   16384 Oct  9 00:48 resource3
-r--r--r-- 1 root root4096 Oct  9 00:48 revision
-rw-rw-r-- 1 root root4096 Oct  9 00:48 sriov_drivers_autoprobe
-rw-rw-r-- 1 root root4096 Oct  9 00:48 sriov_numvfs
-r--r--r-- 1 root root4096 Oct  9 00:48 sriov_offset
-r--r--r-- 1 root root4096 Oct  9 00:48 sriov_stride
-r--r--r-- 1 root root4096 Oct  9 00:48 sriov_totalvfs
-r--r--r-- 1 root root4096 Oct  9 00:48 sriov_vf_device
lrwxrwxrwx 1 root root   0 Oct  9 00:46 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root4096 Oct  9 00:48 subsystem_device
-r--r--r-- 1 root root4096 Oct  9 00:48 subsystem_vendor
-rw-r--r-- 1 root root4096 Oct  9 00:46 uevent
-r--r--r-- 1 root root4096 Oct  9 00:46 vendor
$ cat /sys/bus/pci/devices/\:03\:00.0/sriov_offset
128
$ cat /sys/bus/pci/devices/\:03\:00.0/sriov_stride
2
$ cat /sys/bus/pci/devices/\:03\:00.0/sriov_vf_device 
10ca

Filippo Sironi (1):
  pci: Expose offset, stride, and VF device ID via sysfs

 drivers/pci/pci-sysfs.c | 33 +
 1 file changed, 33 insertions(+)

-- 
2.7.4



[PATCH v2] iommu/vt-d: Don't be too aggressive when clearing one context entry

2017-08-31 Thread Filippo Sironi
Previously, we were invalidating context cache and IOTLB globally when
clearing one context entry.  This is a tad too aggressive.
Invalidate the context cache and IOTLB for the interested device only.

Signed-off-by: Filippo Sironi 
Cc: David Woodhouse 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Cc: Jacob Pan 
Cc: io...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/iommu/intel-iommu.c | 42 --
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3e8636f1220e..1aa4ad7974b9 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -974,20 +974,6 @@ static int device_context_mapped(struct intel_iommu 
*iommu, u8 bus, u8 devfn)
return ret;
 }
 
-static void clear_context_table(struct intel_iommu *iommu, u8 bus, u8 devfn)
-{
-   struct context_entry *context;
-   unsigned long flags;
-
-   spin_lock_irqsave(&iommu->lock, flags);
-   context = iommu_context_addr(iommu, bus, devfn, 0);
-   if (context) {
-   context_clear_entry(context);
-   __iommu_flush_cache(iommu, context, sizeof(*context));
-   }
-   spin_unlock_irqrestore(&iommu->lock, flags);
-}
-
 static void free_context_table(struct intel_iommu *iommu)
 {
int i;
@@ -2351,13 +2337,33 @@ static inline int domain_pfn_mapping(struct dmar_domain 
*domain, unsigned long i
 
 static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
devfn)
 {
+   unsigned long flags;
+   struct context_entry *context;
+   u16 did_old;
+
if (!iommu)
return;
 
-   clear_context_table(iommu, bus, devfn);
-   iommu->flush.flush_context(iommu, 0, 0, 0,
-  DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
+   spin_lock_irqsave(&iommu->lock, flags);
+   context = iommu_context_addr(iommu, bus, devfn, 0);
+   if (!context) {
+   spin_unlock_irqrestore(&iommu->lock, flags);
+   return;
+   }
+   did_old = context_domain_id(context);
+   context_clear_entry(context);
+   __iommu_flush_cache(iommu, context, sizeof(*context));
+   spin_unlock_irqrestore(&iommu->lock, flags);
+   iommu->flush.flush_context(iommu,
+  did_old,
+  (((u16)bus) << 8) | devfn,
+  DMA_CCMD_MASK_NOBIT,
+  DMA_CCMD_DEVICE_INVL);
+   iommu->flush.flush_iotlb(iommu,
+did_old,
+0,
+0,
+DMA_TLB_DSI_FLUSH);
 }
 
 static inline void unlink_domain_info(struct device_domain_info *info)
-- 
2.7.4



[PATCH] intel-iommu: Don't be too aggressive when clearing one context entry

2017-08-28 Thread Filippo Sironi
Previously, we were invalidating context cache and IOTLB globally when
clearing one context entry.  This is a tad too aggressive.
Invalidate the context cache and IOTLB for the interested device only.

Signed-off-by: Filippo Sironi 
Cc: David Woodhouse 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Cc: io...@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/iommu/intel-iommu.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3e8636f1220e..4bf3e59b0929 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2351,13 +2351,32 @@ static inline int domain_pfn_mapping(struct dmar_domain 
*domain, unsigned long i
 
 static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
devfn)
 {
+   unsigned long flags;
+   struct context_entry *context;
+   u16 did_old;
+
if (!iommu)
return;
 
+   spin_lock_irqsave(&iommu->lock, flags);
+   context = iommu_context_addr(iommu, bus, devfn, 0);
+   if (!context) {
+   spin_unlock_irqrestore(&iommu->lock, flags);
+   return;
+   }
+   did_old = context_domain_id(context);
+   spin_unlock_irqrestore(&iommu->lock, flags);
clear_context_table(iommu, bus, devfn);
-   iommu->flush.flush_context(iommu, 0, 0, 0,
-  DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
+   iommu->flush.flush_context(iommu,
+  did_old,
+  (((u16)bus) << 8) | devfn,
+  DMA_CCMD_MASK_NOBIT,
+  DMA_CCMD_DEVICE_INVL);
+   iommu->flush.flush_iotlb(iommu,
+did_old,
+0,
+0,
+DMA_TLB_DSI_FLUSH);
 }
 
 static inline void unlink_domain_info(struct device_domain_info *info)
-- 
2.7.4



[PATCH 1/2] pci: Cache the VF device ID in the SR-IOV structure

2017-08-28 Thread Filippo Sironi
... and use it instead of reading it over and over from the PF config
space capability.

Signed-off-by: Filippo Sironi 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/pci/iov.c | 5 +++--
 drivers/pci/pci.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 120485d6f352..e8f7eafaba6a 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -134,7 +134,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id, int 
reset)
 
virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
virtfn->vendor = dev->vendor;
-   pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
+   virtfn->device = iov->vf_did;
rc = pci_setup_device(virtfn);
if (rc)
goto failed0;
@@ -448,6 +448,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
iov->nres = nres;
iov->ctrl = ctrl;
iov->total_VFs = total;
+   pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_did);
iov->pgsz = pgsz;
iov->self = dev;
iov->drivers_autoprobe = true;
@@ -723,7 +724,7 @@ int pci_vfs_assigned(struct pci_dev *dev)
 * determine the device ID for the VFs, the vendor ID will be the
 * same as the PF so there is no need to check for that one
 */
-   pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id);
+   dev_id = dev->sriov->vf_did;
 
/* loop through all the VFs to see if we own any that are assigned */
vfdev = pci_get_device(dev->vendor, dev_id, NULL);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 22e061738c6f..a7270e11e1ef 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -262,6 +262,7 @@ struct pci_sriov {
u16 num_VFs;/* number of VFs available */
u16 offset; /* first VF Routing ID offset */
u16 stride; /* following VF stride */
+   u16 vf_did; /* VF device ID */
u32 pgsz;   /* page size for BAR alignment */
u8 link;/* Function Dependency Link */
u8 max_VF_buses;/* max buses consumed by VFs */
-- 
2.7.4



[PATCH 2/2] pci: Expose offset, stride, and VF device ID via sysfs

2017-08-28 Thread Filippo Sironi
... to make it easier for userspace applications consumption.

Signed-off-by: Filippo Sironi 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/pci/pci-sysfs.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 2f3780b50723..f920afe7cff3 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -648,6 +648,33 @@ static ssize_t sriov_numvfs_store(struct device *dev,
return count;
 }
 
+static ssize_t sriov_offset_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%u\n", pdev->sriov->offset);
+}
+
+static ssize_t sriov_stride_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%u\n", pdev->sriov->stride);
+}
+
+static ssize_t sriov_vf_did_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%x\n", pdev->sriov->vf_did);
+}
+
 static ssize_t sriov_drivers_autoprobe_show(struct device *dev,
struct device_attribute *attr,
char *buf)
@@ -676,6 +703,9 @@ static struct device_attribute sriov_totalvfs_attr = 
__ATTR_RO(sriov_totalvfs);
 static struct device_attribute sriov_numvfs_attr =
__ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
   sriov_numvfs_show, sriov_numvfs_store);
+static struct device_attribute sriov_offset_attr = __ATTR_RO(sriov_offset);
+static struct device_attribute sriov_stride_attr = __ATTR_RO(sriov_stride);
+static struct device_attribute sriov_vf_did_attr = __ATTR_RO(sriov_vf_did);
 static struct device_attribute sriov_drivers_autoprobe_attr =
__ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP),
   sriov_drivers_autoprobe_show, 
sriov_drivers_autoprobe_store);
@@ -1744,6 +1774,9 @@ static struct attribute_group pci_dev_hp_attr_group = {
 static struct attribute *sriov_dev_attrs[] = {
&sriov_totalvfs_attr.attr,
&sriov_numvfs_attr.attr,
+   &sriov_offset_attr.attr,
+   &sriov_stride_attr.attr,
+   &sriov_vf_did_attr.attr,
&sriov_drivers_autoprobe_attr.attr,
NULL,
 };
-- 
2.7.4



[PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

2017-04-13 Thread Filippo Sironi
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct pages for mapping in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g.,
when limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that backs the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().

Signed-off-by: Filippo Sironi 
Cc: Anthony Liguori 
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kvm/paging_tmpl.h | 42 +-
 1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a01105485315..b3d7a117179d 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -147,15 +147,39 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
struct page *page;
 
npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);
-   /* Check if the user is doing something meaningless. */
-   if (unlikely(npages != 1))
-   return -EFAULT;
-
-   table = kmap_atomic(page);
-   ret = CMPXCHG(&table[index], orig_pte, new_pte);
-   kunmap_atomic(table);
-
-   kvm_release_page_dirty(page);
+   if (likely(npages == 1)) {
+   table = kmap_atomic(page);
+   ret = CMPXCHG(&table[index], orig_pte, new_pte);
+   kunmap_atomic(table);
+
+   kvm_release_page_dirty(page);
+   } else {
+   struct vm_area_struct *vma;
+   unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
+   unsigned long pfn;
+   unsigned long paddr;
+
+   down_read(¤t->mm->mmap_sem);
+   /*
+* vaddr is page-aligned, if a vma exists, it must cover
+* ptep_user
+*/
+   vma = find_vma(current->mm, vaddr);
+   if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
+   up_read(¤t->mm->mmap_sem);
+   return -EFAULT;
+   }
+   pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+   paddr = pfn << PAGE_SHIFT;
+   table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
+   if (!table) {
+   up_read(¤t->mm->mmap_sem);
+   return -EFAULT;
+   }
+   ret = CMPXCHG(&table[index], orig_pte, new_pte);
+   memunmap(table);
+   up_read(¤t->mm->mmap_sem);
+   }
 
return (ret != orig_pte);
 }
-- 
2.7.4



[PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

2017-04-05 Thread Filippo Sironi
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct pages for mapping in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g.,
when limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that backs the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().

Signed-off-by: Filippo Sironi 
Cc: Anthony Liguori 
Cc: k...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kvm/paging_tmpl.h | 39 ++-
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a01105485315..ab4d6617238c 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -147,15 +147,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
struct page *page;
 
npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);
-   /* Check if the user is doing something meaningless. */
-   if (unlikely(npages != 1))
-   return -EFAULT;
-
-   table = kmap_atomic(page);
-   ret = CMPXCHG(&table[index], orig_pte, new_pte);
-   kunmap_atomic(table);
-
-   kvm_release_page_dirty(page);
+   if (likely(npages == 1)) {
+   table = kmap_atomic(page);
+   ret = CMPXCHG(&table[index], orig_pte, new_pte);
+   kunmap_atomic(table);
+
+   kvm_release_page_dirty(page);
+   } else {
+   struct vm_area_struct *vma;
+   unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
+   unsigned long pfn;
+   unsigned long paddr;
+
+   down_read(¤t->mm->mmap_sem);
+   vma = find_vma_intersection(current->mm, vaddr,
+   vaddr + PAGE_SIZE);
+   if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
+   up_read(¤t->mm->mmap_sem);
+   return -EFAULT;
+   }
+   pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+   paddr = pfn << PAGE_SHIFT;
+   table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
+   if (!table) {
+   up_read(¤t->mm->mmap_sem);
+   return -EFAULT;
+   }
+   ret = CMPXCHG(&table[index], orig_pte, new_pte);
+   memunmap(table);
+   up_read(¤t->mm->mmap_sem);
+   }
 
return (ret != orig_pte);
 }
-- 
2.7.4