[RFC PATCH] mm: fork: Prevent a NULL deref by getting mm only if the refcount isn't 0
We've seen a number of crashes with the following signature: BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page ... Oops: [#1] SMP PTI ... RIP: 0010:__rb_erase_color+0xc2/0x260 ... Call Trace: unlink_file_vma+0x36/0x50 free_pgtables+0x62/0x110 exit_mmap+0xd5/0x160 ? put_dec+0x3a/0x90 ? num_to_str+0xa8/0xc0 mmput+0x11/0xb0 do_task_stat+0x940/0xc80 proc_single_show+0x49/0x80 ? __check_object_size+0xcc/0x1a0 seq_read+0xd3/0x400 vfs_read+0x72/0xb0 ksys_read+0x9c/0xd0 do_syscall_64+0x69/0x400 ? schedule+0x2a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ... This happens when a process goes through the tasks stats in procfs while another is exiting. This looks like a race where the process that's exiting drops the last reference on the mm (with mmput) while the other increases it (with mmget). By only increasing when the reference isn't 0 to begin with, we prevent this from happening. Signed-off-by: Filippo Sironi --- kernel/fork.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d3171e8e88e5..a7541a85e5a9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1209,10 +1209,8 @@ struct mm_struct *get_task_mm(struct task_struct *task) task_lock(task); mm = task->mm; if (mm) { - if (task->flags & PF_KTHREAD) + if (task->flags & PF_KTHREAD || !mmget_not_zero(mm)) mm = NULL; - else - mmget(mm); } task_unlock(task); return mm; -- 2.17.1 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH v2] nvme: Add 48-bit DMA address quirk for Amazon NVMe controllers
On 2/10/21 8:37 AM, Christoph Hellwig wrote: On Wed, Feb 10, 2021 at 01:39:42AM +0100, Filippo Sironi wrote: Amazon NVMe controllers do not support 64-bit DMA addresses; they are limited to 48-bit DMA addresses. Let's add a quirk to ensure that we make use of 48-bit DMA addresses to avoid misbehavior. This should probably say some, and mention that they do not follow the spec. But I can fix this up when applying the patch. Thanks! Filippo Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
[PATCH v2] nvme: Add 48-bit DMA address quirk for Amazon NVMe controllers
Amazon NVMe controllers do not support 64-bit DMA addresses; they are limited to 48-bit DMA addresses. Let's add a quirk to ensure that we make use of 48-bit DMA addresses to avoid misbehavior. This affects all Amazon NVMe controllers that expose EBS volumes (0x0061, 0x0065, 0x8061) and local instance storage (0xcd00, 0xcd01, 0xcd02). Signed-off-by: Filippo Sironi --- drivers/nvme/host/nvme.h | 5 + drivers/nvme/host/pci.c | 17 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 88a6b97247f5..dae747b4ac35 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -144,6 +144,11 @@ enum nvme_quirks { * NVMe 1.3 compliance. */ NVME_QUIRK_NO_NS_DESC_LIST = (1 << 15), + + /* +* The controller supports up to 48-bit DMA address. +*/ + NVME_QUIRK_DMA_ADDRESS_BITS_48 = (1 << 16), }; /* diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6bad4d4dcdf0..e7001f5ed6e4 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2362,13 +2362,16 @@ static int nvme_pci_enable(struct nvme_dev *dev) { int result = -ENOMEM; struct pci_dev *pdev = to_pci_dev(dev->dev); + int dma_address_bits = 64; if (pci_enable_device_mem(pdev)) return result; pci_set_master(pdev); - if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64))) + if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48) + dma_address_bits = 48; + if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits))) goto disable; if (readl(dev->bar + NVME_REG_CSTS) == -1) { @@ -3263,6 +3266,18 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, { PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */ .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, + { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0061), + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, + { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x0065), + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, + { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0x8061), + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, + { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0xcd00), + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, + { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0xcd01), + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, + { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, 0xcd02), + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48, }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001), .driver_data = NVME_QUIRK_SINGLE_VECTOR }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) }, -- 2.17.1 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH] nvme: Add 48-bit DMA address quirk
On 2/3/21 12:15 PM, Christoph Hellwig wrote: On Wed, Feb 03, 2021 at 12:12:31PM +0100, Filippo Sironi wrote: I don't disagree on the first part of your sentence, this is a big oversight. But it is not what your commit log suggests. I can definitely rephrase the commit. On the other hand, those controllers are out there and are in use by a lot of customers. We can keep relying on luck, hoping that customers don't run into troubles or we can merge a few lines of code :) Your patch does not just quirk a few controllers out there, but all current and future controllers with an Amazon vendor ID. We could probably talk about quirking an existing vendor ID or two as long as this doesn't happen for future hardware. I know that the hardware team is working on this but I don't know the timelines and there are a few upcoming controllers - of which I don't know the device ids yet - that have the same issue. To avoid issues, it is easier to apply the quirk to all Amazon NVMe controllers for now till the new lines of controllers with the fix comes out. At that point, we'll be able to restrict the application to the known bad controllers. Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH] nvme: Add 48-bit DMA address quirk
On 2/3/21 10:51 AM, Christoph Hellwig wrote: On Wed, Feb 03, 2021 at 10:43:38AM +0100, Filippo Sironi wrote: Certain NVMe controllers don't support 64-bit DMA addresses. Instead, they are limited to 48-bit DMA addresses. Let's add a quirk to use them properly. WTF? This is such a grave NVMe spec compiance bug that I do not think we should support this buggy mess in Linux. I don't disagree on the first part of your sentence, this is a big oversight. On the other hand, those controllers are out there and are in use by a lot of customers. We can keep relying on luck, hoping that customers don't run into troubles or we can merge a few lines of code :) Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
[PATCH] nvme: Add 48-bit DMA address quirk
Certain NVMe controllers don't support 64-bit DMA addresses. Instead, they are limited to 48-bit DMA addresses. Let's add a quirk to use them properly. Signed-off-by: Filippo Sironi --- drivers/nvme/host/nvme.h | 5 + drivers/nvme/host/pci.c | 12 +++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 88a6b97247f5..dae747b4ac35 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -144,6 +144,11 @@ enum nvme_quirks { * NVMe 1.3 compliance. */ NVME_QUIRK_NO_NS_DESC_LIST = (1 << 15), + + /* +* The controller supports up to 48-bit DMA address. +*/ + NVME_QUIRK_DMA_ADDRESS_BITS_48 = (1 << 16), }; /* diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 81e6389b2042..5716ae16c7a7 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2362,13 +2362,16 @@ static int nvme_pci_enable(struct nvme_dev *dev) { int result = -ENOMEM; struct pci_dev *pdev = to_pci_dev(dev->dev); + int dma_address_bits = 64; if (pci_enable_device_mem(pdev)) return result; pci_set_master(pdev); - if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(64))) + if (dev->ctrl.quirks & NVME_QUIRK_DMA_ADDRESS_BITS_48) + dma_address_bits = 48; + if (dma_set_mask_and_coherent(dev->dev, DMA_BIT_MASK(dma_address_bits))) goto disable; if (readl(dev->bar + NVME_REG_CSTS) == -1) { @@ -3259,6 +3262,13 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, { PCI_DEVICE(0x1d97, 0x2263), /* SPCC */ .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { .vendor = PCI_VENDOR_ID_AMAZON, + .device = PCI_ANY_ID, + .subvendor = PCI_ANY_ID, + .subdevice = PCI_ANY_ID, + .class = PCI_CLASS_STORAGE_EXPRESS, + .class_mask = 0xff, + .driver_data = NVME_QUIRK_DMA_ADDRESS_BITS_48 }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001), .driver_data = NVME_QUIRK_SINGLE_VECTOR }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) }, -- 2.17.1 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request
CPU is able to +* deliver the interrupt. +*/ + if (kvm_cpu_has_extint(vcpu)) + return false; + + /* Acknowledging ExtINT does not happen if LINT0 is masked. */ + return !(lapic_in_kernel(vcpu) && !kvm_apic_accept_pic_intr(vcpu)); } -/* - * if userspace requested an interrupt window, check that the - * interrupt window is open. - * - * No need to exit to userspace if we already have an interrupt queued. - */ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu) { return kvm_arch_interrupt_allowed(vcpu) && - !kvm_cpu_has_interrupt(vcpu) && - !kvm_event_needs_reinjection(vcpu) && kvm_cpu_accept_dm_intr(vcpu); } -- 2.28.0 Reviewed-by: Filippo Sironi Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH 1/2] KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint
in_kernel(v)) - return v->arch.interrupt.nr; - - vector = kvm_cpu_get_extint(v); - + int vector = kvm_cpu_get_extint(v); if (vector != -1) return vector; /* PIC */ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 105e7859d1f2..bb5ff761d5e2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2465,7 +2465,7 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu) struct kvm_lapic *apic = vcpu->arch.apic; u32 ppr; - if (!kvm_apic_hw_enabled(apic)) + if (!kvm_apic_present(vcpu)) return -1; __apic_update_ppr(apic, &ppr); -- 2.28.0 Reviewed-by: Filippo Sironi Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
[PATCH v2 2/2] KVM: x86: Implement the arch-specific hook to report the VM UUID
On x86, we report the UUID in DMI System Information (i.e., DMI Type 1) as VM UUID. Signed-off-by: Filippo Sironi --- arch/x86/kernel/kvm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 5c93a65ee1e5..441cab08a09d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -694,6 +695,12 @@ bool kvm_para_available(void) } EXPORT_SYMBOL_GPL(kvm_para_available); +const char *kvm_para_get_uuid(void) +{ + return dmi_get_system_info(DMI_PRODUCT_UUID); +} +EXPORT_SYMBOL_GPL(kvm_para_get_uuid); + unsigned int kvm_arch_para_features(void) { return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES); -- 2.7.4
[PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries
Start populating /sys/hypervisor with KVM entries when we're running on KVM. This is to replicate functionality that's available when we're running on Xen. Start with /sys/hypervisor/uuid, which users prefer over /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual machine, since it's also available when running on Xen HVM and on Xen PV and, on top of that doesn't require root privileges by default. Let's create arch-specific hooks so that different architectures can provide different implementations. Signed-off-by: Filippo Sironi --- v2: * move the retrieval of the VM UUID out of uuid_show and into kvm_para_get_uuid, which is a weak function that can be overwritten drivers/Kconfig | 2 ++ drivers/Makefile | 2 ++ drivers/kvm/Kconfig | 14 ++ drivers/kvm/Makefile | 1 + drivers/kvm/sys-hypervisor.c | 30 ++ 5 files changed, 49 insertions(+) create mode 100644 drivers/kvm/Kconfig create mode 100644 drivers/kvm/Makefile create mode 100644 drivers/kvm/sys-hypervisor.c diff --git a/drivers/Kconfig b/drivers/Kconfig index 45f9decb9848..90eb835fe951 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -146,6 +146,8 @@ source "drivers/hv/Kconfig" source "drivers/xen/Kconfig" +source "drivers/kvm/Kconfig" + source "drivers/staging/Kconfig" source "drivers/platform/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index c61cde554340..79cc92a3f6bf 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -44,6 +44,8 @@ obj-y += soc/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_XEN) += xen/ +obj-$(CONFIG_KVM_GUEST)+= kvm/ + # regulators early, since some subsystems rely on them to initialize obj-$(CONFIG_REGULATOR)+= regulator/ diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig new file mode 100644 index ..3fc041df7c11 --- /dev/null +++ b/drivers/kvm/Kconfig @@ -0,0 +1,14 @@ +menu "KVM driver support" +depends on KVM_GUEST + +config KVM_SYS_HYPERVISOR +bool "Create KVM entries under /sys/hypervisor" +depends on SYSFS +select SYS_HYPERVISOR +default y +help + Create KVM entries under /sys/hypervisor (e.g., uuid). When running + native or on another hypervisor, /sys/hypervisor may still be + present, but it will have no KVM entries. + +endmenu diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile new file mode 100644 index ..73a43fc994b9 --- /dev/null +++ b/drivers/kvm/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_KVM_SYS_HYPERVISOR) += sys-hypervisor.o diff --git a/drivers/kvm/sys-hypervisor.c b/drivers/kvm/sys-hypervisor.c new file mode 100644 index ..43b1d1a09807 --- /dev/null +++ b/drivers/kvm/sys-hypervisor.c @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include + +#include +#include + +__weak const char *kvm_para_get_uuid(void) +{ + return NULL; +} + +static ssize_t uuid_show(struct kobject *obj, +struct kobj_attribute *attr, +char *buf) +{ + const char *uuid = kvm_para_get_uuid(); + return sprintf(buf, "%s\n", uuid); +} + +static struct kobj_attribute uuid = __ATTR_RO(uuid); + +static int __init uuid_init(void) +{ + if (!kvm_para_available()) + return 0; + return sysfs_create_file(hypervisor_kobj, &uuid.attr); +} + +device_initcall(uuid_init); -- 2.7.4
KVM: Start populating /sys/hypervisor with KVM entries
Long-time Xen HVM and Xen PV users are missing /sys/hypervisor entries when moving to KVM. One report is about getting the VM UUID. The VM UUID can already be retrieved using /sys/devices/virtual/dmi/id/product_uuid. This has two downsides: (1) it requires root privileges and (2) it is only available on KVM and Xen HVM. By exposing /sys/hypervisor/uuid when running on KVM as well, we provide an interface that's functional for KVM, Xen HVM, and Xen PV. Let's do so by providing arch-specific hooks so that different architectures can implement the hooks in different ways. Further work can be done by consolidating the creation of the basic /sys/hypervisor across hypervisors. Filippo Sironi (2): KVM: Start populating /sys/hypervisor with KVM entries KVM: x86: Implement the arch-specific hook to report the VM UUID
[PATCH] KVM: Start populating /sys/hypervisor with KVM entries
Start populating /sys/hypervisor with KVM entries when we're running on KVM. This is to replicate functionality that's available when we're running on Xen. Let's start with /sys/hypervisor/uuid, which users prefer over /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual machine, since it's also available when running on Xen HVM and on Xen PV and, on top of that doesn't require root privileges by default. Signed-off-by: Filippo Sironi --- drivers/Kconfig | 2 ++ drivers/Makefile | 2 ++ drivers/kvm/Kconfig | 14 ++ drivers/kvm/Makefile | 1 + drivers/kvm/sys-hypervisor.c | 26 ++ 5 files changed, 45 insertions(+) create mode 100644 drivers/kvm/Kconfig create mode 100644 drivers/kvm/Makefile create mode 100644 drivers/kvm/sys-hypervisor.c diff --git a/drivers/Kconfig b/drivers/Kconfig index afc942c54814..597519c5f7c8 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -135,6 +135,8 @@ source "drivers/hv/Kconfig" source "drivers/xen/Kconfig" +source "drivers/kvm/Kconfig" + source "drivers/staging/Kconfig" source "drivers/platform/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index 1056f9699192..727205e287fc 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -47,6 +47,8 @@ obj-y += soc/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_XEN) += xen/ +obj-$(CONFIG_KVM_GUEST)+= kvm/ + # regulators early, since some subsystems rely on them to initialize obj-$(CONFIG_REGULATOR)+= regulator/ diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig new file mode 100644 index ..3fc041df7c11 --- /dev/null +++ b/drivers/kvm/Kconfig @@ -0,0 +1,14 @@ +menu "KVM driver support" +depends on KVM_GUEST + +config KVM_SYS_HYPERVISOR +bool "Create KVM entries under /sys/hypervisor" +depends on SYSFS +select SYS_HYPERVISOR +default y +help + Create KVM entries under /sys/hypervisor (e.g., uuid). When running + native or on another hypervisor, /sys/hypervisor may still be + present, but it will have no KVM entries. + +endmenu diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile new file mode 100644 index ..73a43fc994b9 --- /dev/null +++ b/drivers/kvm/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_KVM_SYS_HYPERVISOR) += sys-hypervisor.o diff --git a/drivers/kvm/sys-hypervisor.c b/drivers/kvm/sys-hypervisor.c new file mode 100644 index ..ef04ca65cf1a --- /dev/null +++ b/drivers/kvm/sys-hypervisor.c @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include + +#include +#include +#include + +static ssize_t uuid_show(struct kobject *obj, +struct kobj_attribute *attr, +char *buf) +{ + const char *uuid = dmi_get_system_info(DMI_PRODUCT_UUID); + return sprintf(buf, "%s\n", uuid); +} + +static struct kobj_attribute uuid = __ATTR_RO(uuid); + +static int __init uuid_init(void) +{ + if (!kvm_para_available()) + return 0; + return sysfs_create_file(hypervisor_kobj, &uuid.attr); +} + +device_initcall(uuid_init); -- 2.7.4
[tip:x86/urgent] x86/microcode: Update the new microcode revision unconditionally
Commit-ID: 8da38ebaad23fe1b0c4a205438676f6356607cfc Gitweb: https://git.kernel.org/tip/8da38ebaad23fe1b0c4a205438676f6356607cfc Author: Filippo Sironi AuthorDate: Tue, 31 Jul 2018 17:29:30 +0200 Committer: Thomas Gleixner CommitDate: Sun, 2 Sep 2018 14:10:54 +0200 x86/microcode: Update the new microcode revision unconditionally Handle the case where microcode gets loaded on the BSP's hyperthread sibling first and the boot_cpu_data's microcode revision doesn't get updated because of early exit due to the siblings sharing a microcode engine. For that, simply write the updated revision on all CPUs unconditionally. Signed-off-by: Filippo Sironi Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Cc: pra...@redhat.com Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1533050970-14385-1-git-send-email-sir...@amazon.de --- arch/x86/kernel/cpu/microcode/amd.c | 22 +- arch/x86/kernel/cpu/microcode/intel.c | 13 - 2 files changed, 21 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index 602f17134103..07b5fc00b188 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -504,6 +504,7 @@ static enum ucode_state apply_microcode_amd(int cpu) struct microcode_amd *mc_amd; struct ucode_cpu_info *uci; struct ucode_patch *p; + enum ucode_state ret; u32 rev, dummy; BUG_ON(raw_smp_processor_id() != cpu); @@ -521,9 +522,8 @@ static enum ucode_state apply_microcode_amd(int cpu) /* need to apply patch? */ if (rev >= mc_amd->hdr.patch_id) { - c->microcode = rev; - uci->cpu_sig.rev = rev; - return UCODE_OK; + ret = UCODE_OK; + goto out; } if (__apply_microcode_amd(mc_amd)) { @@ -531,17 +531,21 @@ static enum ucode_state apply_microcode_amd(int cpu) cpu, mc_amd->hdr.patch_id); return UCODE_ERROR; } - pr_info("CPU%d: new patch_level=0x%08x\n", cpu, - mc_amd->hdr.patch_id); - uci->cpu_sig.rev = mc_amd->hdr.patch_id; - c->microcode = mc_amd->hdr.patch_id; + rev = mc_amd->hdr.patch_id; + ret = UCODE_UPDATED; + + pr_info("CPU%d: new patch_level=0x%08x\n", cpu, rev); + +out: + uci->cpu_sig.rev = rev; + c->microcode = rev; /* Update boot_cpu_data's revision too, if we're on the BSP: */ if (c->cpu_index == boot_cpu_data.cpu_index) - boot_cpu_data.microcode = mc_amd->hdr.patch_id; + boot_cpu_data.microcode = rev; - return UCODE_UPDATED; + return ret; } static int install_equiv_cpu_table(const u8 *buf) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 256d336cbc04..16936a24795c 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -795,6 +795,7 @@ static enum ucode_state apply_microcode_intel(int cpu) struct ucode_cpu_info *uci = ucode_cpu_info + cpu; struct cpuinfo_x86 *c = &cpu_data(cpu); struct microcode_intel *mc; + enum ucode_state ret; static int prev_rev; u32 rev; @@ -817,9 +818,8 @@ static enum ucode_state apply_microcode_intel(int cpu) */ rev = intel_get_microcode_revision(); if (rev >= mc->hdr.rev) { - uci->cpu_sig.rev = rev; - c->microcode = rev; - return UCODE_OK; + ret = UCODE_OK; + goto out; } /* @@ -848,14 +848,17 @@ static enum ucode_state apply_microcode_intel(int cpu) prev_rev = rev; } + ret = UCODE_UPDATED; + +out: uci->cpu_sig.rev = rev; - c->microcode = rev; + c->microcode = rev; /* Update boot_cpu_data's revision too, if we're on the BSP: */ if (c->cpu_index == boot_cpu_data.cpu_index) boot_cpu_data.microcode = rev; - return UCODE_UPDATED; + return ret; } static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size,
[PATCH] x86/microcode: Don't duplicate code to update ucode cpu info and cpu info
... on late microcode loading when handling a CPU that's already been updated and a CPU that's yet to be updated. Signed-off-by: Filippo Sironi --- arch/x86/kernel/cpu/microcode/amd.c | 15 +-- arch/x86/kernel/cpu/microcode/intel.c | 10 ++ 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index 0624957aa068..77758e10f16f 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -505,6 +505,7 @@ static enum ucode_state apply_microcode_amd(int cpu) struct ucode_cpu_info *uci; struct ucode_patch *p; u32 rev, dummy; + enum ucode_state ret; BUG_ON(raw_smp_processor_id() != cpu); @@ -521,9 +522,8 @@ static enum ucode_state apply_microcode_amd(int cpu) /* need to apply patch? */ if (rev >= mc_amd->hdr.patch_id) { - c->microcode = rev; - uci->cpu_sig.rev = rev; - return UCODE_OK; + ret = UCODE_OK; + goto out; } if (__apply_microcode_amd(mc_amd)) { @@ -534,10 +534,13 @@ static enum ucode_state apply_microcode_amd(int cpu) pr_info("CPU%d: new patch_level=0x%08x\n", cpu, mc_amd->hdr.patch_id); - uci->cpu_sig.rev = mc_amd->hdr.patch_id; - c->microcode = mc_amd->hdr.patch_id; + ret = UCODE_UPDATED; + rev = mc_amd->hdr.patch_id; +out: + uci->cpu_sig.rev = rev; + c->microcode = rev; - return UCODE_UPDATED; + return ret; } static int install_equiv_cpu_table(const u8 *buf) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 97ccf4c3b45b..4bc869e829eb 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -797,6 +797,7 @@ static enum ucode_state apply_microcode_intel(int cpu) struct microcode_intel *mc; static int prev_rev; u32 rev; + enum ucode_state ret; /* We should bind the task to the CPU */ if (WARN_ON(raw_smp_processor_id() != cpu)) @@ -817,9 +818,8 @@ static enum ucode_state apply_microcode_intel(int cpu) */ rev = intel_get_microcode_revision(); if (rev >= mc->hdr.rev) { - uci->cpu_sig.rev = rev; - c->microcode = rev; - return UCODE_OK; + ret = UCODE_OK; + goto out; } /* @@ -848,10 +848,12 @@ static enum ucode_state apply_microcode_intel(int cpu) prev_rev = rev; } + ret = UCODE_UPDATED; +out: uci->cpu_sig.rev = rev; c->microcode = rev; - return UCODE_UPDATED; + return ret; } static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size, -- 2.7.4
[PATCH] x86/MCE: Get microcode revision from cpu_info instead of boot_cpu_data
Commit fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records") extended MCE entries to report the microcode revision taken from boot_cpu_data. Unfortunately, boot_cpu_data isn't updated on late microcode loading, thus making MCE entries slightly incorrect. Use cpu_info instead, which is updated on late microcode loading. Fixes: fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records") Signed-off-by: Filippo Sironi Cc: Tony Luck Cc: Borislav Petkov Cc: linux-e...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- arch/x86/kernel/cpu/mcheck/mce.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 42cf2880d0ed..4be323f9b390 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -134,7 +134,7 @@ void mce_setup(struct mce *m) if (this_cpu_has(X86_FEATURE_INTEL_PPIN)) rdmsrl(MSR_PPIN, m->ppin); - m->microcode = boot_cpu_data.microcode; + m->microcode = cpu_data(m->extcpu).microcode; } DEFINE_PER_CPU(struct mce, injectm); -- 2.7.4
[PATCH] vfio/type1: Search for a fitting iommu_domain before attaching the iommu_group
... to avoid an unnecessary attach/detach of the iommu_group to the newly created iommu_domain. This also saves us a context-cache and an IOTLB flush. This is possible because allocating an iommu_domain for the iommu_group we're attaching is enough to understand whether a fitting iommu_domain already exists. Signed-off-by: Filippo Sironi Cc: Alex Williamson Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- drivers/vfio/vfio_iommu_type1.c | 32 ++-- 1 file changed, 14 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 45657e2b1ff7..88359b4993f3 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1279,15 +1279,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, goto out_domain; } - ret = iommu_attach_group(domain->domain, iommu_group); - if (ret) - goto out_domain; - resv_msi = vfio_iommu_has_sw_msi(iommu_group, &resv_msi_base); - INIT_LIST_HEAD(&domain->group_list); - list_add(&group->next, &domain->group_list); - msi_remap = irq_domain_check_msi_remap() || iommu_capable(bus, IOMMU_CAP_INTR_REMAP); @@ -1295,7 +1288,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n", __func__); ret = -EPERM; - goto out_detach; + goto out_domain; } if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) @@ -1311,21 +1304,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, list_for_each_entry(d, &iommu->domain_list, next) { if (d->domain->ops == domain->domain->ops && d->prot == domain->prot) { - iommu_detach_group(domain->domain, iommu_group); - if (!iommu_attach_group(d->domain, iommu_group)) { - list_add(&group->next, &d->group_list); - iommu_domain_free(domain->domain); - kfree(domain); - mutex_unlock(&iommu->lock); - return 0; - } - - ret = iommu_attach_group(domain->domain, iommu_group); + ret = iommu_attach_group(d->domain, iommu_group); if (ret) goto out_domain; + list_add(&group->next, &d->group_list); + iommu_domain_free(domain->domain); + kfree(domain); + mutex_unlock(&iommu->lock); + return 0; } } + ret = iommu_attach_group(domain->domain, iommu_group); + if (ret) + goto out_domain; + + INIT_LIST_HEAD(&domain->group_list); + list_add(&group->next, &domain->group_list); + vfio_test_domain_fgsp(domain); /* replay mappings on new domains */ -- 2.7.4
[PATCH] sched/fair: Prevent a division by 0 in scale_rt_capacity()
... since total = sched_avg_period() + delta can yield 0x1, which results in a division by 0, given that div_u64() takes a u32 divisor. Use div64_u64() instead. divide error: [#1] SMP CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.58 #1 Hardware name: ... task: 8800a24e2800 task.stack: c974c000 RIP: 0010:[] [] update_group_capacity+0x16e/0x1c0 RSP: 0018:8800a74e3c18 EFLAGS: 00010246 RAX: 00445ced RBX: 0007 RCX: 024d RDX: RSI: RDI: 000160c0 RBP: 8800a74e3c38 R08: 8800a17d5ac0 R09: 8800a74e R10: R11: R12: 8800a297e400 R13: 8800a17d5ac0 R14: R15: 8800a17d5ac0 FS: () GS:8800a74e() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 006f3580 CR3: 01607000 CR4: 007426e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Stack: 8800a17d5180 8800a74e3e00 8800a17d5a01 8800a74e3c68 8800a74e3d90 810d37e6 fff8 002300010c40 0040 8800a17d5ad8 Call Trace: [162553.008569] [] find_busiest_group+0xe6/0x950 [] load_balance+0x188/0xa70 [] ? update_rq_clock.part.88+0x13/0x30 [] rebalance_domains+0x210/0x290 [] run_rebalance_domains+0x1b0/0x1d0 [] __do_softirq+0x89/0x2b0 [] irq_exit+0xab/0xb0 [] smp_reschedule_interrupt+0x2e/0x30 [] reschedule_interrupt+0x84/0x90 [162553.008603] [] ? cpuidle_enter_state+0x12f/0x2c0 [] cpuidle_enter+0x12/0x20 [] cpu_startup_entry+0x1a2/0x1f0 [] start_secondary+0x12d/0x140 Code: 0f 00 4c 8b 96 48 09 00 00 48 8b 86 40 09 00 00 48 8b b6 b0 08 00 00 48 d1 ea 4c 29 d6 41 ba 00 00 00 00 49 0f 48 f2 01 d6 31 d2 <48> f7 f6 ba 00 04 00 00 48 29 c2 48 3d ff 03 00 00 b8 01 00 00 RIP [] update_group_capacity+0x16e/0x1c0 RSP Cc: Ingo Molnar Cc: Peter Zijlstra Cc: linux-kernel@vger.kernel.org Signed-off-by: Filippo Sironi --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4037e19bbca2..04b6f847a241 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7517,7 +7517,7 @@ static unsigned long scale_rt_capacity(int cpu) total = sched_avg_period() + delta; - used = div_u64(avg, total); + used = div64_u64(avg, total); if (likely(used < SCHED_CAPACITY_SCALE)) return SCHED_CAPACITY_SCALE - used; -- 2.7.4
[PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version
... that the guest should see. Guest operating systems may check the microcode version to decide whether to disable certain features that are known to be buggy up to certain microcode versions. Address the issue by making the microcode version that the guest should see settable. The rationale for having userspace specifying the microcode version, rather than having the kernel picking it, is to ensure consistency for live-migrated instances; we don't want them to see a microcode version increase without a reset. Signed-off-by: Filippo Sironi --- arch/x86/kvm/x86.c | 23 +++ include/uapi/linux/kvm.h | 3 +++ 2 files changed, 26 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 925c3e29cad3..741588f27ebc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp, } u; switch (ioctl) { + case KVM_GET_MICROCODE_VERSION: { + r = -EFAULT; + if (copy_to_user(argp, +&kvm->arch.microcode_version, +sizeof(kvm->arch.microcode_version))) + goto out; + break; + } + case KVM_SET_MICROCODE_VERSION: { + u32 microcode_version; + + r = -EFAULT; + if (copy_from_user(µcode_version, + argp, + sizeof(microcode_version))) + goto out; + r = -EINVAL; + if (!microcode_version) + goto out; + kvm->arch.microcode_version = microcode_version; + r = 0; + break; + } case KVM_SET_TSS_ADDR: r = kvm_vm_ioctl_set_tss_addr(kvm, arg); break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 282d7613fce8..e11887758e29 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1192,6 +1192,9 @@ struct kvm_s390_ucas_mapping { #define KVM_S390_UCAS_UNMAP _IOW(KVMIO, 0x51, struct kvm_s390_ucas_mapping) #define KVM_S390_VCPU_FAULT _IOW(KVMIO, 0x52, unsigned long) +#define KVM_GET_MICROCODE_VERSION _IOR(KVMIO, 0x5e, __u32) +#define KVM_SET_MICROCODE_VERSION _IOW(KVMIO, 0x5f, __u32) + /* Device model IOC */ #define KVM_CREATE_IRQCHIP_IO(KVMIO, 0x60) #define KVM_IRQ_LINE _IOW(KVMIO, 0x61, struct kvm_irq_level) -- 2.7.4
[PATCH 1/2] KVM: x86: Store the microcode version in struct kvm_arch
... and read it from there when emulating accesses to MSR_IA32_UCODE_REV. This is the first step to allow userspace to define what's the microcode version that the guest should see. Signed-off-by: Filippo Sironi --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 4 +++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1bfb99770c34..84b20139f4f1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -776,6 +776,8 @@ struct kvm_arch { struct mutex apic_map_lock; struct kvm_apic_map *apic_map; + u32 microcode_version; + unsigned int tss_addr; bool apic_access_page_done; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 34c85aa2e2d1..925c3e29cad3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2447,7 +2447,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = 0; break; case MSR_IA32_UCODE_REV: - msr_info->data = 0x1ULL; + msr_info->data = (u64)vcpu->kvm->arch.microcode_version << 32; break; case MSR_MTRRcap: case 0x200 ... 0x2ff: @@ -8121,6 +8121,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) if (type) return -EINVAL; + kvm->arch.microcode_version = 0x1; + INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); -- 2.7.4
[PATCH v2] pci: Expose offset, stride, and VF device ID via sysfs
... to make it easier for userspace applications to consume them. Signed-off-by: Filippo Sironi Cc: Bjorn Helgaas Cc: linux-...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- v2: * follow up with the rename of vf_did to vf_device drivers/pci/pci-sysfs.c | 33 + 1 file changed, 33 insertions(+) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 2f3780b50723..e6f4133f8992 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -648,6 +648,33 @@ static ssize_t sriov_numvfs_store(struct device *dev, return count; } +static ssize_t sriov_offset_show(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%u\n", pdev->sriov->offset); +} + +static ssize_t sriov_stride_show(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%u\n", pdev->sriov->stride); +} + +static ssize_t sriov_vf_device_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%x\n", pdev->sriov->vf_device); +} + static ssize_t sriov_drivers_autoprobe_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -676,6 +703,9 @@ static struct device_attribute sriov_totalvfs_attr = __ATTR_RO(sriov_totalvfs); static struct device_attribute sriov_numvfs_attr = __ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP), sriov_numvfs_show, sriov_numvfs_store); +static struct device_attribute sriov_offset_attr = __ATTR_RO(sriov_offset); +static struct device_attribute sriov_stride_attr = __ATTR_RO(sriov_stride); +static struct device_attribute sriov_vf_device_attr = __ATTR_RO(sriov_vf_device); static struct device_attribute sriov_drivers_autoprobe_attr = __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), sriov_drivers_autoprobe_show, sriov_drivers_autoprobe_store); @@ -1744,6 +1774,9 @@ static struct attribute_group pci_dev_hp_attr_group = { static struct attribute *sriov_dev_attrs[] = { &sriov_totalvfs_attr.attr, &sriov_numvfs_attr.attr, + &sriov_offset_attr.attr, + &sriov_stride_attr.attr, + &sriov_vf_device_attr.attr, &sriov_drivers_autoprobe_attr.attr, NULL, }; -- 2.7.4
[PATCH v2] pci: Expose offset, stride, and VF device ID via sysfs
Testing done: $ ls -l /sys/bus/pci/devices/\:03\:00.0/ total 0 -rw-r--r-- 1 root root4096 Oct 9 00:48 broken_parity_status -r--r--r-- 1 root root4096 Oct 9 00:48 class -rw-r--r-- 1 root root4096 Oct 9 00:46 config -r--r--r-- 1 root root4096 Oct 9 00:48 consistent_dma_mask_bits -r--r--r-- 1 root root4096 Oct 9 00:48 current_link_speed -r--r--r-- 1 root root4096 Oct 9 00:48 current_link_width -rw-r--r-- 1 root root4096 Oct 9 00:48 d3cold_allowed -r--r--r-- 1 root root4096 Oct 9 00:46 device -r--r--r-- 1 root root4096 Oct 9 00:48 dma_mask_bits lrwxrwxrwx 1 root root 0 Oct 9 00:46 driver -> ../../../../bus/pci/drivers/igb -rw-r--r-- 1 root root4096 Oct 9 00:48 driver_override -rw-r--r-- 1 root root4096 Oct 9 00:48 enable lrwxrwxrwx 1 root root 0 Oct 9 00:48 firmware_node -> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4b/device:4c -r--r--r-- 1 root root4096 Oct 9 00:46 irq -r--r--r-- 1 root root4096 Oct 9 00:48 local_cpulist -r--r--r-- 1 root root4096 Oct 9 00:48 local_cpus -r--r--r-- 1 root root4096 Oct 9 00:48 max_link_speed -r--r--r-- 1 root root4096 Oct 9 00:48 max_link_width -r--r--r-- 1 root root4096 Oct 9 00:48 modalias -rw-r--r-- 1 root root4096 Oct 9 00:48 msi_bus drwxr-xr-x 2 root root 0 Oct 9 00:48 msi_irqs drwxr-xr-x 3 root root 0 Oct 9 00:46 net -rw-r--r-- 1 root root4096 Oct 9 00:48 numa_node drwxr-xr-x 2 root root 0 Oct 9 00:48 power drwxr-xr-x 3 root root 0 Oct 9 00:46 ptp --w--w 1 root root4096 Oct 9 00:48 remove --w--w 1 root root4096 Oct 9 00:48 rescan --w--- 1 root root4096 Oct 9 00:48 reset -r--r--r-- 1 root root4096 Oct 9 00:46 resource -rw--- 1 root root 131072 Oct 9 00:48 resource0 -rw--- 1 root root 4194304 Oct 9 00:48 resource1 -rw--- 1 root root 32 Oct 9 00:48 resource2 -rw--- 1 root root 16384 Oct 9 00:48 resource3 -r--r--r-- 1 root root4096 Oct 9 00:48 revision -rw-rw-r-- 1 root root4096 Oct 9 00:48 sriov_drivers_autoprobe -rw-rw-r-- 1 root root4096 Oct 9 00:48 sriov_numvfs -r--r--r-- 1 root root4096 Oct 9 00:48 sriov_offset -r--r--r-- 1 root root4096 Oct 9 00:48 sriov_stride -r--r--r-- 1 root root4096 Oct 9 00:48 sriov_totalvfs -r--r--r-- 1 root root4096 Oct 9 00:48 sriov_vf_device lrwxrwxrwx 1 root root 0 Oct 9 00:46 subsystem -> ../../../../bus/pci -r--r--r-- 1 root root4096 Oct 9 00:48 subsystem_device -r--r--r-- 1 root root4096 Oct 9 00:48 subsystem_vendor -rw-r--r-- 1 root root4096 Oct 9 00:46 uevent -r--r--r-- 1 root root4096 Oct 9 00:46 vendor $ cat /sys/bus/pci/devices/\:03\:00.0/sriov_offset 128 $ cat /sys/bus/pci/devices/\:03\:00.0/sriov_stride 2 $ cat /sys/bus/pci/devices/\:03\:00.0/sriov_vf_device 10ca Filippo Sironi (1): pci: Expose offset, stride, and VF device ID via sysfs drivers/pci/pci-sysfs.c | 33 + 1 file changed, 33 insertions(+) -- 2.7.4
[PATCH v2] iommu/vt-d: Don't be too aggressive when clearing one context entry
Previously, we were invalidating context cache and IOTLB globally when clearing one context entry. This is a tad too aggressive. Invalidate the context cache and IOTLB for the interested device only. Signed-off-by: Filippo Sironi Cc: David Woodhouse Cc: David Woodhouse Cc: Joerg Roedel Cc: Jacob Pan Cc: io...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org --- drivers/iommu/intel-iommu.c | 42 -- 1 file changed, 24 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 3e8636f1220e..1aa4ad7974b9 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -974,20 +974,6 @@ static int device_context_mapped(struct intel_iommu *iommu, u8 bus, u8 devfn) return ret; } -static void clear_context_table(struct intel_iommu *iommu, u8 bus, u8 devfn) -{ - struct context_entry *context; - unsigned long flags; - - spin_lock_irqsave(&iommu->lock, flags); - context = iommu_context_addr(iommu, bus, devfn, 0); - if (context) { - context_clear_entry(context); - __iommu_flush_cache(iommu, context, sizeof(*context)); - } - spin_unlock_irqrestore(&iommu->lock, flags); -} - static void free_context_table(struct intel_iommu *iommu) { int i; @@ -2351,13 +2337,33 @@ static inline int domain_pfn_mapping(struct dmar_domain *domain, unsigned long i static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn) { + unsigned long flags; + struct context_entry *context; + u16 did_old; + if (!iommu) return; - clear_context_table(iommu, bus, devfn); - iommu->flush.flush_context(iommu, 0, 0, 0, - DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); + spin_lock_irqsave(&iommu->lock, flags); + context = iommu_context_addr(iommu, bus, devfn, 0); + if (!context) { + spin_unlock_irqrestore(&iommu->lock, flags); + return; + } + did_old = context_domain_id(context); + context_clear_entry(context); + __iommu_flush_cache(iommu, context, sizeof(*context)); + spin_unlock_irqrestore(&iommu->lock, flags); + iommu->flush.flush_context(iommu, + did_old, + (((u16)bus) << 8) | devfn, + DMA_CCMD_MASK_NOBIT, + DMA_CCMD_DEVICE_INVL); + iommu->flush.flush_iotlb(iommu, +did_old, +0, +0, +DMA_TLB_DSI_FLUSH); } static inline void unlink_domain_info(struct device_domain_info *info) -- 2.7.4
[PATCH] intel-iommu: Don't be too aggressive when clearing one context entry
Previously, we were invalidating context cache and IOTLB globally when clearing one context entry. This is a tad too aggressive. Invalidate the context cache and IOTLB for the interested device only. Signed-off-by: Filippo Sironi Cc: David Woodhouse Cc: David Woodhouse Cc: Joerg Roedel Cc: io...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org --- drivers/iommu/intel-iommu.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 3e8636f1220e..4bf3e59b0929 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -2351,13 +2351,32 @@ static inline int domain_pfn_mapping(struct dmar_domain *domain, unsigned long i static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn) { + unsigned long flags; + struct context_entry *context; + u16 did_old; + if (!iommu) return; + spin_lock_irqsave(&iommu->lock, flags); + context = iommu_context_addr(iommu, bus, devfn, 0); + if (!context) { + spin_unlock_irqrestore(&iommu->lock, flags); + return; + } + did_old = context_domain_id(context); + spin_unlock_irqrestore(&iommu->lock, flags); clear_context_table(iommu, bus, devfn); - iommu->flush.flush_context(iommu, 0, 0, 0, - DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); + iommu->flush.flush_context(iommu, + did_old, + (((u16)bus) << 8) | devfn, + DMA_CCMD_MASK_NOBIT, + DMA_CCMD_DEVICE_INVL); + iommu->flush.flush_iotlb(iommu, +did_old, +0, +0, +DMA_TLB_DSI_FLUSH); } static inline void unlink_domain_info(struct device_domain_info *info) -- 2.7.4
[PATCH 1/2] pci: Cache the VF device ID in the SR-IOV structure
... and use it instead of reading it over and over from the PF config space capability. Signed-off-by: Filippo Sironi Cc: linux-...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- drivers/pci/iov.c | 5 +++-- drivers/pci/pci.h | 1 + 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 120485d6f352..e8f7eafaba6a 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -134,7 +134,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id, int reset) virtfn->devfn = pci_iov_virtfn_devfn(dev, id); virtfn->vendor = dev->vendor; - pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device); + virtfn->device = iov->vf_did; rc = pci_setup_device(virtfn); if (rc) goto failed0; @@ -448,6 +448,7 @@ static int sriov_init(struct pci_dev *dev, int pos) iov->nres = nres; iov->ctrl = ctrl; iov->total_VFs = total; + pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_did); iov->pgsz = pgsz; iov->self = dev; iov->drivers_autoprobe = true; @@ -723,7 +724,7 @@ int pci_vfs_assigned(struct pci_dev *dev) * determine the device ID for the VFs, the vendor ID will be the * same as the PF so there is no need to check for that one */ - pci_read_config_word(dev, dev->sriov->pos + PCI_SRIOV_VF_DID, &dev_id); + dev_id = dev->sriov->vf_did; /* loop through all the VFs to see if we own any that are assigned */ vfdev = pci_get_device(dev->vendor, dev_id, NULL); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 22e061738c6f..a7270e11e1ef 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -262,6 +262,7 @@ struct pci_sriov { u16 num_VFs;/* number of VFs available */ u16 offset; /* first VF Routing ID offset */ u16 stride; /* following VF stride */ + u16 vf_did; /* VF device ID */ u32 pgsz; /* page size for BAR alignment */ u8 link;/* Function Dependency Link */ u8 max_VF_buses;/* max buses consumed by VFs */ -- 2.7.4
[PATCH 2/2] pci: Expose offset, stride, and VF device ID via sysfs
... to make it easier for userspace applications consumption. Signed-off-by: Filippo Sironi Cc: linux-...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- drivers/pci/pci-sysfs.c | 33 + 1 file changed, 33 insertions(+) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 2f3780b50723..f920afe7cff3 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -648,6 +648,33 @@ static ssize_t sriov_numvfs_store(struct device *dev, return count; } +static ssize_t sriov_offset_show(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%u\n", pdev->sriov->offset); +} + +static ssize_t sriov_stride_show(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%u\n", pdev->sriov->stride); +} + +static ssize_t sriov_vf_did_show(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%x\n", pdev->sriov->vf_did); +} + static ssize_t sriov_drivers_autoprobe_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -676,6 +703,9 @@ static struct device_attribute sriov_totalvfs_attr = __ATTR_RO(sriov_totalvfs); static struct device_attribute sriov_numvfs_attr = __ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP), sriov_numvfs_show, sriov_numvfs_store); +static struct device_attribute sriov_offset_attr = __ATTR_RO(sriov_offset); +static struct device_attribute sriov_stride_attr = __ATTR_RO(sriov_stride); +static struct device_attribute sriov_vf_did_attr = __ATTR_RO(sriov_vf_did); static struct device_attribute sriov_drivers_autoprobe_attr = __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP), sriov_drivers_autoprobe_show, sriov_drivers_autoprobe_store); @@ -1744,6 +1774,9 @@ static struct attribute_group pci_dev_hp_attr_group = { static struct attribute *sriov_dev_attrs[] = { &sriov_totalvfs_attr.attr, &sriov_numvfs_attr.attr, + &sriov_offset_attr.attr, + &sriov_stride_attr.attr, + &sriov_vf_did_attr.attr, &sriov_drivers_autoprobe_attr.attr, NULL, }; -- 2.7.4
[PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of pages and the respective struct pages for mapping in the kernel virtual address space. This doesn't work if get_user_pages_fast() is invoked with a userspace virtual address that's backed by PFNs outside of kernel reach (e.g., when limiting the kernel memory with mem= in the command line and using /dev/mem to map memory). If get_user_pages_fast() fails, look up the VMA that backs the userspace virtual address, compute the PFN and the physical address, and map it in the kernel virtual address space with memremap(). Signed-off-by: Filippo Sironi Cc: Anthony Liguori Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- arch/x86/kvm/paging_tmpl.h | 42 +- 1 file changed, 33 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index a01105485315..b3d7a117179d 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -147,15 +147,39 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct page *page; npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page); - /* Check if the user is doing something meaningless. */ - if (unlikely(npages != 1)) - return -EFAULT; - - table = kmap_atomic(page); - ret = CMPXCHG(&table[index], orig_pte, new_pte); - kunmap_atomic(table); - - kvm_release_page_dirty(page); + if (likely(npages == 1)) { + table = kmap_atomic(page); + ret = CMPXCHG(&table[index], orig_pte, new_pte); + kunmap_atomic(table); + + kvm_release_page_dirty(page); + } else { + struct vm_area_struct *vma; + unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK; + unsigned long pfn; + unsigned long paddr; + + down_read(¤t->mm->mmap_sem); + /* +* vaddr is page-aligned, if a vma exists, it must cover +* ptep_user +*/ + vma = find_vma(current->mm, vaddr); + if (!vma || !(vma->vm_flags & VM_PFNMAP)) { + up_read(¤t->mm->mmap_sem); + return -EFAULT; + } + pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + paddr = pfn << PAGE_SHIFT; + table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB); + if (!table) { + up_read(¤t->mm->mmap_sem); + return -EFAULT; + } + ret = CMPXCHG(&table[index], orig_pte, new_pte); + memunmap(table); + up_read(¤t->mm->mmap_sem); + } return (ret != orig_pte); } -- 2.7.4
[PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of pages and the respective struct pages for mapping in the kernel virtual address space. This doesn't work if get_user_pages_fast() is invoked with a userspace virtual address that's backed by PFNs outside of kernel reach (e.g., when limiting the kernel memory with mem= in the command line and using /dev/mem to map memory). If get_user_pages_fast() fails, look up the VMA that backs the userspace virtual address, compute the PFN and the physical address, and map it in the kernel virtual address space with memremap(). Signed-off-by: Filippo Sironi Cc: Anthony Liguori Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- arch/x86/kvm/paging_tmpl.h | 39 ++- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index a01105485315..ab4d6617238c 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -147,15 +147,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct page *page; npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page); - /* Check if the user is doing something meaningless. */ - if (unlikely(npages != 1)) - return -EFAULT; - - table = kmap_atomic(page); - ret = CMPXCHG(&table[index], orig_pte, new_pte); - kunmap_atomic(table); - - kvm_release_page_dirty(page); + if (likely(npages == 1)) { + table = kmap_atomic(page); + ret = CMPXCHG(&table[index], orig_pte, new_pte); + kunmap_atomic(table); + + kvm_release_page_dirty(page); + } else { + struct vm_area_struct *vma; + unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK; + unsigned long pfn; + unsigned long paddr; + + down_read(¤t->mm->mmap_sem); + vma = find_vma_intersection(current->mm, vaddr, + vaddr + PAGE_SIZE); + if (!vma || !(vma->vm_flags & VM_PFNMAP)) { + up_read(¤t->mm->mmap_sem); + return -EFAULT; + } + pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + paddr = pfn << PAGE_SHIFT; + table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB); + if (!table) { + up_read(¤t->mm->mmap_sem); + return -EFAULT; + } + ret = CMPXCHG(&table[index], orig_pte, new_pte); + memunmap(table); + up_read(¤t->mm->mmap_sem); + } return (ret != orig_pte); } -- 2.7.4