[tip: x86/pasid] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)

2020-09-18 Thread tip-bot2 for Ashok Raj
The following commit has been merged into the x86/pasid branch of tip:

Commit-ID: 4e7b11567d946ebe14a3d10b697b078971a9da89
Gitweb:
https://git.kernel.org/tip/4e7b11567d946ebe14a3d10b697b078971a9da89
Author:Ashok Raj 
AuthorDate:Tue, 15 Sep 2020 09:30:07 -07:00
Committer: Borislav Petkov 
CommitterDate: Thu, 17 Sep 2020 19:29:42 +02:00

Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)

ENQCMD and Data Streaming Accelerator (DSA) and all of their associated
features are a complicated stack with lots of interconnected pieces.
This documentation provides a big picture overview for all of the
features.

Signed-off-by: Ashok Raj 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Signed-off-by: Borislav Petkov 
Reviewed-by: Tony Luck 
Link: 
https://lkml.kernel.org/r/1600187413-163670-4-git-send-email-fenghua...@intel.com
---
 Documentation/x86/index.rst |   1 +-
 Documentation/x86/sva.rst   | 257 +++-
 2 files changed, 258 insertions(+)
 create mode 100644 Documentation/x86/sva.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 265d9e9..e5d5ff0 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -30,3 +30,4 @@ x86-specific Documentation
usb-legacy-support
i386/index
x86_64/index
+   sva
diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst
new file mode 100644
index 000..076efd5
--- /dev/null
+++ b/Documentation/x86/sva.rst
@@ -0,0 +1,257 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+Shared Virtual Addressing (SVA) with ENQCMD
+===
+
+Background
+==
+
+Shared Virtual Addressing (SVA) allows the processor and device to use the
+same virtual addresses avoiding the need for software to translate virtual
+addresses to physical addresses. SVA is what PCIe calls Shared Virtual
+Memory (SVM).
+
+In addition to the convenience of using application virtual addresses
+by the device, it also doesn't require pinning pages for DMA.
+PCIe Address Translation Services (ATS) along with Page Request Interface
+(PRI) allow devices to function much the same way as the CPU handling
+application page-faults. For more information please refer to the PCIe
+specification Chapter 10: ATS Specification.
+
+Use of SVA requires IOMMU support in the platform. IOMMU is also
+required to support the PCIe features ATS and PRI. ATS allows devices
+to cache translations for virtual addresses. The IOMMU driver uses the
+mmu_notifier() support to keep the device TLB cache and the CPU cache in
+sync. When an ATS lookup fails for a virtual address, the device should
+use the PRI in order to request the virtual address to be paged into the
+CPU page tables. The device must use ATS again in order the fetch the
+translation before use.
+
+Shared Hardware Workqueues
+==
+
+Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
+the use of Shared Work Queues (SWQ) by both applications and Virtual
+Machines (VM's). This allows better hardware utilization vs. hard
+partitioning resources that could result in under utilization. In order to
+allow the hardware to distinguish the context for which work is being
+executed in the hardware by SWQ interface, SIOV uses Process Address Space
+ID (PASID), which is a 20-bit number defined by the PCIe SIG.
+
+PASID value is encoded in all transactions from the device. This allows the
+IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
+Resource Identifier (RID) which is the Bus/Device/Function.
+
+
+ENQCMD
+==
+
+ENQCMD is a new instruction on Intel platforms that atomically submits a
+work descriptor to a device. The descriptor includes the operation to be
+performed, virtual addresses of all parameters, virtual address of a completion
+record, and the PASID (process address space ID) of the current process.
+
+ENQCMD works with non-posted semantics and carries a status back if the
+command was accepted by hardware. This allows the submitter to know if the
+submission needs to be retried or other device specific mechanisms to
+implement fairness or ensure forward progress should be provided.
+
+ENQCMD is the glue that ensures applications can directly submit commands
+to the hardware and also permits hardware to be aware of application context
+to perform I/O operations via use of PASID.
+
+Process Address Space Tagging
+=
+
+A new thread-scoped MSR (IA32_PASID) provides the connection between
+user processes and the rest of the hardware. When an application first
+accesses an SVA-capable device, this MSR is initialized with a newly
+allocated PASID. The driver for the device calls an IOMMU-specific API
+that sets up the routing for DMA and page-requests.
+
+For example, the Intel Data Streaming Accelerator (DSA) uses
+iommu_sva_bind_device(), which will 

[tip: x86/urgent] x86/hotplug: Silence APIC only after all interrupts are migrated

2020-08-27 Thread tip-bot2 for Ashok Raj
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 52d6b926aabc47643cd910c85edb262b7f44c168
Gitweb:
https://git.kernel.org/tip/52d6b926aabc47643cd910c85edb262b7f44c168
Author:Ashok Raj 
AuthorDate:Wed, 26 Aug 2020 21:12:10 -07:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 27 Aug 2020 09:29:23 +02:00

x86/hotplug: Silence APIC only after all interrupts are migrated

There is a race when taking a CPU offline. Current code looks like this:

native_cpu_disable()
{
...
apic_soft_disable();
/*
 * Any existing set bits for pending interrupt to
 * this CPU are preserved and will be sent via IPI
 * to another CPU by fixup_irqs().
 */
cpu_disable_common();
{

/*
 * Race window happens here. Once local APIC has been
 * disabled any new interrupts from the device to
 * the old CPU are lost
 */
fixup_irqs(); // Too late to capture anything in IRR.
...
}
}

The fix is to disable the APIC *after* cpu_disable_common().

Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.

Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: Evan Green 
Signed-off-by: Ashok Raj 
Signed-off-by: Thomas Gleixner 
Tested-by: Mathias Nyman 
Tested-by: Evan Green 
Reviewed-by: Evan Green 
Cc: sta...@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h@nanos.tec.linutronix.de/
Link: 
https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok@intel.com

---
 arch/x86/kernel/smpboot.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 27aa04a..f5ef689 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1594,14 +1594,28 @@ int native_cpu_disable(void)
if (ret)
return ret;
 
-   /*
-* Disable the local APIC. Otherwise IPI broadcasts will reach
-* it. It still responds normally to INIT, NMI, SMI, and SIPI
-* messages.
-*/
-   apic_soft_disable();
cpu_disable_common();
 
+/*
+ * Disable the local APIC. Otherwise IPI broadcasts will reach
+ * it. It still responds normally to INIT, NMI, SMI, and SIPI
+ * messages.
+ *
+ * Disabling the APIC must happen after cpu_disable_common()
+ * which invokes fixup_irqs().
+ *
+ * Disabling the APIC preserves already set bits in IRR, but
+ * an interrupt arriving after disabling the local APIC does not
+ * set the corresponding IRR bit.
+ *
+ * fixup_irqs() scans IRR for set bits so it can raise a not
+ * yet handled interrupt on the new destination CPU via an IPI
+ * but obviously it can't do so for IRR bits which are not set.
+ * IOW, interrupts arriving after disabling the local APIC will
+ * be lost.
+ */
+   apic_soft_disable();
+
return 0;
 }
 


[tip: x86/microcode] x86/microcode: Update late microcode in parallel

2019-10-01 Thread tip-bot2 for Ashok Raj
The following commit has been merged into the x86/microcode branch of tip:

Commit-ID: 93946a33b5693a6bbcf917a170198ff4afaa7a31
Gitweb:
https://git.kernel.org/tip/93946a33b5693a6bbcf917a170198ff4afaa7a31
Author:Ashok Raj 
AuthorDate:Thu, 22 Aug 2019 23:43:47 +03:00
Committer: Borislav Petkov 
CommitterDate: Tue, 01 Oct 2019 15:58:54 +02:00

x86/microcode: Update late microcode in parallel

Microcode update was changed to be serialized due to restrictions after
Spectre days. Updating serially on a large multi-socket system can be
painful since it is being done on one CPU at a time.

Cloud customers have expressed discontent as services disappear for
a prolonged time. The restriction is that only one core (or only one
thread of a core in the case of an SMT system) goes through the update
while other cores (or respectively, SMT threads) are quiesced.

Do the microcode update only on the first thread of each core while
other siblings simply wait for this to complete.

 [ bp: Simplify, massage, cleanup comments. ]

Signed-off-by: Ashok Raj 
Signed-off-by: Mihai Carabas 
Signed-off-by: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jon Grimm 
Cc: kanth.ghatr...@oracle.com
Cc: konrad.w...@oracle.com
Cc: patrick.c...@oracle.com
Cc: Thomas Gleixner 
Cc: Tom Lendacky 
Cc: x86-ml 
Link: 
https://lkml.kernel.org/r/1566506627-16536-2-git-send-email-mihai.cara...@oracle.com
---
 arch/x86/kernel/cpu/microcode/core.c | 36 +++
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index cb0fdca..7019d4b 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -63,11 +63,6 @@ LIST_HEAD(microcode_cache);
  */
 static DEFINE_MUTEX(microcode_mutex);
 
-/*
- * Serialize late loading so that CPUs get updated one-by-one.
- */
-static DEFINE_RAW_SPINLOCK(update_lock);
-
 struct ucode_cpu_info  ucode_cpu_info[NR_CPUS];
 
 struct cpu_info_ctx {
@@ -566,11 +561,18 @@ static int __reload_late(void *info)
if (__wait_for_cpus(_cpus_in, NSEC_PER_SEC))
return -1;
 
-   raw_spin_lock(_lock);
-   apply_microcode_local();
-   raw_spin_unlock(_lock);
+   /*
+* On an SMT system, it suffices to load the microcode on one sibling of
+* the core because the microcode engine is shared between the threads.
+* Synchronization still needs to take place so that no concurrent
+* loading attempts happen on multiple threads of an SMT core. See
+* below.
+*/
+   if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
+   apply_microcode_local();
+   else
+   goto wait_for_siblings;
 
-   /* siblings return UCODE_OK because their engine got updated already */
if (err > UCODE_NFOUND) {
pr_warn("Error reloading microcode on CPU %d\n", cpu);
ret = -1;
@@ -578,14 +580,18 @@ static int __reload_late(void *info)
ret = 1;
}
 
+wait_for_siblings:
+   if (__wait_for_cpus(_cpus_out, NSEC_PER_SEC))
+   panic("Timeout during microcode update!\n");
+
/*
-* Increase the wait timeout to a safe value here since we're
-* serializing the microcode update and that could take a while on a
-* large number of CPUs. And that is fine as the *actual* timeout will
-* be determined by the last CPU finished updating and thus cut short.
+* At least one thread has completed update on each core.
+* For others, simply call the update to make sure the
+* per-cpu cpuinfo can be updated with right microcode
+* revision.
 */
-   if (__wait_for_cpus(_cpus_out, NSEC_PER_SEC * num_online_cpus()))
-   panic("Timeout during microcode update!\n");
+   if (cpumask_first(topology_sibling_cpumask(cpu)) != cpu)
+   apply_microcode_local();
 
return ret;
 }