Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
On 4/20/21 4:53 PM, Dan Williams wrote: On Tue, Apr 20, 2021 at 4:12 PM Kuppuswamy, Sathyanarayanan wrote: [..] Also, do you *REALLY* need to do this from assembly? Can't it be done in the C wrapper? Its common for all use cases of TDVMCALL (vendor specific, in/out, etc). so added it here. Can I ask a favor? Please put a line break between quoted lines and your reply. will do That's not a good reason. You could just as easily have a C wrapper which all uses of TDVMCALL go through. ...because this runs together when reading otherwise. Any reason for not preferring it in assembly code? Also, using wrapper will add more complication for in/out instruction substitution use case. please check the use case in following patch. https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543 This commit still has open coded assembly for the TDVMCALL? I thought we talked about it being unified with the common definition, or has this patch not been reworked with that feedback yet? I expect there is no performance reason why in/out need to get their own custom coded TDVMCALL implementation. It should also be the case the failure should behave the same as native in/out failure i.e. all ones on read failure, and silent drops on write failure. That link is for older version. My next version addresses your review comments (re-uses TDVMCALL() function). Although the patch is ready, I am waiting to fix other review comments before sending the next version. I have just shared that link to explain about the use case. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
On 4/20/21 12:59 PM, Dave Hansen wrote: On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote: approach is, it adds a few extra instructions for every TDCALL use case when compared to distributed checks. Although it's a bit less efficient, it's worth it to make the code more readable. What's a "distributed check"? It should be "distributed TDVMCALL/TDCALL inline assembly calls" It's still not clear to what that refers. I am just comparing the performance cost of using generic TDCALL()/TDVMCALL() function implementation with "usage specific" (GetQuote,MapGPA, HLT,etc) custom TDCALL()/TDVMCALL() inline assembly implementation. This also doesn't talk at all about why this approach was chosen versus inline assembly. You're going to be asked "why not use inline asm?" "To make the core more readable and less error prone." I have added this info in above paragraph. Do you think we need more argument to justify our approach? Yes, you need much more justification. That's pretty generic and non-specific. readability is one of the main motivation for not choosing inline assembly. Since number of lines of instructions (with comments) are over 70, using inline assembly made it hard to read. Another reason is, since we are using many registers (R8-R15, R[A-D]X)) in TDVMCAL/TDCALL operation, we are not sure whether some older compiler can follow our specified inline assembly constraints. + /* + * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs + * defined in the GHCI. Note, RAX and RCX are consumed, but only by + * TDX-Module and so don't need to be listed in the mask. + */ "GCHI" is out of the blue here. So is "TDX-Module". There needs to be more context. ok. will add it. Do you want GHCI spec reference with section id here? Absolutely not. I dislike all of the section references as-is. Doesn't a comment like this say what you said above and also add context? Expose every register currently used in the guest-to-host communication interface (GHCI). ok. + movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx + + tdcall + + /* Panic if TDCALL reports failure. */ + test %rax, %rax + jnz 2f Why panic? As per spec, TDCALL should never fail. Note that it has nothing to do with TDVMCALL function specific failure (which is reported via R10). You've introduced two concepts here, without differentiating them. You need to work to differentiate these two kinds of failure somewhere. You can't simply refer to both as "failure". will clarify it. I have assumed that once the user reads the spec, its easier to understand. Also, do you *REALLY* need to do this from assembly? Can't it be done in the C wrapper? Its common for all use cases of TDVMCALL (vendor specific, in/out, etc). so added it here. That's not a good reason. You could just as easily have a C wrapper which all uses of TDVMCALL go through. Any reason for not preferring it in assembly code? Also, using wrapper will add more complication for in/out instruction substitution use case. please check the use case in following patch. https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543 + /* Propagate TDVMCALL success/failure to return value. */ + mov %r10, %rax You just said it panic's on failure. How can this propagate failure? we use panic for TDCALL failure. But, R10 content used to identify whether given TDVMCALL function operation is successful or not. As I said above, please endeavor to differentiate the two classes of failures. Also, if the spec is violated, do you *REALLY* want to blow up the whole world with a panic? I guess you can argue that it could have been something security-related that failed. But, either way, you owe a description of why panic'ing is a good idea, not just blindly deferring to "the spec says this can't happen". ok. will add some comments justifying our case. + xor %r10d, %r10d + xor %r11d, %r11d + xor %r12d, %r12d + xor %r13d, %r13d + xor %r14d, %r14d + xor %r15d, %r15d + + pop %r12 + pop %r13 + pop %r14 + pop %r15 + + FRAME_END + ret +2: + ud2 +.endm + +SYM_FUNC_START(__tdvmcall) + xor %r10, %r10 + tdvmcall_core +SYM_FUNC_END(__tdvmcall) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 0d00dd50a6ff..1147e7e765d6 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -3,6 +3,36 @@ #include +/* + * Wrapper for the common case with standard output value (R10). + */ ... and oddly enough there is no explicit mention of R10 anywhere. Why? For Guest to Host call -> R10 holds TDCALL function id (which is 0 for TDVMCALL). so we don't need special argument. After TDVMCALL execution, R10 value is returned via RAX. OK... so this is how it works. But why mention R10 in the comment? Why is *THAT* worth mentioning? its not need
Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
On 4/20/21 10:36 AM, Dave Hansen wrote: On 3/26/21 4:38 PM, Kuppuswamy Sathyanarayanan wrote: Implement common helper functions to communicate with the TDX Module and VMM (using TDCALL instruction). This is missing any kind of background. I'd say: Guests communicate with VMMs with hypercalls. Historically, these are implemented using instructions that are known to cause VMEXITs like . However, with TDX, VMEXITs no longer expose guest state from the host. This prevents the old hypercall mechanisms from working ... and then go on to talk about what you are introducing, why there are two of them and so forth. Ok. I will add it. __tdvmcall() function can be used to request services from VMM. ^ "from a VMM" or "from the VMM", please will use "from the VMM". __tdcall() function can be used to communicate with the TDX Module. Using common helper functions makes the code more readable and less error prone compared to distributed and use case specific inline assembly code. Only downside in using this ^ "The only downside..." will fix it. approach is, it adds a few extra instructions for every TDCALL use case when compared to distributed checks. Although it's a bit less efficient, it's worth it to make the code more readable. What's a "distributed check"? It should be "distributed TDVMCALL/TDCALL inline assembly calls" This also doesn't talk at all about why this approach was chosen versus inline assembly. You're going to be asked "why not use inline asm?" "To make the core more readable and less error prone." I have added this info in above paragraph. Do you think we need more argument to justify our approach? --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -8,12 +8,35 @@ #ifdef CONFIG_INTEL_TDX_GUEST #include +#include + +struct tdcall_output { + u64 rcx; + u64 rdx; + u64 r8; + u64 r9; + u64 r10; + u64 r11; +}; + +struct tdvmcall_output { + u64 r11; + u64 r12; + u64 r13; + u64 r14; + u64 r15; +}; /* Common API to check TDX support in decompression and common kernel code. */ bool is_tdx_guest(void); void __init tdx_early_init(void); +u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out); + +u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15, + struct tdvmcall_output *out); Some one-liner comments about what these do would be nice. will do. #else // !CONFIG_INTEL_TDX_GUEST static inline bool is_tdx_guest(void) diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index ea111bf50691..7966c10ea8d1 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)+= pvclock.o obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o -obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o +obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o obj-$(CONFIG_EISA) += eisa.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 60b9f42ce3c1..72de0b49467e 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -23,6 +23,10 @@ #include #endif +#ifdef CONFIG_INTEL_TDX_GUEST +#include +#endif + #ifdef CONFIG_X86_32 # include "asm-offsets_32.c" #else @@ -75,6 +79,24 @@ static void __used common(void) OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2); #endif +#ifdef CONFIG_INTEL_TDX_GUEST + BLANK(); + /* Offset for fields in tdcall_output */ + OFFSET(TDCALL_rcx, tdcall_output, rcx); + OFFSET(TDCALL_rdx, tdcall_output, rdx); + OFFSET(TDCALL_r8, tdcall_output, r8); + OFFSET(TDCALL_r9, tdcall_output, r9); + OFFSET(TDCALL_r10, tdcall_output, r10); + OFFSET(TDCALL_r11, tdcall_output, r11); ^ vertically align this will fix it. + /* Offset for fields in tdvmcall_output */ + OFFSET(TDVMCALL_r11, tdvmcall_output, r11); + OFFSET(TDVMCALL_r12, tdvmcall_output, r12); + OFFSET(TDVMCALL_r13, tdvmcall_output, r13); + OFFSET(TDVMCALL_r14, tdvmcall_output, r14); + OFFSET(TDVMCALL_r15, tdvmcall_output, r15); +#endif + BLANK(); OFFSET(BP_scratch, boot_params, scratch); OFFSET(BP_secure_boot, boot_params, secure_boot); diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S new file mode 100644 index ..a73b67c0b407 --- /dev/null +++ b/arch/x86/kernel/tdcall.S @@ -0,0 +1,163 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include +#include +#include + +#include + +#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00 This looks like an undocumented magic number. +/* + * TDCALL instruction is newly added in TDX architecture,
Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
On 4/1/21 2:19 PM, Dave Hansen wrote: On 4/1/21 2:15 PM, Kuppuswamy, Sathyanarayanan wrote: On 4/1/21 2:08 PM, Dave Hansen wrote: On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote: +bool is_tdx_guest(void) +{ + return static_cpu_has(X86_FEATURE_TDX_GUEST); +} Why do you need is_tdx_guest() as opposed to calling cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere? is_tdx_guest() is also implemented/used in compressed code (which uses native_cpuid calls). I don't think we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in compressed code right? Also is_tdx_guest() looks easy to read and use. OK, but how many of the is_tdx_guest() uses are in the compressed code? Why has its use spread beyond that? Its only used in handling in/out instructions in compressed code. But this code shared with in/out handling on non-compressed code. #define __out(bwl, bw) \ do {\ if (is_tdx_guest()) { \ asm volatile("call tdg_out" #bwl : :\ "a"(value), "d"(port)); \ } else {\ asm volatile("out" #bwl " %" #bw "0, %w1" : : \ "a"(value), "Nd"(port));\ } \ } while (0) #define __in(bwl, bw) \ do {\ if (is_tdx_guest()) { \ asm volatile("call tdg_in" #bwl : \ "=a"(value) : "d"(port)); \ } else {\ asm volatile("in" #bwl " %w1, %" #bw "0" : \ "=a"(value) : "Nd"(port)); \ } \ } while (0) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
On 4/1/21 2:08 PM, Dave Hansen wrote: On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote: +bool is_tdx_guest(void) +{ + return static_cpu_has(X86_FEATURE_TDX_GUEST); +} Why do you need is_tdx_guest() as opposed to calling cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere? is_tdx_guest() is also implemented/used in compressed code (which uses native_cpuid calls). I don't think we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in compressed code right? Also is_tdx_guest() looks easy to read and use. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[PATCH v5 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
When running as a TDX guest, there are a number of existing, privileged instructions that do not work. If the guest kernel uses these instructions, the hardware generates a #VE. You can find the list of unsupported instructions in Intel Trust Domain Extensions (Intel® TDX) Module specification, sec 9.2.2 and in Guest-Host Communication Interface (GHCI) Specification for Intel TDX, sec 2.4.1. To prevent TD guest from using these unsupported instructions, following measures are adapted: 1. For MWAIT/MONITOR instructions, support for these instructions are already disabled by TDX module (SEAM). So CPUID flags for these instructions should be in disabled state. Also, just to be sure that these instructions are disabled, forcefully unset X86_FEATURE_MWAIT CPU cap in OS. 2. For WBINVD instruction, we use audit to find the code that uses this instruction and disable them for TD. After the above mentioned preventive measures, if TD guests still execute these instructions, add appropriate warning messages in #VE handler. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since v4: * Fixed commit log and comments as per Dave's comments * Used WARN_ONCE for MWAIT/MONITOR #VE. * Removed X86_FEATURE_MWAIT suppression code. Changes since v3: * WARN user if SEAM does not disable MONITOR/MWAIT instruction. * Fix the commit log and comments to address review comments from from Dave & Sean. Changes since v2: * Added BUG() for WBINVD, WARN for MONITOR instructions. * Fixed comments as per Dave's review. Changes since v1: * Added WARN() for MWAIT #VE exception. Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 16 1 file changed, 16 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..9bc84caf4096 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -362,6 +362,22 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; + case EXIT_REASON_WBINVD: + /* +* WBINVD is not supported inside TDX guests. All in- +* kernel uses should have been disabled. +*/ + pr_err("TD Guest used unsupported WBINVD instruction\n"); + BUG(); + break; + case EXIT_REASON_MONITOR_INSTRUCTION: + case EXIT_REASON_MWAIT_INSTRUCTION: + /* +* Something in the kernel used MONITOR or MWAIT despite +* X86_FEATURE_MWAIT being cleared for TDX guests. +*/ + WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n"); + break; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/31/21 2:49 PM, Dave Hansen wrote: On 3/31/21 2:09 PM, Kuppuswamy Sathyanarayanan wrote: As per Guest-Host Communication Interface (GHCI) Specification for Intel TDX, sec 2.4.1, TDX architecture does not support MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode, if MWAIT/MONITOR instructions are executed with CPL != 0 it will trigger #UD, and for CPL = 0 case, virtual exception (#VE) is triggered. WBINVD instruction behavior is also similar to MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead of #UD. Could we give it a go to try this in plain English before jumping in and quoting the exact spec section? Also, the CPL language is nice and precise for talking inside Intel, but it's generally easier for me to read kernel descriptions when we just talk about the kernel. When running as a TDX guest, there are a number of existing, privileged instructions that do not work. If the guest kernel uses these instructions, the hardware generates a #VE. I will fix it in next version. Which reminds me... The SDM says: MWAIT will "#UD ... If CPUID.01H:ECX.MONITOR[bit 3] = 0". So, is this an architectural change? The guest is *supposed* to see that CPUID bit as 0, so shouldn't it also get a #UD? Or is this all so that if SEAM *forgets* to clear the CPUID bit, the guest gets #VE? AFAIK, we are only concerned about the case where the instruction support is not disabled by SEAM. For disabled case, it should get #UD. Sean, can you confirm it? What are we *actually* mitigating here? we add support for #VE, when executing un-supported instruction in TD guest kernel. Also, FWIW, MWAIT/MONITOR and WBINVD are pretty different beasts. I think this would all have been a lot more clear if this would have been two patches instead of shoehorning them into one. Since all of them are unsupported instructions, I have grouped them together. Even if we split it, there should be some duplication in commit log (since handling is similar). But let me know if this is a desired approach. I can split it in two patches. To prevent TD guest from using these unsupported instructions, following measures are adapted: 1. For MWAIT/MONITOR instructions, support for these instructions are already disabled by TDX module (SEAM). So CPUID flags for these instructions should be in disabled state. Also, just to be sure that these instructions are disabled, forcefully unset X86_FEATURE_MWAIT CPU cap in OS. 2. For WBINVD instruction, we use audit to find the code that uses this instruction and disable them for TD. Really? Where are those patches? For MWAIT/MONITOR, the change is included in the same patch. For WBINVD, we have will have some patches included in next series. +static inline bool cpuid_has_mwait(void) +{ + if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32))) + return true; + + return false; +} + bool is_tdx_guest(void) { return static_cpu_has(X86_FEATURE_TDX_GUEST); @@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve) return insn.length; } +/* Initialize TDX specific CPU capabilities */ +static void __init tdx_cpu_cap_init(void) +{ + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + + if (cpuid_has_mwait()) { + WARN(1, "TDX Module failed to disable MWAIT\n"); WARN(1, "TDX guest enumerated support for MWAIT, disabling it"). will fix it in next version. + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); + } + +} Extra newline. void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) return; - setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + tdx_cpu_cap_init(); tdg_get_info(); @@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; + case EXIT_REASON_WBINVD: + /* +* TDX architecture does not support WBINVD instruction. +* Currently, usage of this instruction is prevented by +* disabling the drivers which uses it. So if we still +* reach here, it needs user attention. +*/ This comment is awfully vague. "TDX architecture..." what? Any CPUs supporting the TDX architecture? TDX VMM's? TDX Guests? Let's also not waste byte on stating the obvious. If it didn't need attention we wouldn't be warning about it, eh? So, let's halve the size of the comment and say: /* * WBINVD is not supported inside TDX guests. All in- * kernel uses should have been disabled. */ ok. will fix it next version. + pr_err("TD Guest us
Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/31/21 3:11 PM, Dave Hansen wrote: On 3/31/21 3:06 PM, Sean Christopherson wrote: I've no objection to a nice message in the #VE handler. What I'm objecting to is sanity checking the CPUID model provided by the TDX module. If we don't trust the TDX module to honor the spec, then there are a huge pile of things that are far higher priority than MONITOR/MWAIT. In other words: Don't muck with CPUID or the X86_FEATURE at all. Don't check it to comply with the spec. If something doesn't comply, we'll get a #VE at *SOME* point. We don't need to do belt-and-suspenders programming here. That sounds sane to me. But I think there are cases (like MCE) where SEAM does not disable them because there will be future support for it. We should at-least suppress such features in kernel. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [RFC v1 00/26] Add TDX Guest Support
Hi All, On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote: Hi All, NOTE: This series is not ready for wide public review. It is being specifically posted so that Peter Z and other experts on the entry code can look for problems with the new exception handler (#VE). That's also why x86@ is not being spammed. We are currently working on a solution to fix the issues raised in "Add #VE support for TDX guest" patch. While we fix that issue, I would like to know if there are issues in other patches in this series. So if possible can you please review other patches in the series and let us know your comments?. If you want me to rebase the series on top of v5.12-rcX kernel and repost it, please let me know. Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious hosts and some physical attacks. This series adds the bare-minimum support to run a TDX guest. The host-side support will be submitted separately. Also support for advanced TD guest features like attestation or debug-mode will be submitted separately. Also, at this point it is not secure with some known holes in drivers, and also hasn’t been fully audited and fuzzed yet. TDX has a lot of similarities to SEV. It enhances confidentiality and of guest memory and state (like registers) and includes a new exception (#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged yet), TDX limits the host's ability to effect changes in the guest physical address space. In contrast to the SEV code in the kernel, TDX guest memory is integrity protected and isolated; the host is prevented from accessing guest memory (even ciphertext). The TDX architecture also includes a new CPU mode called Secure-Arbitration Mode (SEAM). The software (TDX module) running in this mode arbitrates interactions between host and guest and implements many of the guarantees of the TDX architecture. Some of the key differences between TD and regular VM is, 1. Multi CPU bring-up is done using the ACPI MADT wake-up table. 2. A new #VE exception handler is added. The TDX module injects #VE exception to the guest TD in cases of instructions that need to be emulated, disallowed MSR accesses, subset of CPUID leaves, etc. 3. By default memory is marked as private, and TD will selectively share it with VMM based on need. 4. Remote attestation is supported to enable a third party (either the owner of the workload or a user of the services provided by the workload) to establish that the workload is running on an Intel-TDX-enabled platform located within a TD prior to providing that workload data. You can find TDX related documents in the following link. https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html This RFC series has been reviewed by Dave Hansen. Kirill A. Shutemov (16): x86/paravirt: Introduce CONFIG_PARAVIRT_XL x86/tdx: Get TD execution environment information via TDINFO x86/traps: Add #VE support for TDX guest x86/tdx: Add HLT support for TDX guest x86/tdx: Wire up KVM hypercalls x86/tdx: Add MSR support for TDX guest x86/tdx: Handle CPUID via #VE x86/io: Allow to override inX() and outX() implementation x86/tdx: Handle port I/O x86/tdx: Handle in-kernel MMIO x86/mm: Move force_dma_unencrypted() to common code x86/tdx: Exclude Shared bit from __PHYSICAL_MASK x86/tdx: Make pages shared in ioremap() x86/tdx: Add helper to do MapGPA TDVMALL x86/tdx: Make DMA pages shared x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan (6): x86/cpufeatures: Add TDX Guest CPU feature x86/cpufeatures: Add is_tdx_guest() interface x86/tdx: Handle MWAIT, MONITOR and WBINVD ACPI: tables: Add multiprocessor wake-up support x86/topology: Disable CPU hotplug support for TDX platforms. x86/tdx: Introduce INTEL_TDX_GUEST config option Sean Christopherson (4): x86/boot: Add a trampoline for APs booting in 64-bit mode x86/boot: Avoid #VE during compressed boot for TDX platforms x86/boot: Avoid unnecessary #VE during boot process x86/tdx: Forcefully disable legacy PIC for TDX guests arch/x86/Kconfig | 28 +- arch/x86/boot/compressed/Makefile| 2 + arch/x86/boot/compressed/head_64.S | 10 +- arch/x86/boot/compressed/misc.h | 1 + arch/x86/boot/compressed/pgtable.h | 2 +- arch/x86/boot/compressed/tdx.c | 32 ++ arch/x86/boot/compressed/tdx_io.S| 9 + arch/x86/include/asm/apic.h | 3 + arch/x86/include/asm/asm-prototypes.h| 1 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/idtentry.h | 4 + arch/x86/include/asm/io.h| 25 +- arch/x86/include/asm/irqflags.h | 42 +- arch/x86/include/asm/kvm_para.h | 21 + arch/x86/include/asm/paravirt.h | 22 +- arch/x86/include/asm/paravirt_types.h| 3 +- arch/x
[PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
As per Guest-Host Communication Interface (GHCI) Specification for Intel TDX, sec 2.4.1, TDX architecture does not support MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode, if MWAIT/MONITOR instructions are executed with CPL != 0 it will trigger #UD, and for CPL = 0 case, virtual exception (#VE) is triggered. WBINVD instruction behavior is also similar to MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead of #UD. To prevent TD guest from using these unsupported instructions, following measures are adapted: 1. For MWAIT/MONITOR instructions, support for these instructions are already disabled by TDX module (SEAM). So CPUID flags for these instructions should be in disabled state. Also, just to be sure that these instructions are disabled, forcefully unset X86_FEATURE_MWAIT CPU cap in OS. 2. For WBINVD instruction, we use audit to find the code that uses this instruction and disable them for TD. After the above mentioned preventive measures, if TD guest still execute these instructions, add appropriate warning messages in #VE handler. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since v3: * WARN user if SEAM does not disable MONITOR/MWAIT instruction. * Fix the commit log and comments to address review comments from from Dave & Sean. Changes since v2: * Added BUG() for WBINVD, WARN for MONITOR instructions. * Fixed comments as per Dave's review. Changes since v1: * Added WARN() for MWAIT #VE exception. Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 44 ++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..82b411b828a5 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -63,6 +63,14 @@ static inline bool cpuid_has_tdx_guest(void) return true; } +static inline bool cpuid_has_mwait(void) +{ + if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32))) + return true; + + return false; +} + bool is_tdx_guest(void) { return static_cpu_has(X86_FEATURE_TDX_GUEST); @@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve) return insn.length; } +/* Initialize TDX specific CPU capabilities */ +static void __init tdx_cpu_cap_init(void) +{ + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + + if (cpuid_has_mwait()) { + WARN(1, "TDX Module failed to disable MWAIT\n"); + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); + } + +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) return; - setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + tdx_cpu_cap_init(); tdg_get_info(); @@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; + case EXIT_REASON_WBINVD: + /* +* TDX architecture does not support WBINVD instruction. +* Currently, usage of this instruction is prevented by +* disabling the drivers which uses it. So if we still +* reach here, it needs user attention. +*/ + pr_err("TD Guest used unsupported WBINVD instruction\n"); + BUG(); + break; + case EXIT_REASON_MONITOR_INSTRUCTION: + case EXIT_REASON_MWAIT_INSTRUCTION: + /* +* MWAIT/MONITOR features are disabled by TDX Module (SEAM) +* and also re-suppressed in kernel by clearing +* X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So +* if TD guest still executes MWAIT/MONITOR instruction with +* above suppression, it needs user attention. +*/ + WARN(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n"); + break; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/30/21 8:10 AM, Dave Hansen wrote: On 3/30/21 8:00 AM, Andi Kleen wrote: + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This is enforced by SEAM module. Good point. Do we still need to safeguard it by setup_clear_cpu_cap() here? I guess it doesn't hurt to do it explicitly. If this MWAIT behavior (clearing the CPUID bit) is part of the guest architecture, then it would also be appropriate to WARN() rather than silently clearing the X86_FEATURE bit. Makes sense. It will be useful to find the problem with TDX Module. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/29/21 4:23 PM, Andy Lutomirski wrote: On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan wrote: In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions appropriately. Is there something I missed elsewhere in the code that checks CPL? We don't check for CPL explicitly. But if we are reaching here, then we executing these instructions with wrong CPL. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions appropriately. Since the impact of executing WBINVD instruction in non ring-0 mode can be heavy, use BUG() to report it. For others, raise a WARNING message. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since v2: * Added BUG() for WBINVD, WARN for MONITOR instructions. * Fixed comments as per Dave's review. Changes since v1: * Added WARN() for MWAIT #VE exception. Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 29 + 1 file changed, 29 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..4c6336a055a3 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -308,6 +308,9 @@ void __init tdx_early_init(void) setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); + tdg_get_info(); pv_ops.irq.safe_halt = tdg_safe_halt; @@ -362,6 +365,32 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; + case EXIT_REASON_WBINVD: + /* +* WBINVD is a privileged instruction, can only be executed +* in ring 0. Since we reached here, the kernel is in buggy +* state. +*/ + pr_err("WBINVD #VE Exception\n"); + BUG(); + break; + case EXIT_REASON_MONITOR_INSTRUCTION: + /* +* MONITOR is a privileged instruction, can only be executed +* in ring 0. So we are not supposed to reach here. Raise a +* warning message. +*/ + WARN(1, "MONITOR unexpected #VE Exception\n"); + break; + case EXIT_REASON_MWAIT_INSTRUCTION: + /* +* MWAIT feature is suppressed in firmware and in +* tdx_early_init() by clearing X86_FEATURE_MWAIT CPU feature +* flag. Since we are not supposed to reach here, raise a +* warning message and return -EFAULT. +*/ + WARN(1, "MWAIT unexpected #VE Exception\n"); + return -EFAULT; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/29/21 3:12 PM, Dave Hansen wrote: On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote: + case EXIT_REASON_MWAIT_INSTRUCTION: + /* MWAIT is supressed, not supposed to reach here. */ + WARN(1, "MWAIT unexpected #VE Exception\n"); + return -EFAULT; How is MWAIT "supppressed"? I am clearing the MWAIT feature flag in early init code. We should also disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT); I'd be more explicit about that. Maybe even reference the code that clears the X86_FEATURE. This change is part of the same patch. Right, but if someone goes and looks at the switch() statement in 10 years is it going to be obvious how MWAIT was "suppressed"? Ok. I can add a comment line for it. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/29/21 3:02 PM, Dave Hansen wrote: On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote: MONITOR is a privileged instruction, right? So we can only end up in here if the kernel screws up and isn't reading CPUID correctly, right? That dosen't seem to me like something we want to suppress. This needs a warning, at least. I assume that having a MONITOR instruction immediately return doesn't do any harm. Agree. Since we are not supposed to come here, I will use BUG. "This is unexpected" is a WARN()able offense. "This is unexpected and might be corrupting data" is where we want to use BUG(). Does broken MONITOR risk data corruption? We will be reaching this point only if something is buggy in kernel. I am not sure about impact of this buggy state. But MONITOR instruction by itself, should not cause data corruption. + case EXIT_REASON_MWAIT_INSTRUCTION: + /* MWAIT is supressed, not supposed to reach here. */ + WARN(1, "MWAIT unexpected #VE Exception\n"); + return -EFAULT; How is MWAIT "supppressed"? I am clearing the MWAIT feature flag in early init code. We should also disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT); I'd be more explicit about that. Maybe even reference the code that clears the X86_FEATURE. This change is part of the same patch. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/29/21 10:14 AM, Dave Hansen wrote: On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote: + /* +* Per Guest-Host-Communication Interface (GHCI) for Intel Trust +* Domain Extensions (Intel TDX) specification, sec 2.4, +* some instructions that unconditionally cause #VE (such as WBINVD, +* MONITOR, MWAIT) do not have corresponding TDCALL +* [TDG.VP.VMCALL ] leaves, since the TD has been designed +* with no deterministic way to confirm the result of those operations +* performed by the host VMM. In those cases, the goal is for the TD +* #VE handler to increment the RIP appropriately based on the VE +* information provided via TDCALL. +*/ That's an awfully big comment. Could you pare it down, please? Maybe focus on the fact that we should never get here and why, rather than talking about some silly spec? I will remove this and add individual one line comment for WBINVD and MONITOR instructions. Some thing like "Privileged instruction, can only be executed in ring 0. So raise a BUG. + case EXIT_REASON_WBINVD: + pr_warn_once("WBINVD #VE Exception\n"); I actually think WBINVD in here should oops. We use it for some really important things. If it can't be executed, and we're depending on it, the kernel is in deep, deep trouble. Agree. I will call BUG(). I think a noop here is dangerous. + case EXIT_REASON_MONITOR_INSTRUCTION: + /* Handle as nops. */ + break; MONITOR is a privileged instruction, right? So we can only end up in here if the kernel screws up and isn't reading CPUID correctly, right? That dosen't seem to me like something we want to suppress. This needs a warning, at least. I assume that having a MONITOR instruction immediately return doesn't do any harm. Agree. Since we are not supposed to come here, I will use BUG. + case EXIT_REASON_MWAIT_INSTRUCTION: + /* MWAIT is supressed, not supposed to reach here. */ + WARN(1, "MWAIT unexpected #VE Exception\n"); + return -EFAULT; How is MWAIT "supppressed"? I am clearing the MWAIT feature flag in early init code. We should also disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered
On 3/16/21 9:13 PM, Lukas Wunner wrote: On Fri, Mar 12, 2021 at 07:32:08PM -0800, sathyanarayanan.kuppusw...@linux.intel.com wrote: + if ((events == PCI_EXP_SLTSTA_DLLSC) && is_dpc_reset_active(pdev)) { + ctrl_info(ctrl, "Slot(%s): DLLSC event(DPC), skipped\n", + slot_name(ctrl)); + ret = IRQ_HANDLED; + goto out; + } Two problems here: (1) If recovery fails, the link will *remain* down, so there'll be no Link Up event. You've filtered the Link Down event, thus the slot will remain in ON_STATE even though the device in the slot is no longer accessible. That's not good, the slot should be brought down in this case. (2) If recovery succeeds, there's a race where pciehp may call is_dpc_reset_active() *after* dpc_reset_link() has finished. So both the DPC Trigger Status bit as well as pdev->dpc_reset_active will be cleared. Thus, the Link Up event is not filtered by pciehp and the slot is brought down and back up even though DPC recovery was succesful, which seems undesirable. The only viable solution I see is to wait until DPC has completed. Sinan (+cc) proposed something along those lines a couple years back: https://patchwork.ozlabs.org/project/linux-pci/patch/20180818065126.77912-1-ok...@kernel.org/ Included below please find my suggestion for how to fix this. I've sort of combined yours and Sinan's approach, but I'm using a waitqueue (Sinan used polling) and I'm using atomic bitops on pdev->priv_flags (you're using an atomic_t instead, which needs additionally space in struct pci_dev). Note: It's compile-tested only, I don't have any DPC-capable hardware at my disposal. Would this work for you? If so, I can add a commit message to the patch and submit it properly. Let me know what you think. Thanks! -- >8 -- Subject: [PATCH] PCI: pciehp: Ignore Link Down/Up caused by DPC Signed-off-by: Lukas Wunner --- drivers/pci/hotplug/pciehp_hpc.c | 11 + drivers/pci/pci.h| 4 drivers/pci/pcie/dpc.c | 51 ++-- 3 files changed, 64 insertions(+), 2 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index fb3840e..bcc018e 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -707,6 +707,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) } /* +* Ignore Link Down/Up caused by Downstream Port Containment +* if recovery from the error succeeded. +*/ + if ((events & PCI_EXP_SLTSTA_DLLSC) && pci_dpc_recovered(pdev) && + ctrl->state == ON_STATE) { + atomic_and(~PCI_EXP_SLTSTA_DLLSC, >pending_events); Why modify pending_events here. It should be already be zero right? + if (pciehp_check_link_active(ctrl) > 0) + events &= ~PCI_EXP_SLTSTA_DLLSC; + } + + /* * Disable requests have higher priority than Presence Detect Changed * or Data Link Layer State Changed events. */ diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9684b46..e5ae5e8 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -392,6 +392,8 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev) /* pci_dev priv_flags */ #define PCI_DEV_ADDED 0 +#define PCI_DPC_RECOVERED 1 +#define PCI_DPC_RECOVERING 2 static inline void pci_dev_assign_added(struct pci_dev *dev, bool added) { @@ -446,10 +448,12 @@ struct rcec_ea { void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool pci_dpc_recovered(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_restore_dpc_state(struct pci_dev *dev) {} static inline void pci_dpc_init(struct pci_dev *pdev) {} +static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; } #endif #ifdef CONFIG_PCIEPORTBUS diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index e05aba8..7328d9c4 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -71,6 +71,44 @@ void pci_restore_dpc_state(struct pci_dev *dev) pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); } +static bool dpc_completed(struct pci_dev *pdev) +{ + u16 status; + + pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, ); + if (status & PCI_EXP_DPC_STATUS_TRIGGER) + return false; + + if (test_bit(PCI_DPC_RECOVERING, >priv_flags)) + return false; + + return true; +} + +static DECLARE_WAIT_QUEUE_HEAD(dpc_completed_waitqueue); + +bool pci_dpc_recovered(struct pci_dev *pdev) +{ + struct pci_host_bridge *host; + + if (!pdev->dpc_cap) + return false; + + /* +*
[PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions as no ops. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since v1: * Added WARN() for MWAIT #VE exception. Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..3c6158a53c17 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -308,6 +308,9 @@ void __init tdx_early_init(void) setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); + tdg_get_info(); pv_ops.irq.safe_halt = tdg_safe_halt; @@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; + /* +* Per Guest-Host-Communication Interface (GHCI) for Intel Trust +* Domain Extensions (Intel TDX) specification, sec 2.4, +* some instructions that unconditionally cause #VE (such as WBINVD, +* MONITOR, MWAIT) do not have corresponding TDCALL +* [TDG.VP.VMCALL ] leaves, since the TD has been designed +* with no deterministic way to confirm the result of those operations +* performed by the host VMM. In those cases, the goal is for the TD +* #VE handler to increment the RIP appropriately based on the VE +* information provided via TDCALL. +*/ + case EXIT_REASON_WBINVD: + pr_warn_once("WBINVD #VE Exception\n"); + case EXIT_REASON_MONITOR_INSTRUCTION: + /* Handle as nops. */ + break; + case EXIT_REASON_MWAIT_INSTRUCTION: + /* MWAIT is supressed, not supposed to reach here. */ + WARN(1, "MWAIT unexpected #VE Exception\n"); + return -EFAULT; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On 3/26/21 7:40 PM, Andy Lutomirski wrote: On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan wrote: In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions as no ops. These should at least be WARN. I will change it to WARN. Does TDX send #UD if these instructions have the wrong CPL? No, TDX does not trigger #UD for these instructions. If the #VE came from user mode, we should send an appropriate signal instead. It will be mapped into #GP(0) fault. This should be enough notification right? Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..fb7d22b846fc 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -308,6 +308,9 @@ void __init tdx_early_init(void) setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); +/* MWAIT is not supported in TDX platform, so suppress it */ +setup_clear_cpu_cap(X86_FEATURE_MWAIT); + tdg_get_info(); pv_ops.irq.safe_halt = tdg_safe_halt; @@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; +/* + * Per Guest-Host-Communication Interface (GHCI) for Intel Trust + * Domain Extensions (Intel TDX) specification, sec 2.4, + * some instructions that unconditionally cause #VE (such as WBINVD, + * MONITOR, MWAIT) do not have corresponding TDCALL + * [TDG.VP.VMCALL ] leaves, since the TD has been designed + * with no deterministic way to confirm the result of those operations + * performed by the host VMM. In those cases, the goal is for the TD + * #VE handler to increment the RIP appropriately based on the VE + * information provided via TDCALL. + */ +case EXIT_REASON_WBINVD: +pr_warn_once("WBINVD #VE Exception\n"); +case EXIT_REASON_MONITOR_INSTRUCTION: +/* Handle as nops. */ +break; +case EXIT_REASON_MWAIT_INSTRUCTION: +/* MWAIT is supressed, not supposed to reach here. */ +pr_warn("MWAIT unexpected #VE Exception\n"); +return -EFAULT; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1 -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions as no ops. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- Changes since previous series: * Suppressed MWAIT feature as per Andi's comment. * Added warning debug log for MWAIT #VE exception. arch/x86/kernel/tdx.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e936b2f88bf6..fb7d22b846fc 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -308,6 +308,9 @@ void __init tdx_early_init(void) setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + /* MWAIT is not supported in TDX platform, so suppress it */ + setup_clear_cpu_cap(X86_FEATURE_MWAIT); + tdg_get_info(); pv_ops.irq.safe_halt = tdg_safe_halt; @@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdg_handle_mmio(regs, ve); break; + /* +* Per Guest-Host-Communication Interface (GHCI) for Intel Trust +* Domain Extensions (Intel TDX) specification, sec 2.4, +* some instructions that unconditionally cause #VE (such as WBINVD, +* MONITOR, MWAIT) do not have corresponding TDCALL +* [TDG.VP.VMCALL ] leaves, since the TD has been designed +* with no deterministic way to confirm the result of those operations +* performed by the host VMM. In those cases, the goal is for the TD +* #VE handler to increment the RIP appropriately based on the VE +* information provided via TDCALL. +*/ + case EXIT_REASON_WBINVD: + pr_warn_once("WBINVD #VE Exception\n"); + case EXIT_REASON_MONITOR_INSTRUCTION: + /* Handle as nops. */ + break; + case EXIT_REASON_MWAIT_INSTRUCTION: + /* MWAIT is supressed, not supposed to reach here. */ + pr_warn("MWAIT unexpected #VE Exception\n"); + return -EFAULT; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
[PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
Implement common helper functions to communicate with the TDX Module and VMM (using TDCALL instruction). __tdvmcall() function can be used to request services from VMM. __tdcall() function can be used to communicate with the TDX Module. Using common helper functions makes the code more readable and less error prone compared to distributed and use case specific inline assembly code. Only downside in using this approach is, it adds a few extra instructions for every TDCALL use case when compared to distributed checks. Although it's a bit less efficient, it's worth it to make the code more readable. Originally-by: Sean Christopherson Signed-off-by: Kuppuswamy Sathyanarayanan --- Hi All, Please let me know your review comments. If you agree with this patch and want to see the use of these APIs in rest of the patches, I will re-send the patch series with updated code. Please let me know. Changes since v1: * Implemented tdvmcall and tdcall helper functions as assembly code. * Followed suggestion provided by Sean & Dave. arch/x86/include/asm/tdx.h| 23 + arch/x86/kernel/Makefile | 2 +- arch/x86/kernel/asm-offsets.c | 22 + arch/x86/kernel/tdcall.S | 163 ++ arch/x86/kernel/tdx.c | 30 +++ 5 files changed, 239 insertions(+), 1 deletion(-) create mode 100644 arch/x86/kernel/tdcall.S diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 69af72d08d3d..ce6212ce5f45 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -8,12 +8,35 @@ #ifdef CONFIG_INTEL_TDX_GUEST #include +#include + +struct tdcall_output { + u64 rcx; + u64 rdx; + u64 r8; + u64 r9; + u64 r10; + u64 r11; +}; + +struct tdvmcall_output { + u64 r11; + u64 r12; + u64 r13; + u64 r14; + u64 r15; +}; /* Common API to check TDX support in decompression and common kernel code. */ bool is_tdx_guest(void); void __init tdx_early_init(void); +u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out); + +u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15, + struct tdvmcall_output *out); + #else // !CONFIG_INTEL_TDX_GUEST static inline bool is_tdx_guest(void) diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index ea111bf50691..7966c10ea8d1 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)+= pvclock.o obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o -obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o +obj-$(CONFIG_INTEL_TDX_GUEST) += tdcall.o tdx.o obj-$(CONFIG_EISA) += eisa.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 60b9f42ce3c1..72de0b49467e 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -23,6 +23,10 @@ #include #endif +#ifdef CONFIG_INTEL_TDX_GUEST +#include +#endif + #ifdef CONFIG_X86_32 # include "asm-offsets_32.c" #else @@ -75,6 +79,24 @@ static void __used common(void) OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2); #endif +#ifdef CONFIG_INTEL_TDX_GUEST + BLANK(); + /* Offset for fields in tdcall_output */ + OFFSET(TDCALL_rcx, tdcall_output, rcx); + OFFSET(TDCALL_rdx, tdcall_output, rdx); + OFFSET(TDCALL_r8, tdcall_output, r8); + OFFSET(TDCALL_r9, tdcall_output, r9); + OFFSET(TDCALL_r10, tdcall_output, r10); + OFFSET(TDCALL_r11, tdcall_output, r11); + + /* Offset for fields in tdvmcall_output */ + OFFSET(TDVMCALL_r11, tdvmcall_output, r11); + OFFSET(TDVMCALL_r12, tdvmcall_output, r12); + OFFSET(TDVMCALL_r13, tdvmcall_output, r13); + OFFSET(TDVMCALL_r14, tdvmcall_output, r14); + OFFSET(TDVMCALL_r15, tdvmcall_output, r15); +#endif + BLANK(); OFFSET(BP_scratch, boot_params, scratch); OFFSET(BP_secure_boot, boot_params, secure_boot); diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S new file mode 100644 index ..a73b67c0b407 --- /dev/null +++ b/arch/x86/kernel/tdcall.S @@ -0,0 +1,163 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include +#include +#include + +#include + +#define TDVMCALL_EXPOSE_REGS_MASK 0xfc00 + +/* + * TDCALL instruction is newly added in TDX architecture, + * used by TD for requesting the host VMM to provide + * (untrusted) services. Supported in Binutils >= 2.36 + */ +#define tdcall .byte 0x66,0x0f,0x01,0xcc + +/* Only for non TDVMCALL use cases */ +SYM_FUNC_START(__tdcall) + FRAME_BEGIN + + /* Save/restore non-volatile GPRs that are exposed to the VMM. */ + push %r15 + push %r14 + push %r13 + push %r12 + + /* +* RDI => RAX = TDCALL leaf +* RSI => RCX = input param 1 +* RDX
Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
On 3/19/21 11:22 AM, Dave Hansen wrote: On 3/19/21 10:42 AM, Kuppuswamy, Sathyanarayanan wrote: @@ -4,6 +4,58 @@ #include #include +void tdcall(u64 leafid, struct tdcall_regs *regs) +{ + asm volatile( + /* RAX = leafid (TDCALL LEAF ID) */ + " movq %0, %%rax;" + /* Move regs->r[*] data to regs r[a-c]x, r8-r5 */ + " movq 8(%1), %%rcx;" I am super duper opposed to using inline asm. Large blocks are hard to read, I think this point is arguable. Based on the review comments I received so far, people prefer inline assembly compared to asm sub functions. It's arguable, but Sean makes a pretty compelling case. I actually think inline assembly is a monstrosity. It's insanely arcane and, as I hope you have noted, does not scale nicely beyond one or two instructions. and even harder to maintain. E.g. the %1 usage falls apart if an output constraint is added; that can be avoided by defining a local const/imm (I forget what they're called), but it doesn't help readability. we can use OFFSET() calls to improve the readability and avoid this issue. Also IMO, any one adding constraints should know how this would affect the asm code. This is about *maintainability*. How _easily_ can someone change this code in the future? Sean's arguing that it's *hard* to correctly add a constraint. Unfortunately, our supply of omnipotent kernel developers is a bit short. + " movq 16(%1), %%rdx;" + " movq 24(%1), %%r8;" + " movq 32(%1), %%r9;" + " movq 40(%1), %%r10;" + " movq 48(%1), %%r11;" + " movq 56(%1), %%r12;" + " movq 64(%1), %%r13;" + " movq 72(%1), %%r14;" + " movq 80(%1), %%r15;" This is extremely unsafe, and wasteful. Putting the onus on the caller to zero out unused registers, with no mechanism to enforce/encourage doing so, For encouragement, we can add a comment to this function about callers responsibility. makes it likely that the kernel will leak information to the VMM, e.g. in the form of stack data due to a partially initialized "regs". Unless you create sub-functions for each use cases, callers cannot avoid this responsibility. I don't think we're quite at the point where we throw up our hands. It would be pretty simple to have an initializer that zeros the registers out, or looks at the argument mask and does it more precisely. Surely we can do *something*. /* Offset for fields in tdvmcall_output */ OFFSET(TDVMCALL_r12, tdvmcall_output, r13); OFFSET(TDVMCALL_r13, tdvmcall_output, r13); OFFSET(TDVMCALL_r14, tdvmcall_output, r14); OFFSET(TDVMCALL_r15, tdvmcall_output, r15); SYM_FUNC_START(__tdvmcall) FRAME_BEGIN /* Save/restore non-volatile GPRs that are exposed to the VMM. */ push %r15 push %r14 push %r13 push %r12 I might have some tweaks for the assembly once someone puts a real patch together. But, that looks a lot more sane than the inline assembly to me. Ok. Let me use Sean's proposal and submit tested version of this patch. Also, any thoughts on whether you want to use single function for tdcall and tdvmcall? -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
Hi Sean, Thanks for the review. On 3/19/21 9:55 AM, Sean Christopherson wrote: On Thu, Mar 18, 2021, Kuppuswamy Sathyanarayanan wrote: diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e44e55d1e519..7ae1d25e272b 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -4,6 +4,58 @@ #include #include +void tdcall(u64 leafid, struct tdcall_regs *regs) +{ + asm volatile( + /* RAX = leafid (TDCALL LEAF ID) */ + " movq %0, %%rax;" + /* Move regs->r[*] data to regs r[a-c]x, r8-r5 */ + " movq 8(%1), %%rcx;" I am super duper opposed to using inline asm. Large blocks are hard to read, I think this point is arguable. Based on the review comments I received so far, people prefer inline assembly compared to asm sub functions. and even harder to maintain. E.g. the %1 usage falls apart if an output constraint is added; that can be avoided by defining a local const/imm (I forget what they're called), but it doesn't help readability. we can use OFFSET() calls to improve the readability and avoid this issue. Also IMO, any one adding constraints should know how this would affect the asm code. + " movq 16(%1), %%rdx;" + " movq 24(%1), %%r8;" + " movq 32(%1), %%r9;" + " movq 40(%1), %%r10;" + " movq 48(%1), %%r11;" + " movq 56(%1), %%r12;" + " movq 64(%1), %%r13;" + " movq 72(%1), %%r14;" + " movq 80(%1), %%r15;" This is extremely unsafe, and wasteful. Putting the onus on the caller to zero out unused registers, with no mechanism to enforce/encourage doing so, For encouragement, we can add a comment to this function about callers responsibility. makes it likely that the kernel will leak information to the VMM, e.g. in the form of stack data due to a partially initialized "regs". Unless you create sub-functions for each use cases, callers cannot avoid this responsibility. And although TDVMCALL is anything but speedy, requiring multiple memory operations just to set a single register is unnecessary. Not to mention several of these registers are never used in the GHCI-defined TDVMCALLs. This function is common between TDCALL and TDVMCALL. Extra registers you mentioned are related to other TDCALL usecases. And, since the caller defines the mask (which I also dislike), it's possible/likely that many of these memory operations are wasteful even for registers that are used by _some_ TDVMCALLs. Unnecessary accesses are inevitable if we want a common helper, but this is too much. using single function makes it easy to maintain, readable and less error prone. But I agree there are many unnecessary accesses for many users. + TDCALL ";" + /* Save TDCALL success/failure to regs->rax */ + " movq %%rax, (%1);" + /* Save rcx and rdx contents to regs->r[c-d]x */ + " movq %%rcx, 8(%1);" + " movq %%rdx, 16(%1);" + /* Move content of registers R8-R15 regs->r[8-15] */ + " movq %%r8, 24(%1);" + " movq %%r9, 32(%1);" + " movq %%r10, 40(%1);" + " movq %%r11, 48(%1);" + " movq %%r12, 56(%1);" + " movq %%r13, 64(%1);" + " movq %%r14, 72(%1);" + " movq %%r15, 80(%1);" + + : + : "r" (leafid), "r" (regs) + : "memory", "rax", "rbx", "rcx", "rdx", "r8", + "r9", "r10", "r11", "r12", "r13", "r14", "r15" All these clobbers mean even more memory operations... + ); + +} + +void tdvmcall(u64 subid, struct tdcall_regs *regs) +{ + /* Expose GPRs R8-R15 to VMM */ + regs->rcx = 0xff00; + /* R10 = 0 (standard TDVMCALL) */ + regs->r10 = TDVMCALL_STANDARD; + /* Save subid to r11 register */ + regs->r11 = subid; + + tdcall(TDVMCALL, regs); This implies the caller is responsible for _all_ error checking. The base TDCALL should never fail; if it does, something is horribly wrong with TDX-Module and panicking is absolutely the best option. I haven't added error checking to common function because some use cases like MSR and IO access does n
[PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
Implement common helper functions to communicate with the TDX Module and VMM (using TDCALL instruction). tdvmcall() function can be used to request services from VMM. tdcall() function can be used to communicate with the TDX Module. Using common helper functions makes the code more readable and less error prone compared to distributed and use case specific inline assembly code. Only downside in using this approach is, it adds a few extra instructions for every TDCALL use case when compared to distributed checks. Although it's a bit less efficient, it's worth it to make the code more readable. Signed-off-by: Kuppuswamy Sathyanarayanan --- Hi All, As you have suggested, I have created common helper functions for all tdcall() and tdvmcall() use cases. It uses inline assembly and passes GPRs R8-15 and r[a-c]x registers to TDX Module/VMM. Please take a look at it and let me know your comments. If you agree with the design, I can re-submit the patchset with changes related to using these new APIs. Please let me know. arch/x86/include/asm/tdx.h | 27 arch/x86/kernel/tdx.c | 52 ++ 2 files changed, 79 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 0b9d571b1f95..311252a90cfb 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -3,8 +3,27 @@ #ifndef _ASM_X86_TDX_H #define _ASM_X86_TDX_H +#include + #define TDX_CPUID_LEAF_ID 0x21 +#define TDVMCALL 0 + +/* TDVMCALL R10 Input */ +#define TDVMCALL_STANDARD 0 + +/* + * TDCALL instruction is newly added in TDX architecture, + * used by TD for requesting the host VMM to provide + * (untrusted) services. Supported in Binutils >= 2.36 + */ +#define TDCALL ".byte 0x66,0x0f,0x01,0xcc" + +struct tdcall_regs { + u64 rax, rcx, rdx; + u64 r8, r9, r10, r11, r12, r13, r14, r15; +}; + #ifdef CONFIG_INTEL_TDX_GUEST /* Common API to check TDX support in decompression and common kernel code. */ @@ -12,6 +31,10 @@ bool is_tdx_guest(void); void __init tdx_early_init(void); +void tdcall(u64 leafid, struct tdcall_regs *regs); + +void tdvmcall(u64 subid, struct tdcall_regs *regs); + #else // !CONFIG_INTEL_TDX_GUEST static inline bool is_tdx_guest(void) @@ -21,6 +44,10 @@ static inline bool is_tdx_guest(void) static inline void tdx_early_init(void) { }; +static inline void tdcall(u64 leafid, struct tdcall_regs *regs) { }; + +static inline void tdvmcall(u64 subid, struct tdcall_regs *regs) { }; + #endif /* CONFIG_INTEL_TDX_GUEST */ #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e44e55d1e519..7ae1d25e272b 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -4,6 +4,58 @@ #include #include +void tdcall(u64 leafid, struct tdcall_regs *regs) +{ + asm volatile( + /* RAX = leafid (TDCALL LEAF ID) */ + " movq %0, %%rax;" + /* Move regs->r[*] data to regs r[a-c]x, r8-r5 */ + " movq 8(%1), %%rcx;" + " movq 16(%1), %%rdx;" + " movq 24(%1), %%r8;" + " movq 32(%1), %%r9;" + " movq 40(%1), %%r10;" + " movq 48(%1), %%r11;" + " movq 56(%1), %%r12;" + " movq 64(%1), %%r13;" + " movq 72(%1), %%r14;" + " movq 80(%1), %%r15;" + TDCALL ";" + /* Save TDCALL success/failure to regs->rax */ + " movq %%rax, (%1);" + /* Save rcx and rdx contents to regs->r[c-d]x */ + " movq %%rcx, 8(%1);" + " movq %%rdx, 16(%1);" + /* Move content of registers R8-R15 regs->r[8-15] */ + " movq %%r8, 24(%1);" + " movq %%r9, 32(%1);" + " movq %%r10, 40(%1);" + " movq %%r11, 48(%1);" + " movq %%r12, 56(%1);" + " movq %%r13, 64(%1);" + " movq %%r14, 72(%1);" + " movq %%r15, 80(%1);" + + : + : "r" (leafid), "r" (regs) + : "memory", "rax", "rbx", "rcx", "rdx", "r8", + "r9", "r10", "r11", "r12", "r13", "r14", "r15" + ); + +} + +void tdvmcall(u64 subid, struct tdcall_regs *regs) +{ + /* Expose GPRs R8-R15 to VMM */ + regs->rcx = 0xff00; + /* R10 = 0 (standard TDVMCALL) */ + regs->r10 = TDVMCALL_STANDARD; + /* Save subid to r11 register */ + regs->r11 = subid; + + tdcall(TDVMCALL, regs); +} + static inline bool cpuid_has_tdx_guest(void) { u32 eax, signature[3]; -- 2.25.1
Re: [PATCH v13 0/5] Simplify PCIe native ownership
Hi Bjorn, On 1/22/21 5:11 PM, Kuppuswamy Sathyanarayanan wrote: Currently, PCIe capabilities ownership status is detected by verifying the status of pcie_ports_native, and _OSC negotiated results (cached in struct pci_host_bridge->native_* members). But this logic can be simplified, and we can use only struct pci_host_bridge ->native_* members to detect it. This patchset removes the distributed checks for pcie_ports_native, parameter. Any comments on this patch set? Changes since v12: * Rebased on top of v5.11-rc1 Changes since v11 (Bjorns update): * Add bugfix for DPC with no AER Capability * Split OSC_OWNER trivial changes from pcie_ports_native changes * Temporarily drop pcie_ports_dpc_native changes (revisit it later). Changes since v10: * Addressed format issue reported by lkp test. Changes since v9: * Rebased on top of v5.10-rc1 Changes since v8: * Simplified setting _OSC ownwership logic * Moved bridge->native_ltr out of #ifdef CONFIG_PCIEPORTBUS. Changes since v7: * Fixed "fix array_size.cocci warnings". Changes since v6: * Created new patch for CONFIG_PCIEPORTBUS check in pci_init_host_bridge(). * Added warning message for a case when pcie_ports_native overrides _OSC negotiation result. Changes since v5: * Rebased on top of v5.8-rc1 Changes since v4: * Changed the patch set title (Original link: https://lkml.org/lkml/2020/5/26/1710) * Added AER/DPC dependency logic cleanup fixes. Bjorn Helgaas (2): PCI/DPC: Ignore devices with no AER Capability PCI/ACPI: Centralize pci_aer_available() checking Kuppuswamy Sathyanarayanan (3): PCI: Assume control of portdrv-related features only when portdrv enabled PCI/ACPI: Tidy _OSC control bit checking PCI/ACPI: Centralize pcie_ports_native checking drivers/acpi/pci_root.c | 49 --- drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 -- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/dpc.c| 3 ++ drivers/pci/pcie/portdrv_core.c | 11 +++ drivers/pci/probe.c | 6 ++-- 7 files changed, 51 insertions(+), 25 deletions(-) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered
On 3/17/21 12:01 PM, Lukas Wunner wrote: On Wed, Mar 17, 2021 at 10:54:09AM -0700, Sathyanarayanan Kuppuswamy Natarajan wrote: Flush of hotplug event after successful recovery, and a simulated hotplug link down event after link recovery fails should solve the problems raised by Lukas. I assume Lukas' proposal adds this support. I will check his patch shortly. Thank you! I'd like to get a better understanding of the issues around hotplug/DPC, specifically I'm wondering: If DPC recovery was successful, what is the desired behavior by pciehp, should it ignore the Link Down/Up or bring the slot down and back up after DPC recovery? If the events are ignored, the driver of the device in the hotplug slot is not unbound and rebound. So the driver must be able to cope with loss of TLPs during DPC recovery and it must be able to cope with whatever state the endpoint device is in after DPC recovery. Is this really safe? How does the nvme driver deal with it? During DPC recovery, in pcie_do_recovery() function, we use report_frozen_detected() to notify all devices attached to the port about the fatal error. After this notification, we expect all affected devices to halt its IO transactions. Regarding state restoration, after successful recovery, we use report_slot_reset() to notify about the slot/link reset. So device drivers are expected to restore the device to working state after this notification. Also, if DPC is handled by firmware, your patch does not ignore the Link Down/Up events, Only for pure firmware model. For EDR case, we still ignore the Link Down/Up events. so pciehp brings down the slot when DPC is triggered, then brings it up after succesful recovery. In a code comment, you write that this behavior is okay because there's "no race between hotplug and DPC recovery". My point is, there is no race in OS handlers (pciehp_ist() vs pcie_do_recovery()) However, Sinan wrote in 2018 that one of the issues with hotplug versus DPC is that pciehp may turn off slot power and thereby foil DPC recovery. (Power off = cold reset, whereas DPC recovery = warm reset.) This can occur as well if DPC is handled by firmware. I am not sure how pure firmware DPC recovery works. Is there a platform which uses this combination? For firmware DPC model, spec does not clarify following points. 1. Who will notify the affected device drivers to halt the IO transactions. 2. Who is responsible to restore the state of the device after link reset. IMO, pure firmware DPC does not support seamless recovery. I think after it clears the DPC trigger status, it might expect hotplug handler be responsible for device recovery. I don't want to add fix to the code path that I don't understand. This is the reason for extending this logic to pure firmware DPC case. So I guess pciehp should make an attempt to await DPC recovery even if it's handled by firmware? Or am I missing something? We may be able to achieve that by polling the DPC Trigger Status bit and DLLLA bit, but it won't work as perfectly as with native DPC support. Finally, you write in your commit message that there are "a lot of stability issues" if pciehp and DPC are allowed to recover freely without proper serialization. What are these issues exactly? In most cases, I see failure of DPC recovery handler (you can see the example dmesg in commit log). Other than this, we also noticed some extended delay or failure in link retraining (while waiting for LINK UP in pcie_wait_for_link(pdev, true)). In some cases, we noticed slot power on failures and that card will not be detected when running lscpi. Above mentioned cases are just observations we have made. we did not dig deeper on why the race between pciehp and DPC leads to such issues. (Beyond the slot power issue mentioned above, and that the endpoint device's driver should presumably not be unbound if DPC recovery was successful.) Thanks! Lukas -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered
On 3/12/21 7:32 PM, sathyanarayanan.kuppusw...@linux.intel.com wrote: From: Kuppuswamy Sathyanarayanan When hotplug and DPC are both enabled on a Root port or Downstream Port, during DPC events that cause a DLLSC link down/up events, such events (DLLSC) must be suppressed to let the DPC driver own the recovery path. When DPC is present and enabled, hardware will put the port in containment state to allow SW to recover from the error condition in the seamless manner. But, during the DPC error recovery process, since the link is in disabled state, it will also raise the DLLSC event. In Linux kernel architecture, DPC events are handled by DPC driver and DLLSC events are handled by hotplug driver. If a hotplug driver is allowed to handle such DLLSC event (triggered by DPC containment), then we will have a race condition between error recovery handler (in DPC driver) and hotplug handler in recovering the contained port. Allowing such a race leads to a lot of stability issues while recovering the device. So skip DLLSC handling in the hotplug driver when the PCIe port associated with the hotplug event is in DPC triggered state and let the DPC driver be responsible for the port recovery. Following is the sample dmesg log which shows the contention between hotplug handler and error recovery handler. In this case, hotplug handler won the race and error recovery handler reported failure. pcieport :97:02.0: pciehp: Slot(4): Link Down pcieport :97:02.0: DPC: containment event, status:0x1f01 source:0x pcieport :97:02.0: DPC: unmasked uncorrectable error detected pcieport :97:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) pcieport :97:02.0: device [8086:347a] error status/mask=4000/00100020 pcieport :97:02.0:[14] CmpltTO(First) pci :98:00.0: AER: can't recover (no error_detected callback) pcieport :97:02.0: pciehp: Slot(4): Card present pcieport :97:02.0: DPC: Data Link Layer Link Active not set in 1000 msec pcieport :97:02.0: AER: subordinate device reset failed pcieport :97:02.0: AER: device recovery failed pci :98:00.0: [8086:0953] type 00 class 0x010802 nvme nvme1: pci function :98:00.0 nvme :98:00.0: enabling device (0140 -> 0142) nvme nvme1: 31/0/0 default/read/poll queues nvme1n2: p1 Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Dan Williams Reviewed-by: Raj Ashok --- Missed to add the change log. will include it in next version. Changes since v1: * Trimmed down the kernel log in commit history. * Removed usage of !! in is_dpc_reset_active(). * Addressed other minor comments from Bjorn. drivers/pci/hotplug/pciehp_hpc.c | 19 + drivers/pci/pci.h| 2 ++ drivers/pci/pcie/dpc.c | 36 ++-- include/linux/pci.h | 1 + 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index fb3840e222ad..55da5208c7e5 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -691,6 +691,25 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) goto out; } + /* +* If the DLLSC link up/down event is generated due to DPC containment +* in the PCIe port, skip the DLLSC event handling and let the DPC +* driver own the port recovery. Allowing both hotplug DLLSC event +* handler and DPC event trigger handler to attempt recovery on the +* same port leads to stability issues. If DPC recovery is successful, +* is_dpc_reset_active() will return false and the hotplug handler will +* not suppress the DLLSC event. If DPC recovery fails and the link is +* left in disabled state, once the user changes the faulty card, the +* hotplug handler can still handle the PRESENCE change event and bring +* the device back up. +*/ + if ((events == PCI_EXP_SLTSTA_DLLSC) && is_dpc_reset_active(pdev)) { + ctrl_info(ctrl, "Slot(%s): DLLSC event(DPC), skipped\n", + slot_name(ctrl)); + ret = IRQ_HANDLED; + goto out; + } + /* Check Attention Button Pressed */ if (events & PCI_EXP_SLTSTA_ABP) { ctrl_info(ctrl, "Slot(%s): Attention button pressed\n", diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index ef7c4661314f..cee7095483bd 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -446,10 +446,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool is_dpc_reset_active(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_re
Re: [PATCH v1 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered
On 3/12/21 3:14 PM, Bjorn Helgaas wrote: On Fri, Mar 12, 2021 at 02:11:03PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 3/12/21 1:33 PM, Bjorn Helgaas wrote: On Mon, Mar 08, 2021 at 10:34:10PM -0800, sathyanarayanan.kuppusw...@linux.intel.com wrote: From: Kuppuswamy Sathyanarayanan +bool is_dpc_reset_active(struct pci_dev *dev) +{ + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); + u16 status; + + if (!dev->dpc_cap) + return false; + + /* +* If DPC is owned by firmware and EDR is not supported, there is +* no race between hotplug and DPC recovery handler. So return +* false. +*/ + if (!host->native_dpc && !IS_ENABLED(CONFIG_PCIE_EDR)) + return false; + + if (atomic_read_acquire(>dpc_reset_active)) + return true; + + pci_read_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_STATUS, ); + + return !!(status & PCI_EXP_DPC_STATUS_TRIGGER); I know it's somewhat common in drivers/pci/, but I'm not really a big fan of "!!". I can change it to use ternary operator. (status & PCI_EXP_DPC_STATUS_TRIGGER) ? true : false; Ternary isn't terrible, but what's wrong with: if (status & PCI_EXP_DPC_STATUS_TRIGGER) return true; return false; I am fine with above format. which matches the style of the rest of the function. Looking at this again, we return "true" if either dpc_reset_active or PCI_EXP_DPC_STATUS_TRIGGER. I haven't worked this all out, but that pattern feels racy. I guess the thought is that if PCI_EXP_DPC_STATUS_TRIGGER is set, dpc_reset_link() will be invoked soon and we don't want to interfere? Yes, the reason for checking dpc_reset_active before PCI_EXP_DPC_STATUS_TRIGGER is because, we want suppress DLLSC events till link comes back or it times out. 137 atomic_inc_return_acquire(>dpc_reset_active); 138 139 pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS, 140 PCI_EXP_DPC_STATUS_TRIGGER); 141 142 if (!pcie_wait_for_link(pdev, true)) { 143 pci_info(pdev, "Data Link Layer Link Active not set in 1000 msec\n"); 144 status = PCI_ERS_RESULT_DISCONNECT; 145 } 146 147 atomic_dec_return_release(>dpc_reset_active); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v1 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered
On 3/12/21 1:33 PM, Bjorn Helgaas wrote: [+cc Lukas, pciehp expert] On Mon, Mar 08, 2021 at 10:34:10PM -0800, sathyanarayanan.kuppusw...@linux.intel.com wrote: From: Kuppuswamy Sathyanarayanan When hotplug and DPC are both enabled on a Root port or Downstream Port, during DPC events that cause a DLLSC link down/up events, such events must be suppressed to let the DPC driver own the recovery path. I first thought you were saying "during DPC events, DPC events must be suppressed" which would be nonsensical. But I guess this is saying the "*link down/up* events must be suppressed"? Correct. Our intention is to suppress DLLSC event during the DPC event. When DPC is present and enabled, hardware will put the port in containment state to allow SW to recover from the error condition in the seamless manner. But, during the DPC error recovery process, since the link is in disabled state, it will also raise the DLLSC event. In Linux kernel architecture, DPC events are handled by DPC driver and DLLSC events are handled by hotplug driver. If a hotplug driver is allowed to handle such DLLSC event (triggered by DPC containment), then we will have a race condition between error recovery handler (in DPC driver) and hotplug handler in recovering the contained port. Allowing such a race leads to a lot of stability issues while recovering the device. So skip DLLSC handling in the hotplug driver when the PCIe port associated with the hotplug event is in DPC triggered state and let the DPC driver be responsible for the port recovery. Following is the sample dmesg log which shows the contention between hotplug handler and error recovery handler. In this case, hotplug handler won the race and error recovery handler reported failure. [ 724.974237] pcieport :97:02.0: pciehp: Slot(4): Link Down [ 724.974266] pcieport :97:02.0: DPC: containment event, status:0x1f01 source:0x [ 724.974269] pcieport :97:02.0: DPC: unmasked uncorrectable error detected [ 724.974275] pcieport :97:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 724.974283] pcieport :97:02.0: device [8086:347a] error status/mask=4000/00100020 [ 724.974288] pcieport :97:02.0:[14] CmpltTO(First) [ 724.999181] pci :98:00.0: AER: can't recover (no error_detected callback) [ 724.999227] pci :98:00.0: Removing from iommu group 181 [ 726.063125] pcieport :97:02.0: pciehp: Slot(4): Card present [ 726.221117] pcieport :97:02.0: DPC: Data Link Layer Link Active not set in 1000 msec [ 726.221122] pcieport :97:02.0: AER: subordinate device reset failed [ 726.221162] pcieport :97:02.0: AER: device recovery failed [ 727.227176] pci :98:00.0: [8086:0953] type 00 class 0x010802 [ 727.227202] pci :98:00.0: reg 0x10: [mem 0x-0x3fff 64bit] [ 727.227234] pci :98:00.0: reg 0x30: [mem 0x-0x pref] [ 727.227246] pci :98:00.0: Max Payload Size set to 256 (was 128, max 256) [ 727.227251] pci :98:00.0: enabling Extended Tags [ 727.227736] pci :98:00.0: Adding to iommu group 181 [ 727.231150] pci :98:00.0: BAR 6: assigned [mem 0xd100-0xd100 pref] [ 727.231156] pci :98:00.0: BAR 0: assigned [mem 0xd101-0xd1013fff 64bit] [ 727.231170] pcieport :97:02.0: PCI bridge to [bus 98] [ 727.231174] pcieport :97:02.0: bridge window [io 0xc000-0xcfff] [ 727.231181] pcieport :97:02.0: bridge window [mem 0xd100-0xd10f] [ 727.231186] pcieport :97:02.0: bridge window [mem 0x2060-0x2060001f 64bit pref] [ 727.231555] nvme nvme1: pci function :98:00.0 [ 727.231581] nvme :98:00.0: enabling device (0140 -> 0142) [ 737.141132] nvme nvme1: 31/0/0 default/read/poll queues [ 737.146211] nvme1n2: p1 Quite a bit of the above really isn't relevant to the problem, so stripping it out would reduce distraction. E.g., Removing from iommu group reg ... Max Payload Size set enabling Extended Tags Adding to iommu group BAR X: assigned ... PCI bridge to [bus 98] bridge window ... Probably the timestamps are also only of incidental interest and could be removed? Agree. will fix this in next version. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Dan Williams Reviewed-by: Raj Ashok --- drivers/pci/hotplug/pciehp_hpc.c | 18 + drivers/pci/pci.h| 2 ++ drivers/pci/pcie/dpc.c | 33 ++-- include/linux/pci.h | 1 + 4 files changed, 52 insertions(+), 2 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index fb3840e222ad..8e7916abc60e 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -691,6 +691,24 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) goto out; } + /* +
Re: [PATCH 1/2] PCI: controller: thunder: fix compile testing
On 2/25/21 10:51 AM, Bjorn Helgaas wrote: Thanks for looking this over! I'd like to acknowledge your review, but I need an explicit Reviewed-by or similar. I don't want to put words in your mouth by converting "Looks good to me" to "Reviewed-by". will do so in future reviews. Reviewed-by: Kuppuswamy Sathyanarayanan -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH 1/2] PCI: controller: thunder: fix compile testing
On 2/25/21 6:37 AM, Arnd Bergmann wrote: From: Arnd Bergmann Compile-testing these drivers is currently broken. Enabling it causes a couple of build failures though: drivers/pci/controller/pci-thunder-ecam.c:119:30: error: shift count >= width of type [-Werror,-Wshift-count-overflow] drivers/pci/controller/pci-thunder-pem.c:54:2: error: implicit declaration of function 'writeq' [-Werror,-Wimplicit-function-declaration] drivers/pci/controller/pci-thunder-pem.c:392:8: error: implicit declaration of function 'acpi_get_rc_resources' [-Werror,-Wimplicit-function-declaration] Fix them with the obvious one-line changes. Looks good to me. Signed-off-by: Arnd Bergmann --- drivers/pci/controller/pci-thunder-ecam.c | 2 +- drivers/pci/controller/pci-thunder-pem.c | 13 +++-- drivers/pci/pci.h | 6 ++ 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/drivers/pci/controller/pci-thunder-ecam.c b/drivers/pci/controller/pci-thunder-ecam.c index f964fd26f7e0..ffd84656544f 100644 --- a/drivers/pci/controller/pci-thunder-ecam.c +++ b/drivers/pci/controller/pci-thunder-ecam.c @@ -116,7 +116,7 @@ static int thunder_ecam_p2_config_read(struct pci_bus *bus, unsigned int devfn, * the config space access window. Since we are working with * the high-order 32 bits, shift everything down by 32 bits. */ - node_bits = (cfg->res.start >> 32) & (1 << 12); + node_bits = upper_32_bits(cfg->res.start) & (1 << 12); v |= node_bits; set_val(v, where, size, val); diff --git a/drivers/pci/controller/pci-thunder-pem.c b/drivers/pci/controller/pci-thunder-pem.c index 1a3f70ac61fc..0660b9da204f 100644 --- a/drivers/pci/controller/pci-thunder-pem.c +++ b/drivers/pci/controller/pci-thunder-pem.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "../pci.h" #if defined(CONFIG_PCI_HOST_THUNDER_PEM) || (defined(CONFIG_ACPI) && defined(CONFIG_PCI_QUIRKS)) @@ -324,9 +325,9 @@ static int thunder_pem_init(struct device *dev, struct pci_config_window *cfg, * structure here for the BAR. */ bar4_start = res_pem->start + 0xf0; - pem_pci->ea_entry[0] = (u32)bar4_start | 2; - pem_pci->ea_entry[1] = (u32)(res_pem->end - bar4_start) & ~3u; - pem_pci->ea_entry[2] = (u32)(bar4_start >> 32); + pem_pci->ea_entry[0] = lower_32_bits(bar4_start) | 2; + pem_pci->ea_entry[1] = lower_32_bits(res_pem->end - bar4_start) & ~3u; + pem_pci->ea_entry[2] = upper_32_bits(bar4_start); cfg->priv = pem_pci; return 0; @@ -334,9 +335,9 @@ static int thunder_pem_init(struct device *dev, struct pci_config_window *cfg, #if defined(CONFIG_ACPI) && defined(CONFIG_PCI_QUIRKS) -#define PEM_RES_BASE 0x87e0c000UL -#define PEM_NODE_MASK GENMASK(45, 44) -#define PEM_INDX_MASK GENMASK(26, 24) +#define PEM_RES_BASE 0x87e0c000ULL +#define PEM_NODE_MASK GENMASK_ULL(45, 44) +#define PEM_INDX_MASK GENMASK_ULL(26, 24) #define PEM_MIN_DOM_IN_NODE 4 #define PEM_MAX_DOM_IN_NODE 10 diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 0a2b6d993fe1..022c2f433676 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -625,6 +625,12 @@ static inline int pci_dev_specific_reset(struct pci_dev *dev, int probe) #if defined(CONFIG_PCI_QUIRKS) && defined(CONFIG_ARM64) int acpi_get_rc_resources(struct device *dev, const char *hid, u16 segment, struct resource *res); +#else +static inline int acpi_get_rc_resources(struct device *dev, const char *hid, + u16 segment, struct resource *res) +{ + return -ENODEV; +} #endif int pci_rebar_get_current_size(struct pci_dev *pdev, int bar); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO
On 2/8/21 2:00 AM, Peter Zijlstra wrote: This needs a binutils version number. Yes, we will add it in next version. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
On 2/8/21 8:59 AM, Peter Zijlstra wrote: 'cute', might be useful to have that mentioned somewhere. we will add a note for it in comments. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[RFC v1 15/26] x86/boot: Add a trampoline for APs booting in 64-bit mode
From: Sean Christopherson Add a trampoline for booting APs in 64-bit mode via a software handoff with BIOS, and use the new trampoline for the ACPI MP wake protocol used by TDX. Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit mode. For the GDT pointer, create a new entry as the existing storage for the pointer occupies the zero entry in the GDT itself. Reported-by: Kai Huang Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/realmode.h | 1 + arch/x86/kernel/smpboot.c| 5 +++ arch/x86/realmode/rm/header.S| 1 + arch/x86/realmode/rm/trampoline_64.S | 49 +++- arch/x86/realmode/rm/trampoline_common.S | 5 ++- 5 files changed, 58 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h index 5db5d083c873..5066c8b35e7c 100644 --- a/arch/x86/include/asm/realmode.h +++ b/arch/x86/include/asm/realmode.h @@ -25,6 +25,7 @@ struct real_mode_header { u32 sev_es_trampoline_start; #endif #ifdef CONFIG_X86_64 + u32 trampoline_start64; u32 trampoline_pgd; #endif /* ACPI S3 wakeup */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 8ca66af96a54..11dd0deb4810 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1035,6 +1035,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle, unsigned long boot_error = 0; unsigned long timeout; +#ifdef CONFIG_X86_64 + if (is_tdx_guest()) + start_ip = real_mode_header->trampoline_start64; +#endif + idle->thread.sp = (unsigned long)task_pt_regs(idle); early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu); initial_code = (unsigned long)start_secondary; diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S index 8c1db5bf5d78..2eb62be6d256 100644 --- a/arch/x86/realmode/rm/header.S +++ b/arch/x86/realmode/rm/header.S @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header) .long pa_sev_es_trampoline_start #endif #ifdef CONFIG_X86_64 + .long pa_trampoline_start64 .long pa_trampoline_pgd; #endif /* ACPI S3 wakeup */ diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S index 84c5d1b33d10..12b734b1da8b 100644 --- a/arch/x86/realmode/rm/trampoline_64.S +++ b/arch/x86/realmode/rm/trampoline_64.S @@ -143,13 +143,20 @@ SYM_CODE_START(startup_32) movl%eax, %cr3 # Set up EFER + movl$MSR_EFER, %ecx + rdmsr + cmp pa_tr_efer, %eax + jne .Lwrite_efer + cmp pa_tr_efer + 4, %edx + je .Ldone_efer +.Lwrite_efer: movlpa_tr_efer, %eax movlpa_tr_efer + 4, %edx - movl$MSR_EFER, %ecx wrmsr +.Ldone_efer: # Enable paging and in turn activate Long Mode - movl$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax + movl$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax movl%eax, %cr0 /* @@ -161,6 +168,19 @@ SYM_CODE_START(startup_32) ljmpl $__KERNEL_CS, $pa_startup_64 SYM_CODE_END(startup_32) +SYM_CODE_START(pa_trampoline_compat) + /* +* In compatibility mode. Prep ESP and DX for startup_32, then disable +* paging and complete the switch to legacy 32-bit mode. +*/ + movl$rm_stack_end, %esp + movw$__KERNEL_DS, %dx + + movl$(X86_CR0_NE | X86_CR0_PE), %eax + movl%eax, %cr0 + ljmpl $__KERNEL32_CS, $pa_startup_32 +SYM_CODE_END(pa_trampoline_compat) + .section ".text64","ax" .code64 .balign 4 @@ -169,6 +189,20 @@ SYM_CODE_START(startup_64) jmpq*tr_start(%rip) SYM_CODE_END(startup_64) +SYM_CODE_START(trampoline_start64) + /* +* APs start here on a direct transfer from 64-bit BIOS with identity +* mapped page tables. Load the kernel's GDT in order to gear down to +* 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load +* segment registers. Load the zero IDT so any fault triggers a +* shutdown instead of jumping back into BIOS. +*/ + lidttr_idt(%rip) + lgdttr_gdt64(%rip) + + ljmpl *tr_compat(%rip) +SYM_CODE_END(trampoline_start64) + .section ".rodata","a" # Duplicate the global descriptor table # so the kernel can live anywhere @@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt) .quad 0x00cf9300 # __KERNEL_DS SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end) +SYM_DATA_START(tr_gdt64) + .short tr_gdt_end - tr_gdt - 1 # gdt limit + .long pa_tr_gdt + .long 0 +SYM_DATA_END(tr_gdt64) + +SYM_DATA_START(tr_compat) + .long pa_t
[RFC v1 14/26] ACPI: tables: Add multiprocessor wake-up support
As per Guest-Host Communication Interface (GHCI) Specification for Intel TDX, sec 4.1, a new sub structure – multiprocessor wake-up structure - is added to the ACPI Multiple APIC Description Table (MADT) to describe the information of the mailbox. If a platform firmware produces the multiprocessor wake-up structure, then the BSP in OS may use this new mailbox-based mechanism to wake up the APs. Add ACPI MADT wake table parsing support and if MADT wake table is present, update apic->wakeup_secondary_cpu with new API which uses MADT wake mailbox to wake-up CPU. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- arch/x86/include/asm/apic.h | 3 ++ arch/x86/kernel/acpi/boot.c | 56 + arch/x86/kernel/apic/probe_32.c | 8 + arch/x86/kernel/apic/probe_64.c | 8 + drivers/acpi/tables.c | 9 ++ include/acpi/actbl2.h | 21 - 6 files changed, 104 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 34cb3c159481..63f970c61cbe 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -497,6 +497,9 @@ static inline unsigned int read_apic_id(void) return apic->get_apic_id(reg); } +typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip); +extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler); + extern int default_apic_id_valid(u32 apicid); extern int default_acpi_madt_oem_check(char *, char *); extern void default_setup_apic_routing(void); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 7bdc0239a943..37ada1908fb7 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -65,6 +65,9 @@ int acpi_fix_pin2_polarity __initdata; static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE; #endif +static struct acpi_madt_mp_wake_mailbox *acpi_mp_wake_mailbox; +static u64 acpi_mp_wake_mailbox_paddr; + #ifdef CONFIG_X86_IO_APIC /* * Locks related to IOAPIC hotplug @@ -329,6 +332,29 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e return 0; } +static void acpi_mp_wake_mailbox_init(void) +{ + if (acpi_mp_wake_mailbox) + return; + + acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr, + sizeof(*acpi_mp_wake_mailbox), MEMREMAP_WB); +} + +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip) +{ + acpi_mp_wake_mailbox_init(); + + if (!acpi_mp_wake_mailbox) + return -EINVAL; + + WRITE_ONCE(acpi_mp_wake_mailbox->apic_id, apicid); + WRITE_ONCE(acpi_mp_wake_mailbox->wakeup_vector, start_ip); + WRITE_ONCE(acpi_mp_wake_mailbox->command, ACPI_MP_WAKE_COMMAND_WAKEUP); + + return 0; +} + #endif /*CONFIG_X86_LOCAL_APIC */ #ifdef CONFIG_X86_IO_APIC @@ -1086,6 +1112,30 @@ static int __init acpi_parse_madt_lapic_entries(void) } return 0; } + +static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, + const unsigned long end) +{ + struct acpi_madt_mp_wake *mp_wake = NULL; + + if (!IS_ENABLED(CONFIG_SMP)) + return -ENODEV; + + mp_wake = (struct acpi_madt_mp_wake *)header; + if (BAD_MADT_ENTRY(mp_wake, end)) + return -EINVAL; + + if (acpi_mp_wake_mailbox) + return -EINVAL; + + acpi_table_print_madt_entry(>common); + + acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address; + + acpi_wake_cpu_handler_update(acpi_wakeup_cpu); + + return 0; +} #endif /* CONFIG_X86_LOCAL_APIC */ #ifdef CONFIG_X86_IO_APIC @@ -1284,6 +1334,12 @@ static void __init acpi_process_madt(void) smp_found_config = 1; } + + /* +* Parse MADT MP Wake entry. +*/ + acpi_table_parse_madt(ACPI_MADT_TYPE_MP_WAKE, + acpi_parse_mp_wake, 1); } if (error == -EINVAL) { /* diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c index a61f642b1b90..d450014841b2 100644 --- a/arch/x86/kernel/apic/probe_32.c +++ b/arch/x86/kernel/apic/probe_32.c @@ -207,3 +207,11 @@ int __init default_acpi_madt_oem_check(char *oem_id, char *oem_table_id) } return 0; } + +void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler) +{ + struct apic **drv; + + for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) + (*drv)->wakeup_secondary_cpu = handler; +} diff --git a/arch/x86/kernel/apic/probe_64.c b/arch/x86/kernel/apic/pr
[RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions are not supported. So handle #VE due to these instructions as no ops. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- arch/x86/kernel/tdx.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index eff58329751e..8d1d7555fb56 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -451,6 +451,23 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_EPT_VIOLATION: ve->instr_len = tdx_handle_mmio(regs, ve); break; + /* +* Per Guest-Host-Communication Interface (GHCI) for Intel Trust +* Domain Extensions (Intel TDX) specification, sec 2.4, +* some instructions that unconditionally cause #VE (such as WBINVD, +* MONITOR, MWAIT) do not have corresponding TDCALL +* [TDG.VP.VMCALL ] leaves, since the TD has been designed +* with no deterministic way to confirm the result of those operations +* performed by the host VMM. In those cases, the goal is for the TD +* #VE handler to increment the RIP appropriately based on the VE +* information provided via TDCALL. +*/ + case EXIT_REASON_WBINVD: + pr_warn_once("WBINVD #VE Exception\n"); + case EXIT_REASON_MWAIT_INSTRUCTION: + case EXIT_REASON_MONITOR_INSTRUCTION: + /* Handle as nops. */ + break; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
[RFC v1 12/26] x86/tdx: Handle in-kernel MMIO
From: "Kirill A. Shutemov" Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION exit reason. For now we only handle subset of instruction that kernel uses for MMIO oerations. User-space access triggers SIGBUS. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/tdx.c | 120 ++ 1 file changed, 120 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 3846d2807a7a..eff58329751e 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -6,6 +6,8 @@ #include #include #include +#include +#include /* force_sig_fault() */ #ifdef CONFIG_KVM_GUEST #include "tdx-kvm.c" @@ -270,6 +272,121 @@ static void tdx_handle_io(struct pt_regs *regs, u32 exit_qual) } } +static unsigned long tdx_mmio(int size, bool write, unsigned long addr, + unsigned long val) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = EXIT_REASON_EPT_VIOLATION; + register long r12 asm("r12") = size; + register long r13 asm("r13") = write; + register long r14 asm("r14") = addr; + register long r15 asm("r15") = val; + register long rcx asm("rcx"); + long ret; + + /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */ + rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13), + "=r"(r14), "=r"(r15) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12), + "r"(r13), "r"(r14), "r"(r15) + : ); + + WARN_ON(ret || r10); + + return r11; +} + +static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn) +{ + static const int regoff[] = { + offsetof(struct pt_regs, ax), + offsetof(struct pt_regs, cx), + offsetof(struct pt_regs, dx), + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, sp), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + offsetof(struct pt_regs, r8), + offsetof(struct pt_regs, r9), + offsetof(struct pt_regs, r10), + offsetof(struct pt_regs, r11), + offsetof(struct pt_regs, r12), + offsetof(struct pt_regs, r13), + offsetof(struct pt_regs, r14), + offsetof(struct pt_regs, r15), + }; + int regno; + + regno = X86_MODRM_REG(insn->modrm.value); + if (X86_REX_R(insn->rex_prefix.value)) + regno += 8; + + return (void *)regs + regoff[regno]; +} + +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve) +{ + int size; + bool write; + unsigned long *reg; + struct insn insn; + unsigned long val = 0; + + /* +* User mode would mean the kernel exposed a device directly +* to ring3, which shouldn't happen except for things like +* DPDK. +*/ + if (user_mode(regs)) { + pr_err("Unexpected user-mode MMIO access.\n"); + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla); + return 0; + } + + kernel_insn_init(, (void *) regs->ip, MAX_INSN_SIZE); + insn_get_length(); + insn_get_opcode(); + + write = ve->exit_qual & 0x2; + + size = insn.opnd_bytes; + switch (insn.opcode.bytes[0]) { + /* MOV r/m8 r8 */ + case 0x88: + /* MOV r8 r/m8*/ + case 0x8A: + /* MOV r/m8 imm8*/ + case 0xC6: + size = 1; + break; + } + + if (inat_has_immediate(insn.attr)) { + BUG_ON(!write); + val = insn.immediate.value; + tdx_mmio(size, write, ve->gpa, val); + return insn.length; + } + + BUG_ON(!inat_has_modrm(insn.attr)); + + reg = get_reg_ptr(regs, ); + + if (write) { + memcpy(, reg, size); + tdx_mmio(size, write, ve->gpa, val); + } else { + val = tdx_mmio(size, write, ve->gpa, val); + memset(reg, 0, size); + memcpy(reg, , size); + } + return insn.length; +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) @@ -331,6 +448,9 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs,
[RFC v1 11/26] x86/tdx: Handle port I/O
From: "Kirill A. Shutemov" Unroll string operations and handle port I/O through TDVMCALLs. Also handle #VE due to I/O operations with the same TDVMCALLs. Decompression code uses port IO for earlyprintk. We must use paravirt calls there too if we want to allow earlyprintk. Decompresion code cannot deal with alternatives: use branches instead to implement inX() and outX() helpers. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/boot/compressed/Makefile | 1 + arch/x86/boot/compressed/tdx_io.S | 9 ++ arch/x86/include/asm/asm-prototypes.h | 1 + arch/x86/include/asm/io.h | 5 +- arch/x86/include/asm/tdx.h| 62 +-- arch/x86/kernel/Makefile | 2 +- arch/x86/kernel/tdx.c | 72 + arch/x86/kernel/tdx_io.S | 143 ++ 8 files changed, 284 insertions(+), 11 deletions(-) create mode 100644 arch/x86/boot/compressed/tdx_io.S create mode 100644 arch/x86/kernel/tdx_io.S diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index a2554621cefe..54da333adc4e 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -97,6 +97,7 @@ endif vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx_io.o vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a diff --git a/arch/x86/boot/compressed/tdx_io.S b/arch/x86/boot/compressed/tdx_io.S new file mode 100644 index ..67498f67cb18 --- /dev/null +++ b/arch/x86/boot/compressed/tdx_io.S @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#include + +/* Do not export symbols in decompression code */ +#undef EXPORT_SYMBOL +#define EXPORT_SYMBOL(sym) + +#include "../../kernel/tdx_io.S" diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h index 51e2bf27cc9b..6bc97aa39a21 100644 --- a/arch/x86/include/asm/asm-prototypes.h +++ b/arch/x86/include/asm/asm-prototypes.h @@ -6,6 +6,7 @@ #include #include #include +#include #include diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index ef7a686a55a9..30a3b30395ad 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -43,6 +43,7 @@ #include #include #include +#include #define build_mmio_read(name, size, type, reg, barrier) \ static inline type name(const volatile void __iomem *addr) \ @@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port) \ \ static inline void outs##bwl(int port, const void *addr, unsigned long count) \ { \ - if (sev_key_active()) { \ + if (sev_key_active() || is_tdx_guest()) { \ unsigned type *value = (unsigned type *)addr; \ while (count) { \ out##bwl(*value, port); \ @@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \ \ static inline void ins##bwl(int port, void *addr, unsigned long count) \ { \ - if (sev_key_active()) { \ + if (sev_key_active() || is_tdx_guest()) { \ unsigned type *value = (unsigned type *)addr; \ while (count) { \ *value = in##bwl(port); \ diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 8c3e5af88643..b46ae140e39b 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -5,7 +5,16 @@ #define TDX_CPUID_LEAF_ID 0x21 -#ifdef CONFIG_INTEL_TDX_GUEST +#define TDVMCALL 0 +#define TDINFO 1 +#define TDGETVEINFO3 + +/* TDVMCALL R10 Input */ +#define TDVMCALL_STANDARD 0 +#define TDVMCALL_VENDOR1 + +#ifndef __ASSEMBLY__ +#include /* * TDCALL instruction is newly added in TDX architecture, @@ -14,19 +23,55 @@ */ #define TDCALL ".byte 0x66,0x0f,0x01,0xcc" -#define TDVMCALL 0 -#define TDINFO 1 -#define TDGETVEINFO3 - -/* TDVMCALL R10 Input */ -#define TDVMCALL_STANDARD 0 -#define TDVMCALL_VENDOR1 +#ifdef CONFIG_INTEL_TDX_GUEST /* Common API to check TDX support in decompression and common kernel code. */ bool is_t
[RFC v1 10/26] x86/io: Allow to override inX() and outX() implementation
From: "Kirill A. Shutemov" The patch allows to override the implementation of the port IO helpers. TDX code will provide an implementation that redirect the helpers to paravirt calls. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/io.h | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index d726459d08e5..ef7a686a55a9 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -271,18 +271,26 @@ static inline bool sev_key_active(void) { return false; } #endif /* CONFIG_AMD_MEM_ENCRYPT */ +#ifndef __out +#define __out(bwl, bw) \ + asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port)) +#endif + +#ifndef __in +#define __in(bwl, bw) \ + asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port)) +#endif + #define BUILDIO(bwl, bw, type) \ static inline void out##bwl(unsigned type value, int port) \ { \ - asm volatile("out" #bwl " %" #bw "0, %w1" \ -: : "a"(value), "Nd"(port)); \ + __out(bwl, bw); \ } \ \ static inline unsigned type in##bwl(int port) \ { \ unsigned type value;\ - asm volatile("in" #bwl " %w1, %" #bw "0"\ -: "=a"(value) : "Nd"(port)); \ + __in(bwl, bw); \ return value; \ } \ \ -- 2.25.1
[RFC v1 09/26] x86/tdx: Handle CPUID via #VE
From: "Kirill A. Shutemov" TDX has three classes of CPUID leaves: some CPUID leaves are always handled by the CPU, others are handled by the TDX module, and some others are handled by the VMM. Since the VMM cannot directly intercept the instruction these are reflected with a #VE exception to the guest, which then converts it into a TDCALL to the VMM, or handled directly. The TDX module EAS has a full list of CPUID leaves which are handled natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by the #VE method. In practice this typically only applies to the hypervisor specific CPUIDs unknown to the native CPU. Therefore there is no risk of causing this in early CPUID code which runs before the #VE handler is set up because it will never access those exotic CPUID leaves. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/tdx.c | 32 1 file changed, 32 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 5d961263601e..e98058c048b5 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -172,6 +172,35 @@ static int tdx_write_msr_safe(unsigned int msr, unsigned int low, return ret || r10 ? -EIO : 0; } +static void tdx_handle_cpuid(struct pt_regs *regs) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = EXIT_REASON_CPUID; + register long r12 asm("r12") = regs->ax; + register long r13 asm("r13") = regs->cx; + register long r14 asm("r14"); + register long r15 asm("r15"); + register long rcx asm("rcx"); + long ret; + + /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */ + rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13), + "=r"(r14), "=r"(r15) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12), + "r"(r13) + : ); + + regs->ax = r12; + regs->bx = r13; + regs->cx = r14; + regs->dx = r15; + + WARN_ON(ret || r10); +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) @@ -227,6 +256,9 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs, case EXIT_REASON_MSR_WRITE: ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx); break; + case EXIT_REASON_CPUID: + tdx_handle_cpuid(regs); + break; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; -- 2.25.1
[RFC v1 08/26] x86/tdx: Add MSR support for TDX guest
From: "Kirill A. Shutemov" Operations on context-switched MSRs can be run natively. The rest of MSRs should be handled through TDVMCALLs. TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide MSR oprations. You can find RDMSR and WRMSR details in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec 3.10, 3.11. Also, since CSTAR MSR is not used on Intel CPUs as SYSCALL instruction, ignore accesses to CSTAR MSR. Ignore accesses to the MSR for compatibility: no need in wrap callers in !is_tdx_guest(). Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/tdx.c | 94 ++- 1 file changed, 93 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index bbefe639a2ed..5d961263601e 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -94,6 +94,84 @@ static __cpuidle void tdx_safe_halt(void) BUG_ON(ret || r10); } +static bool tdx_is_context_switched_msr(unsigned int msr) +{ + /* XXX: Update the list of context-switched MSRs */ + + switch (msr) { + case MSR_EFER: + case MSR_IA32_CR_PAT: + case MSR_FS_BASE: + case MSR_GS_BASE: + case MSR_KERNEL_GS_BASE: + case MSR_IA32_SYSENTER_CS: + case MSR_IA32_SYSENTER_EIP: + case MSR_IA32_SYSENTER_ESP: + case MSR_STAR: + case MSR_LSTAR: + case MSR_SYSCALL_MASK: + case MSR_IA32_XSS: + case MSR_TSC_AUX: + case MSR_IA32_BNDCFGS: + return true; + } + return false; +} + +static u64 tdx_read_msr_safe(unsigned int msr, int *err) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = EXIT_REASON_MSR_READ; + register long r12 asm("r12") = msr; + register long rcx asm("rcx"); + long ret; + + WARN_ON_ONCE(tdx_is_context_switched_msr(msr)); + + if (msr == MSR_CSTAR) + return 0; + + /* Allow to pass R10, R11 and R12 down to the VMM */ + rcx = BIT(10) | BIT(11) | BIT(12); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12) + : ); + + /* XXX: Better error handling needed? */ + *err = (ret || r10) ? -EIO : 0; + + return r11; +} + +static int tdx_write_msr_safe(unsigned int msr, unsigned int low, + unsigned int high) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = EXIT_REASON_MSR_WRITE; + register long r12 asm("r12") = msr; + register long r13 asm("r13") = (u64)high << 32 | low; + register long rcx asm("rcx"); + long ret; + + WARN_ON_ONCE(tdx_is_context_switched_msr(msr)); + + if (msr == MSR_CSTAR) + return 0; + + /* Allow to pass R10, R11, R12 and R13 down to the VMM */ + rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12), + "r"(r13) + : ); + + return ret || r10 ? -EIO : 0; +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) @@ -132,17 +210,31 @@ unsigned long tdx_get_ve_info(struct ve_info *ve) int tdx_handle_virtualization_exception(struct pt_regs *regs, struct ve_info *ve) { + unsigned long val; + int ret = 0; + switch (ve->exit_reason) { case EXIT_REASON_HLT: tdx_halt(); break; + case EXIT_REASON_MSR_READ: + val = tdx_read_msr_safe(regs->cx, (unsigned int *)); + if (!ret) { + regs->ax = val & UINT_MAX; + regs->dx = val >> 32; + } + break; + case EXIT_REASON_MSR_WRITE: + ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx); + break; default: pr_warn("Unexpected #VE: %d\n", ve->exit_reason); return -EFAULT; } /* After successful #VE handling, move the IP */ - regs->ip += ve->instr_len; + if (!ret) + regs->ip += ve->instr_len; return ret; } -- 2.25.1
[RFC v1 07/26] x86/tdx: Wire up KVM hypercalls
From: "Kirill A. Shutemov" KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/kvm_para.h | 21 ++ arch/x86/include/asm/tdx.h | 8 +++ arch/x86/kernel/tdx-kvm.c | 116 arch/x86/kernel/tdx.c | 4 ++ 4 files changed, 149 insertions(+) create mode 100644 arch/x86/kernel/tdx-kvm.c diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 338119852512..2fa85481520b 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -6,6 +6,7 @@ #include #include #include +#include extern void kvmclock_init(void); @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void) static inline long kvm_hypercall0(unsigned int nr) { long ret; + + if (is_tdx_guest()) + return tdx_kvm_hypercall0(nr); + asm volatile(KVM_HYPERCALL : "=a"(ret) : "a"(nr) @@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr) static inline long kvm_hypercall1(unsigned int nr, unsigned long p1) { long ret; + + if (is_tdx_guest()) + return tdx_kvm_hypercall1(nr, p1); + asm volatile(KVM_HYPERCALL : "=a"(ret) : "a"(nr), "b"(p1) @@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2) { long ret; + + if (is_tdx_guest()) + return tdx_kvm_hypercall2(nr, p1, p2); + asm volatile(KVM_HYPERCALL : "=a"(ret) : "a"(nr), "b"(p1), "c"(p2) @@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2, unsigned long p3) { long ret; + + if (is_tdx_guest()) + return tdx_kvm_hypercall3(nr, p1, p2, p3); + asm volatile(KVM_HYPERCALL : "=a"(ret) : "a"(nr), "b"(p1), "c"(p2), "d"(p3) @@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p4) { long ret; + + if (is_tdx_guest()) + return tdx_kvm_hypercall4(nr, p1, p2, p3, p4); + asm volatile(KVM_HYPERCALL : "=a"(ret) : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index b98de067257b..8c3e5af88643 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -51,4 +51,12 @@ unsigned long tdx_get_ve_info(struct ve_info *ve); int tdx_handle_virtualization_exception(struct pt_regs *regs, struct ve_info *ve); +long tdx_kvm_hypercall0(unsigned int nr); +long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1); +long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2); +long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2, + unsigned long p3); +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2, + unsigned long p3, unsigned long p4); + #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c new file mode 100644 index ..323d43fcb338 --- /dev/null +++ b/arch/x86/kernel/tdx-kvm.c @@ -0,0 +1,116 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +long tdx_kvm_hypercall0(unsigned int nr) +{ + register long r10 asm("r10") = TDVMCALL_VENDOR; + register long r11 asm("r11") = nr; + register long rcx asm("rcx"); + long ret; + + /* Allow to pass R10 and R11 down to the VMM */ + rcx = BIT(10) | BIT(11); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11) + : "memory"); + + BUG_ON(ret); + return r10; +} +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0); + +long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1) +{ + register long r10 asm("r10") = TDVMCALL_VENDOR; + register long r11 asm("r11") = nr; + register long r12 asm("r12") = p1; + register long rcx asm("rcx"); + long ret; + + /* Allow to pass R10, R11 and R12 down to the VMM */ + rcx = BIT(10) | BIT(11) | BIT(12); + + asm volatile(TDCALL +
[RFC v1 06/26] x86/tdx: Add HLT support for TDX guest
From: "Kirill A. Shutemov" Per Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec 3.8, TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement halt() and safe_halt() paravirtualization calls. The same TDVMCALL is used to handle #VE exception due to EXIT_REASON_HLT. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/tdx.h | 5 arch/x86/kernel/tdx.c | 61 ++ 2 files changed, 60 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 90eb61b07d1f..b98de067257b 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -14,9 +14,14 @@ */ #define TDCALL ".byte 0x66,0x0f,0x01,0xcc" +#define TDVMCALL 0 #define TDINFO 1 #define TDGETVEINFO3 +/* TDVMCALL R10 Input */ +#define TDVMCALL_STANDARD 0 +#define TDVMCALL_VENDOR1 + /* Common API to check TDX support in decompression and common kernel code. */ bool is_tdx_guest(void); diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index ae2d5c847700..25dd33bc2e49 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -51,6 +51,45 @@ static void tdx_get_info(void) td_info.attributes = rdx; } +static __cpuidle void tdx_halt(void) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = EXIT_REASON_HLT; + register long rcx asm("rcx"); + long ret; + + /* Allow to pass R10 and R11 down to the VMM */ + rcx = BIT(10) | BIT(11); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10), "=r"(r11) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11) + : ); + + /* It should never fail */ + BUG_ON(ret || r10); +} + +static __cpuidle void tdx_safe_halt(void) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = EXIT_REASON_HLT; + register long rcx asm("rcx"); + long ret; + + /* Allow to pass R10 and R11 down to the VMM */ + rcx = BIT(10) | BIT(11); + + /* Enable interrupts next to the TDVMCALL to avoid performance degradation */ + asm volatile("sti\n\t" TDCALL + : "=a"(ret), "=r"(r10), "=r"(r11) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11) + : ); + + /* It should never fail */ + BUG_ON(ret || r10); +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) @@ -60,6 +99,9 @@ void __init tdx_early_init(void) tdx_get_info(); + pv_ops.irq.safe_halt = tdx_safe_halt; + pv_ops.irq.halt = tdx_halt; + pr_info("TDX guest is initialized\n"); } @@ -86,10 +128,17 @@ unsigned long tdx_get_ve_info(struct ve_info *ve) int tdx_handle_virtualization_exception(struct pt_regs *regs, struct ve_info *ve) { - /* -* TODO: Add handler support for various #VE exit -* reasons -*/ - pr_warn("Unexpected #VE: %d\n", ve->exit_reason); - return -EFAULT; + switch (ve->exit_reason) { + case EXIT_REASON_HLT: + tdx_halt(); + break; + default: + pr_warn("Unexpected #VE: %d\n", ve->exit_reason); + return -EFAULT; + } + + /* After successful #VE handling, move the IP */ + regs->ip += ve->instr_len; + + return ret; } -- 2.25.1
[RFC v1 17/26] x86/boot: Avoid unnecessary #VE during boot process
From: Sean Christopherson Skip writing EFER during secondary_startup_64() if the current value is also the desired value. This avoids a #VE when running as a TDX guest, as the TDX-Module does not allow writes to EFER (even when writing the current, fixed value). Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE when running as a TDX guest. The TDX-Module (effectively part of the hypervisor) requires CR4.MCE to be set at all times and injects a #VE if the guest attempts to clear CR4.MCE. Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/boot/compressed/head_64.S | 5 - arch/x86/kernel/head_64.S | 13 +++-- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 37c2f37d4a0d..2d79e5f97360 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src) popl%ecx /* Enable PAE and LA57 (if required) paging modes */ - movl$X86_CR4_PAE, %eax + movl%cr4, %eax + /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */ + andl$X86_CR4_MCE, %eax + orl $X86_CR4_PAE, %eax testl %edx, %edx jz 1f orl $X86_CR4_LA57, %eax diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 04bddaaba8e2..92c77cf75542 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) 1: /* Enable PAE mode, PGE and LA57 */ - movl$(X86_CR4_PAE | X86_CR4_PGE), %ecx + movq%cr4, %rcx + /* Clearing CR4.MCE will #VE on TDX guests. Leave it alone. */ + andl$X86_CR4_MCE, %ecx + orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx #ifdef CONFIG_X86_5LEVEL testl $1, __pgtable_l5_enabled(%rip) jz 1f @@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) /* Setup EFER (Extended Feature Enable Register) */ movl$MSR_EFER, %ecx rdmsr + movl%eax, %edx btsl$_EFER_SCE, %eax/* Enable System Call */ btl $20,%edi/* No Execute supported? */ jnc 1f btsl$_EFER_NX, %eax btsq$_PAGE_BIT_NX,early_pmd_flags(%rip) -1: wrmsr /* Make changes effective */ + /* Skip the WRMSR if the current value matches the desired value. */ +1: cmpl%edx, %eax + je 1f + xor %edx, %edx + wrmsr /* Make changes effective */ +1: /* Setup cr0 */ movl$CR0_STATE, %eax /* Make changes effective */ -- 2.25.1
[RFC v1 01/26] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
From: "Kirill A. Shutemov" Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into a separate config option. It provides a middle ground for not-so-deep paravirtulized environments. CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt calls that was hidden under CONFIG_PARAVIRT_XXL, but the rest of the config would be a bloat for TDX. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/Kconfig | 4 +++ arch/x86/boot/compressed/misc.h | 1 + arch/x86/include/asm/irqflags.h | 42 +++ arch/x86/include/asm/paravirt.h | 22 +++--- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/kernel/paravirt.c| 4 ++- arch/x86/mm/mem_encrypt_identity.c| 1 + 7 files changed, 46 insertions(+), 31 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7b6dd10b162a..8fe91114bfee 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -776,8 +776,12 @@ config PARAVIRT over full virtualization. However, when run without a hypervisor the kernel is theoretically slower and slightly larger. +config PARAVIRT_XL + bool + config PARAVIRT_XXL bool + select PARAVIRT_XL config PARAVIRT_DEBUG bool "paravirt-ops debugging" diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index 901ea5ebec22..4b84abe43765 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -9,6 +9,7 @@ * paravirt and debugging variants are added.) */ #undef CONFIG_PARAVIRT +#undef CONFIG_PARAVIRT_XL #undef CONFIG_PARAVIRT_XXL #undef CONFIG_PARAVIRT_SPINLOCKS #undef CONFIG_KASAN diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h index 2dfc8d380dab..299c9b1ed857 100644 --- a/arch/x86/include/asm/irqflags.h +++ b/arch/x86/include/asm/irqflags.h @@ -68,11 +68,33 @@ static inline __cpuidle void native_halt(void) #endif -#ifdef CONFIG_PARAVIRT_XXL +#ifdef CONFIG_PARAVIRT_XL #include #else #ifndef __ASSEMBLY__ #include +/* + * Used in the idle loop; sti takes one instruction cycle + * to complete: + */ +static inline __cpuidle void arch_safe_halt(void) +{ + native_safe_halt(); +} + +/* + * Used when interrupts are already enabled or to + * shutdown the processor: + */ +static inline __cpuidle void halt(void) +{ + native_halt(); +} +#endif /* !__ASSEMBLY__ */ +#endif /* CONFIG_PARAVIRT_XL */ + +#ifndef CONFIG_PARAVIRT_XXL +#ifndef __ASSEMBLY__ static __always_inline unsigned long arch_local_save_flags(void) { @@ -94,24 +116,6 @@ static __always_inline void arch_local_irq_enable(void) native_irq_enable(); } -/* - * Used in the idle loop; sti takes one instruction cycle - * to complete: - */ -static inline __cpuidle void arch_safe_halt(void) -{ - native_safe_halt(); -} - -/* - * Used when interrupts are already enabled or to - * shutdown the processor: - */ -static inline __cpuidle void halt(void) -{ - native_halt(); -} - /* * For spinlocks, etc: */ diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index f8dce11d2bc1..700b94abfd1b 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) PVOP_VCALL1(mmu.exit_mmap, mm); } +#ifdef CONFIG_PARAVIRT_XL +static inline void arch_safe_halt(void) +{ + PVOP_VCALL0(irq.safe_halt); +} + +static inline void halt(void) +{ + PVOP_VCALL0(irq.halt); +} +#endif + #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { @@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x) PVOP_VCALL1(cpu.write_cr4, x); } -static inline void arch_safe_halt(void) -{ - PVOP_VCALL0(irq.safe_halt); -} - -static inline void halt(void) -{ - PVOP_VCALL0(irq.halt); -} - static inline void wbinvd(void) { PVOP_VCALL0(cpu.wbinvd); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index b6b02b7c19cc..634482a0a60d 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -190,7 +190,8 @@ struct pv_irq_ops { struct paravirt_callee_save restore_fl; struct paravirt_callee_save irq_disable; struct paravirt_callee_save irq_enable; - +#endif +#ifdef CONFIG_PARAVIRT_XL void (*safe_halt)(void); void (*halt)(void); #endif diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 6c3407ba6ee9..85714a6389d6 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -327,9 +327,11 @@ struct paravirt_patch_template pv_ops = { .irq.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), .irq.irq_disable= __PV_IS_CALLEE_SAVE(native_irq_disable),
[RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
Add helper function to detect TDX feature support. It will be used to protect TDX specific code. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/boot/compressed/Makefile | 1 + arch/x86/boot/compressed/tdx.c| 32 +++ arch/x86/include/asm/tdx.h| 8 arch/x86/kernel/tdx.c | 6 ++ 4 files changed, 47 insertions(+) create mode 100644 arch/x86/boot/compressed/tdx.c diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index e0bc3988c3fa..a2554621cefe 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -96,6 +96,7 @@ ifdef CONFIG_X86_64 endif vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c new file mode 100644 index ..0a87c1775b67 --- /dev/null +++ b/arch/x86/boot/compressed/tdx.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * tdx.c - Early boot code for TDX + */ + +#include + +static int __ro_after_init tdx_guest = -1; + +static inline bool native_cpuid_has_tdx_guest(void) +{ + u32 eax = TDX_CPUID_LEAF_ID, signature[3] = {0}; + + if (native_cpuid_eax(0) < TDX_CPUID_LEAF_ID) + return false; + + native_cpuid(, [0], [1], [2]); + + if (memcmp("IntelTDX", signature, 12)) + return false; + + return true; +} + +bool is_tdx_guest(void) +{ + if (tdx_guest < 0) + tdx_guest = native_cpuid_has_tdx_guest(); + + return !!tdx_guest; +} + diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 2cc246c0cecf..0b9d571b1f95 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -7,10 +7,18 @@ #ifdef CONFIG_INTEL_TDX_GUEST +/* Common API to check TDX support in decompression and common kernel code. */ +bool is_tdx_guest(void); + void __init tdx_early_init(void); #else // !CONFIG_INTEL_TDX_GUEST +static inline bool is_tdx_guest(void) +{ + return false; +} + static inline void tdx_early_init(void) { }; #endif /* CONFIG_INTEL_TDX_GUEST */ diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 473b4c1c0920..e44e55d1e519 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -20,6 +20,12 @@ static inline bool cpuid_has_tdx_guest(void) return true; } +bool is_tdx_guest(void) +{ + return static_cpu_has(X86_FEATURE_TDX_GUEST); +} +EXPORT_SYMBOL_GPL(is_tdx_guest); + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) -- 2.25.1
[RFC v1 05/26] x86/traps: Add #VE support for TDX guest
From: "Kirill A. Shutemov" The TDX module injects #VE exception to the guest TD in cases of disallowed instructions, disallowed MSR accesses and subset of CPUID leaves. Also, it's theoretically possible for CPU to inject #VE exception on EPT violation, but the TDX module makes sure this does not happen, as long as all memory used is properly accepted using TDCALLs. You can find more details about it in, Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec 2.3. Add basic infrastructure to handle #VE. If there is no handler for a given #VE, since its a unexpected event (fault case), treat it as a general protection fault and handle it using do_general_protection() call. TDCALL[TDGETVEINFO] provides information about #VE such as exit reason. More details on cases where #VE exceptions are allowed/not-allowed: The #VE exception do not occur in the paranoid entry paths, like NMIs. While other operations during an NMI might cause #VE, these are in the NMI code that can handle nesting, so there is no concern about reentrancy. This is similar to how #PF is handled in NMIs. The #VE exception also cannot happen in entry/exit code with the wrong gs, such as the SWAPGS code, so it's entry point does not need "paranoid" handling. Any memory accesses can cause #VE if it causes an EPT violation. However, the VMM is only in direct control of some of the EPT tables. The Secure EPT tables are controlled by the TDX module which guarantees no EPT violations will result in #VE for the guest, once the memory has been accepted. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/idtentry.h | 4 ++ arch/x86/include/asm/tdx.h | 14 +++ arch/x86/kernel/idt.c | 6 +++ arch/x86/kernel/tdx.c | 31 ++ arch/x86/kernel/traps.c | 73 ++--- 5 files changed, 105 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 247a60a47331..a2cbb68f9ae8 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -615,6 +615,10 @@ DECLARE_IDTENTRY_VC(X86_TRAP_VC, exc_vmm_communication); DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback); #endif +#ifdef CONFIG_INTEL_TDX_GUEST +DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception); +#endif + /* Device interrupts common/spurious */ DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt); #ifdef CONFIG_X86_LOCAL_APIC diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index f8cdc8eb1046..90eb61b07d1f 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -15,6 +15,7 @@ #define TDCALL ".byte 0x66,0x0f,0x01,0xcc" #define TDINFO 1 +#define TDGETVEINFO3 /* Common API to check TDX support in decompression and common kernel code. */ bool is_tdx_guest(void); @@ -32,4 +33,17 @@ static inline void tdx_early_init(void) { }; #endif /* CONFIG_INTEL_TDX_GUEST */ +struct ve_info { + unsigned int exit_reason; + unsigned long exit_qual; + unsigned long gla; + unsigned long gpa; + unsigned int instr_len; + unsigned int instr_info; +}; + +unsigned long tdx_get_ve_info(struct ve_info *ve); +int tdx_handle_virtualization_exception(struct pt_regs *regs, + struct ve_info *ve); + #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index ee1a283f8e96..546b6b636c7d 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = { */ INTG(X86_TRAP_PF, asm_exc_page_fault), #endif +#ifdef CONFIG_INTEL_TDX_GUEST + INTG(X86_TRAP_VE, asm_exc_virtualization_exception), +#endif }; /* @@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = { INTG(X86_TRAP_MF, asm_exc_coprocessor_error), INTG(X86_TRAP_AC, asm_exc_alignment_check), INTG(X86_TRAP_XF, asm_exc_simd_coprocessor_error), +#ifdef CONFIG_INTEL_TDX_GUEST + INTG(X86_TRAP_VE, asm_exc_virtualization_exception), +#endif #ifdef CONFIG_X86_32 TSKG(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS), diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 13303bfdfdd1..ae2d5c847700 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -62,3 +62,34 @@ void __init tdx_early_init(void) pr_info("TDX guest is initialized\n"); } + +unsigned long tdx_get_ve_info(struct ve_info *ve) +{ + register long r8 asm("r8"); + register long r9 asm("r9"); + register long r10 asm(&quo
[RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO
From: "Kirill A. Shutemov" Per Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec 2.4.2, TDCALL[TDINFO] provides basic TD execution environment information, not provided by CPUID. Call TDINFO during early boot to be used for following system initialization. The call provides info on which bit in pfn is used to indicate that the page is shared with the host and attributes of the TD, such as debug. We don't save information about the number of cpus as there's no users so far. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/tdx.h | 9 + arch/x86/kernel/tdx.c | 27 +++ 2 files changed, 36 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 0b9d571b1f95..f8cdc8eb1046 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -7,6 +7,15 @@ #ifdef CONFIG_INTEL_TDX_GUEST +/* + * TDCALL instruction is newly added in TDX architecture, + * used by TD for requesting the host VMM to provide + * (untrusted) services. + */ +#define TDCALL ".byte 0x66,0x0f,0x01,0xcc" + +#define TDINFO 1 + /* Common API to check TDX support in decompression and common kernel code. */ bool is_tdx_guest(void); diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index e44e55d1e519..13303bfdfdd1 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -3,6 +3,14 @@ #include #include +#include +#include +#include + +static struct { + unsigned int gpa_width; + unsigned long attributes; +} td_info __ro_after_init; static inline bool cpuid_has_tdx_guest(void) { @@ -26,6 +34,23 @@ bool is_tdx_guest(void) } EXPORT_SYMBOL_GPL(is_tdx_guest); +static void tdx_get_info(void) +{ + register long rcx asm("rcx"); + register long rdx asm("rdx"); + register long r8 asm("r8"); + long ret; + + asm volatile(TDCALL +: "=a"(ret), "=c"(rcx), "=r"(rdx), "=r"(r8) +: "a"(TDINFO) +: "r9", "r10", "r11", "memory"); + BUG_ON(ret); + + td_info.gpa_width = rcx & GENMASK(5, 0); + td_info.attributes = rdx; +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) @@ -33,5 +58,7 @@ void __init tdx_early_init(void) setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + tdx_get_info(); + pr_info("TDX guest is initialized\n"); } -- 2.25.1
[RFC v1 02/26] x86/cpufeatures: Add TDX Guest CPU feature
Add CPU feature detection for Trusted Domain Extensions support. TDX feature adds capabilities to keep guest register state and memory isolated from hypervisor. For TDX guest platforms, executing CPUID(0x21, 0) will return following values in EAX, EBX, ECX and EDX. EAX: Maximum sub-leaf number: 0 EBX/EDX/ECX: Vendor string: EBX = “Inte” EDX = ”lTDX” ECX = ““ So when above condition is true, set X86_FEATURE_TDX_GUEST feature cap bit Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/tdx.h | 18 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/head64.c | 3 +++ arch/x86/kernel/tdx.c | 31 ++ 5 files changed, 54 insertions(+) create mode 100644 arch/x86/include/asm/tdx.h create mode 100644 arch/x86/kernel/tdx.c diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 84b887825f12..989e2b302880 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -238,6 +238,7 @@ #define X86_FEATURE_VMW_VMMCALL( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */ #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */ #define X86_FEATURE_VM_PAGE_FLUSH ( 8*32+21) /* "" VM Page Flush MSR is supported */ +#define X86_FEATURE_TDX_GUEST ( 8*32+22) /* Trusted Domain Extensions Guest */ /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */ #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/ diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h new file mode 100644 index ..2cc246c0cecf --- /dev/null +++ b/arch/x86/include/asm/tdx.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2020 Intel Corporation */ +#ifndef _ASM_X86_TDX_H +#define _ASM_X86_TDX_H + +#define TDX_CPUID_LEAF_ID 0x21 + +#ifdef CONFIG_INTEL_TDX_GUEST + +void __init tdx_early_init(void); + +#else // !CONFIG_INTEL_TDX_GUEST + +static inline void tdx_early_init(void) { }; + +#endif /* CONFIG_INTEL_TDX_GUEST */ + +#endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 5eeb808eb024..ba8ee9300f23 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -128,6 +128,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)+= pvclock.o obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o obj-$(CONFIG_JAILHOUSE_GUEST) += jailhouse.o +obj-$(CONFIG_INTEL_TDX_GUEST) += tdx.o obj-$(CONFIG_EISA) += eisa.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index 5e9beb77cafd..75f2401cb5db 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -40,6 +40,7 @@ #include #include #include +#include /* * Manage page tables very early on. @@ -491,6 +492,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) kasan_early_init(); + tdx_early_init(); + idt_setup_early_handler(); copy_bootdata(__va(real_mode_data)); diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c new file mode 100644 index ..473b4c1c0920 --- /dev/null +++ b/arch/x86/kernel/tdx.c @@ -0,0 +1,31 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* Copyright (C) 2020 Intel Corporation */ + +#include +#include + +static inline bool cpuid_has_tdx_guest(void) +{ + u32 eax, signature[3]; + + if (cpuid_eax(0) < TDX_CPUID_LEAF_ID) + return false; + + cpuid_count(TDX_CPUID_LEAF_ID, 0, , [0], + [1], [2]); + + if (memcmp("IntelTDX", signature, 12)) + return false; + + return true; +} + +void __init tdx_early_init(void) +{ + if (!cpuid_has_tdx_guest()) + return; + + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); + + pr_info("TDX guest is initialized\n"); +} -- 2.25.1
[RFC v1 24/26] x86/tdx: Add helper to do MapGPA TDVMALL
From: "Kirill A. Shutemov" MapGPA TDVMCALL requests the host VMM to map a GPA range as private or shared memory mappings. Shared GPA mappings can be used for communication beteen TD guest and host VMM, for example for paravirtualized IO. The new helper tdx_map_gpa() provides access to the operation. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/tdx.h | 2 ++ arch/x86/kernel/tdx.c | 28 2 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 9bbfe6520ea4..efffdef35c78 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -105,5 +105,7 @@ long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2, unsigned long p3, unsigned long p4); phys_addr_t tdx_shared_mask(void); + +int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private); #endif #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 9681f4a0b4e0..f99fe54b4f88 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -14,6 +14,8 @@ #include "tdx-kvm.c" #endif +#define TDVMCALL_MAP_GPA 0x10001 + static struct { unsigned int gpa_width; unsigned long attributes; @@ -66,6 +68,32 @@ static void tdx_get_info(void) physical_mask &= ~tdx_shared_mask(); } +int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private) +{ + register long r10 asm("r10") = TDVMCALL_STANDARD; + register long r11 asm("r11") = TDVMCALL_MAP_GPA; + register long r12 asm("r12") = gpa; + register long r13 asm("r13") = PAGE_SIZE * numpages; + register long rcx asm("rcx"); + long ret; + + if (!private) + r12 |= tdx_shared_mask(); + + /* Allow to pass R10, R11, R12 and R13 down to the VMM */ + rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13); + + asm volatile(TDCALL + : "=a"(ret), "=r"(r10) + : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12), + "r"(r13) + : ); + + // Host kernel doesn't implement it yet. + // WARN_ON(ret || r10); + return ret || r10 ? -EIO : 0; +} + static __cpuidle void tdx_halt(void) { register long r10 asm("r10") = TDVMCALL_STANDARD; -- 2.25.1
Re: [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD
Hi Andy, On 2/5/21 3:43 PM, Andy Lutomirski wrote: MWAIT turning into NOP is no good. How about suppressing X86_FEATURE_MWAIT instead? Yes, we can suppress it in tdx_early_init(). + setup_clear_cpu_cap(X86_FEATURE_MWAIT); But do you want to leave the MWAIT #VE handler as it as (just in case)? -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[RFC v1 25/26] x86/tdx: Make DMA pages shared
From: "Kirill A. Shutemov" Make force_dma_unencrypted() return true for TDX to get DMA pages mapped as shared. __set_memory_enc_dec() is now aware about TDX and sets Shared bit accordingly following with relevant TDVMCALL. Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when converting memory to private. If the VMM uses a common pool for private and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA (or on the first access to the private GPA), in which case TDX-Module will hold the page in a non-present "pending" state until it is explicitly accepted by the BUG() if TDACCEPTPAGE fails (except the above case), as the guest is completely hosed if it can't access memory. Tested-by: Kai Huang Signed-off-by: Kirill A. Shutemov Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/tdx.h | 3 +++ arch/x86/kernel/tdx.c| 29 ++--- arch/x86/mm/mem_encrypt_common.c | 4 ++-- arch/x86/mm/pat/set_memory.c | 23 ++- 4 files changed, 49 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index efffdef35c78..9b66c3a5cf83 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -8,6 +8,9 @@ #define TDVMCALL 0 #define TDINFO 1 #define TDGETVEINFO3 +#define TDACCEPTPAGE 6 + +#define TDX_PAGE_ALREADY_ACCEPTED 0x0B0A /* TDVMCALL R10 Input */ #define TDVMCALL_STANDARD 0 diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index f99fe54b4f88..f51a19168adc 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -68,7 +68,7 @@ static void tdx_get_info(void) physical_mask &= ~tdx_shared_mask(); } -int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private) +static int __tdx_map_gpa(phys_addr_t gpa, int numpages, bool private) { register long r10 asm("r10") = TDVMCALL_STANDARD; register long r11 asm("r11") = TDVMCALL_MAP_GPA; @@ -89,11 +89,34 @@ int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private) "r"(r13) : ); - // Host kernel doesn't implement it yet. - // WARN_ON(ret || r10); + WARN_ON(ret || r10); return ret || r10 ? -EIO : 0; } +static void tdx_accept_page(phys_addr_t gpa) +{ + u64 ret; + + asm volatile(TDCALL : "=a"(ret) : "a"(TDACCEPTPAGE), "c"(gpa)); + + BUG_ON(ret && (ret & ~0xffull) != TDX_PAGE_ALREADY_ACCEPTED); +} + + +int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private) +{ + int ret, i; + + ret = __tdx_map_gpa(gpa, numpages, private); + if (ret || !private) + return ret; + + for (i = 0; i < numpages; i++) + tdx_accept_page(gpa + i*PAGE_SIZE); + + return 0; +} + static __cpuidle void tdx_halt(void) { register long r10 asm("r10") = TDVMCALL_STANDARD; diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index 964e04152417..b6d93b0c5dcf 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -15,9 +15,9 @@ bool force_dma_unencrypted(struct device *dev) { /* -* For SEV, all DMA must be to unencrypted/shared addresses. +* For SEV and TDX, all DMA must be to unencrypted/shared addresses. */ - if (sev_active()) + if (sev_active() || is_tdx_guest()) return true; /* diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 16f878c26667..6f23a9816ef0 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "../mm_internal.h" @@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) struct cpa_data cpa; int ret; - /* Nothing to do if memory encryption is not active */ - if (!mem_encrypt_active()) + /* Nothing to do if memory encryption and TDX are not active */ + if (!mem_encrypt_active() && !is_tdx_guest()) return 0; /* Should not be working on unaligned addresses */ @@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) memset(, 0, sizeof(cpa)); cpa.vaddr = cpa.numpages = numpages; - cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0); - cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC); + if (is_tdx_guest()) { + cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask()); + cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0); + } else { + cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0); +
[RFC v1 23/26] x86/tdx: Make pages shared in ioremap()
From: "Kirill A. Shutemov" All ioremap()ed paged that are not backed by normal memory (NONE or RESERVED) have to be mapped as shared. Reuse the infrastructure we have for AMD SEV. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/pgtable.h | 3 +++ arch/x86/mm/ioremap.c | 8 +--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a02c67291cfc..a82bab48379e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -21,6 +21,9 @@ #define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot))) #define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot))) +/* Make the page accesable by VMM */ +#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | tdx_shared_mask()) + #ifndef __ASSEMBLY__ #include #include diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 9e5ccc56f8e0..a0ba760866d4 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res) } /* - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because - * there the whole memory is already encrypted. + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or + * private in TDX case) because there the whole memory is already encrypted. */ static unsigned int __ioremap_check_encrypted(struct resource *res) { - if (!sev_active()) + if (!sev_active() && !is_tdx_guest()) return 0; switch (res->desc) { @@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size, prot = PAGE_KERNEL_IO; if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted) prot = pgprot_encrypted(prot); + else if (is_tdx_guest()) + prot = pgprot_tdx_shared(prot); switch (pcm) { case _PAGE_CACHE_MODE_UC: -- 2.25.1
[RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest
From: "Kirill A. Shutemov" TDX doesn't allow to perform DMA access to guest private memory. In order for DMA to work properly in TD guest, user SWIOTLB bounce buffers. Move AMD SEV initialization into common code and adopt for TDX. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/pci-swiotlb.c| 2 +- arch/x86/kernel/tdx.c| 3 +++ arch/x86/mm/mem_encrypt.c| 44 --- arch/x86/mm/mem_encrypt_common.c | 45 4 files changed, 49 insertions(+), 45 deletions(-) diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c index c2cfa5e7c152..020e13749758 100644 --- a/arch/x86/kernel/pci-swiotlb.c +++ b/arch/x86/kernel/pci-swiotlb.c @@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void) * buffers are allocated and used for devices that do not support * the addressing range required for the encryption mask. */ - if (sme_active()) + if (sme_active() || is_tdx_guest()) swiotlb = 1; return swiotlb; diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index f51a19168adc..ccb9401bd706 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -9,6 +9,7 @@ #include #include #include /* force_sig_fault() */ +#include #ifdef CONFIG_KVM_GUEST #include "tdx-kvm.c" @@ -472,6 +473,8 @@ void __init tdx_early_init(void) legacy_pic = _legacy_pic; + swiotlb_force = SWIOTLB_FORCE; + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug", NULL, tdx_cpu_offline_prepare); diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 11a6a7b3af7e..7fbbb2f3d426 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -408,47 +408,3 @@ void __init mem_encrypt_free_decrypted_mem(void) free_init_pages("unused decrypted", vaddr, vaddr_end); } - -static void print_mem_encrypt_feature_info(void) -{ - pr_info("AMD Memory Encryption Features active:"); - - /* Secure Memory Encryption */ - if (sme_active()) { - /* -* SME is mutually exclusive with any of the SEV -* features below. -*/ - pr_cont(" SME\n"); - return; - } - - /* Secure Encrypted Virtualization */ - if (sev_active()) - pr_cont(" SEV"); - - /* Encrypted Register State */ - if (sev_es_active()) - pr_cont(" SEV-ES"); - - pr_cont("\n"); -} - -/* Architecture __weak replacement functions */ -void __init mem_encrypt_init(void) -{ - if (!sme_me_mask) - return; - - /* Call into SWIOTLB to update the SWIOTLB DMA buffers */ - swiotlb_update_mem_attributes(); - - /* -* With SEV, we need to unroll the rep string I/O instructions. -*/ - if (sev_active()) - static_branch_enable(_enable_key); - - print_mem_encrypt_feature_info(); -} - diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c index b6d93b0c5dcf..6f3d90d4d68e 100644 --- a/arch/x86/mm/mem_encrypt_common.c +++ b/arch/x86/mm/mem_encrypt_common.c @@ -10,6 +10,7 @@ #include #include #include +#include /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ bool force_dma_unencrypted(struct device *dev) @@ -36,3 +37,47 @@ bool force_dma_unencrypted(struct device *dev) return false; } + +static void print_mem_encrypt_feature_info(void) +{ + pr_info("AMD Memory Encryption Features active:"); + + /* Secure Memory Encryption */ + if (sme_active()) { + /* +* SME is mutually exclusive with any of the SEV +* features below. +*/ + pr_cont(" SME\n"); + return; + } + + /* Secure Encrypted Virtualization */ + if (sev_active()) + pr_cont(" SEV"); + + /* Encrypted Register State */ + if (sev_es_active()) + pr_cont(" SEV-ES"); + + pr_cont("\n"); +} + +/* Architecture __weak replacement functions */ +void __init mem_encrypt_init(void) +{ + if (!sme_me_mask && !is_tdx_guest()) + return; + + /* Call into SWIOTLB to update the SWIOTLB DMA buffers */ + swiotlb_update_mem_attributes(); + + /* +* With SEV, we need to unroll the rep string I/O instructions. +*/ + if (sev_active()) + static_branch_enable(_enable_key); + + if (!is_tdx_guest()) + print_mem_encrypt_feature_info(); +} -- 2.25.1
[RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code
From: "Kirill A. Shutemov" Intel TDX doesn't allow VMM to access guest memory. Any memory that is required for communication with VMM suppose to be shared explicitly by setting the bit in page table entry. The shared memory is similar to unencrypted memory in AMD SME/SEV terminology. force_dma_unencrypted() has to return true for TDX guest. Move it out of AMD SME code. Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be selected by all x86 memory encryption features. This is preparation for TDX changes in DMA code. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/Kconfig | 8 +-- arch/x86/include/asm/io.h| 4 +++- arch/x86/mm/Makefile | 2 ++ arch/x86/mm/mem_encrypt.c| 30 - arch/x86/mm/mem_encrypt_common.c | 38 5 files changed, 49 insertions(+), 33 deletions(-) create mode 100644 arch/x86/mm/mem_encrypt_common.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 0374d9f262a5..8fa654d61ac2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS helps to determine the effectiveness of preserving large and huge page mappings when mapping protections are changed. +config X86_MEM_ENCRYPT_COMMON + select ARCH_HAS_FORCE_DMA_UNENCRYPTED + select DYNAMIC_PHYSICAL_MASK + def_bool n + config AMD_MEM_ENCRYPT bool "AMD Secure Memory Encryption (SME) support" depends on X86_64 && CPU_SUP_AMD select DMA_COHERENT_POOL - select DYNAMIC_PHYSICAL_MASK select ARCH_USE_MEMREMAP_PROT - select ARCH_HAS_FORCE_DMA_UNENCRYPTED select INSTRUCTION_DECODER + select X86_MEM_ENCRYPT_COMMON help Say yes to enable support for the encryption of system memory. This requires an AMD processor that supports Secure Memory diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index 30a3b30395ad..95e534cffa99 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -257,10 +257,12 @@ static inline void slow_down_io(void) #endif -#ifdef CONFIG_AMD_MEM_ENCRYPT #include extern struct static_key_false sev_enable_key; + +#ifdef CONFIG_AMD_MEM_ENCRYPT + static inline bool sev_key_active(void) { return static_branch_unlikely(_enable_key); diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5864219221ca..b31cb52bf1bd 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)+= pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o +obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o + obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index c79e5736ab2b..11a6a7b3af7e 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -15,10 +15,6 @@ #include #include #include -#include -#include -#include -#include #include #include @@ -389,32 +385,6 @@ bool noinstr sev_es_active(void) return sev_status & MSR_AMD64_SEV_ES_ENABLED; } -/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ -bool force_dma_unencrypted(struct device *dev) -{ - /* -* For SEV, all DMA must be to unencrypted addresses. -*/ - if (sev_active()) - return true; - - /* -* For SME, all DMA must be to unencrypted addresses if the -* device does not support DMA to addresses that include the -* encryption mask. -*/ - if (sme_active()) { - u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask)); - u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask, - dev->bus_dma_limit); - - if (dma_dev_mask <= dma_enc_mask) - return true; - } - - return false; -} - void __init mem_encrypt_free_decrypted_mem(void) { unsigned long vaddr, vaddr_end, npages; diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c new file mode 100644 index ..964e04152417 --- /dev/null +++ b/arch/x86/mm/mem_encrypt_common.c @@ -0,0 +1,38 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * AMD Memory Encryption Support + * + * Copyright (C) 2016 Advanced Micro Devices, Inc. + * + * Author: Tom Lendacky + */ + +#include +#include +#include + +/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */ +bool force_dma_unencrypted(struct device *dev) +{ + /* +* For SEV, all DM
[RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
From: "Kirill A. Shutemov" tdx_shared_mask() returns the mask that has to be set in page table entry to make page shared with VMM. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tdx.h | 1 + arch/x86/kernel/tdx.c | 8 3 files changed, 10 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8fa654d61ac2..f10a00c4ad7f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -875,6 +875,7 @@ config INTEL_TDX_GUEST select PARAVIRT_XL select X86_X2APIC select SECURITY_LOCKDOWN_LSM + select X86_MEM_ENCRYPT_COMMON help Provide support for running in a trusted domain on Intel processors equipped with Trusted Domain eXtenstions. TDX is an new Intel diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index b46ae140e39b..9bbfe6520ea4 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -104,5 +104,6 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2, long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2, unsigned long p3, unsigned long p4); +phys_addr_t tdx_shared_mask(void); #endif #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index ae37498df981..9681f4a0b4e0 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -41,6 +41,11 @@ bool is_tdx_guest(void) } EXPORT_SYMBOL_GPL(is_tdx_guest); +phys_addr_t tdx_shared_mask(void) +{ + return 1ULL << (td_info.gpa_width - 1); +} + static void tdx_get_info(void) { register long rcx asm("rcx"); @@ -56,6 +61,9 @@ static void tdx_get_info(void) td_info.gpa_width = rcx & GENMASK(5, 0); td_info.attributes = rdx; + + /* Exclude Shared bit from the __PHYSICAL_MASK */ + physical_mask &= ~tdx_shared_mask(); } static __cpuidle void tdx_halt(void) -- 2.25.1
[RFC v1 20/26] x86/tdx: Introduce INTEL_TDX_GUEST config option
Add INTEL_TDX_GUEST config option to selectively compile TDX guest support. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- arch/x86/Kconfig | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8fe91114bfee..0374d9f262a5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -868,6 +868,21 @@ config ACRN_GUEST IOT with small footprint and real-time features. More details can be found in https://projectacrn.org/. +config INTEL_TDX_GUEST + bool "Intel Trusted Domain eXtensions Guest Support" + depends on X86_64 && CPU_SUP_INTEL && PARAVIRT + depends on SECURITY + select PARAVIRT_XL + select X86_X2APIC + select SECURITY_LOCKDOWN_LSM + help + Provide support for running in a trusted domain on Intel processors + equipped with Trusted Domain eXtenstions. TDX is an new Intel + technology that extends VMX and Memory Encryption with a new kind of + virtual machine guest called Trust Domain (TD). A TD is designed to + run in a CPU mode that protects the confidentiality of TD memory + contents and the TD’s CPU state from other software, including VMM. + endif #HYPERVISOR_GUEST source "arch/x86/Kconfig.cpu" -- 2.25.1
[RFC v1 18/26] x86/topology: Disable CPU hotplug support for TDX platforms.
As per Intel TDX Virtual Firmware Design Guide, sec 4.3.5 and sec 9.4, all unused CPUs are put in spinning state by TDVF until OS requests for CPU bring-up via mailbox address passed by ACPI MADT table. Since by default all unused CPUs are always in spinning state, there is no point in supporting dynamic CPU online/offline feature. So current generation of TDVF does not support CPU hotplug feature. It may be supported in next generation. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Andi Kleen --- arch/x86/kernel/tdx.c | 14 ++ arch/x86/kernel/topology.c | 3 ++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index 8d1d7555fb56..a36b6ae14942 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -387,6 +387,17 @@ static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve) return insn.length; } +static int tdx_cpu_offline_prepare(unsigned int cpu) +{ + /* +* Per Intel TDX Virtual Firmware Design Guide, +* sec 4.3.5 and sec 9.4, Hotplug is not supported +* in TDX platforms. So don't support CPU +* offline feature once its turned on. +*/ + return -EOPNOTSUPP; +} + void __init tdx_early_init(void) { if (!cpuid_has_tdx_guest()) @@ -399,6 +410,9 @@ void __init tdx_early_init(void) pv_ops.irq.safe_halt = tdx_safe_halt; pv_ops.irq.halt = tdx_halt; + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug", + NULL, tdx_cpu_offline_prepare); + pr_info("TDX guest is initialized\n"); } diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c index f5477eab5692..d879ea96d79c 100644 --- a/arch/x86/kernel/topology.c +++ b/arch/x86/kernel/topology.c @@ -34,6 +34,7 @@ #include #include #include +#include static DEFINE_PER_CPU(struct x86_cpu, cpu_devices); @@ -130,7 +131,7 @@ int arch_register_cpu(int num) } } } - if (num || cpu0_hotpluggable) + if ((num || cpu0_hotpluggable) && !is_tdx_guest()) per_cpu(cpu_devices, num).cpu.hotpluggable = 1; return register_cpu(_cpu(cpu_devices, num).cpu, num); -- 2.25.1
[RFC v1 19/26] x86/tdx: Forcefully disable legacy PIC for TDX guests
From: Sean Christopherson Disable the legacy PIC (8259) for TDX guests as the PIC cannot be supported by the VMM. TDX Module does not allow direct IRQ injection, and using posted interrupt style delivery requires the guest to EOI the IRQ, which diverges from the legacy PIC behavior. Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/kernel/tdx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c index a36b6ae14942..ae37498df981 100644 --- a/arch/x86/kernel/tdx.c +++ b/arch/x86/kernel/tdx.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include @@ -410,6 +411,8 @@ void __init tdx_early_init(void) pv_ops.irq.safe_halt = tdx_safe_halt; pv_ops.irq.halt = tdx_halt; + legacy_pic = _legacy_pic; + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug", NULL, tdx_cpu_offline_prepare); -- 2.25.1
[RFC v1 16/26] x86/boot: Avoid #VE during compressed boot for TDX platforms
From: Sean Christopherson Avoid operations which will inject #VE during compressed boot, which is obviously fatal for TDX platforms. Details are, 1. TDX module injects #VE if a TDX guest attempts to write EFER. So skip the WRMSR to set EFER.LME=1 if it's already set. TDX also forces EFER.LME=1, i.e. the branch will always be taken and thus the #VE avoided. 2. TDX module also injects a #VE if the guest attempts to clear CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed boot. The Setting CR0.NE should be a nop on all CPUs that support 64-bit mode. Signed-off-by: Sean Christopherson Reviewed-by: Andi Kleen Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/boot/compressed/head_64.S | 5 +++-- arch/x86/boot/compressed/pgtable.h | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index e94874f4bbc1..37c2f37d4a0d 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src) movl$MSR_EFER, %ecx rdmsr btsl$_EFER_LME, %eax + jc 1f wrmsr - popl%edx +1: popl%edx popl%ecx /* Enable PAE and LA57 (if required) paging modes */ @@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src) pushl %eax /* Enable paging again */ - movl$(X86_CR0_PG | X86_CR0_PE), %eax + movl$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax movl%eax, %cr0 lret diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h index 6ff7e81b5628..cc9b2529a086 100644 --- a/arch/x86/boot/compressed/pgtable.h +++ b/arch/x86/boot/compressed/pgtable.h @@ -6,7 +6,7 @@ #define TRAMPOLINE_32BIT_PGTABLE_OFFSET0 #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE -#define TRAMPOLINE_32BIT_CODE_SIZE 0x70 +#define TRAMPOLINE_32BIT_CODE_SIZE 0x80 #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE -- 2.25.1
[PATCH v13 4/5] PCI/ACPI: Centralize pcie_ports_native checking
If the user booted with "pcie_ports=native", we take control of the PCIe features unconditionally, regardless of what _OSC says. Centralize the testing of pcie_ports_native in acpi_pci_root_create(), where we interpret the _OSC results, so other places only have to check host_bridge->native_X and we don't have to sprinkle tests of pcie_ports_native everywhere. [bhelgaas: commit log, rework OSC_PCI_EXPRESS_CONTROL_MASKS, logging] Link: https://lore.kernel.org/r/bc87c9e675118960949043a832bed86bc22becbd.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Bjorn Helgaas --- drivers/acpi/pci_root.c | 19 +++ drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 --- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/portdrv_core.c | 11 --- 5 files changed, 25 insertions(+), 12 deletions(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index 601fbe905993..16ca58d58fef 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -880,6 +880,8 @@ static void acpi_pci_root_release_info(struct pci_host_bridge *bridge) flag = 0; \ } while (0) +#define FLAG(x)((x) ? '+' : '-') + struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, struct acpi_pci_root_ops *ops, struct acpi_pci_root_info *info, @@ -928,6 +930,23 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, OSC_OWNER(ctrl, OSC_PCI_EXPRESS_LTR_CONTROL, host_bridge->native_ltr); OSC_OWNER(ctrl, OSC_PCI_EXPRESS_DPC_CONTROL, host_bridge->native_dpc); + if (pcie_ports_native) { + dev_info(>device->dev, "Taking control of PCIe-related features because \"pcie_ports=native\" specified; may conflict with firmware\n"); + host_bridge->native_pcie_hotplug = 1; + host_bridge->native_aer = 1; + host_bridge->native_pme = 1; + host_bridge->native_ltr = 1; + host_bridge->native_dpc = 1; + } + + dev_info(>device->dev, "OS native features: SHPCHotplug%c PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n", + FLAG(host_bridge->native_shpc_hotplug), + FLAG(host_bridge->native_pcie_hotplug), + FLAG(host_bridge->native_pme), + FLAG(host_bridge->native_aer), + FLAG(host_bridge->native_dpc), + FLAG(host_bridge->native_ltr)); + /* * Evaluate the "PCI Boot Configuration" _DSM Function. If it * exists and returns 0, we must preserve any PCI resource diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index ad3393930ecb..d1831e6bf60a 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -256,7 +256,7 @@ static bool pme_is_native(struct pcie_device *dev) const struct pci_host_bridge *host; host = pci_find_host_bridge(dev->port->bus); - return pcie_ports_native || host->native_pme; + return host->native_pme; } static void pciehp_disable_interrupt(struct pcie_device *dev) diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c index 53502a751914..f6327cf0601b 100644 --- a/drivers/pci/pci-acpi.c +++ b/drivers/pci/pci-acpi.c @@ -800,9 +800,6 @@ bool pciehp_is_native(struct pci_dev *bridge) if (!(slot_cap & PCI_EXP_SLTCAP_HPC)) return false; - if (pcie_ports_native) - return true; - host = pci_find_host_bridge(bridge->bus); return host->native_pcie_hotplug; } diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 77b0f2c45bc0..7fdeaadc40fe 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -219,7 +219,7 @@ int pcie_aer_is_native(struct pci_dev *dev) if (!dev->aer_cap) return 0; - return pcie_ports_native || host->native_aer; + return host->native_aer; } int pci_enable_pcie_error_reporting(struct pci_dev *dev) diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index e1fed6649c41..ea1099908d5d 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -208,8 +208,7 @@ static int get_port_device_capability(struct pci_dev *dev) struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); int services = 0; - if (dev->is_hotplug_bridge && - (pcie_ports_native || host->native_pcie_hotplug)) { + if (host->native_pcie_hotplug && dev->is_hotplug_bridge) { services |= PCIE_PORT_SERVICE_HP; /* @@ -221,8 +220,7 @@ static int get_po
[PATCH v13 1/5] PCI/DPC: Ignore devices with no AER Capability
From: Bjorn Helgaas Downstream Ports may support DPC regardless of whether they support AER (see PCIe r5.0, sec 6.2.10.2). Previously, if the user booted with "pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if the device had no AER Capability, but dpc_get_aer_uncorrect_severity() depends on the AER Capability. dpc_probe() previously failed if: !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law so it succeeded if: pcie_aer_is_native() || pcie_ports_dpc_native Fail dpc_probe() if the device has no AER Capability. Signed-off-by: Bjorn Helgaas Cc: Olof Johansson --- drivers/pci/pcie/dpc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index e05aba86a317..ed0dbc43d018 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev) int status; u16 ctl, cap; + if (!pdev->aer_cap) + return -ENOTSUPP; + if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native) return -ENOTSUPP; -- 2.25.1
[PATCH v13 0/5] Simplify PCIe native ownership
Currently, PCIe capabilities ownership status is detected by verifying the status of pcie_ports_native, and _OSC negotiated results (cached in struct pci_host_bridge->native_* members). But this logic can be simplified, and we can use only struct pci_host_bridge ->native_* members to detect it. This patchset removes the distributed checks for pcie_ports_native, parameter. Changes since v12: * Rebased on top of v5.11-rc1 Changes since v11 (Bjorns update): * Add bugfix for DPC with no AER Capability * Split OSC_OWNER trivial changes from pcie_ports_native changes * Temporarily drop pcie_ports_dpc_native changes (revisit it later). Changes since v10: * Addressed format issue reported by lkp test. Changes since v9: * Rebased on top of v5.10-rc1 Changes since v8: * Simplified setting _OSC ownwership logic * Moved bridge->native_ltr out of #ifdef CONFIG_PCIEPORTBUS. Changes since v7: * Fixed "fix array_size.cocci warnings". Changes since v6: * Created new patch for CONFIG_PCIEPORTBUS check in pci_init_host_bridge(). * Added warning message for a case when pcie_ports_native overrides _OSC negotiation result. Changes since v5: * Rebased on top of v5.8-rc1 Changes since v4: * Changed the patch set title (Original link: https://lkml.org/lkml/2020/5/26/1710) * Added AER/DPC dependency logic cleanup fixes. Bjorn Helgaas (2): PCI/DPC: Ignore devices with no AER Capability PCI/ACPI: Centralize pci_aer_available() checking Kuppuswamy Sathyanarayanan (3): PCI: Assume control of portdrv-related features only when portdrv enabled PCI/ACPI: Tidy _OSC control bit checking PCI/ACPI: Centralize pcie_ports_native checking drivers/acpi/pci_root.c | 49 --- drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 -- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/dpc.c| 3 ++ drivers/pci/pcie/portdrv_core.c | 11 +++ drivers/pci/probe.c | 6 ++-- 7 files changed, 51 insertions(+), 25 deletions(-) -- 2.25.1
[PATCH v13 3/5] PCI/ACPI: Tidy _OSC control bit checking
Add OSC_OWNER() helper to prettify checking the _OSC control bits to learn whether the platform has granted us control of PCI features. No functional change intended. [bhelgaas: split to separate patch, commit log] Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Bjorn Helgaas --- drivers/acpi/pci_root.c | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index 0bf072cef6cf..601fbe905993 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -874,6 +874,12 @@ static void acpi_pci_root_release_info(struct pci_host_bridge *bridge) __acpi_pci_root_release_info(bridge->release_data); } +#define OSC_OWNER(ctrl, bit, flag) \ + do {\ + if (!(ctrl & bit)) \ + flag = 0; \ + } while (0) + struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, struct acpi_pci_root_ops *ops, struct acpi_pci_root_info *info, @@ -885,6 +891,7 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, struct pci_bus *bus; struct pci_host_bridge *host_bridge; union acpi_object *obj; + u32 ctrl; info->root = root; info->bridge = device; @@ -910,18 +917,16 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, goto out_release_info; host_bridge = to_pci_host_bridge(bus->bridge); - if (!(root->osc_control_set & OSC_PCI_EXPRESS_NATIVE_HP_CONTROL)) - host_bridge->native_pcie_hotplug = 0; - if (!(root->osc_control_set & OSC_PCI_SHPC_NATIVE_HP_CONTROL)) - host_bridge->native_shpc_hotplug = 0; - if (!(root->osc_control_set & OSC_PCI_EXPRESS_AER_CONTROL)) - host_bridge->native_aer = 0; - if (!(root->osc_control_set & OSC_PCI_EXPRESS_PME_CONTROL)) - host_bridge->native_pme = 0; - if (!(root->osc_control_set & OSC_PCI_EXPRESS_LTR_CONTROL)) - host_bridge->native_ltr = 0; - if (!(root->osc_control_set & OSC_PCI_EXPRESS_DPC_CONTROL)) - host_bridge->native_dpc = 0; + + ctrl = root->osc_control_set; + OSC_OWNER(ctrl, OSC_PCI_EXPRESS_NATIVE_HP_CONTROL, + host_bridge->native_pcie_hotplug); + OSC_OWNER(ctrl, OSC_PCI_SHPC_NATIVE_HP_CONTROL, + host_bridge->native_shpc_hotplug); + OSC_OWNER(ctrl, OSC_PCI_EXPRESS_AER_CONTROL, host_bridge->native_aer); + OSC_OWNER(ctrl, OSC_PCI_EXPRESS_PME_CONTROL, host_bridge->native_pme); + OSC_OWNER(ctrl, OSC_PCI_EXPRESS_LTR_CONTROL, host_bridge->native_ltr); + OSC_OWNER(ctrl, OSC_PCI_EXPRESS_DPC_CONTROL, host_bridge->native_dpc); /* * Evaluate the "PCI Boot Configuration" _DSM Function. If it -- 2.25.1
[PATCH v13 5/5] PCI/ACPI: Centralize pci_aer_available() checking
From: Bjorn Helgaas Check pci_aer_available() in acpi_pci_root_create() when we're interpreting _OSC results so host_bridge->native_aer becomes the single way to determine whether we control AER capabilities. Signed-off-by: Bjorn Helgaas --- drivers/acpi/pci_root.c | 3 +++ drivers/pci/pcie/portdrv_core.c | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index 16ca58d58fef..f7d2eed3975c 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -939,6 +939,9 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, host_bridge->native_dpc = 1; } + if (!pci_aer_available()) + host_bridge->native_aer = 0; + dev_info(>device->dev, "OS native features: SHPCHotplug%c PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n", FLAG(host_bridge->native_shpc_hotplug), FLAG(host_bridge->native_pcie_hotplug), diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index ea1099908d5d..a2b8a4bc91fa 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -220,7 +220,7 @@ static int get_port_device_capability(struct pci_dev *dev) } #ifdef CONFIG_PCIEAER - if (host->native_aer && dev->aer_cap && pci_aer_available()) { + if (host->native_aer && dev->aer_cap) { services |= PCIE_PORT_SERVICE_AER; /* -- 2.25.1
[PATCH v13 2/5] PCI: Assume control of portdrv-related features only when portdrv enabled
Native control of PME, AER, DPC, and PCIe hotplug depends on the portdrv, so default to native handling of them only when CONFIG_PCIEPORTBUS is enabled. Native control LTR and SHPC hotplug does not depend on portdrv, so we can always take control of them unless some platform interface, e.g., _OSC, tells us otherwise. [bhelgaas: commit log] Link: https://lore.kernel.org/r/fcbe8a624166a1101a755edfef44a185d32ff493.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Bjorn Helgaas --- drivers/pci/probe.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 953f15abc850..97498f61f5ad 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -588,12 +588,14 @@ static void pci_init_host_bridge(struct pci_host_bridge *bridge) * may implement its own AER handling and use _OSC to prevent the * OS from interfering. */ +#ifdef CONFIG_PCIEPORTBUS bridge->native_aer = 1; bridge->native_pcie_hotplug = 1; - bridge->native_shpc_hotplug = 1; bridge->native_pme = 1; - bridge->native_ltr = 1; bridge->native_dpc = 1; +#endif + bridge->native_ltr = 1; + bridge->native_shpc_hotplug = 1; device_initialize(>dev); } -- 2.25.1
Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()
On 1/8/21 2:30 PM, Bjorn Helgaas wrote: Can we push this forward now? There are several pending patches in this area from Keith and Sathyanarayanan; I haven't gotten to them yet, so not sure whether they help address any of this. Following two patches should also address the same issue. My patch: https://patchwork.kernel.org/project/linux-pci/patch/6f63321637fef86b6cf0beebf98b987062f9e811.1610153755.git.sathyanarayanan.kuppusw...@linux.intel.com/ Keith's patch: https://patchwork.kernel.org/project/linux-pci/patch/20210104230300.1277180-4-kbu...@kernel.org/ -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
[PATCH v9 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") merged fatal and non-fatal error recovery paths, and also made recovery code depend on hotplug handler for "remove affected device + rescan" support. But this change also complicated the error recovery path and which in turn led to the following issues. 1. We depend on hotplug handler for removing the affected devices/drivers on DLLSC LINK down event (on DPC event trigger) and DPC handler for handling the error recovery. Since both handlers operate on same set of affected devices, it leads to race condition, which in turn leads to NULL pointer exceptions or error recovery failures.You can find more details about this issue in following link. https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t 2. For non-hotplug capable devices fatal (DPC) error recovery is currently broken. Current fatal error recovery implementation relies on PCIe hotplug (pciehp) handler for detaching and re-enumerating the affected devices/drivers. So when dealing with non-hotplug capable devices, recovery code does not restore the state of the affected devices correctly. You can find more details about this issue in the following links. https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ https://lore.kernel.org/linux-pci/12115.1588207324@famine/ https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ In order to fix the above two issues, we should stop relying on hotplug handler for cleaning the affected devices/drivers and let error recovery handler own this functionality. So this patch reverts Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") and re-introduce the "remove affected device + rescan" functionality in fatal error recovery handler. Also holding pci_lock_rescan_remove() will prevent the race between hotplug and DPC handler. Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Sinan Kaya --- Changes since v8: * Rebased on top of v5.11-rc1. Changes since v7: * Rebased on top of v5.10-rc1. Changes since v6: * Renamed pcie_do_nonfatal_recovery() to pcie_nonfatal_recovery(). * Renamed pcie_do_fatal_recovery() to pcie_fatal_recovery(). * Addressed some format issues. Changes since v5: * Fixed static/non-static declartion issue. Changes since v4: * Added new interfaces for error recovery (pcie_do_fatal_recovery() and pcie_do_nonfatal_recovery()). Documentation/PCI/pci-error-recovery.rst | 47 +++-- Documentation/PCI/pcieaer-howto.rst | 2 +- drivers/pci/pci.h| 7 +-- drivers/pci/pcie/aer.c | 10 ++-- drivers/pci/pcie/dpc.c | 2 +- drivers/pci/pcie/edr.c | 2 +- drivers/pci/pcie/err.c | 67 +--- 7 files changed, 91 insertions(+), 46 deletions(-) diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst index 84ceebb08cac..830c8af5838b 100644 --- a/Documentation/PCI/pci-error-recovery.rst +++ b/Documentation/PCI/pci-error-recovery.rst @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a PCI error event will be platform-dependent, but will follow the general sequence described below. -STEP 0: Error Event +STEP 0: Error Event: ERR_NONFATAL --- A PCI bus error is detected by the PCI hardware. On powerpc, the slot is isolated, in that all I/O is blocked: all reads return 0x, @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and recovery proceeds to STEP 2 (MMIO Enable). If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), -then recovery proceeds to STEP 4 (Slot Reset). +then recovery proceeds to STEP 3 (Slot Reset). If the platform is unable to recover the slot, the next step -is STEP 6 (Permanent Failure). +is STEP 5 (Permanent Failure). .. note:: @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This callback is made if all drivers on a segment agree that they can try to recover and if no automatic link reset was performed by the HW. If the platform can't just re-enable IOs without a slot reset or a link reset, it will not call this callback, and -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) +instead will have gone directly to STEP 3 (Slot Reset) .. note:: @@ -233,18 +233,12 @@ The driver should return one of the following result codes: The next step taken depends on the results returned by the drivers. If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). +proceeds to STEP 4 (Resume Operations). If any driver returned PCI_ERS_RESULT
[PATCH v9 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
Currently if report_error_detected() or report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET, current pcie_do_recovery() implementation does not do the requested explicit device reset, but instead just calls the report_slot_reset() on all affected devices. Notifying about the reset via report_slot_reset() without doing the actual device reset is incorrect. So call pci_bus_reset() before triggering ->slot_reset() callback. Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Sinan Kaya Reviewed-by: Ashok Raj --- Changes since v7: * Rebased on top of v5.11-rc1. Changes since v7: * Rebased on top of v5.10-rc1. Changes since v6: * None. Changes since v5: * Added Ashok's Reviewed-by tag. Changes since v4: * Added check for pci_reset_bus() return value. drivers/pci/pcie/err.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 510f31f0ef6d..6c19e9948232 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -177,6 +177,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, struct pci_dev *bridge; pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER; struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); + int ret; /* * If the error was detected by a Root Port, Downstream Port, RCEC, @@ -214,11 +215,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, } if (status == PCI_ERS_RESULT_NEED_RESET) { - /* -* TODO: Should call platform-specific -* functions to reset slot before calling -* drivers' slot_reset callbacks? -*/ + ret = pci_reset_bus(bridge); + if (ret < 0) { + pci_err(dev, "Failed to reset %d\n", ret); + status = PCI_ERS_RESULT_DISCONNECT; + goto failed; + } status = PCI_ERS_RESULT_RECOVERED; pci_dbg(bridge, "broadcast slot_reset message\n"); pci_walk_bridge(bridge, report_slot_reset, ); -- 2.25.1
Re: linux-next: manual merge of the amdgpu tree with the pci tree
On 12/14/20 3:37 PM, Bjorn Helgaas wrote: On Mon, Dec 14, 2020 at 06:18:54PM -0500, Alex Deucher wrote: On Mon, Dec 14, 2020 at 6:16 PM Bjorn Helgaas wrote: On Tue, Dec 15, 2020 at 07:34:31AM +1100, Stephen Rothwell wrote: Hi all, On Tue, 8 Dec 2020 13:56:20 +1100 Stephen Rothwell wrote: I don't plan to merge this upstream via my tree. I was just carrying it in my drm-next branch because we have a number of users that depend on it for working DPC and a number of people use this branch for testing. OK, thanks. FWIW, it's currently marked "Changes Requested" in patchwork, so it isn't really going anywhere right now: https://patchwork.kernel.org/project/linux-pci/patch/cbba08a5e9ca62778c8937f44eda2192a2045da7.1595617529.git.sathyanarayanan.kuppusw...@linux.intel.com/ There is a newer version of this patch set. Please use it when merging this patch. https://patchwork.kernel.org/project/linux-pci/list/?series=370855 I have merged Sean's series, so this would be a good time to try to move this one forward. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v8 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
On 10/26/20 12:37 PM, Kuppuswamy Sathyanarayanan wrote: Currently if report_error_detected() or report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET, current pcie_do_recovery() implementation does not do the requested explicit device reset, but instead just calls the report_slot_reset() on all affected devices. Notifying about the reset via report_slot_reset() without doing the actual device reset is incorrect. So call pci_bus_reset() before triggering ->slot_reset() callback. Gentle ping! Any comments on this patch set? Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Sinan Kaya Reviewed-by: Ashok Raj --- Changes since v7: * Rebased on top of v5.10-rc1. Changes since v6: * None. Changes since v5: * Added Ashok's Reviewed-by tag. Changes since v4: * Added check for pci_reset_bus() return value. drivers/pci/pcie/err.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..315a4d559c4c 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -152,6 +152,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, { pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER; struct pci_bus *bus; + int ret; /* * Error recovery runs on all subordinates of the first downstream port. @@ -181,11 +182,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, } if (status == PCI_ERS_RESULT_NEED_RESET) { - /* -* TODO: Should call platform-specific -* functions to reset slot before calling -* drivers' slot_reset callbacks? -*/ + ret = pci_reset_bus(dev); + if (ret < 0) { + pci_err(dev, "Failed to reset %d\n", ret); + status = PCI_ERS_RESULT_DISCONNECT; + goto failed; + } status = PCI_ERS_RESULT_RECOVERED; pci_dbg(dev, "broadcast slot_reset message\n"); pci_walk_bus(bus, report_slot_reset, ); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v12 0/5] Simplify PCIe native ownership
Hi Bjorn, On 11/30/20 5:11 PM, Kuppuswamy, Sathyanarayanan wrote: Hi Bjorn, On 11/25/20 7:48 PM, Kuppuswamy, Sathyanarayanan wrote: Along with above patch, you also left following two cleanup patches. Is this intentional? Following fixes have no dependency on pcie_ports_dpc_native change. [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic [PATCH v11 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver Any comment? If you want me to add these patches in my re-submission, please let me know. Gentle ping. Any comments? -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: pci 0000:00:07.0: DPC: RP PIO log size 0 is invalid
Hi, On 12/7/20 5:08 AM, Paul Menzel wrote: [Bringing the issue up on the list in case the Linux Bugzilla is not monitored/used.] Dear Linux folks, On Intel Tiger Lake Dell laptop, Linux logs the error below [1]. [ 0.507307] pci :00:07.0: DPC: RP PIO log size 0 is invalid [ 0.508835] pci :00:07.2: DPC: RP PIO log size 0 is invalid $ lspci -nn -s 00:07 00:07.0 PCI bridge [0604]: Intel Corporation Tiger Lake-LP Thunderbolt PCI Express Root Port #0 [8086:9a23] (rev 01) 00:07.2 PCI bridge [0604]: Intel Corporation Tiger Lake-LP Thunderbolt PCI Express Root Port #2 [8086:9a27] (rev 01) Commit 2700561817 (PCI/DPC: Cache DPC capabilities in pci_init_capabilities()) [1] probably introduced it in Linux 5.7. What does this error actually mean? pdev->dpc_rp_log_size = (cap & PCI_EXP_DPC_RP_PIO_LOG_SIZE) >> 8; if (pdev->dpc_rp_log_size < 4 || pdev->dpc_rp_log_size > 9) { pci_err(pdev, "RP PIO log size %u is invalid\n", pdev->dpc_rp_log_size); pdev->dpc_rp_log_size = 0; As per PCIe spec r5.0, sec 7.9.15.2, valid RP log size is 4 or greater. Please see the text copied from spec - - - - RP PIO Log Size - This field indicates how many DWORDs are allocated for the RP PIO log registers, comprised by the RP PIO Header Log, the RP PIO ImpSpec Log, and RP PIO TLP Prefix Log. If the Root Port supports RP Extensions for DPC, the value of this field must be 4 or greater; otherwise, the value of this field must be 0. See Section 7.9.15.11 , Section 7.9.15.12 , and Section 7.9.15.13 . - - - - In this case, since "(!(cap & PCI_EXP_DPC_CAP_RP_EXT))" condition is false, RP EXT is supported. If RP EXT is supported, valid log size should be at-least 4. } (I guess `cap & PCI_EXP_DPC_RP_PIO_LOG_SIZE` is zero too?) Is it a firmware issue or a hardware issue? I think this could be hardware issue. Kind regards, Paul [1]: https://bugzilla.kernel.org/show_bug.cgi?id=209943 "pci :00:07.0: DPC: RP PIO log size 0 is invalid" [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=27005618178ef9e9bf9c42fd91101771c92e9308 -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v12 0/5] Simplify PCIe native ownership
Hi Bjorn, On 11/25/20 7:48 PM, Kuppuswamy, Sathyanarayanan wrote: Along with above patch, you also left following two cleanup patches. Is this intentional? Following fixes have no dependency on pcie_ports_dpc_native change. [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic [PATCH v11 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver Any comment? If you want me to add these patches in my re-submission, please let me know. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v12 10/15] PCI/ERR: Limit AER resets in pcie_do_recovery()
On 11/30/20 4:25 PM, Bjorn Helgaas wrote: I think EDR is the same as DPC? Yes, EDR is same as DPC. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability
On 11/28/20 3:25 PM, Bjorn Helgaas wrote: On Sat, Nov 28, 2020 at 01:56:23PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 11/28/20 1:53 PM, Bjorn Helgaas wrote: On Sat, Nov 28, 2020 at 01:49:46PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 11/28/20 12:24 PM, Bjorn Helgaas wrote: On Wed, Nov 25, 2020 at 06:01:57PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 11/25/20 5:18 PM, Bjorn Helgaas wrote: From: Bjorn Helgaas Downstream Ports may support DPC regardless of whether they support AER (see PCIe r5.0, sec 6.2.10.2). Previously, if the user booted with "pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if the device had no AER Capability, but dpc_get_aer_uncorrect_severity() depends on the AER Capability. dpc_probe() previously failed if: !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law so it succeeded if: pcie_aer_is_native() || pcie_ports_dpc_native Fail dpc_probe() if the device has no AER Capability. Signed-off-by: Bjorn Helgaas Cc: Olof Johansson --- drivers/pci/pcie/dpc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index e05aba86a317..ed0dbc43d018 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev) int status; u16 ctl, cap; + if (!pdev->aer_cap) + return -ENOTSUPP; Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ? We don't enable DPC service, if AER service is not enabled. And AER service is only enabled if AER capability is supported. So dpc_probe() should not happen if AER capability is not supported? I don't think that's always true. If I'm reading this right, we have this: get_port_device_capability(...) { #ifdef CONFIG_PCIEAER if (dev->aer_cap && ...) services |= PCIE_PORT_SERVICE_AER; #endif if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() && (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER))) services |= PCIE_PORT_SERVICE_DPC; } and in the case where: - CONFIG_PCIEAER=y - booted with "pcie_ports=dpc-native" (pcie_ports_dpc_native is true) - "dev" has no AER capability - "dev" has DPC capability I think we do enable PCIE_PORT_SERVICE_DPC. Got it. But further looking into it, I am wondering whether we should keep this dependency? Currently we just use it to dump the error information. Do we need to create dependency between DPC and AER (which is functionality not dependent) just to see more details about the error? That's a good question, but I don't really want to get into the actual operation of the AER and DPC drivers in this series, so maybe something we should explore later. In that case, can you move this check to drivers/pci/pcie/portdrv_core.c? I don't see the point of distributed checks in both get_port_device_capability() and dpc_probe(). I totally agree that these distributed checks are terrible, but my long-term hope is to get rid of portdrv and handle these "services" more like we handle other capabilities. For example, maybe we can squash dpc_probe() into pci_dpc_init(), so I'd actually like to move things from get_port_device_capability() into dpc_probe(). Removing the service driver model will be a major overhaul. It would affect even the error recovery drivers. You can find motivation for service drivers in Documentation/PCI/pciebus-howto.rst. But till we fix this part, I recommend grouping all dependency checks to one place (either dpc_probe() or portdrv service driver). -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability
On 11/28/20 12:24 PM, Bjorn Helgaas wrote: On Wed, Nov 25, 2020 at 06:01:57PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 11/25/20 5:18 PM, Bjorn Helgaas wrote: From: Bjorn Helgaas Downstream Ports may support DPC regardless of whether they support AER (see PCIe r5.0, sec 6.2.10.2). Previously, if the user booted with "pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if the device had no AER Capability, but dpc_get_aer_uncorrect_severity() depends on the AER Capability. dpc_probe() previously failed if: !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law so it succeeded if: pcie_aer_is_native() || pcie_ports_dpc_native Fail dpc_probe() if the device has no AER Capability. Signed-off-by: Bjorn Helgaas Cc: Olof Johansson --- drivers/pci/pcie/dpc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index e05aba86a317..ed0dbc43d018 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev) int status; u16 ctl, cap; + if (!pdev->aer_cap) + return -ENOTSUPP; Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ? We don't enable DPC service, if AER service is not enabled. And AER service is only enabled if AER capability is supported. So dpc_probe() should not happen if AER capability is not supported? I don't think that's always true. If I'm reading this right, we have this: get_port_device_capability(...) { #ifdef CONFIG_PCIEAER if (dev->aer_cap && ...) services |= PCIE_PORT_SERVICE_AER; #endif if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() && (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER))) services |= PCIE_PORT_SERVICE_DPC; } and in the case where: - CONFIG_PCIEAER=y - booted with "pcie_ports=dpc-native" (pcie_ports_dpc_native is true) - "dev" has no AER capability - "dev" has DPC capability I think we do enable PCIE_PORT_SERVICE_DPC. Got it. But further looking into it, I am wondering whether we should keep this dependency? Currently we just use it to dump the error information. Do we need to create dependency between DPC and AER (which is functionality not dependent) just to see more details about the error? 206 static int get_port_device_capability(struct pci_dev *dev) ... ... 251 if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && 252 host->native_dpc && 253 (host->native_dpc || (services & PCIE_PORT_SERVICE_AER))) 254 services |= PCIE_PORT_SERVICE_DPC; 255 + if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native) return -ENOTSUPP; -- Sathyanarayanan Kuppuswamy Linux Kernel Developer -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability
On 11/28/20 1:53 PM, Bjorn Helgaas wrote: On Sat, Nov 28, 2020 at 01:49:46PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 11/28/20 12:24 PM, Bjorn Helgaas wrote: On Wed, Nov 25, 2020 at 06:01:57PM -0800, Kuppuswamy, Sathyanarayanan wrote: On 11/25/20 5:18 PM, Bjorn Helgaas wrote: From: Bjorn Helgaas Downstream Ports may support DPC regardless of whether they support AER (see PCIe r5.0, sec 6.2.10.2). Previously, if the user booted with "pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if the device had no AER Capability, but dpc_get_aer_uncorrect_severity() depends on the AER Capability. dpc_probe() previously failed if: !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law so it succeeded if: pcie_aer_is_native() || pcie_ports_dpc_native Fail dpc_probe() if the device has no AER Capability. Signed-off-by: Bjorn Helgaas Cc: Olof Johansson --- drivers/pci/pcie/dpc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index e05aba86a317..ed0dbc43d018 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev) int status; u16 ctl, cap; + if (!pdev->aer_cap) + return -ENOTSUPP; Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ? We don't enable DPC service, if AER service is not enabled. And AER service is only enabled if AER capability is supported. So dpc_probe() should not happen if AER capability is not supported? I don't think that's always true. If I'm reading this right, we have this: get_port_device_capability(...) { #ifdef CONFIG_PCIEAER if (dev->aer_cap && ...) services |= PCIE_PORT_SERVICE_AER; #endif if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() && (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER))) services |= PCIE_PORT_SERVICE_DPC; } and in the case where: - CONFIG_PCIEAER=y - booted with "pcie_ports=dpc-native" (pcie_ports_dpc_native is true) - "dev" has no AER capability - "dev" has DPC capability I think we do enable PCIE_PORT_SERVICE_DPC. Got it. But further looking into it, I am wondering whether we should keep this dependency? Currently we just use it to dump the error information. Do we need to create dependency between DPC and AER (which is functionality not dependent) just to see more details about the error? That's a good question, but I don't really want to get into the actual operation of the AER and DPC drivers in this series, so maybe something we should explore later. In that case, can you move this check to drivers/pci/pcie/portdrv_core.c? I don't see the point of distributed checks in both get_port_device_capability() and dpc_probe(). -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v12 0/5] Simplify PCIe native ownership
On 11/25/20 5:18 PM, Bjorn Helgaas wrote: From: Bjorn Helgaas This is Sathy's work with a few tweaks on top. I dropped the DPC pcie_ports_dpc_native changes for now just because I haven't had time to understand it all. We currently ignore the OSC_PCI_EXPRESS_DPC_CONTROL bit, which seems wrong. We might want to start looking at it, but we should try to make that a separate patch that's as small as possible. Along with above patch, you also left following two cleanup patches. Is this intentional? Following fixes have no dependency on pcie_ports_dpc_native change. [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic [PATCH v11 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver Changes since v11: * Add bugfix for DPC with no AER Capability * Split OSC_OWNER trivial changes from pcie_ports_native changes * Temporarily drop pcie_ports_dpc_native changes v11 posting: https://lore.kernel.org/r/cover.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com Bjorn Helgaas (2): PCI/DPC: Ignore devices with no AER Capability PCI/ACPI: Centralize pci_aer_available() checking Kuppuswamy Sathyanarayanan (3): PCI: Assume control of portdrv-related features only when portdrv enabled PCI/ACPI: Tidy _OSC control bit checking PCI/ACPI: Centralize pcie_ports_native checking drivers/acpi/pci_root.c | 49 --- drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 -- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/dpc.c| 3 ++ drivers/pci/pcie/portdrv_core.c | 9 ++ drivers/pci/probe.c | 6 ++-- 7 files changed, 50 insertions(+), 24 deletions(-) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH 4/5] PCI/ACPI: Centralize pcie_ports_native checking
On 11/25/20 5:18 PM, Bjorn Helgaas wrote: From: Kuppuswamy Sathyanarayanan If the user booted with "pcie_ports=native", we take control of the PCIe features unconditionally, regardless of what _OSC says. Centralize the testing of pcie_ports_native in acpi_pci_root_create(), where we interpret the _OSC results, so other places only have to check host_bridge->native_X and we don't have to sprinkle tests of pcie_ports_native everywhere. [bhelgaas: commit log, rework OSC_PCI_EXPRESS_CONTROL_MASKS, logging] Link: https://lore.kernel.org/r/bc87c9e675118960949043a832bed86bc22becbd.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Bjorn Helgaas --- drivers/acpi/pci_root.c | 19 +++ drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 --- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/portdrv_core.c | 9 +++-- 5 files changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index 6db071038fd5..36142ed7b8f8 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -882,6 +882,8 @@ static void acpi_pci_root_release_info(struct pci_host_bridge *bridge) flag = 0; \ } while (0) +#define FLAG(x) ((x) ? '+' : '-') + struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, struct acpi_pci_root_ops *ops, struct acpi_pci_root_info *info, @@ -930,6 +932,23 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, OSC_OWNER(ctrl, OSC_PCI_EXPRESS_LTR_CONTROL, host_bridge->native_ltr); OSC_OWNER(ctrl, OSC_PCI_EXPRESS_DPC_CONTROL, host_bridge->native_dpc); + if (pcie_ports_native) { + dev_info(>device->dev, "Taking control of PCIe-related features because \"pcie_ports=native\" specified; may conflict with firmware\n"); + host_bridge->native_pcie_hotplug = 1; + host_bridge->native_aer = 1; + host_bridge->native_pme = 1; + host_bridge->native_ltr = 1; + host_bridge->native_dpc = 1; + } Won't it be better if its merged with above OSC_OWNER() calls? If pcie_ports_native is set, then above OSC_OWNER() calls for PCIe related features are redundant. Let me know. + + dev_info(>device->dev, "OS native features: SHPCHotplug%c PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n", + FLAG(host_bridge->native_shpc_hotplug), + FLAG(host_bridge->native_pcie_hotplug), + FLAG(host_bridge->native_pme), + FLAG(host_bridge->native_aer), + FLAG(host_bridge->native_dpc), + FLAG(host_bridge->native_ltr)); + /* * Evaluate the "PCI Boot Configuration" _DSM Function. If it * exists and returns 0, we must preserve any PCI resource diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index ad3393930ecb..d1831e6bf60a 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -256,7 +256,7 @@ static bool pme_is_native(struct pcie_device *dev) const struct pci_host_bridge *host; host = pci_find_host_bridge(dev->port->bus); - return pcie_ports_native || host->native_pme; + return host->native_pme; } static void pciehp_disable_interrupt(struct pcie_device *dev) diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c index bf03648c2072..a84f75ec6df8 100644 --- a/drivers/pci/pci-acpi.c +++ b/drivers/pci/pci-acpi.c @@ -800,9 +800,6 @@ bool pciehp_is_native(struct pci_dev *bridge) if (!(slot_cap & PCI_EXP_SLTCAP_HPC)) return false; - if (pcie_ports_native) - return true; - host = pci_find_host_bridge(bridge->bus); return host->native_pcie_hotplug; } diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 65dff5f3457a..79bb441139c2 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -219,7 +219,7 @@ int pcie_aer_is_native(struct pci_dev *dev) if (!dev->aer_cap) return 0; - return pcie_ports_native || host->native_aer; + return host->native_aer; } int pci_enable_pcie_error_reporting(struct pci_dev *dev) diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index 50a9522ab07d..2a1190e8db60 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -208,8 +208,7 @@ static int get_port_device_capability(struct pci_dev *dev) struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); int services = 0; - if (dev->is_hotplug_bridge && - (pcie_
Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability
On 11/25/20 5:18 PM, Bjorn Helgaas wrote: From: Bjorn Helgaas Downstream Ports may support DPC regardless of whether they support AER (see PCIe r5.0, sec 6.2.10.2). Previously, if the user booted with "pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if the device had no AER Capability, but dpc_get_aer_uncorrect_severity() depends on the AER Capability. dpc_probe() previously failed if: !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law so it succeeded if: pcie_aer_is_native() || pcie_ports_dpc_native Fail dpc_probe() if the device has no AER Capability. Signed-off-by: Bjorn Helgaas Cc: Olof Johansson --- drivers/pci/pcie/dpc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index e05aba86a317..ed0dbc43d018 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev) int status; u16 ctl, cap; + if (!pdev->aer_cap) + return -ENOTSUPP; Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ? We don't enable DPC service, if AER service is not enabled. And AER service is only enabled if AER capability is supported. So dpc_probe() should not happen if AER capability is not supported? 206 static int get_port_device_capability(struct pci_dev *dev) ... ... 251 if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && 252 host->native_dpc && 253 (host->native_dpc || (services & PCIE_PORT_SERVICE_AER))) 254 services |= PCIE_PORT_SERVICE_DPC; 255 + if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native) return -ENOTSUPP; -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 2/5] ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.
On 11/25/20 12:28 PM, Bjorn Helgaas wrote: On Mon, Oct 26, 2020 at 07:57:05PM -0700, Kuppuswamy Sathyanarayanan wrote: pcie_ports_native is set only if user requests native handling of PCIe capabilities via pcie_port_setup command line option. User input takes precedence over _OSC based control negotiation result. So consider the _OSC negotiated result only if pcie_ports_native is unset. Also, since struct pci_host_bridge ->native_* members caches the ownership status of various PCIe capabilities, use them instead of distributed checks for pcie_ports_native. diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index 50a9522ab07d..ccd5e0ce5605 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -208,8 +208,7 @@ static int get_port_device_capability(struct pci_dev *dev) struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); int services = 0; - if (dev->is_hotplug_bridge && - (pcie_ports_native || host->native_pcie_hotplug)) { + if (dev->is_hotplug_bridge && host->native_pcie_hotplug) { This is a nit, but I think this and similar checks should be reordered so we do the most generic test first, i.e., if (host->native_pcie_hotplug && dev->is_hotplug_bridge) ok. I can add this fix on top of your update. Logically there's no point in looking at the device things if we don't have native control. services |= PCIE_PORT_SERVICE_HP; /* @@ -221,8 +220,7 @@ static int get_port_device_capability(struct pci_dev *dev) } #ifdef CONFIG_PCIEAER - if (dev->aer_cap && pci_aer_available() && - (pcie_ports_native || host->native_aer)) { + if (dev->aer_cap && pci_aer_available() && host->native_aer) { Can't we clear host->native_aer when pci_aer_available() returns false? I'd like to have all the checks about whether we have native control to be in one place instead of being scattered. Something like this above: OSC_OWNER(ctrl, OSC_PCI_EXPRESS_AER_CONTROL, host_bridge->native_aer); if (!pci_aer_available()) host_bridge->native_aer = 0; So this test would become: if (host->native_aer && dev->aer_cap) We already use it in pci_root.c to confirm AER availability before we request control over it via _OSC. So if pci_aer_available() is false OSC_*AER_CONTROL bit in ctrl_set will be left zero (which means ->native_aer will also be zero). 490 if (pci_aer_available()) 491 control |= OSC_PCI_EXPRESS_AER_CONTROL; So, may we don't need pci_aer_available() check here. services |= PCIE_PORT_SERVICE_AER; /* @@ -238,8 +236,7 @@ static int get_port_device_capability(struct pci_dev *dev) * Event Collectors can also generate PMEs, but we don't handle * those yet. */ - if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT && - (pcie_ports_native || host->native_pme)) { + if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT && host->native_pme) { services |= PCIE_PORT_SERVICE_PME; Also here: if (host->native_pme && pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 2/5] ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.
On 11/25/20 2:25 PM, Bjorn Helgaas wrote: I've been fiddling with this, so let me post a v12 tonight and you can see what you think. Ok. I will wait for your update. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 2/5] ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.
Hi Bjorn, Thanks for the review. On 11/25/20 12:12 PM, Bjorn Helgaas wrote: On Mon, Oct 26, 2020 at 07:57:05PM -0700, Kuppuswamy Sathyanarayanan wrote: pcie_ports_native is set only if user requests native handling of PCIe capabilities via pcie_port_setup command line option. User input takes precedence over _OSC based control negotiation result. So consider the _OSC negotiated result only if pcie_ports_native is unset. Also, since struct pci_host_bridge ->native_* members caches the ownership status of various PCIe capabilities, use them instead of distributed checks for pcie_ports_native. Signed-off-by: Kuppuswamy Sathyanarayanan --- drivers/acpi/pci_root.c | 35 ++- drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 --- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/portdrv_core.c | 9 +++- include/linux/acpi.h | 2 ++ 6 files changed, 32 insertions(+), 21 deletions(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index c12b5fb3e8fb..a9e6b782622d 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -41,6 +41,12 @@ static int acpi_pci_root_scan_dependent(struct acpi_device *adev) | OSC_PCI_CLOCK_PM_SUPPORT \ | OSC_PCI_MSI_SUPPORT) + + if (pcie_ports_native) { + decode_osc_control(root, "OS forcibly taking over", + OSC_PCI_EXPRESS_CONTROL_MASKS); The only place OSC_PCI_EXPRESS_CONTROL_MASKS is used is right here, so it's kind of pointless. I think I'd rather have this: dev_info(>device->dev, "Ignoring PCIe-related _OSC results because \"pcie_ports=native\" specified\n"); I was trying to keep the same print format. In pci_root.c, decode_os_control() is repeatedly used to print info related to PCIe capability ownership. But either way is fine with me. I can use the format you mentioned. followed by something like this after we're done fiddling with all the host_bridge->native* bits: #define FLAG(x) ((x) ? '+' : '-') dev_info(>device->dev, "OS native features: SHPCHotplug%c PCIeCapability%c PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n", FLAG(host_bridge->native_shpc_hotplug), ?, FLAG(host_bridge->native_pcie_hotplug), ...); But I don't know how to handle OSC_PCI_EXPRESS_CAPABILITY_CONTROL since we don't track it the same way. Maybe we'd have to omit it from this message for now? I will add it in next version. But for now, its not worry about OSC_PCI_EXPRESS_CAPABILITY_CONTROL. /* * Evaluate the "PCI Boot Configuration" _DSM Function. If it -- 2.17.1 -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v7 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
Hi Guilherme, On 11/24/20 10:45 AM, Guilherme G. Piccoli wrote: Hi Kuppuswamy Sathyanarayanan (and all involved here), thanks for the patch! I'd like to ask what is the status of this patchset - I just "parachuted" in the issue, and by tracking the linux-pci ML, I found this V7 (and all previous versions since V2). Also, noticed that Jay's email might have gotten lost in translation (he's not CCed in latest versions of the patchset). I was able to find even another interesting thread that might be related, Ethan's patchset. So, if any of the developers can clarify the current status of this patchset or if the functionality hereby proposed ended-up being implemented in another patch, I appreciate a lot. Thanks for bringing this up. Its waiting for Bjorn's comments/approval. Bjorn, any comments ? Some of our customers also looking for this issue fix. Please let me know. Thanks in advance! Below, some references to lore archives. Cheers, -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH 1/1] pci: pciehp: Handle MRL interrupts to enable slot for hotplug.
Hi, On 11/19/20 2:08 PM, Raj, Ashok wrote: If an Attention Button is present, the current behavior is to bring up the hotplug slot as soon as presence or link is detected. We don't wait for a button press. This is intended as a convience feature to bring up slots as quickly as possible, but the behavior can be changed if the button press and 5 second delay is preferred. No personal preference, I thought that is how the code in Linux was earlier. Looks like we still don't subscribe to PDC if ATTN is present? So you don't get an event until someone presses the button to process hotplug right? pciehp_hpc.c:pcie_enable_notification() { * Always enable link events: thus link-up and link-down shall * always be treated as hotplug and unplug respectively. Enable * presence detect only if Attention Button is not present. */ cmd = PCI_EXP_SLTCTL_DLLSCE; if (ATTN_BUTTN(ctrl)) cmd |= PCI_EXP_SLTCTL_ABPE; else cmd |= PCI_EXP_SLTCTL_PDCE; } Looks like we still wait for button press to process. When you don't have a power control yes the DLLSC would kick in and we would process them. but if you have power control, you won't turn on until the button press? Yes, as per current logic, if attention button is present, then we don't subscribe to PDC event changes. we wait for button press to turn on/off the slot. -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 14/16] PCI/AER: Add pcie_walk_rcec() to RCEC AER handling
On 11/17/20 11:19 AM, Sean V Kelley wrote: Root Complex Event Collectors (RCEC) appear as peers to Root Ports and also have the AER capability. In addition, actions need to be taken for associated RCiEPs. In such cases the RCECs will need to be walked in order to find and act upon their respective RCiEPs. Extend the existing ability to link the RCECs with a walking function pcie_walk_rcec(). Add RCEC support to the current AER service driver and attach the AER service driver to the RCEC device. Reviewed-by: Kuppuswamy Sathyanarayanan [bhelgaas: kernel doc, whitespace cleanup] Co-developed-by: Qiuxu Zhuo Link: https://lore.kernel.org/r/20201002184735.1229220-13-seanvk@oregontracks.org Signed-off-by: Qiuxu Zhuo Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas Reviewed-by: Jonathan Cameron --- drivers/pci/pci.h | 6 ++ drivers/pci/pcie/aer.c | 29 ++--- drivers/pci/pcie/rcec.c | 37 + 3 files changed, 65 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index ae2ee4df1cff..e988cc930d0b 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -473,10 +473,16 @@ static inline void pci_dpc_init(struct pci_dev *pdev) {} void pci_rcec_init(struct pci_dev *dev); void pci_rcec_exit(struct pci_dev *dev); void pcie_link_rcec(struct pci_dev *rcec); +void pcie_walk_rcec(struct pci_dev *rcec, + int (*cb)(struct pci_dev *, void *), + void *userdata); #else static inline void pci_rcec_init(struct pci_dev *dev) {} static inline void pci_rcec_exit(struct pci_dev *dev) {} static inline void pcie_link_rcec(struct pci_dev *rcec) {} +static inline void pcie_walk_rcec(struct pci_dev *rcec, + int (*cb)(struct pci_dev *, void *), + void *userdata) {} #endif #ifdef CONFIG_PCI_ATS diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 2869212af8b4..72723a1b67af 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -300,7 +300,8 @@ int pci_aer_raw_clear_status(struct pci_dev *dev) return -EIO; port_type = pci_pcie_type(dev); - if (port_type == PCI_EXP_TYPE_ROOT_PORT) { + if (port_type == PCI_EXP_TYPE_ROOT_PORT || + port_type == PCI_EXP_TYPE_RC_EC) { pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, ); pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, status); } @@ -595,7 +596,8 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj, if ((a == _attr_aer_rootport_total_err_cor.attr || a == _attr_aer_rootport_total_err_fatal.attr || a == _attr_aer_rootport_total_err_nonfatal.attr) && - pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) + ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) && +(pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))) return 0; return a->mode; @@ -916,7 +918,10 @@ static bool find_source_device(struct pci_dev *parent, if (result) return true; - pci_walk_bus(parent->subordinate, find_device_iter, e_info); + if (pci_pcie_type(parent) == PCI_EXP_TYPE_RC_EC) + pcie_walk_rcec(parent, find_device_iter, e_info); + else + pci_walk_bus(parent->subordinate, find_device_iter, e_info); if (!e_info->error_dev_num) { pci_info(parent, "can't find device of ID%04x\n", e_info->id); @@ -1053,6 +1058,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) if (!(info->status & ~info->mask)) return 0; } else if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT || + pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC || pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM || info->severity == AER_NONFATAL) { @@ -1205,6 +1211,7 @@ static int set_device_error_reporting(struct pci_dev *dev, void *data) int type = pci_pcie_type(dev); if ((type == PCI_EXP_TYPE_ROOT_PORT) || + (type == PCI_EXP_TYPE_RC_EC) || (type == PCI_EXP_TYPE_UPSTREAM) || (type == PCI_EXP_TYPE_DOWNSTREAM)) { if (enable) @@ -1229,9 +1236,12 @@ static void set_downstream_devices_error_reporting(struct pci_dev *dev, { set_device_error_reporting(dev, ); - if (!dev->subordinate) - return; - pci_walk_bus(dev->subordinate, set_device_error_reporting, ); + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) + pcie_walk_rcec(dev, set_device_error_reporting, ); + else if (dev->subordinate) + pci_walk_bus(dev->subordinate, set_device_error_reporting, +); + } /** @@ -1329,6 +1339,
Re: [PATCH v11 15/16] PCI/PME: Add pcie_walk_rcec() to RCEC PME handling
On 11/17/20 11:19 AM, Sean V Kelley wrote: Root Complex Event Collectors (RCEC) appear as peers of Root Ports and also have the PME capability. As with AER, there is a need to be able to walk the RCiEPs associated with their RCEC for purposes of acting upon them with callbacks. Add RCEC support through the use of pcie_walk_rcec() to the current PME service driver and attach the PME service driver to the RCEC device. Co-developed-by: Qiuxu Zhuo Link: https://lore.kernel.org/r/20201002184735.1229220-14-seanvk@oregontracks.org Signed-off-by: Qiuxu Zhuo Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas --- drivers/pci/pcie/pme.c | 15 +++ drivers/pci/pcie/portdrv_core.c | 9 +++-- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/drivers/pci/pcie/pme.c b/drivers/pci/pcie/pme.c index 6a32970bb731..87799166c96a 100644 --- a/drivers/pci/pcie/pme.c +++ b/drivers/pci/pcie/pme.c @@ -310,7 +310,10 @@ static int pcie_pme_can_wakeup(struct pci_dev *dev, void *ign) static void pcie_pme_mark_devices(struct pci_dev *port) { pcie_pme_can_wakeup(port, NULL); - if (port->subordinate) + + if (pci_pcie_type(port) == PCI_EXP_TYPE_RC_EC) + pcie_walk_rcec(port, pcie_pme_can_wakeup, NULL); + else if (port->subordinate) pci_walk_bus(port->subordinate, pcie_pme_can_wakeup, NULL); } @@ -320,10 +323,15 @@ static void pcie_pme_mark_devices(struct pci_dev *port) */ static int pcie_pme_probe(struct pcie_device *srv) { - struct pci_dev *port; + struct pci_dev *port = srv->port; struct pcie_pme_service_data *data; int ret; + /* Limit to Root Ports or Root Complex Event Collectors */ + if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) && + (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT)) may be you can save the value pci_pcie_type(port)? + return -ENODEV; + data = kzalloc(sizeof(*data), GFP_KERNEL); if (!data) return -ENOMEM; @@ -333,7 +341,6 @@ static int pcie_pme_probe(struct pcie_device *srv) data->srv = srv; set_service_data(srv, data); - port = srv->port; pcie_pme_interrupt_enable(port, false); pcie_clear_root_pme_status(port); @@ -445,7 +452,7 @@ static void pcie_pme_remove(struct pcie_device *srv) static struct pcie_port_service_driver pcie_pme_driver = { .name = "pcie_pme", - .port_type = PCI_EXP_TYPE_ROOT_PORT, + .port_type = PCIE_ANY_PORT, .service= PCIE_PORT_SERVICE_PME, .probe = pcie_pme_probe, diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index 50a9522ab07d..e1fed6649c41 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -233,12 +233,9 @@ static int get_port_device_capability(struct pci_dev *dev) } #endif - /* -* Root ports are capable of generating PME too. Root Complex -* Event Collectors can also generate PMEs, but we don't handle -* those yet. -*/ - if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT && + /* Root Ports and Root Complex Event Collectors may generate PMEs */ + if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT || +pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) && (pcie_ports_native || host->native_pme)) { services |= PCIE_PORT_SERVICE_PME; -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 16/16] PCI/AER: Add RCEC AER error injection support
On 11/17/20 11:19 AM, Sean V Kelley wrote: From: Qiuxu Zhuo Root Complex Event Collectors (RCEC) appear as peers to Root Ports and may also have the AER capability. Add RCEC support to the AER error injection driver. Reviewed-by: Kuppuswamy Sathyanarayanan Co-developed-by: Sean V Kelley Link: https://lore.kernel.org/r/20201002184735.1229220-15-seanvk@oregontracks.org Signed-off-by: Qiuxu Zhuo Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas --- drivers/pci/pcie/aer_inject.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pcie/aer_inject.c b/drivers/pci/pcie/aer_inject.c index c2cbf425afc5..767f8859b99b 100644 --- a/drivers/pci/pcie/aer_inject.c +++ b/drivers/pci/pcie/aer_inject.c @@ -333,8 +333,11 @@ static int aer_inject(struct aer_error_inj *einj) if (!dev) return -ENODEV; rpdev = pcie_find_root_port(dev); + /* If Root Port not found, try to find an RCEC */ + if (!rpdev) + rpdev = dev->rcec; if (!rpdev) { - pci_err(dev, "Root port not found\n"); + pci_err(dev, "Neither Root Port nor RCEC found\n"); ret = -ENODEV; goto out_put; } -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 09/16] PCI/ERR: Avoid negated conditional for clarity
On 11/17/20 11:19 AM, Sean V Kelley wrote: Reverse the sense of the Root Port/Downstream Port conditional for clarity. No functional change intended. [bhelgaas: split to separate patch] Link: https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas Acked-by: Jonathan Cameron Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pcie/err.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 46a5b84f8842..931e75f2549d 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -159,11 +159,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, * Error recovery runs on all subordinates of the bridge. If the * bridge detected the error, it is cleared at the end. */ - if (!(type == PCI_EXP_TYPE_ROOT_PORT || - type == PCI_EXP_TYPE_DOWNSTREAM)) - bridge = pci_upstream_bridge(dev); - else + if (type == PCI_EXP_TYPE_ROOT_PORT || + type == PCI_EXP_TYPE_DOWNSTREAM) bridge = dev; + else + bridge = pci_upstream_bridge(dev); bus = bridge->subordinate; pci_dbg(bridge, "broadcast error_detected message\n"); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 08/16] PCI/ERR: Use "bridge" for clarity in pcie_do_recovery()
On 11/17/20 11:19 AM, Sean V Kelley wrote: pcie_do_recovery() may be called with "dev" being either a bridge (Root Port or Switch Downstream Port) or an Endpoint. The bulk of the function deals with the bridge, so if we start with an Endpoint, we reset "dev" to be the bridge leading to it. For clarity, replace "dev" in the body of the function with "bridge". No functional change intended. [bhelgaas: commit log, split pieces out so this is pure rename, also replace "dev" with "bridge" in messages and pci_uevent_ers()] Suggested-by: Bjorn Helgaas Link: https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas Acked-by: Jonathan Cameron Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pcie/err.c | 37 - 1 file changed, 20 insertions(+), 17 deletions(-) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 07/16] PCI/ERR: Simplify by computing pci_pcie_type() once
Hi, On 11/17/20 11:19 AM, Sean V Kelley wrote: Instead of calling pci_pcie_type(dev) twice, call it once and save the result. No functional change intended. Same optimization can be applied to drivers/pci/pcie/portdrv_pci.c and drivers/pci/pcie/aer.c. Can you fix them together ? [bhelgaas: split to separate patch] Link: https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas Acked-by: Jonathan Cameron --- drivers/pci/pcie/err.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 05f61da5ed9d..7a5af873d8bc 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -150,6 +150,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, pci_channel_state_t state, pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev)) { + int type = pci_pcie_type(dev); pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER; struct pci_bus *bus; @@ -157,8 +158,8 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, * Error recovery runs on all subordinates of the first downstream port. * If the downstream port detected the error, it is cleared at the end. */ - if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT || - pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM)) + if (!(type == PCI_EXP_TYPE_ROOT_PORT || + type == PCI_EXP_TYPE_DOWNSTREAM)) dev = pci_upstream_bridge(dev); bus = dev->subordinate; -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 06/16] PCI/ERR: Simplify by using pci_upstream_bridge()
On 11/17/20 11:19 AM, Sean V Kelley wrote: Use pci_upstream_bridge() in place of dev->bus->self. No functional change intended. [bhelgaas: split to separate patch] Link: https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas Acked-by: Jonathan Cameron Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pcie/err.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index db149c6ce4fb..05f61da5ed9d 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -159,7 +159,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, */ if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT || pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM)) - dev = dev->bus->self; + dev = pci_upstream_bridge(dev); Makes it easier to read. bus = dev->subordinate; pci_dbg(dev, "broadcast error_detected message\n"); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 05/16] PCI/ERR: Rename reset_link() to reset_subordinates()
Hi, On 11/17/20 11:19 AM, Sean V Kelley wrote: reset_link() appears to be misnamed. The point is to reset any devices below a given bridge, so rename it to reset_subordinates() to make it clear that we are passing a bridge with the intent to reset the devices below it. [bhelgaas: fix reset_subordinate_device() typo, shorten name] Suggested-by: Bjorn Helgaas Link: https://lore.kernel.org/r/20201002184735.1229220-5-seanvk@oregontracks.org Signed-off-by: Sean V Kelley Signed-off-by: Bjorn Helgaas Acked-by: Jonathan Cameron Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pci.h | 4 ++-- drivers/pci/pcie/err.c | 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 12dcad8f3635..3c4570a3058f 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -572,8 +572,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev) /* PCI error reporting and recovery */ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, - pci_channel_state_t state, - pci_ers_result_t (*reset_link)(struct pci_dev *pdev)); + pci_channel_state_t state, + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev)); bool pcie_wait_for_link(struct pci_dev *pdev, bool active); #ifdef CONFIG_PCIEASPM diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..db149c6ce4fb 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -147,8 +147,8 @@ static int report_resume(struct pci_dev *dev, void *data) } pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, - pci_channel_state_t state, - pci_ers_result_t (*reset_link)(struct pci_dev *pdev)) + pci_channel_state_t state, + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev)) { pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER; struct pci_bus *bus; @@ -165,9 +165,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, pci_dbg(dev, "broadcast error_detected message\n"); if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, ); - status = reset_link(dev); + status = reset_subordinates(dev); if (status != PCI_ERS_RESULT_RECOVERED) { - pci_warn(dev, "link reset failed\n"); + pci_warn(dev, "subordinate device reset failed\n"); goto failed; } } else { -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 03/16] PCI/RCEC: Bind RCEC devices to the Root Port driver
On 11/17/20 11:19 AM, Sean V Kelley wrote: From: Qiuxu Zhuo If a Root Complex Integrated Endpoint (RCiEP) is implemented, it may signal errors through a Root Complex Event Collector (RCEC). Each RCiEP must be associated with no more than one RCEC. For an RCEC (which is technically not a Bridge), error messages "received" from associated RCiEPs must be enabled for "transmission" in order to cause a System Error via the Root Control register or (when the Advanced Error Reporting Capability is present) reporting via the Root Error Command register and logging in the Root Error Status register and Error Source Identification register. Given the commonality with Root Ports and the need to also support AER and PME services for RCECs, extend the Root Port driver to support RCEC devices by adding the RCEC Class ID to the driver structure. Co-developed-by: Sean V Kelley Link: https://lore.kernel.org/r/20201002184735.1229220-3-seanvk@oregontracks.org Signed-off-by: Sean V Kelley Signed-off-by: Qiuxu Zhuo Signed-off-by: Bjorn Helgaas Reviewed-by: Jonathan Cameron Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/pci/pcie/portdrv_pci.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 3a3ce40ae1ab..4d880679b9b1 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -106,7 +106,8 @@ static int pcie_portdrv_probe(struct pci_dev *dev, if (!pci_is_pcie(dev) || ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) && -(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) +(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM) && +(pci_pcie_type(dev) != PCI_EXP_TYPE_RC_EC))) return -ENODEV; status = pcie_port_device_register(dev); @@ -195,6 +196,8 @@ static const struct pci_device_id port_pci_ids[] = { { PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x00), ~0) }, /* subtractive decode PCI-to-PCI bridge, class type is 060401h */ { PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x01), ~0) }, + /* handle any Root Complex Event Collector */ + { PCI_DEVICE_CLASS(((PCI_CLASS_SYSTEM_RCEC << 8) | 0x00), ~0) }, { }, }; -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 02/16] PCI/RCEC: Add RCEC class code and extended capability
Hi, On 11/17/20 11:19 AM, Sean V Kelley wrote: From: Qiuxu Zhuo A PCIe Root Complex Event Collector (RCEC) has base class 0x08, sub-class 0x07, and programming interface 0x00. Add the class code 0x0807 to identify RCEC devices and add #defines for the RCEC Endpoint Association Extended Capability. See PCIe r5.0, sec 1.3.4 ("Root Complex Event Collector") and sec 7.9.10 ("Root Complex Event Collector Endpoint Association Extended Capability"). Why not merge this change with usage patch ? Keeping changes together will help in case of reverting the code. Link: https://lore.kernel.org/r/20201002184735.1229220-2-seanvk@oregontracks.org Signed-off-by: Qiuxu Zhuo Signed-off-by: Bjorn Helgaas Reviewed-by: Jonathan Cameron --- include/linux/pci_ids.h | 1 + include/uapi/linux/pci_regs.h | 7 +++ 2 files changed, 8 insertions(+) diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index 1ab1e24bcbce..d8156a5dbee8 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -81,6 +81,7 @@ #define PCI_CLASS_SYSTEM_RTC 0x0803 #define PCI_CLASS_SYSTEM_PCI_HOTPLUG 0x0804 #define PCI_CLASS_SYSTEM_SDHCI0x0805 +#define PCI_CLASS_SYSTEM_RCEC 0x0807 #define PCI_CLASS_SYSTEM_OTHER0x0880 #define PCI_BASE_CLASS_INPUT 0x09 diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index a95d55f9f257..bccd3e35cb65 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -831,6 +831,13 @@ #define PCI_PWR_CAP_BUDGET(x)((x) & 1) /* Included in system budget */ #define PCI_EXT_CAP_PWR_SIZEOF16 +/* Root Complex Event Collector Endpoint Association */ +#define PCI_RCEC_RCIEP_BITMAP 4 /* Associated Bitmap for RCiEPs */ +#define PCI_RCEC_BUSN 8 /* RCEC Associated Bus Numbers */ +#define PCI_RCEC_BUSN_REG_VER 0x02/* Least version with BUSN present */ +#define PCI_RCEC_BUSN_NEXT(x) (((x) >> 8) & 0xff) +#define PCI_RCEC_BUSN_LAST(x) (((x) >> 16) & 0xff) + /* Vendor-Specific (VSEC, PCI_EXT_CAP_ID_VNDR) */ #define PCI_VNDR_HEADER 4 /* Vendor-Specific Header */ #define PCI_VNDR_HEADER_ID(x)((x) & 0x) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 01/16] AER: aer_root_reset() non-native handling
On 11/17/20 11:19 AM, Sean V Kelley wrote: If an OS has not been granted AER control via _OSC, then the OS should not make changes to PCI_ERR_ROOT_COMMAND and PCI_ERR_ROOT_STATUS related registers. Per section 4.5.1 of the System Firmware Intermediary (SFI) _OSC and DPC Updates ECN [1], this bit also covers these aspects of the PCI Express Advanced Error Reporting. Based on the above and earlier discussion [2], make the following changes: Add a check for the native case (i.e., AER control via _OSC) Note that the current "clear, reset, enable" order suggests that the reset might cause errors that we should ignore. Lacking historical context, these will be retained. [1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24, 2020, affecting PCI Firmware Specification, Rev. 3.2 https://members.pcisig.com/wg/PCI-SIG/document/14076 [2] https://lore.kernel.org/linux-pci/20201020162820.GA370938@bjorn-Precision-5520/ Reviewed-by: Kuppuswamy Sathyanarayanan -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v11 0/5] Simplify PCIe native ownership detection logic
Hi Bjorn, On 10/26/20 7:57 PM, Kuppuswamy Sathyanarayanan wrote: Currently, PCIe capabilities ownership status is detected by verifying the status of pcie_ports_native, pcie_ports_dpc_native and _OSC negotiated results (cached in struct pci_host_bridge ->native_* members). But this logic can be simplified, and we can use only struct pci_host_bridge ->native_* members to detect it. This patchset removes the distributed checks for pcie_ports_native, pcie_ports_dpc_native parameters. Any comments on this series? Changes since v10: * Addressed format issue reported by lkp test. Changes since v9: * Rebased on top of v5.10-rc1 Changes since v8: * Simplified setting _OSC ownwership logic * Moved bridge->native_ltr out of #ifdef CONFIG_PCIEPORTBUS. Changes since v7: * Fixed "fix array_size.cocci warnings". Changes since v6: * Created new patch for CONFIG_PCIEPORTBUS check in pci_init_host_bridge(). * Added warning message for a case when pcie_ports_native overrides _OSC negotiation result. Changes since v5: * Rebased on top of v5.8-rc1 Changes since v4: * Changed the patch set title (Original link: https://lkml.org/lkml/2020/5/26/1710) * Added AER/DPC dependency logic cleanup fixes. Kuppuswamy Sathyanarayanan (5): PCI: Conditionally initialize host bridge native_* members ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set. ACPI/PCI: Ignore _OSC DPC negotiation result if pcie_ports_dpc_native is set. PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic PCI/DPC: Move AER/DPC dependency checks out of DPC driver drivers/acpi/pci_root.c | 39 +++ drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/pci-acpi.c| 3 --- drivers/pci/pcie/aer.c| 2 +- drivers/pci/pcie/dpc.c| 3 --- drivers/pci/pcie/portdrv.h| 2 -- drivers/pci/pcie/portdrv_core.c | 13 --- drivers/pci/probe.c | 6 +++-- include/linux/acpi.h | 2 ++ include/linux/pci.h | 2 ++ 10 files changed, 44 insertions(+), 30 deletions(-) -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v8 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
Hi Bjorn, On 10/26/20 12:37 PM, Kuppuswamy Sathyanarayanan wrote: Currently if report_error_detected() or report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET, current pcie_do_recovery() implementation does not do the requested explicit device reset, but instead just calls the report_slot_reset() on all affected devices. Notifying about the reset via report_slot_reset() without doing the actual device reset is incorrect. So call pci_bus_reset() before triggering ->slot_reset() callback. Any comments on this series? Signed-off-by: Kuppuswamy Sathyanarayanan Reviewed-by: Sinan Kaya Reviewed-by: Ashok Raj --- Changes since v7: * Rebased on top of v5.10-rc1. Changes since v6: * None. Changes since v5: * Added Ashok's Reviewed-by tag. Changes since v4: * Added check for pci_reset_bus() return value. drivers/pci/pcie/err.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..315a4d559c4c 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -152,6 +152,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, { pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER; struct pci_bus *bus; + int ret; /* * Error recovery runs on all subordinates of the first downstream port. @@ -181,11 +182,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, } if (status == PCI_ERS_RESULT_NEED_RESET) { - /* -* TODO: Should call platform-specific -* functions to reset slot before calling -* drivers' slot_reset callbacks? -*/ + ret = pci_reset_bus(dev); + if (ret < 0) { + pci_err(dev, "Failed to reset %d\n", ret); + status = PCI_ERS_RESULT_DISCONNECT; + goto failed; + } status = PCI_ERS_RESULT_RECOVERED; pci_dbg(dev, "broadcast slot_reset message\n"); pci_walk_bus(bus, report_slot_reset, ); -- Sathyanarayanan Kuppuswamy Linux Kernel Developer
Re: [PATCH v4 5/6] PCI/ACPI: Replace open coded variant of resource_union()
On 11/2/20 1:00 PM, Andy Shevchenko wrote: Since we have resource_union() helper, let's utilize it here. Signed-off-by: Andy Shevchenko Cc: Kuppuswamy Sathyanarayanan Cc: Bjorn Helgaas Cc: linux-...@vger.kernel.org Reviewed-by: Rafael J. Wysocki Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/acpi/pci_root.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index c12b5fb3e8fb..0bf072cef6cf 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -722,9 +722,7 @@ static void acpi_pci_root_validate_resources(struct device *dev, * our resources no longer match the ACPI _CRS, but * the kernel resource tree doesn't allow overlaps. */ - if (resource_overlaps(res1, res2)) { - res2->start = min(res1->start, res2->start); - res2->end = max(res1->end, res2->end); + if (resource_union(res1, res2, res2)) { dev_info(dev, "host bridge window expanded to %pR; %pR ignored\n", res2, res1); free = true; -- Sathyanarayanan Kuppuswamy Linux Kernel Developer