Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

2021-04-20 Thread Kuppuswamy, Sathyanarayanan




On 4/20/21 4:53 PM, Dan Williams wrote:

On Tue, Apr 20, 2021 at 4:12 PM Kuppuswamy, Sathyanarayanan
 wrote:
[..]

Also, do you *REALLY* need to do this from assembly?  Can't it be done
in the C wrapper?

Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
so added
it here.




Can I ask a favor?

Please put a line break between quoted lines and your reply.


will do




That's not a good reason.  You could just as easily have a C wrapper
which all uses of TDVMCALL go through.


...because this runs together when reading otherwise.


Any reason for not preferring it in assembly code?
Also, using wrapper will add more complication for in/out instruction
substitution use case. please check the use case in following patch.
https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543


This commit still has open coded assembly for the TDVMCALL? I thought
we talked about it being unified with the common definition, or has
this patch not been reworked with that feedback yet? I expect there is
no performance reason why in/out need to get their own custom coded
TDVMCALL implementation. It should also be the case the failure should
behave the same as native in/out failure i.e. all ones on read
failure, and silent drops on write failure.



That link is for older version. My next version addresses your review
comments (re-uses TDVMCALL() function). Although the patch is ready, I am
waiting to fix other review comments before sending the next version. I
have just shared that link to explain about the use case.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

2021-04-20 Thread Kuppuswamy, Sathyanarayanan




On 4/20/21 12:59 PM, Dave Hansen wrote:

On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:

approach is, it adds a few extra instructions for every
TDCALL use case when compared to distributed checks. Although
it's a bit less efficient, it's worth it to make the code more
readable.


What's a "distributed check"?


It should be "distributed TDVMCALL/TDCALL inline assembly calls"


It's still not clear to what that refers.


I am just comparing the performance cost of using generic TDCALL()/TDVMCALL()
function implementation with "usage specific" (GetQuote,MapGPA, HLT,etc) custom
TDCALL()/TDVMCALL() inline assembly implementation.



This also doesn't talk at all about why this approach was chosen versus
inline assembly.  You're going to be asked "why not use inline asm?"

"To make the core more readable and less error prone." I have added this
info in above paragraph. Do you think we need more argument to
justify our approach?


Yes, you need much more justification.  That's pretty generic and
non-specific.

readability is one of the main motivation for not choosing inline
assembly. Since number of lines of instructions (with comments) are over
70, using inline assembly made it hard to read. Another reason is, since we
are using many registers (R8-R15, R[A-D]X)) in TDVMCAL/TDCALL operation, we are
not sure whether some older compiler can follow our specified inline assembly
constraints.



+    /*
+ * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
+ * defined in the GHCI.  Note, RAX and RCX are consumed, but
only by
+ * TDX-Module and so don't need to be listed in the mask.
+ */


"GCHI" is out of the blue here.  So is "TDX-Module".  There needs to be
more context.

ok. will add it. Do you want GHCI spec reference with section id here?


Absolutely not.  I dislike all of the section references as-is.  Doesn't
a comment like this say what you said above and also add context?

Expose every register currently used in the
guest-to-host communication interface (GHCI).

ok.



+    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+    tdcall
+
+    /* Panic if TDCALL reports failure. */
+    test %rax, %rax
+    jnz 2f


Why panic?

As per spec, TDCALL should never fail. Note that it has nothing to do
with TDVMCALL function specific failure (which is reported via R10).


You've introduced two concepts here, without differentiating them.  You
need to work to differentiate these two kinds of failure somewhere.  You
can't simply refer to both as "failure".

will clarify it. I have assumed that once the user reads the spec, its easier
to understand.



Also, do you *REALLY* need to do this from assembly?  Can't it be done
in the C wrapper?

Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
so added
it here.


That's not a good reason.  You could just as easily have a C wrapper
which all uses of TDVMCALL go through.

Any reason for not preferring it in assembly code?
Also, using wrapper will add more complication for in/out instruction
substitution use case. please check the use case in following patch.
https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543



+    /* Propagate TDVMCALL success/failure to return value. */
+    mov %r10, %rax


You just said it panic's on failure.  How can this propagate failure?

we use panic for TDCALL failure. But, R10 content used to identify
whether given
TDVMCALL function operation is successful or not.


As I said above, please endeavor to differentiate the two classes of
failures.

Also, if the spec is violated, do you *REALLY* want to blow up the whole
world with a panic?  I guess you can argue that it could have been
something security-related that failed.  But, either way, you owe a
description of why panic'ing is a good idea, not just blindly deferring
to "the spec says this can't happen".

ok. will add some comments justifying our case.



+    xor %r10d, %r10d
+    xor %r11d, %r11d
+    xor %r12d, %r12d
+    xor %r13d, %r13d
+    xor %r14d, %r14d
+    xor %r15d, %r15d
+
+    pop %r12
+    pop %r13
+    pop %r14
+    pop %r15
+
+    FRAME_END
+    ret
+2:
+    ud2
+.endm
+
+SYM_FUNC_START(__tdvmcall)
+    xor %r10, %r10
+    tdvmcall_core
+SYM_FUNC_END(__tdvmcall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 0d00dd50a6ff..1147e7e765d6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -3,6 +3,36 @@
     #include 
   +/*
+ * Wrapper for the common case with standard output value (R10).
+ */


... and oddly enough there is no explicit mention of R10 anywhere.  Why?

For Guest to Host call -> R10 holds TDCALL function id (which is 0 for
TDVMCALL). so
we don't need special argument.
After TDVMCALL execution, R10 value is returned via RAX.


OK... so this is how it works.  But why mention R10 in the comment?  Why
is *THAT* worth mentioning?

its not need

Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

2021-04-20 Thread Kuppuswamy, Sathyanarayanan




On 4/20/21 10:36 AM, Dave Hansen wrote:

On 3/26/21 4:38 PM, Kuppuswamy Sathyanarayanan wrote:

Implement common helper functions to communicate with
the TDX Module and VMM (using TDCALL instruction).


This is missing any kind of background.  I'd say:

Guests communicate with VMMs with hypercalls. Historically, these are
implemented using instructions that are known to cause VMEXITs like
.  However, with TDX, VMEXITs no longer expose guest
state from the host.  This prevents the old hypercall mechanisms from
working

... and then go on to talk about what you are introducing, why there are
two of them and so forth.

Ok. I will add it.



__tdvmcall() function can be used to request services
from VMM.


^ "from a VMM" or "from the VMM", please


will use "from the VMM".

__tdcall() function can be used to communicate with the
TDX Module.

Using common helper functions makes the code more readable
and less error prone compared to distributed and use case
specific inline assembly code. Only downside in using this


 ^ "The only downside..."

will fix it.



approach is, it adds a few extra instructions for every
TDCALL use case when compared to distributed checks. Although
it's a bit less efficient, it's worth it to make the code more
readable.


What's a "distributed check"?


It should be "distributed TDVMCALL/TDCALL inline assembly calls"


This also doesn't talk at all about why this approach was chosen versus
inline assembly.  You're going to be asked "why not use inline asm?"

"To make the core more readable and less error prone." I have added this info
in above paragraph. Do you think we need more argument to justify our approach?



--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,35 @@
  #ifdef CONFIG_INTEL_TDX_GUEST
  
  #include 

+#include 
+
+struct tdcall_output {
+   u64 rcx;
+   u64 rdx;
+   u64 r8;
+   u64 r9;
+   u64 r10;
+   u64 r11;
+};
+
+struct tdvmcall_output {
+   u64 r11;
+   u64 r12;
+   u64 r13;
+   u64 r14;
+   u64 r15;
+};
  
  /* Common API to check TDX support in decompression and common kernel code. */

  bool is_tdx_guest(void);
  
  void __init tdx_early_init(void);
  
+u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out);

+
+u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+  struct tdvmcall_output *out);


Some one-liner comments about what these do would be nice.

will do.



  #else // !CONFIG_INTEL_TDX_GUEST
  
  static inline bool is_tdx_guest(void)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)+= pvclock.o
  obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
  
  obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o

-obj-$(CONFIG_INTEL_TDX_GUEST)  += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)  += tdcall.o tdx.o
  
  obj-$(CONFIG_EISA)		+= eisa.o

  obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..72de0b49467e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
  #include 
  #endif
  
+#ifdef CONFIG_INTEL_TDX_GUEST

+#include 
+#endif
+
  #ifdef CONFIG_X86_32
  # include "asm-offsets_32.c"
  #else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
  #endif
  
+#ifdef CONFIG_INTEL_TDX_GUEST

+   BLANK();
+   /* Offset for fields in tdcall_output */
+   OFFSET(TDCALL_rcx, tdcall_output, rcx);
+   OFFSET(TDCALL_rdx, tdcall_output, rdx);
+   OFFSET(TDCALL_r8, tdcall_output, r8);
+   OFFSET(TDCALL_r9, tdcall_output, r9);
+   OFFSET(TDCALL_r10, tdcall_output, r10);
+   OFFSET(TDCALL_r11, tdcall_output, r11);


 ^ vertically align this


will fix it.

+   /* Offset for fields in tdvmcall_output */
+   OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
+   OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
+   OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
+   OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
+   OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index ..a73b67c0b407
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,163 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define TDVMCALL_EXPOSE_REGS_MASK  0xfc00


This looks like an undocumented magic number.


+/*
+ * TDCALL instruction is newly added in TDX architecture,

Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface

2021-04-01 Thread Kuppuswamy, Sathyanarayanan




On 4/1/21 2:19 PM, Dave Hansen wrote:

On 4/1/21 2:15 PM, Kuppuswamy, Sathyanarayanan wrote:

On 4/1/21 2:08 PM, Dave Hansen wrote:

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:

+bool is_tdx_guest(void)
+{
+    return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}


Why do you need is_tdx_guest() as opposed to calling
cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?


is_tdx_guest() is also implemented/used in compressed
code (which uses native_cpuid calls). I don't think
we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in
compressed code right? Also is_tdx_guest() looks easy
to read and use.


OK, but how many of the is_tdx_guest() uses are in the compressed code?
  Why has its use spread beyond that?

Its only used in handling in/out instructions in compressed code. But this
code shared with in/out handling on non-compressed code.

#define __out(bwl, bw)  \
do {\
if (is_tdx_guest()) {   \
asm volatile("call tdg_out" #bwl : :\
"a"(value), "d"(port)); \
} else {\
asm volatile("out" #bwl " %" #bw "0, %w1" : :   \
"a"(value), "Nd"(port));\
}   \
} while (0)
#define __in(bwl, bw)   \
do {\
if (is_tdx_guest()) {   \
asm volatile("call tdg_in" #bwl :   \
"=a"(value) : "d"(port));   \
} else {\
asm volatile("in" #bwl " %w1, %" #bw "0" :  \
"=a"(value) : "Nd"(port));  \
}   \
} while (0)





--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface

2021-04-01 Thread Kuppuswamy, Sathyanarayanan




On 4/1/21 2:08 PM, Dave Hansen wrote:

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:

+bool is_tdx_guest(void)
+{
+   return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}


Why do you need is_tdx_guest() as opposed to calling
cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?


is_tdx_guest() is also implemented/used in compressed
code (which uses native_cpuid calls). I don't think
we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in
compressed code right? Also is_tdx_guest() looks easy
to read and use.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


[PATCH v5 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Kuppuswamy Sathyanarayanan
When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.
   
To prevent TD guest from using these unsupported instructions,
following measures are adapted:
   
1. For MWAIT/MONITOR instructions, support for these
instructions are already disabled by TDX module (SEAM).
So CPUID flags for these instructions should be in disabled
state. Also, just to be sure that these instructions are
disabled, forcefully unset X86_FEATURE_MWAIT CPU cap in OS.
       
2. For WBINVD instruction, we use audit to find the code that
uses this instruction and disable them for TD.

After the above mentioned preventive measures, if TD guests still
execute these instructions, add appropriate warning messages in #VE
handler.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---

Changes since v4:
 * Fixed commit log and comments as per Dave's comments
 * Used WARN_ONCE for MWAIT/MONITOR #VE.
 * Removed X86_FEATURE_MWAIT suppression code.

Changes since v3:
 * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
 * Fix the commit log and comments to address review comments from
   from Dave & Sean.

Changes since v2:
 * Added BUG() for WBINVD, WARN for MONITOR instructions.
 * Fixed comments as per Dave's review.

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..9bc84caf4096 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -362,6 +362,22 @@ int tdg_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+   case EXIT_REASON_WBINVD:
+   /*
+* WBINVD is not supported inside TDX guests. All in-
+* kernel uses should have been disabled.
+*/
+   pr_err("TD Guest used unsupported WBINVD instruction\n");
+   BUG();
+   break;
+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   /*
+* Something in the kernel used MONITOR or MWAIT despite
+* X86_FEATURE_MWAIT being cleared for TDX guests.
+*/
+   WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR 
instruction\n");
+   break;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Kuppuswamy, Sathyanarayanan




On 3/31/21 2:49 PM, Dave Hansen wrote:

On 3/31/21 2:09 PM, Kuppuswamy Sathyanarayanan wrote:

As per Guest-Host Communication Interface (GHCI) Specification
for Intel TDX, sec 2.4.1, TDX architecture does not support
MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode,
if MWAIT/MONITOR instructions are executed with CPL != 0 it will
trigger #UD, and for CPL = 0 case, virtual exception (#VE) is
triggered. WBINVD instruction behavior is also similar to
MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead
of #UD.


Could we give it a go to try this in plain English before jumping in and
quoting the exact spec section?  Also, the CPL language is nice and
precise for talking inside Intel, but it's generally easier for me to
read kernel descriptions when we just talk about the kernel.

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work.  If the guest kernel
uses these instructions, the hardware generates a #VE.

I will fix it in next version.


Which reminds me...  The SDM says: MWAIT will "#UD ... If
CPUID.01H:ECX.MONITOR[bit 3] = 0".  So, is this an architectural change?
  The guest is *supposed* to see that CPUID bit as 0, so shouldn't it
also get a #UD?  Or is this all so that if SEAM *forgets* to clear the
CPUID bit, the guest gets #VE?

AFAIK, we are only concerned about the case where the instruction support
is not disabled by SEAM. For disabled case, it should get #UD.
Sean, can you confirm it?


What are we *actually* mitigating here?

we add support for #VE, when executing un-supported instruction in TD guest
kernel.


Also, FWIW, MWAIT/MONITOR and WBINVD are pretty different beasts.  I
think this would all have been a lot more clear if this would have been
two patches instead of shoehorning them into one.

Since all of them are unsupported instructions, I have grouped them
together. Even if we split it, there should be some duplication in commit
log (since handling is similar). But let me know if this is a desired
approach. I can split it in two patches.



To prevent TD guest from using these unsupported instructions,
following measures are adapted:

1. For MWAIT/MONITOR instructions, support for these instructions
are already disabled by TDX module (SEAM). So CPUID flags for
these instructions should be in disabled state. Also, just to be
sure that these instructions are disabled, forcefully unset
X86_FEATURE_MWAIT CPU cap in OS.

2. For WBINVD instruction, we use audit to find the code that uses
this instruction and disable them for TD.


Really?  Where are those patches?

For MWAIT/MONITOR, the change is included in the same patch.
For WBINVD, we have will have some patches included in next
series.



+static inline bool cpuid_has_mwait(void)
+{
+   if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32)))
+   return true;
+
+   return false;
+}
+
  bool is_tdx_guest(void)
  {
return static_cpu_has(X86_FEATURE_TDX_GUEST);
@@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct 
ve_info *ve)
return insn.length;
  }
  
+/* Initialize TDX specific CPU capabilities */

+static void __init tdx_cpu_cap_init(void)
+{
+   setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+   if (cpuid_has_mwait()) {
+   WARN(1, "TDX Module failed to disable MWAIT\n");


WARN(1, "TDX guest enumerated support for MWAIT, disabling it").

will fix it in next version.



+   /* MWAIT is not supported in TDX platform, so suppress it */
+   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+   }
+
+}


Extra newline.


  void __init tdx_early_init(void)
  {
if (!cpuid_has_tdx_guest())
return;
  
-	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+   tdx_cpu_cap_init();
  
  	tdg_get_info();
  
@@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,

case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+   case EXIT_REASON_WBINVD:
+   /*
+* TDX architecture does not support WBINVD instruction.
+* Currently, usage of this instruction is prevented by
+* disabling the drivers which uses it. So if we still
+* reach here, it needs user attention.
+*/


This comment is awfully vague.  "TDX architecture..." what?  Any CPUs
supporting the TDX architecture?  TDX VMM's?  TDX Guests?

Let's also not waste byte on stating the obvious.  If it didn't need
attention we wouldn't be warning about it, eh?

So, let's halve the size of the comment and say:

/*
 * WBINVD is not supported inside TDX guests.  All in-
 * kernel uses should have been disabled.
 */

ok. will fix it next version.



+   pr_err("TD Guest us

Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Kuppuswamy, Sathyanarayanan




On 3/31/21 3:11 PM, Dave Hansen wrote:

On 3/31/21 3:06 PM, Sean Christopherson wrote:

I've no objection to a nice message in the #VE handler.  What I'm objecting to
is sanity checking the CPUID model provided by the TDX module.  If we don't
trust the TDX module to honor the spec, then there are a huge pile of things
that are far higher priority than MONITOR/MWAIT.


In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
check it to comply with the spec.  If something doesn't comply, we'll
get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
programming here.

That sounds sane to me.

But I think there are cases (like MCE) where SEAM does not disable them because
there will be future support for it. We should at-least suppress such features
in kernel.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [RFC v1 00/26] Add TDX Guest Support

2021-03-31 Thread Kuppuswamy, Sathyanarayanan

Hi All,

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.


We are currently working on a solution to fix the issues raised in
"Add #VE support for TDX guest" patch. While we fix that issue, I would
like to know if there are issues in other patches in this series. So if
possible can you please review other patches in the series and let us
know your comments?.

If you want me to rebase the series on top of v5.12-rcX kernel and repost it,
please let me know.



Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
to the guest TD in cases of instructions that need to be emulated, 
disallowed
MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
the workload or a user of the services provided by the workload) to 
establish
that the workload is running on an Intel-TDX-enabled platform located 
within a
TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

This RFC series has been reviewed by Dave Hansen.

Kirill A. Shutemov (16):
   x86/paravirt: Introduce CONFIG_PARAVIRT_XL
   x86/tdx: Get TD execution environment information via TDINFO
   x86/traps: Add #VE support for TDX guest
   x86/tdx: Add HLT support for TDX guest
   x86/tdx: Wire up KVM hypercalls
   x86/tdx: Add MSR support for TDX guest
   x86/tdx: Handle CPUID via #VE
   x86/io: Allow to override inX() and outX() implementation
   x86/tdx: Handle port I/O
   x86/tdx: Handle in-kernel MMIO
   x86/mm: Move force_dma_unencrypted() to common code
   x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
   x86/tdx: Make pages shared in ioremap()
   x86/tdx: Add helper to do MapGPA TDVMALL
   x86/tdx: Make DMA pages shared
   x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (6):
   x86/cpufeatures: Add TDX Guest CPU feature
   x86/cpufeatures: Add is_tdx_guest() interface
   x86/tdx: Handle MWAIT, MONITOR and WBINVD
   ACPI: tables: Add multiprocessor wake-up support
   x86/topology: Disable CPU hotplug support for TDX platforms.
   x86/tdx: Introduce INTEL_TDX_GUEST config option

Sean Christopherson (4):
   x86/boot: Add a trampoline for APs booting in 64-bit mode
   x86/boot: Avoid #VE during compressed boot for TDX platforms
   x86/boot: Avoid unnecessary #VE during boot process
   x86/tdx: Forcefully disable legacy PIC for TDX guests

  arch/x86/Kconfig |  28 +-
  arch/x86/boot/compressed/Makefile|   2 +
  arch/x86/boot/compressed/head_64.S   |  10 +-
  arch/x86/boot/compressed/misc.h  |   1 +
  arch/x86/boot/compressed/pgtable.h   |   2 +-
  arch/x86/boot/compressed/tdx.c   |  32 ++
  arch/x86/boot/compressed/tdx_io.S|   9 +
  arch/x86/include/asm/apic.h  |   3 +
  arch/x86/include/asm/asm-prototypes.h|   1 +
  arch/x86/include/asm/cpufeatures.h   |   1 +
  arch/x86/include/asm/idtentry.h  |   4 +
  arch/x86/include/asm/io.h|  25 +-
  arch/x86/include/asm/irqflags.h  |  42 +-
  arch/x86/include/asm/kvm_para.h  |  21 +
  arch/x86/include/asm/paravirt.h  |  22 +-
  arch/x86/include/asm/paravirt_types.h|   3 +-
  arch/x

[PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Kuppuswamy Sathyanarayanan
As per Guest-Host Communication Interface (GHCI) Specification
for Intel TDX, sec 2.4.1, TDX architecture does not support
MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode,
if MWAIT/MONITOR instructions are executed with CPL != 0 it will
trigger #UD, and for CPL = 0 case, virtual exception (#VE) is
triggered. WBINVD instruction behavior is also similar to
MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead
of #UD.

To prevent TD guest from using these unsupported instructions,
following measures are adapted:

1. For MWAIT/MONITOR instructions, support for these instructions
are already disabled by TDX module (SEAM). So CPUID flags for
these instructions should be in disabled state. Also, just to be
sure that these instructions are disabled, forcefully unset
X86_FEATURE_MWAIT CPU cap in OS.

2. For WBINVD instruction, we use audit to find the code that uses
this instruction and disable them for TD.

After the above mentioned preventive measures, if TD guest still
execute these instructions, add appropriate warning messages in #VE
handler.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---

Changes since v3:
 * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
 * Fix the commit log and comments to address review comments from
   from Dave & Sean.

Changes since v2:
 * Added BUG() for WBINVD, WARN for MONITOR instructions.
 * Fixed comments as per Dave's review.

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 44 ++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..82b411b828a5 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -63,6 +63,14 @@ static inline bool cpuid_has_tdx_guest(void)
return true;
 }
 
+static inline bool cpuid_has_mwait(void)
+{
+   if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32)))
+   return true;
+
+   return false;
+}
+
 bool is_tdx_guest(void)
 {
return static_cpu_has(X86_FEATURE_TDX_GUEST);
@@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct 
ve_info *ve)
return insn.length;
 }
 
+/* Initialize TDX specific CPU capabilities */
+static void __init tdx_cpu_cap_init(void)
+{
+   setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+   if (cpuid_has_mwait()) {
+   WARN(1, "TDX Module failed to disable MWAIT\n");
+   /* MWAIT is not supported in TDX platform, so suppress it */
+   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+   }
+
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
return;
 
-   setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+   tdx_cpu_cap_init();
 
tdg_get_info();
 
@@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+   case EXIT_REASON_WBINVD:
+   /*
+* TDX architecture does not support WBINVD instruction.
+* Currently, usage of this instruction is prevented by
+* disabling the drivers which uses it. So if we still
+* reach here, it needs user attention.
+*/
+   pr_err("TD Guest used unsupported WBINVD instruction\n");
+   BUG();
+   break;
+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   /*
+* MWAIT/MONITOR features are disabled by TDX Module (SEAM)
+* and also re-suppressed in kernel by clearing
+* X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So
+* if TD guest still executes MWAIT/MONITOR instruction with
+* above suppression, it needs user attention.
+*/
+   WARN(1, "TD Guest used unsupported MWAIT/MONITOR 
instruction\n");
+   break;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-30 Thread Kuppuswamy, Sathyanarayanan




On 3/30/21 8:10 AM, Dave Hansen wrote:

On 3/30/21 8:00 AM, Andi Kleen wrote:

+   /* MWAIT is not supported in TDX platform, so suppress it */
+   setup_clear_cpu_cap(X86_FEATURE_MWAIT);

In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
is enforced by SEAM module.

Good point.

Do we still need to safeguard it by setup_clear_cpu_cap() here?

I guess it doesn't hurt to do it explicitly.


If this MWAIT behavior (clearing the CPUID bit) is part of the guest
architecture, then it would also be appropriate to WARN() rather than
silently clearing the X86_FEATURE bit.

Makes sense. It will be useful to find the problem with TDX Module.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Kuppuswamy, Sathyanarayanan




On 3/29/21 4:23 PM, Andy Lutomirski wrote:



On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan 
 wrote:

In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions
appropriately.


Is there something I missed elsewhere in the code that checks CPL?

We don't check for CPL explicitly. But if we are reaching here, then we
executing these instructions with wrong CPL.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


[PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Kuppuswamy Sathyanarayanan
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions
appropriately.

Since the impact of executing WBINVD instruction in non ring-0 mode
can be heavy, use BUG() to report it. For others, raise a WARNING
message.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---

Changes since v2:
 * Added BUG() for WBINVD, WARN for MONITOR instructions.
 * Fixed comments as per Dave's review.

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..4c6336a055a3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+   /* MWAIT is not supported in TDX platform, so suppress it */
+   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
tdg_get_info();
 
pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,32 @@ int tdg_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+   case EXIT_REASON_WBINVD:
+   /*
+* WBINVD is a privileged instruction, can only be executed
+* in ring 0. Since we reached here, the kernel is in buggy
+* state.
+*/
+   pr_err("WBINVD #VE Exception\n");
+   BUG();
+   break;
+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   /*
+* MONITOR is a privileged instruction, can only be executed
+* in ring 0. So we are not supposed to reach here. Raise a
+* warning message.
+*/
+   WARN(1, "MONITOR unexpected #VE Exception\n");
+   break;
+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   /*
+* MWAIT feature is suppressed in firmware and in
+* tdx_early_init() by clearing X86_FEATURE_MWAIT CPU feature
+* flag. Since we are not supposed to reach here, raise a
+* warning message and return -EFAULT.
+*/
+   WARN(1, "MWAIT unexpected #VE Exception\n");
+   return -EFAULT;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Kuppuswamy, Sathyanarayanan




On 3/29/21 3:12 PM, Dave Hansen wrote:

On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote:

+    case EXIT_REASON_MWAIT_INSTRUCTION:
+    /* MWAIT is supressed, not supposed to reach here. */
+    WARN(1, "MWAIT unexpected #VE Exception\n");
+    return -EFAULT;


How is MWAIT "supppressed"?

I am clearing the MWAIT feature flag in early init code. We should also
disable this feature in firmware.
setup_clear_cpu_cap(X86_FEATURE_MWAIT);


I'd be more explicit about that.  Maybe even reference the code that
clears the X86_FEATURE.

This change is part of the same patch.


Right, but if someone goes and looks at the switch() statement in 10
years is it going to be obvious how MWAIT was "suppressed"?

Ok. I can add a comment line for it.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Kuppuswamy, Sathyanarayanan




On 3/29/21 3:02 PM, Dave Hansen wrote:

On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote:


MONITOR is a privileged instruction, right?  So we can only end up in
here if the kernel screws up and isn't reading CPUID correctly, right?

That dosen't seem to me like something we want to suppress.  This needs
a warning, at least.  I assume that having a MONITOR instruction
immediately return doesn't do any harm.

Agree. Since we are not supposed to come here, I will use BUG.


"This is unexpected" is a WARN()able offense.

"This is unexpected and might be corrupting data" is where we want to
use BUG().

Does broken MONITOR risk data corruption?

We will be reaching this point only if something is buggy in kernel. I am
not sure about impact of this buggy state. But MONITOR instruction by
itself, should not cause data corruption.




+    case EXIT_REASON_MWAIT_INSTRUCTION:
+    /* MWAIT is supressed, not supposed to reach here. */
+    WARN(1, "MWAIT unexpected #VE Exception\n");
+    return -EFAULT;


How is MWAIT "supppressed"?

I am clearing the MWAIT feature flag in early init code. We should also
disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT);


I'd be more explicit about that.  Maybe even reference the code that
clears the X86_FEATURE.

This change is part of the same patch.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Kuppuswamy, Sathyanarayanan




On 3/29/21 10:14 AM, Dave Hansen wrote:

On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote:

+   /*
+* Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+* Domain Extensions (Intel TDX) specification, sec 2.4,
+* some instructions that unconditionally cause #VE (such as WBINVD,
+* MONITOR, MWAIT) do not have corresponding TDCALL
+* [TDG.VP.VMCALL ] leaves, since the TD has been designed
+* with no deterministic way to confirm the result of those operations
+* performed by the host VMM.  In those cases, the goal is for the TD
+* #VE handler to increment the RIP appropriately based on the VE
+* information provided via TDCALL.
+*/




That's an awfully big comment.  Could you pare it down, please?  Maybe
focus on the fact that we should never get here and why, rather than
talking about some silly spec?

I will remove this and add individual one line comment for WBINVD and MONITOR
instructions. Some thing like "Privileged instruction, can only be executed
in ring 0. So raise a BUG.



+   case EXIT_REASON_WBINVD:
+   pr_warn_once("WBINVD #VE Exception\n");


I actually think WBINVD in here should oops.  We use it for some really
important things.  If it can't be executed, and we're depending on it,
the kernel is in deep, deep trouble.

Agree. I will call BUG().


I think a noop here is dangerous.


+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   /* Handle as nops. */
+   break;


MONITOR is a privileged instruction, right?  So we can only end up in
here if the kernel screws up and isn't reading CPUID correctly, right?

That dosen't seem to me like something we want to suppress.  This needs
a warning, at least.  I assume that having a MONITOR instruction
immediately return doesn't do any harm.

Agree. Since we are not supposed to come here, I will use BUG.



+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   /* MWAIT is supressed, not supposed to reach here. */
+   WARN(1, "MWAIT unexpected #VE Exception\n");
+   return -EFAULT;


How is MWAIT "supppressed"?

I am clearing the MWAIT feature flag in early init code. We should also disable
this feature in firmware.
setup_clear_cpu_cap(X86_FEATURE_MWAIT);




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

2021-03-27 Thread Kuppuswamy, Sathyanarayanan




On 3/16/21 9:13 PM, Lukas Wunner wrote:

On Fri, Mar 12, 2021 at 07:32:08PM -0800, 
sathyanarayanan.kuppusw...@linux.intel.com wrote:

+   if ((events == PCI_EXP_SLTSTA_DLLSC) && is_dpc_reset_active(pdev)) {
+   ctrl_info(ctrl, "Slot(%s): DLLSC event(DPC), skipped\n",
+ slot_name(ctrl));
+   ret = IRQ_HANDLED;
+   goto out;
+   }


Two problems here:

(1) If recovery fails, the link will *remain* down, so there'll be
 no Link Up event.  You've filtered the Link Down event, thus the
 slot will remain in ON_STATE even though the device in the slot is
 no longer accessible.  That's not good, the slot should be brought
 down in this case.

(2) If recovery succeeds, there's a race where pciehp may call
 is_dpc_reset_active() *after* dpc_reset_link() has finished.
 So both the DPC Trigger Status bit as well as pdev->dpc_reset_active
 will be cleared.  Thus, the Link Up event is not filtered by pciehp
 and the slot is brought down and back up even though DPC recovery
 was succesful, which seems undesirable.

The only viable solution I see is to wait until DPC has completed.
Sinan (+cc) proposed something along those lines a couple years back:

https://patchwork.ozlabs.org/project/linux-pci/patch/20180818065126.77912-1-ok...@kernel.org/

Included below please find my suggestion for how to fix this.
I've sort of combined yours and Sinan's approach, but I'm
using a waitqueue (Sinan used polling) and I'm using atomic bitops
on pdev->priv_flags (you're using an atomic_t instead, which needs
additionally space in struct pci_dev).  Note: It's compile-tested
only, I don't have any DPC-capable hardware at my disposal.

Would this work for you?  If so, I can add a commit message to the
patch and submit it properly.  Let me know what you think.  Thanks!





-- >8 --
Subject: [PATCH] PCI: pciehp: Ignore Link Down/Up caused by DPC

Signed-off-by: Lukas Wunner 
---
  drivers/pci/hotplug/pciehp_hpc.c | 11 +
  drivers/pci/pci.h|  4 
  drivers/pci/pcie/dpc.c   | 51 ++--
  3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index fb3840e..bcc018e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -707,6 +707,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
}
  
  	/*

+* Ignore Link Down/Up caused by Downstream Port Containment
+* if recovery from the error succeeded.
+*/
+   if ((events & PCI_EXP_SLTSTA_DLLSC) && pci_dpc_recovered(pdev) &&
+   ctrl->state == ON_STATE) {
+   atomic_and(~PCI_EXP_SLTSTA_DLLSC, >pending_events);

Why modify pending_events here. It should be already be zero right?

+   if (pciehp_check_link_active(ctrl) > 0)
+   events &= ~PCI_EXP_SLTSTA_DLLSC;
+   }
+
+   /*
 * Disable requests have higher priority than Presence Detect Changed
 * or Data Link Layer State Changed events.
 */
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9684b46..e5ae5e8 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -392,6 +392,8 @@ static inline bool pci_dev_is_disconnected(const struct 
pci_dev *dev)
  
  /* pci_dev priv_flags */

  #define PCI_DEV_ADDED 0
+#define PCI_DPC_RECOVERED 1
+#define PCI_DPC_RECOVERING 2
  
  static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)

  {
@@ -446,10 +448,12 @@ struct rcec_ea {
  void pci_dpc_init(struct pci_dev *pdev);
  void dpc_process_error(struct pci_dev *pdev);
  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool pci_dpc_recovered(struct pci_dev *pdev);
  #else
  static inline void pci_save_dpc_state(struct pci_dev *dev) {}
  static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
  static inline void pci_dpc_init(struct pci_dev *pdev) {}
+static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; }
  #endif
  
  #ifdef CONFIG_PCIEPORTBUS

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e05aba8..7328d9c4 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -71,6 +71,44 @@ void pci_restore_dpc_state(struct pci_dev *dev)
pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
  }
  
+static bool dpc_completed(struct pci_dev *pdev)

+{
+   u16 status;
+
+   pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
+   if (status & PCI_EXP_DPC_STATUS_TRIGGER)
+   return false;
+
+   if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
+   return false;
+
+   return true;
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(dpc_completed_waitqueue);
+
+bool pci_dpc_recovered(struct pci_dev *pdev)
+{
+   struct pci_host_bridge *host;
+
+   if (!pdev->dpc_cap)
+   return false;
+
+   /*
+* 

[PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-27 Thread Kuppuswamy Sathyanarayanan
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..3c6158a53c17 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+   /* MWAIT is not supported in TDX platform, so suppress it */
+   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
tdg_get_info();
 
pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+   /*
+* Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+* Domain Extensions (Intel TDX) specification, sec 2.4,
+* some instructions that unconditionally cause #VE (such as WBINVD,
+* MONITOR, MWAIT) do not have corresponding TDCALL
+* [TDG.VP.VMCALL ] leaves, since the TD has been designed
+* with no deterministic way to confirm the result of those operations
+* performed by the host VMM.  In those cases, the goal is for the TD
+* #VE handler to increment the RIP appropriately based on the VE
+* information provided via TDCALL.
+*/
+   case EXIT_REASON_WBINVD:
+   pr_warn_once("WBINVD #VE Exception\n");
+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   /* Handle as nops. */
+   break;
+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   /* MWAIT is supressed, not supposed to reach here. */
+   WARN(1, "MWAIT unexpected #VE Exception\n");
+   return -EFAULT;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-26 Thread Kuppuswamy, Sathyanarayanan




On 3/26/21 7:40 PM, Andy Lutomirski wrote:




On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan 
 wrote:

In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.


These should at least be WARN.

I will change it to WARN.


Does TDX send #UD if these instructions have the wrong CPL?  

No, TDX does not trigger #UD for these instructions.
If the #VE came from user mode, we should send an appropriate signal instead.
It will be mapped into #GP(0) fault. This should be enough notification right?




Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---

Changes since previous series:
* Suppressed MWAIT feature as per Andi's comment.
* Added warning debug log for MWAIT #VE exception.

arch/x86/kernel/tdx.c | 23 +++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..fb7d22b846fc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)

setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+/* MWAIT is not supported in TDX platform, so suppress it */
+setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
tdg_get_info();

pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+/*
+ * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+ * Domain Extensions (Intel TDX) specification, sec 2.4,
+ * some instructions that unconditionally cause #VE (such as WBINVD,
+ * MONITOR, MWAIT) do not have corresponding TDCALL
+ * [TDG.VP.VMCALL ] leaves, since the TD has been designed
+ * with no deterministic way to confirm the result of those operations
+ * performed by the host VMM.  In those cases, the goal is for the TD
+ * #VE handler to increment the RIP appropriately based on the VE
+ * information provided via TDCALL.
+ */
+case EXIT_REASON_WBINVD:
+pr_warn_once("WBINVD #VE Exception\n");
+case EXIT_REASON_MONITOR_INSTRUCTION:
+/* Handle as nops. */
+break;
+case EXIT_REASON_MWAIT_INSTRUCTION:
+/* MWAIT is supressed, not supposed to reach here. */
+pr_warn("MWAIT unexpected #VE Exception\n");
+return -EFAULT;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
--
2.25.1



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


[PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-26 Thread Kuppuswamy Sathyanarayanan
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..fb7d22b846fc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+   /* MWAIT is not supported in TDX platform, so suppress it */
+   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
tdg_get_info();
 
pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdg_handle_mmio(regs, ve);
break;
+   /*
+* Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+* Domain Extensions (Intel TDX) specification, sec 2.4,
+* some instructions that unconditionally cause #VE (such as WBINVD,
+* MONITOR, MWAIT) do not have corresponding TDCALL
+* [TDG.VP.VMCALL ] leaves, since the TD has been designed
+* with no deterministic way to confirm the result of those operations
+* performed by the host VMM.  In those cases, the goal is for the TD
+* #VE handler to increment the RIP appropriately based on the VE
+* information provided via TDCALL.
+*/
+   case EXIT_REASON_WBINVD:
+   pr_warn_once("WBINVD #VE Exception\n");
+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   /* Handle as nops. */
+   break;
+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   /* MWAIT is supressed, not supposed to reach here. */
+   pr_warn("MWAIT unexpected #VE Exception\n");
+   return -EFAULT;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



[PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions

2021-03-26 Thread Kuppuswamy Sathyanarayanan
Implement common helper functions to communicate with
the TDX Module and VMM (using TDCALL instruction).

__tdvmcall() function can be used to request services
from VMM.

__tdcall() function can be used to communicate with the
TDX Module.

Using common helper functions makes the code more readable
and less error prone compared to distributed and use case
specific inline assembly code. Only downside in using this
approach is, it adds a few extra instructions for every
TDCALL use case when compared to distributed checks. Although
it's a bit less efficient, it's worth it to make the code more
readable.

Originally-by: Sean Christopherson 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---

Hi All,

Please let me know your review comments. If you agree with this patch
and want to see the use of these APIs in rest of the patches, I will
re-send the patch series with updated code. Please let me know.

Changes since v1:
 * Implemented tdvmcall and tdcall helper functions as assembly code.
 * Followed suggestion provided by Sean & Dave.

 arch/x86/include/asm/tdx.h|  23 +
 arch/x86/kernel/Makefile  |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 +
 arch/x86/kernel/tdcall.S  | 163 ++
 arch/x86/kernel/tdx.c |  30 +++
 5 files changed, 239 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..ce6212ce5f45 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,35 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include 
+#include 
+
+struct tdcall_output {
+   u64 rcx;
+   u64 rdx;
+   u64 r8;
+   u64 r9;
+   u64 r10;
+   u64 r11;
+};
+
+struct tdvmcall_output {
+   u64 r11;
+   u64 r12;
+   u64 r13;
+   u64 r14;
+   u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out);
+
+u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+  struct tdvmcall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)  += jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)  += tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)  += tdcall.o tdx.o
 
 obj-$(CONFIG_EISA) += eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)  += pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..72de0b49467e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include 
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include 
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+   BLANK();
+   /* Offset for fields in tdcall_output */
+   OFFSET(TDCALL_rcx, tdcall_output, rcx);
+   OFFSET(TDCALL_rdx, tdcall_output, rdx);
+   OFFSET(TDCALL_r8, tdcall_output, r8);
+   OFFSET(TDCALL_r9, tdcall_output, r9);
+   OFFSET(TDCALL_r10, tdcall_output, r10);
+   OFFSET(TDCALL_r11, tdcall_output, r11);
+
+   /* Offset for fields in tdvmcall_output */
+   OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
+   OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
+   OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
+   OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
+   OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
+#endif
+
BLANK();
OFFSET(BP_scratch, boot_params, scratch);
OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index ..a73b67c0b407
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,163 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define TDVMCALL_EXPOSE_REGS_MASK  0xfc00
+
+/*
+ * TDCALL instruction is newly added in TDX architecture,
+ * used by TD for requesting the host VMM to provide
+ * (untrusted) services. Supported in Binutils >= 2.36
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/* Only for non TDVMCALL use cases */
+SYM_FUNC_START(__tdcall)
+   FRAME_BEGIN
+
+   /* Save/restore non-volatile GPRs that are exposed to the VMM. */
+   push %r15
+   push %r14
+   push %r13
+   push %r12
+
+   /*
+* RDI  => RAX = TDCALL leaf
+* RSI  => RCX = input param 1
+* RDX 

Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-03-19 Thread Kuppuswamy, Sathyanarayanan




On 3/19/21 11:22 AM, Dave Hansen wrote:

On 3/19/21 10:42 AM, Kuppuswamy, Sathyanarayanan wrote:

@@ -4,6 +4,58 @@
   #include 
   #include 
   +void tdcall(u64 leafid, struct tdcall_regs *regs)
+{
+    asm volatile(
+    /* RAX = leafid (TDCALL LEAF ID) */
+    "  movq %0, %%rax;"
+    /* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
+    "  movq 8(%1), %%rcx;"


I am super duper opposed to using inline asm.  Large blocks are
hard to read,

I think this point is arguable. Based on the review comments I
received so far, people prefer inline assembly compared to asm sub
functions.


It's arguable, but Sean makes a pretty compelling case.

I actually think inline assembly is a monstrosity.  It's insanely arcane
and, as I hope you have noted, does not scale nicely beyond one or two
instructions.


and even harder to maintain.  E.g. the %1 usage falls apart if an
output constraint is added; that can be avoided by defining a local
const/imm (I forget what they're called), but it doesn't help
readability.

we can use OFFSET() calls to improve the readability and avoid this
issue. Also IMO, any one adding constraints should know how this
would affect the asm code.


This is about *maintainability*.  How _easily_ can someone change this
code in the future?  Sean's arguing that it's *hard* to correctly add a
constraint.  Unfortunately, our supply of omnipotent kernel developers
is a bit short.


+    "  movq 16(%1), %%rdx;"
+    "  movq 24(%1), %%r8;"
+    "  movq 32(%1), %%r9;"
+    "  movq 40(%1), %%r10;"
+    "  movq 48(%1), %%r11;"
+    "  movq 56(%1), %%r12;"
+    "  movq 64(%1), %%r13;"
+    "  movq 72(%1), %%r14;"
+    "  movq 80(%1), %%r15;"


This is extremely unsafe, and wasteful.  Putting the onus on the
caller to zero out unused registers, with no mechanism to
enforce/encourage doing so,

For encouragement, we can add a comment to this function about
callers responsibility. makes it

likely that the kernel will leak information to the VMM, e.g. in
the form of stack data due to a partially initialized "regs".

Unless you create sub-functions for each use cases, callers cannot
avoid this responsibility.


I don't think we're quite at the point where we throw up our hands.

It would be pretty simple to have an initializer that zeros the
registers out, or looks at the argument mask and does it more precisely.
  Surely we can do *something*.


 /* Offset for fields in tdvmcall_output */
 OFFSET(TDVMCALL_r12, tdvmcall_output, r13);
 OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
 OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
 OFFSET(TDVMCALL_r15, tdvmcall_output, r15);

SYM_FUNC_START(__tdvmcall)
 FRAME_BEGIN

 /* Save/restore non-volatile GPRs that are exposed to the VMM. */
  push %r15
  push %r14
  push %r13
  push %r12


I might have some tweaks for the assembly once someone puts a real patch
together.  But, that looks a lot more sane than the inline assembly to me.

Ok. Let me use Sean's proposal and submit tested version of this patch.

Also, any thoughts on whether you want to use single function for
tdcall and tdvmcall?




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-03-19 Thread Kuppuswamy, Sathyanarayanan

Hi Sean,

Thanks for the review.

On 3/19/21 9:55 AM, Sean Christopherson wrote:

On Thu, Mar 18, 2021, Kuppuswamy Sathyanarayanan wrote:

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e44e55d1e519..7ae1d25e272b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,58 @@
  #include 
  #include 
  
+void tdcall(u64 leafid, struct tdcall_regs *regs)

+{
+   asm volatile(
+   /* RAX = leafid (TDCALL LEAF ID) */
+   "  movq %0, %%rax;"
+   /* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
+   "  movq 8(%1), %%rcx;"


I am super duper opposed to using inline asm.  Large blocks are hard to read,

I think this point is arguable. Based on the review comments I received so far,
people prefer inline assembly compared to asm sub functions.

and even harder to maintain.  E.g. the %1 usage falls apart if an output
constraint is added; that can be avoided by defining a local const/imm (I forget
what they're called), but it doesn't help readability.

we can use OFFSET() calls to improve the readability and avoid this issue. Also 
IMO,
any one adding constraints should know how this would affect the asm code.



+   "  movq 16(%1), %%rdx;"
+   "  movq 24(%1), %%r8;"
+   "  movq 32(%1), %%r9;"
+   "  movq 40(%1), %%r10;"
+   "  movq 48(%1), %%r11;"
+   "  movq 56(%1), %%r12;"
+   "  movq 64(%1), %%r13;"
+   "  movq 72(%1), %%r14;"
+   "  movq 80(%1), %%r15;"


This is extremely unsafe, and wasteful.  Putting the onus on the caller to zero
out unused registers, with no mechanism to enforce/encourage doing so,

For encouragement, we can add a comment to this function about callers 
responsibility.
 makes it

likely that the kernel will leak information to the VMM, e.g. in the form of
stack data due to a partially initialized "regs".

Unless you create sub-functions for each use cases, callers cannot avoid this
responsibility.


And although TDVMCALL is anything but speedy, requiring multiple memory
operations just to set a single register is unnecessary.  Not to mention
several of these registers are never used in the GHCI-defined TDVMCALLs. 

This function is common between TDCALL and TDVMCALL. Extra registers you
mentioned are related to other TDCALL usecases.
 And,

since the caller defines the mask (which I also dislike), it's possible/likely
that many of these memory operations are wasteful even for registers that are
used by _some_ TDVMCALLs.  Unnecessary accesses are inevitable if we want a
common helper, but this is too much.

using single function makes it easy to maintain, readable and less error prone.
But I agree there are many unnecessary accesses for many users.



+   TDCALL ";"
+   /* Save TDCALL success/failure to regs->rax */
+   "  movq %%rax, (%1);"
+   /* Save rcx and rdx contents to regs->r[c-d]x */
+   "  movq %%rcx, 8(%1);"
+   "  movq %%rdx, 16(%1);"
+   /* Move content of registers R8-R15 regs->r[8-15] */
+   "  movq %%r8, 24(%1);"
+   "  movq %%r9, 32(%1);"
+   "  movq %%r10, 40(%1);"
+   "  movq %%r11, 48(%1);"
+   "  movq %%r12, 56(%1);"
+   "  movq %%r13, 64(%1);"
+   "  movq %%r14, 72(%1);"
+   "  movq %%r15, 80(%1);"
+
+   :
+   : "r" (leafid), "r" (regs)
+   : "memory", "rax", "rbx", "rcx", "rdx", "r8",
+ "r9", "r10", "r11", "r12", "r13", "r14", "r15"


All these clobbers mean even more memory operations...


+   );
+
+}
+
+void tdvmcall(u64 subid, struct tdcall_regs *regs)
+{
+   /* Expose GPRs R8-R15 to VMM */
+   regs->rcx = 0xff00;
+   /* R10 = 0 (standard TDVMCALL) */
+   regs->r10 = TDVMCALL_STANDARD;
+   /* Save subid to r11 register */
+   regs->r11 = subid;
+
+   tdcall(TDVMCALL, regs);


This implies the caller is responsible for _all_ error checking.  The base
TDCALL should never fail; if it does, something is horribly wrong with 
TDX-Module
and panicking is absolutely the best option.

I haven't added error checking to common function because some use cases like
MSR and IO access does n

[PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions

2021-03-18 Thread Kuppuswamy Sathyanarayanan
Implement common helper functions to communicate with
the TDX Module and VMM (using TDCALL instruction).

tdvmcall() function can be used to request services
from VMM.

tdcall() function can be used to communicate with the
TDX Module.

Using common helper functions makes the code more readable
and less error prone compared to distributed and use case
specific inline assembly code. Only downside in using this
approach is, it adds a few extra instructions for every
TDCALL use case when compared to distributed checks. Although
it's a bit less efficient, it's worth it to make the code more
readable.

Signed-off-by: Kuppuswamy Sathyanarayanan 

---

Hi All,

As you have suggested, I have created common helper functions
for all tdcall() and tdvmcall() use cases. It uses inline
assembly and passes GPRs R8-15 and r[a-c]x registers to TDX
Module/VMM. Please take a look at it and let me know your
comments. If you agree with the design, I can re-submit the
patchset with changes related to using these new APIs. Please
let me know.

 arch/x86/include/asm/tdx.h | 27 
 arch/x86/kernel/tdx.c  | 52 ++
 2 files changed, 79 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0b9d571b1f95..311252a90cfb 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,8 +3,27 @@
 #ifndef _ASM_X86_TDX_H
 #define _ASM_X86_TDX_H
 
+#include 
+
 #define TDX_CPUID_LEAF_ID  0x21
 
+#define TDVMCALL   0
+
+/* TDVMCALL R10 Input */
+#define TDVMCALL_STANDARD  0
+
+/*
+ * TDCALL instruction is newly added in TDX architecture,
+ * used by TD for requesting the host VMM to provide
+ * (untrusted) services. Supported in Binutils >= 2.36
+ */
+#define TDCALL ".byte 0x66,0x0f,0x01,0xcc"
+
+struct tdcall_regs {
+   u64 rax, rcx, rdx;
+   u64 r8, r9, r10, r11, r12, r13, r14, r15;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 /* Common API to check TDX support in decompression and common kernel code. */
@@ -12,6 +31,10 @@ bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+void tdcall(u64 leafid, struct tdcall_regs *regs);
+
+void tdvmcall(u64 subid, struct tdcall_regs *regs);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -21,6 +44,10 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline void tdcall(u64 leafid, struct tdcall_regs *regs) { };
+
+static inline void tdvmcall(u64 subid, struct tdcall_regs *regs) { };
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e44e55d1e519..7ae1d25e272b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,58 @@
 #include 
 #include 
 
+void tdcall(u64 leafid, struct tdcall_regs *regs)
+{
+   asm volatile(
+   /* RAX = leafid (TDCALL LEAF ID) */
+   "  movq %0, %%rax;"
+   /* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
+   "  movq 8(%1), %%rcx;"
+   "  movq 16(%1), %%rdx;"
+   "  movq 24(%1), %%r8;"
+   "  movq 32(%1), %%r9;"
+   "  movq 40(%1), %%r10;"
+   "  movq 48(%1), %%r11;"
+   "  movq 56(%1), %%r12;"
+   "  movq 64(%1), %%r13;"
+   "  movq 72(%1), %%r14;"
+   "  movq 80(%1), %%r15;"
+   TDCALL ";"
+   /* Save TDCALL success/failure to regs->rax */
+   "  movq %%rax, (%1);"
+   /* Save rcx and rdx contents to regs->r[c-d]x */
+   "  movq %%rcx, 8(%1);"
+   "  movq %%rdx, 16(%1);"
+   /* Move content of registers R8-R15 regs->r[8-15] */
+   "  movq %%r8, 24(%1);"
+   "  movq %%r9, 32(%1);"
+   "  movq %%r10, 40(%1);"
+   "  movq %%r11, 48(%1);"
+   "  movq %%r12, 56(%1);"
+   "  movq %%r13, 64(%1);"
+   "  movq %%r14, 72(%1);"
+   "  movq %%r15, 80(%1);"
+
+   :
+   : "r" (leafid), "r" (regs)
+   : "memory", "rax", "rbx", "rcx", "rdx", "r8",
+ "r9", "r10", "r11", "r12", "r13", "r14", "r15"
+   );
+
+}
+
+void tdvmcall(u64 subid, struct tdcall_regs *regs)
+{
+   /* Expose GPRs R8-R15 to VMM */
+   regs->rcx = 0xff00;
+   /* R10 = 0 (standard TDVMCALL) */
+   regs->r10 = TDVMCALL_STANDARD;
+   /* Save subid to r11 register */
+   regs->r11 = subid;
+
+   tdcall(TDVMCALL, regs);
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
u32 eax, signature[3];
-- 
2.25.1



Re: [PATCH v13 0/5] Simplify PCIe native ownership

2021-03-18 Thread Kuppuswamy, Sathyanarayanan

Hi Bjorn,

On 1/22/21 5:11 PM, Kuppuswamy Sathyanarayanan wrote:

Currently, PCIe capabilities ownership status is detected by
verifying the status of pcie_ports_native, and _OSC negotiated
results (cached in  struct pci_host_bridge->native_* members).
But this logic can be simplified, and we can use only struct
pci_host_bridge ->native_* members to detect it.

This patchset removes the distributed checks for pcie_ports_native,
parameter.

Any comments on this patch set?


Changes since v12:
  * Rebased on top of v5.11-rc1

Changes since v11 (Bjorns update):
  * Add bugfix for DPC with no AER Capability
  * Split OSC_OWNER trivial changes from pcie_ports_native changes
  * Temporarily drop pcie_ports_dpc_native changes (revisit it later).

Changes since v10:
  * Addressed format issue reported by lkp test.

Changes since v9:
  * Rebased on top of v5.10-rc1

Changes since v8:
  * Simplified setting _OSC ownwership logic
  * Moved bridge->native_ltr out of #ifdef CONFIG_PCIEPORTBUS.

Changes since v7:
  * Fixed "fix array_size.cocci warnings".

Changes since v6:
  * Created new patch for CONFIG_PCIEPORTBUS check in
pci_init_host_bridge().
  * Added warning message for a case when pcie_ports_native
overrides _OSC negotiation result.

Changes since v5:
  * Rebased on top of v5.8-rc1

Changes since v4:
  * Changed the patch set title (Original link: 
https://lkml.org/lkml/2020/5/26/1710)
  * Added AER/DPC dependency logic cleanup fixes.

Bjorn Helgaas (2):
   PCI/DPC: Ignore devices with no AER Capability
   PCI/ACPI: Centralize pci_aer_available() checking

Kuppuswamy Sathyanarayanan (3):
   PCI: Assume control of portdrv-related features only when portdrv
 enabled
   PCI/ACPI: Tidy _OSC control bit checking
   PCI/ACPI: Centralize pcie_ports_native checking

  drivers/acpi/pci_root.c   | 49 ---
  drivers/pci/hotplug/pciehp_core.c |  2 +-
  drivers/pci/pci-acpi.c|  3 --
  drivers/pci/pcie/aer.c|  2 +-
  drivers/pci/pcie/dpc.c|  3 ++
  drivers/pci/pcie/portdrv_core.c   | 11 +++
  drivers/pci/probe.c   |  6 ++--
  7 files changed, 51 insertions(+), 25 deletions(-)



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

2021-03-17 Thread Kuppuswamy, Sathyanarayanan




On 3/17/21 12:01 PM, Lukas Wunner wrote:

On Wed, Mar 17, 2021 at 10:54:09AM -0700, Sathyanarayanan Kuppuswamy Natarajan 
wrote:

Flush of hotplug event after successful recovery, and a simulated
hotplug link down event after link recovery fails should solve the
problems raised by Lukas. I assume Lukas' proposal adds this support.
I will check his patch shortly.


Thank you!

I'd like to get a better understanding of the issues around hotplug/DPC,
specifically I'm wondering:

If DPC recovery was successful, what is the desired behavior by pciehp,
should it ignore the Link Down/Up or bring the slot down and back up
after DPC recovery?

If the events are ignored, the driver of the device in the hotplug slot
is not unbound and rebound.  So the driver must be able to cope with
loss of TLPs during DPC recovery and it must be able to cope with
whatever state the endpoint device is in after DPC recovery.
Is this really safe?  How does the nvme driver deal with it?

During DPC recovery, in pcie_do_recovery() function, we use
report_frozen_detected() to notify all devices attached to the port
about the fatal error. After this notification, we expect all
affected devices to halt its IO transactions.

Regarding state restoration, after successful recovery, we use
report_slot_reset() to notify about the slot/link reset. So device
drivers are expected to restore the device to working state after this
notification.


Also, if DPC is handled by firmware, your patch does not ignore the
Link Down/Up events, 

Only for pure firmware model. For EDR case, we still ignore the Link
Down/Up events.
so pciehp brings down the slot when DPC is

triggered, then brings it up after succesful recovery.  In a code
comment, you write that this behavior is okay because there's "no
race between hotplug and DPC recovery". 

My point is, there is no race in OS handlers (pciehp_ist() vs 
pcie_do_recovery())
 However, Sinan wrote in

2018 that one of the issues with hotplug versus DPC is that pciehp
may turn off slot power and thereby foil DPC recovery.  (Power off =
cold reset, whereas DPC recovery = warm reset.)  This can occur
as well if DPC is handled by firmware.

I am not sure how pure firmware DPC recovery works. Is there a platform
which uses this combination? For firmware DPC model, spec does not clarify
following points.

1. Who will notify the affected device drivers to halt the IO transactions.
2. Who is responsible to restore the state of the device after link reset.

IMO, pure firmware DPC does not support seamless recovery. I think after it
clears the DPC trigger status, it might expect hotplug handler be responsible
for device recovery.

I don't want to add fix to the code path that I don't understand. This is the
reason for extending this logic to pure firmware DPC case.



So I guess pciehp should make an attempt to await DPC recovery even
if it's handled by firmware?  Or am I missing something?  We may be
able to achieve that by polling the DPC Trigger Status bit and
DLLLA bit, but it won't work as perfectly as with native DPC support.

Finally, you write in your commit message that there are "a lot of
stability issues" if pciehp and DPC are allowed to recover freely
without proper serialization.  What are these issues exactly?

In most cases, I see failure of DPC recovery handler (you can see the
example dmesg in commit log). Other than this, we also noticed some
extended delay or failure in link retraining (while waiting for LINK UP
in pcie_wait_for_link(pdev, true)).
In some cases, we noticed slot power on failures and that card will not
be detected when running lscpi.

Above mentioned cases are just observations we have made. we did not dig
deeper on why the race between pciehp and DPC leads to such issues.

(Beyond the slot power issue mentioned above, and that the endpoint
device's driver should presumably not be unbound if DPC recovery
was successful.)

Thanks!

Lukas



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

2021-03-12 Thread Kuppuswamy, Sathyanarayanan




On 3/12/21 7:32 PM, sathyanarayanan.kuppusw...@linux.intel.com wrote:

From: Kuppuswamy Sathyanarayanan 

When hotplug and DPC are both enabled on a Root port or
Downstream Port, during DPC events that cause a DLLSC link
down/up events, such events (DLLSC) must be suppressed to
let the DPC driver own the recovery path.

When DPC is present and enabled, hardware will put the port in
containment state to allow SW to recover from the error condition
in the seamless manner. But, during the DPC error recovery process,
since the link is in disabled state, it will also raise the DLLSC
event. In Linux kernel architecture, DPC events are handled by DPC
driver and DLLSC events are handled by hotplug driver. If a hotplug
driver is allowed to handle such DLLSC event (triggered by DPC
containment), then we will have a race condition between error
recovery handler (in DPC driver) and hotplug handler in recovering
the contained port. Allowing such a race leads to a lot of stability
issues while recovering the  device. So skip DLLSC handling in the
hotplug driver when the PCIe port associated with the hotplug event is
in DPC triggered state and let the DPC driver be responsible for the
port recovery.

Following is the sample dmesg log which shows the contention
between hotplug handler and error recovery handler. In this
case, hotplug handler won the race and error recovery
handler reported failure.

pcieport :97:02.0: pciehp: Slot(4): Link Down
pcieport :97:02.0: DPC: containment event, status:0x1f01 source:0x
pcieport :97:02.0: DPC: unmasked uncorrectable error detected
pcieport :97:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), 
type=Transaction Layer, (Requester ID)
pcieport :97:02.0:   device [8086:347a] error status/mask=4000/00100020
pcieport :97:02.0:[14] CmpltTO(First)
pci :98:00.0: AER: can't recover (no error_detected callback)
pcieport :97:02.0: pciehp: Slot(4): Card present
pcieport :97:02.0: DPC: Data Link Layer Link Active not set in 1000 msec
pcieport :97:02.0: AER: subordinate device reset failed
pcieport :97:02.0: AER: device recovery failed
pci :98:00.0: [8086:0953] type 00 class 0x010802
nvme nvme1: pci function :98:00.0
nvme :98:00.0: enabling device (0140 -> 0142)
nvme nvme1: 31/0/0 default/read/poll queues
  nvme1n2: p1

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Dan Williams 
Reviewed-by: Raj Ashok 
---

Missed to add the change log. will include it in next version.

Changes since v1:
 * Trimmed down the kernel log in commit history.
 * Removed usage of !! in is_dpc_reset_active().
 * Addressed other minor comments from Bjorn.


  drivers/pci/hotplug/pciehp_hpc.c | 19 +
  drivers/pci/pci.h|  2 ++
  drivers/pci/pcie/dpc.c   | 36 ++--
  include/linux/pci.h  |  1 +
  4 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index fb3840e222ad..55da5208c7e5 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -691,6 +691,25 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
goto out;
}
  
+	/*

+* If the DLLSC link up/down event is generated due to DPC containment
+* in the PCIe port, skip the DLLSC event handling and let the DPC
+* driver own the port recovery. Allowing both hotplug DLLSC event
+* handler and DPC event trigger handler to attempt recovery on the
+* same port leads to stability issues. If DPC recovery is successful,
+* is_dpc_reset_active() will return false and the hotplug handler will
+* not suppress the DLLSC event. If DPC recovery fails and the link is
+* left in disabled state, once the user changes the faulty card, the
+* hotplug handler can still handle the PRESENCE change event and bring
+* the device back up.
+*/
+   if ((events == PCI_EXP_SLTSTA_DLLSC) && is_dpc_reset_active(pdev)) {
+   ctrl_info(ctrl, "Slot(%s): DLLSC event(DPC), skipped\n",
+ slot_name(ctrl));
+   ret = IRQ_HANDLED;
+   goto out;
+   }
+
/* Check Attention Button Pressed */
if (events & PCI_EXP_SLTSTA_ABP) {
ctrl_info(ctrl, "Slot(%s): Attention button pressed\n",
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index ef7c4661314f..cee7095483bd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -446,10 +446,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
  void pci_dpc_init(struct pci_dev *pdev);
  void dpc_process_error(struct pci_dev *pdev);
  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool is_dpc_reset_active(struct pci_dev *pdev);
  #else
  static inline void pci_save_dpc_state(struct pci_dev *dev) {}
  static inline void pci_re

Re: [PATCH v1 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

2021-03-12 Thread Kuppuswamy, Sathyanarayanan




On 3/12/21 3:14 PM, Bjorn Helgaas wrote:

On Fri, Mar 12, 2021 at 02:11:03PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 3/12/21 1:33 PM, Bjorn Helgaas wrote:

On Mon, Mar 08, 2021 at 10:34:10PM -0800, 
sathyanarayanan.kuppusw...@linux.intel.com wrote:

From: Kuppuswamy Sathyanarayanan 



+bool is_dpc_reset_active(struct pci_dev *dev)
+{
+   struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+   u16 status;
+
+   if (!dev->dpc_cap)
+   return false;
+
+   /*
+* If DPC is owned by firmware and EDR is not supported, there is
+* no race between hotplug and DPC recovery handler. So return
+* false.
+*/
+   if (!host->native_dpc && !IS_ENABLED(CONFIG_PCIE_EDR))
+   return false;
+
+   if (atomic_read_acquire(>dpc_reset_active))
+   return true;
+
+   pci_read_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_STATUS, );
+
+   return !!(status & PCI_EXP_DPC_STATUS_TRIGGER);


I know it's somewhat common in drivers/pci/, but I'm not really a
big fan of "!!".

I can change it to use ternary operator.
(status & PCI_EXP_DPC_STATUS_TRIGGER) ? true : false;


Ternary isn't terrible, but what's wrong with:

   if (status & PCI_EXP_DPC_STATUS_TRIGGER)
 return true;
   return false;

I am fine with above format.


which matches the style of the rest of the function.

Looking at this again, we return "true" if either dpc_reset_active or
PCI_EXP_DPC_STATUS_TRIGGER.  I haven't worked this all out, but that
pattern feels racy.  I guess the thought is that if
PCI_EXP_DPC_STATUS_TRIGGER is set, dpc_reset_link() will be invoked
soon and we don't want to interfere?

Yes, the reason for checking dpc_reset_active before
PCI_EXP_DPC_STATUS_TRIGGER is because, we want suppress DLLSC events
till link comes back or it times out.

137 atomic_inc_return_acquire(>dpc_reset_active);
138
139 pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
140   PCI_EXP_DPC_STATUS_TRIGGER);
141
142 if (!pcie_wait_for_link(pdev, true)) {
143 pci_info(pdev, "Data Link Layer Link Active not set in 1000 
msec\n");
144 status = PCI_ERS_RESULT_DISCONNECT;
145 }
146
147 atomic_dec_return_release(>dpc_reset_active);





--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v1 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

2021-03-12 Thread Kuppuswamy, Sathyanarayanan




On 3/12/21 1:33 PM, Bjorn Helgaas wrote:

[+cc Lukas, pciehp expert]

On Mon, Mar 08, 2021 at 10:34:10PM -0800, 
sathyanarayanan.kuppusw...@linux.intel.com wrote:

From: Kuppuswamy Sathyanarayanan 

When hotplug and DPC are both enabled on a Root port or
Downstream Port, during DPC events that cause a DLLSC link
down/up events, such events must be suppressed to let the DPC
driver own the recovery path.


I first thought you were saying "during DPC events, DPC events must be
suppressed" which would be nonsensical.  But I guess this is saying
the "*link down/up* events must be suppressed"?

Correct. Our intention is to suppress DLLSC event during the DPC event.



When DPC is present and enabled, hardware will put the port in
containment state to allow SW to recover from the error condition
in the seamless manner. But, during the DPC error recovery process,
since the link is in disabled state, it will also raise the DLLSC
event. In Linux kernel architecture, DPC events are handled by DPC
driver and DLLSC events are handled by hotplug driver. If a hotplug
driver is allowed to handle such DLLSC event (triggered by DPC
containment), then we will have a race condition between error
recovery handler (in DPC driver) and hotplug handler in recovering
the contained port. Allowing such a race leads to a lot of stability
issues while recovering the  device. So skip DLLSC handling in the
hotplug driver when the PCIe port associated with the hotplug event is
in DPC triggered state and let the DPC driver be responsible for the
port recovery.

Following is the sample dmesg log which shows the contention
between hotplug handler and error recovery handler. In this
case, hotplug handler won the race and error recovery
handler reported failure.

[  724.974237] pcieport :97:02.0: pciehp: Slot(4): Link Down
[  724.974266] pcieport :97:02.0: DPC: containment event, status:0x1f01 
source:0x
[  724.974269] pcieport :97:02.0: DPC: unmasked uncorrectable error detected
[  724.974275] pcieport :97:02.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Requester ID)
[  724.974283] pcieport :97:02.0:   device [8086:347a] error 
status/mask=4000/00100020
[  724.974288] pcieport :97:02.0:[14] CmpltTO(First)
[  724.999181] pci :98:00.0: AER: can't recover (no error_detected callback)
[  724.999227] pci :98:00.0: Removing from iommu group 181
[  726.063125] pcieport :97:02.0: pciehp: Slot(4): Card present
[  726.221117] pcieport :97:02.0: DPC: Data Link Layer Link Active not set 
in 1000 msec
[  726.221122] pcieport :97:02.0: AER: subordinate device reset failed
[  726.221162] pcieport :97:02.0: AER: device recovery failed
[  727.227176] pci :98:00.0: [8086:0953] type 00 class 0x010802
[  727.227202] pci :98:00.0: reg 0x10: [mem 0x-0x3fff 64bit]
[  727.227234] pci :98:00.0: reg 0x30: [mem 0x-0x pref]
[  727.227246] pci :98:00.0: Max Payload Size set to 256 (was 128, max 256)
[  727.227251] pci :98:00.0: enabling Extended Tags
[  727.227736] pci :98:00.0: Adding to iommu group 181
[  727.231150] pci :98:00.0: BAR 6: assigned [mem 0xd100-0xd100 
pref]
[  727.231156] pci :98:00.0: BAR 0: assigned [mem 0xd101-0xd1013fff 
64bit]
[  727.231170] pcieport :97:02.0: PCI bridge to [bus 98]
[  727.231174] pcieport :97:02.0:   bridge window [io  0xc000-0xcfff]
[  727.231181] pcieport :97:02.0:   bridge window [mem 
0xd100-0xd10f]
[  727.231186] pcieport :97:02.0:   bridge window [mem 
0x2060-0x2060001f 64bit pref]
[  727.231555] nvme nvme1: pci function :98:00.0
[  727.231581] nvme :98:00.0: enabling device (0140 -> 0142)
[  737.141132] nvme nvme1: 31/0/0 default/read/poll queues
[  737.146211]  nvme1n2: p1


Quite a bit of the above really isn't relevant to the problem, so
stripping it out would reduce distraction.  E.g.,

   Removing from iommu group
   reg ...
   Max Payload Size set
   enabling Extended Tags
   Adding to iommu group
   BAR X: assigned ...
   PCI bridge to [bus 98]
   bridge window ...

Probably the timestamps are also only of incidental interest and could
be removed?

Agree. will fix this in next version.



Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Dan Williams 
Reviewed-by: Raj Ashok 
---
  drivers/pci/hotplug/pciehp_hpc.c | 18 +
  drivers/pci/pci.h|  2 ++
  drivers/pci/pcie/dpc.c   | 33 ++--
  include/linux/pci.h  |  1 +
  4 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index fb3840e222ad..8e7916abc60e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -691,6 +691,24 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
goto out;
}
  
+	/*

+

Re: [PATCH 1/2] PCI: controller: thunder: fix compile testing

2021-02-25 Thread Kuppuswamy, Sathyanarayanan




On 2/25/21 10:51 AM, Bjorn Helgaas wrote:

Thanks for looking this over!  I'd like to acknowledge your review,
but I need an explicit Reviewed-by or similar.  I don't want to put
words in your mouth by converting "Looks good to me" to "Reviewed-by".


will do so in future reviews.

Reviewed-by: Kuppuswamy Sathyanarayanan 


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH 1/2] PCI: controller: thunder: fix compile testing

2021-02-25 Thread Kuppuswamy, Sathyanarayanan




On 2/25/21 6:37 AM, Arnd Bergmann wrote:

From: Arnd Bergmann 

Compile-testing these drivers is currently broken. Enabling
it causes a couple of build failures though:

drivers/pci/controller/pci-thunder-ecam.c:119:30: error: shift count >= width 
of type [-Werror,-Wshift-count-overflow]
drivers/pci/controller/pci-thunder-pem.c:54:2: error: implicit declaration of 
function 'writeq' [-Werror,-Wimplicit-function-declaration]
drivers/pci/controller/pci-thunder-pem.c:392:8: error: implicit declaration of 
function 'acpi_get_rc_resources' [-Werror,-Wimplicit-function-declaration]

Fix them with the obvious one-line changes.

Looks good to me.


Signed-off-by: Arnd Bergmann 
---
  drivers/pci/controller/pci-thunder-ecam.c |  2 +-
  drivers/pci/controller/pci-thunder-pem.c  | 13 +++--
  drivers/pci/pci.h |  6 ++
  3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/controller/pci-thunder-ecam.c 
b/drivers/pci/controller/pci-thunder-ecam.c
index f964fd26f7e0..ffd84656544f 100644
--- a/drivers/pci/controller/pci-thunder-ecam.c
+++ b/drivers/pci/controller/pci-thunder-ecam.c
@@ -116,7 +116,7 @@ static int thunder_ecam_p2_config_read(struct pci_bus *bus, 
unsigned int devfn,
 * the config space access window.  Since we are working with
 * the high-order 32 bits, shift everything down by 32 bits.
 */
-   node_bits = (cfg->res.start >> 32) & (1 << 12);
+   node_bits = upper_32_bits(cfg->res.start) & (1 << 12);
  
  	v |= node_bits;

set_val(v, where, size, val);
diff --git a/drivers/pci/controller/pci-thunder-pem.c 
b/drivers/pci/controller/pci-thunder-pem.c
index 1a3f70ac61fc..0660b9da204f 100644
--- a/drivers/pci/controller/pci-thunder-pem.c
+++ b/drivers/pci/controller/pci-thunder-pem.c
@@ -12,6 +12,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "../pci.h"
  
  #if defined(CONFIG_PCI_HOST_THUNDER_PEM) || (defined(CONFIG_ACPI) && defined(CONFIG_PCI_QUIRKS))

@@ -324,9 +325,9 @@ static int thunder_pem_init(struct device *dev, struct 
pci_config_window *cfg,
 * structure here for the BAR.
 */
bar4_start = res_pem->start + 0xf0;
-   pem_pci->ea_entry[0] = (u32)bar4_start | 2;
-   pem_pci->ea_entry[1] = (u32)(res_pem->end - bar4_start) & ~3u;
-   pem_pci->ea_entry[2] = (u32)(bar4_start >> 32);
+   pem_pci->ea_entry[0] = lower_32_bits(bar4_start) | 2;
+   pem_pci->ea_entry[1] = lower_32_bits(res_pem->end - bar4_start) & ~3u;
+   pem_pci->ea_entry[2] = upper_32_bits(bar4_start);
  
  	cfg->priv = pem_pci;

return 0;
@@ -334,9 +335,9 @@ static int thunder_pem_init(struct device *dev, struct 
pci_config_window *cfg,
  
  #if defined(CONFIG_ACPI) && defined(CONFIG_PCI_QUIRKS)
  
-#define PEM_RES_BASE		0x87e0c000UL

-#define PEM_NODE_MASK  GENMASK(45, 44)
-#define PEM_INDX_MASK  GENMASK(26, 24)
+#define PEM_RES_BASE   0x87e0c000ULL
+#define PEM_NODE_MASK  GENMASK_ULL(45, 44)
+#define PEM_INDX_MASK  GENMASK_ULL(26, 24)
  #define PEM_MIN_DOM_IN_NODE   4
  #define PEM_MAX_DOM_IN_NODE   10
  
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h

index 0a2b6d993fe1..022c2f433676 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -625,6 +625,12 @@ static inline int pci_dev_specific_reset(struct pci_dev 
*dev, int probe)
  #if defined(CONFIG_PCI_QUIRKS) && defined(CONFIG_ARM64)
  int acpi_get_rc_resources(struct device *dev, const char *hid, u16 segment,
  struct resource *res);
+#else
+static inline int acpi_get_rc_resources(struct device *dev, const char *hid,
+   u16 segment, struct resource *res)
+{
+   return -ENODEV;
+}
  #endif
  
  int pci_rebar_get_current_size(struct pci_dev *pdev, int bar);




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO

2021-02-08 Thread Kuppuswamy, Sathyanarayanan




On 2/8/21 2:00 AM, Peter Zijlstra wrote:

This needs a binutils version number.

Yes, we will add it in next version.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Kuppuswamy, Sathyanarayanan




On 2/8/21 8:59 AM, Peter Zijlstra wrote:

'cute', might be useful to have that mentioned somewhere.

we will add a note for it in comments.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


[RFC v1 15/26] x86/boot: Add a trampoline for APs booting in 64-bit mode

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: Sean Christopherson 

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang 
Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/realmode.h  |  1 +
 arch/x86/kernel/smpboot.c|  5 +++
 arch/x86/realmode/rm/header.S|  1 +
 arch/x86/realmode/rm/trampoline_64.S | 49 +++-
 arch/x86/realmode/rm/trampoline_common.S |  5 ++-
 5 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..5066c8b35e7c 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
u32 sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+   u32 trampoline_start64;
u32 trampoline_pgd;
 #endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8ca66af96a54..11dd0deb4810 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1035,6 +1035,11 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle,
unsigned long boot_error = 0;
unsigned long timeout;
 
+#ifdef CONFIG_X86_64
+   if (is_tdx_guest())
+   start_ip = real_mode_header->trampoline_start64;
+#endif
+
idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
initial_code = (unsigned long)start_secondary;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
.long   pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+   .long   pa_trampoline_start64
.long   pa_trampoline_pgd;
 #endif
/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S 
b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
movl%eax, %cr3
 
# Set up EFER
+   movl$MSR_EFER, %ecx
+   rdmsr
+   cmp pa_tr_efer, %eax
+   jne .Lwrite_efer
+   cmp pa_tr_efer + 4, %edx
+   je  .Ldone_efer
+.Lwrite_efer:
movlpa_tr_efer, %eax
movlpa_tr_efer + 4, %edx
-   movl$MSR_EFER, %ecx
wrmsr
 
+.Ldone_efer:
# Enable paging and in turn activate Long Mode
-   movl$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+   movl$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
movl%eax, %cr0
 
/*
@@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
ljmpl   $__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+   /*
+* In compatibility mode.  Prep ESP and DX for startup_32, then disable
+* paging and complete the switch to legacy 32-bit mode.
+*/
+   movl$rm_stack_end, %esp
+   movw$__KERNEL_DS, %dx
+
+   movl$(X86_CR0_NE | X86_CR0_PE), %eax
+   movl%eax, %cr0
+   ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
.section ".text64","ax"
.code64
.balign 4
@@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
jmpq*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+   /*
+* APs start here on a direct transfer from 64-bit BIOS with identity
+* mapped page tables.  Load the kernel's GDT in order to gear down to
+* 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+* segment registers.  Load the zero IDT so any fault triggers a
+* shutdown instead of jumping back into BIOS.
+*/
+   lidttr_idt(%rip)
+   lgdttr_gdt64(%rip)
+
+   ljmpl   *tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
.section ".rodata","a"
# Duplicate the global descriptor table
# so the kernel can live anywhere
@@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
.quad   0x00cf9300  # __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+   .short  tr_gdt_end - tr_gdt - 1 # gdt limit
+   .long   pa_tr_gdt
+   .long   0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+   .long   pa_t

[RFC v1 14/26] ACPI: tables: Add multiprocessor wake-up support

2021-02-05 Thread Kuppuswamy Sathyanarayanan
As per Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 4.1, a new sub
structure – multiprocessor wake-up structure - is added to the
ACPI Multiple APIC Description Table (MADT) to describe the
information of the mailbox. If a platform firmware produces the
multiprocessor wake-up structure, then the BSP in OS may use this
new mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake table parsing support and if MADT wake table is
present, update apic->wakeup_secondary_cpu with new API which
uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---
 arch/x86/include/asm/apic.h |  3 ++
 arch/x86/kernel/acpi/boot.c | 56 +
 arch/x86/kernel/apic/probe_32.c |  8 +
 arch/x86/kernel/apic/probe_64.c |  8 +
 drivers/acpi/tables.c   |  9 ++
 include/acpi/actbl2.h   | 21 -
 6 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 34cb3c159481..63f970c61cbe 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -497,6 +497,9 @@ static inline unsigned int read_apic_id(void)
return apic->get_apic_id(reg);
 }
 
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+
 extern int default_apic_id_valid(u32 apicid);
 extern int default_acpi_madt_oem_check(char *, char *);
 extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 7bdc0239a943..37ada1908fb7 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,9 @@ int acpi_fix_pin2_polarity __initdata;
 static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 #endif
 
+static struct acpi_madt_mp_wake_mailbox *acpi_mp_wake_mailbox;
+static u64 acpi_mp_wake_mailbox_paddr;
+
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -329,6 +332,29 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, 
const unsigned long e
return 0;
 }
 
+static void acpi_mp_wake_mailbox_init(void)
+{
+   if (acpi_mp_wake_mailbox)
+   return;
+
+   acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+   sizeof(*acpi_mp_wake_mailbox), MEMREMAP_WB);
+}
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+   acpi_mp_wake_mailbox_init();
+
+   if (!acpi_mp_wake_mailbox)
+   return -EINVAL;
+
+   WRITE_ONCE(acpi_mp_wake_mailbox->apic_id, apicid);
+   WRITE_ONCE(acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+   WRITE_ONCE(acpi_mp_wake_mailbox->command, ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+   return 0;
+}
+
 #endif /*CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1086,6 +1112,30 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
 }
+
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+   struct acpi_madt_mp_wake *mp_wake = NULL;
+
+   if (!IS_ENABLED(CONFIG_SMP))
+   return -ENODEV;
+
+   mp_wake = (struct acpi_madt_mp_wake *)header;
+   if (BAD_MADT_ENTRY(mp_wake, end))
+   return -EINVAL;
+
+   if (acpi_mp_wake_mailbox)
+   return -EINVAL;
+
+   acpi_table_print_madt_entry(>common);
+
+   acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;
+
+   acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+   return 0;
+}
 #endif /* CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1284,6 +1334,12 @@ static void __init acpi_process_madt(void)
 
smp_found_config = 1;
}
+
+   /*
+* Parse MADT MP Wake entry.
+*/
+   acpi_table_parse_madt(ACPI_MADT_TYPE_MP_WAKE,
+ acpi_parse_mp_wake, 1);
}
if (error == -EINVAL) {
/*
diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c
index a61f642b1b90..d450014841b2 100644
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -207,3 +207,11 @@ int __init default_acpi_madt_oem_check(char *oem_id, char 
*oem_table_id)
}
return 0;
 }
+
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+   struct apic **drv;
+
+   for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+   (*drv)->wakeup_secondary_cpu = handler;
+}
diff --git a/arch/x86/kernel/apic/probe_64.c b/arch/x86/kernel/apic/pr

[RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-02-05 Thread Kuppuswamy Sathyanarayanan
In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---
 arch/x86/kernel/tdx.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index eff58329751e..8d1d7555fb56 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -451,6 +451,23 @@ int tdx_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_EPT_VIOLATION:
ve->instr_len = tdx_handle_mmio(regs, ve);
break;
+   /*
+* Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+* Domain Extensions (Intel TDX) specification, sec 2.4,
+* some instructions that unconditionally cause #VE (such as WBINVD,
+* MONITOR, MWAIT) do not have corresponding TDCALL
+* [TDG.VP.VMCALL ] leaves, since the TD has been designed
+* with no deterministic way to confirm the result of those operations
+* performed by the host VMM.  In those cases, the goal is for the TD
+* #VE handler to increment the RIP appropriately based on the VE
+* information provided via TDCALL.
+*/
+   case EXIT_REASON_WBINVD:
+   pr_warn_once("WBINVD #VE Exception\n");
+   case EXIT_REASON_MWAIT_INSTRUCTION:
+   case EXIT_REASON_MONITOR_INSTRUCTION:
+   /* Handle as nops. */
+   break;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



[RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle subset of instruction that kernel uses for MMIO
oerations. User-space access triggers SIGBUS.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/kernel/tdx.c | 120 ++
 1 file changed, 120 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3846d2807a7a..eff58329751e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,8 @@
 #include 
 #include 
 #include 
+#include 
+#include  /* force_sig_fault() */
 
 #ifdef CONFIG_KVM_GUEST
 #include "tdx-kvm.c"
@@ -270,6 +272,121 @@ static void tdx_handle_io(struct pt_regs *regs, u32 
exit_qual)
}
 }
 
+static unsigned long tdx_mmio(int size, bool write, unsigned long addr,
+   unsigned long val)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_EPT_VIOLATION;
+   register long r12 asm("r12") = size;
+   register long r13 asm("r13") = write;
+   register long r14 asm("r14") = addr;
+   register long r15 asm("r15") = val;
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+ "=r"(r14), "=r"(r15)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+ "r"(r13), "r"(r14), "r"(r15)
+   : );
+
+   WARN_ON(ret || r10);
+
+   return r11;
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+   static const int regoff[] = {
+   offsetof(struct pt_regs, ax),
+   offsetof(struct pt_regs, cx),
+   offsetof(struct pt_regs, dx),
+   offsetof(struct pt_regs, bx),
+   offsetof(struct pt_regs, sp),
+   offsetof(struct pt_regs, bp),
+   offsetof(struct pt_regs, si),
+   offsetof(struct pt_regs, di),
+   offsetof(struct pt_regs, r8),
+   offsetof(struct pt_regs, r9),
+   offsetof(struct pt_regs, r10),
+   offsetof(struct pt_regs, r11),
+   offsetof(struct pt_regs, r12),
+   offsetof(struct pt_regs, r13),
+   offsetof(struct pt_regs, r14),
+   offsetof(struct pt_regs, r15),
+   };
+   int regno;
+
+   regno = X86_MODRM_REG(insn->modrm.value);
+   if (X86_REX_R(insn->rex_prefix.value))
+   regno += 8;
+
+   return (void *)regs + regoff[regno];
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+   int size;
+   bool write;
+   unsigned long *reg;
+   struct insn insn;
+   unsigned long val = 0;
+
+   /*
+* User mode would mean the kernel exposed a device directly
+* to ring3, which shouldn't happen except for things like
+* DPDK.
+*/
+   if (user_mode(regs)) {
+   pr_err("Unexpected user-mode MMIO access.\n");
+   force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+   return 0;
+   }
+
+   kernel_insn_init(, (void *) regs->ip, MAX_INSN_SIZE);
+   insn_get_length();
+   insn_get_opcode();
+
+   write = ve->exit_qual & 0x2;
+
+   size = insn.opnd_bytes;
+   switch (insn.opcode.bytes[0]) {
+   /* MOV r/m8 r8  */
+   case 0x88:
+   /* MOV r8   r/m8*/
+   case 0x8A:
+   /* MOV r/m8 imm8*/
+   case 0xC6:
+   size = 1;
+   break;
+   }
+
+   if (inat_has_immediate(insn.attr)) {
+   BUG_ON(!write);
+   val = insn.immediate.value;
+   tdx_mmio(size, write, ve->gpa, val);
+   return insn.length;
+   }
+
+   BUG_ON(!inat_has_modrm(insn.attr));
+
+   reg = get_reg_ptr(regs, );
+
+   if (write) {
+   memcpy(, reg, size);
+   tdx_mmio(size, write, ve->gpa, val);
+   } else {
+   val = tdx_mmio(size, write, ve->gpa, val);
+   memset(reg, 0, size);
+   memcpy(reg, , size);
+   }
+   return insn.length;
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -331,6 +448,9 @@ int tdx_handle_virtualization_exception(struct pt_regs 
*regs,

[RFC v1 11/26] x86/tdx: Handle port I/O

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Unroll string operations and handle port I/O through TDVMCALLs.
Also handle #VE due to I/O operations with the same TDVMCALLs.

Decompression code uses port IO for earlyprintk. We must use
paravirt calls there too if we want to allow earlyprintk.

Decompresion code cannot deal with alternatives: use branches
instead to implement inX() and outX() helpers.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/boot/compressed/Makefile |   1 +
 arch/x86/boot/compressed/tdx_io.S |   9 ++
 arch/x86/include/asm/asm-prototypes.h |   1 +
 arch/x86/include/asm/io.h |   5 +-
 arch/x86/include/asm/tdx.h|  62 +--
 arch/x86/kernel/Makefile  |   2 +-
 arch/x86/kernel/tdx.c |  72 +
 arch/x86/kernel/tdx_io.S  | 143 ++
 8 files changed, 284 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx_io.S
 create mode 100644 arch/x86/kernel/tdx_io.S

diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index a2554621cefe..54da333adc4e 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -97,6 +97,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx_io.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx_io.S 
b/arch/x86/boot/compressed/tdx_io.S
new file mode 100644
index ..67498f67cb18
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx_io.S
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include 
+
+/* Do not export symbols in decompression code */
+#undef EXPORT_SYMBOL
+#define EXPORT_SYMBOL(sym)
+
+#include "../../kernel/tdx_io.S"
diff --git a/arch/x86/include/asm/asm-prototypes.h 
b/arch/x86/include/asm/asm-prototypes.h
index 51e2bf27cc9b..6bc97aa39a21 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index ef7a686a55a9..30a3b30395ad 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
@@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port)   
\
\
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {  \
-   if (sev_key_active()) { \
+   if (sev_key_active() || is_tdx_guest()) {   \
unsigned type *value = (unsigned type *)addr;   \
while (count) { \
out##bwl(*value, port); \
@@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, 
unsigned long count) \
\
 static inline void ins##bwl(int port, void *addr, unsigned long count) \
 {  \
-   if (sev_key_active()) { \
+   if (sev_key_active() || is_tdx_guest()) {   \
unsigned type *value = (unsigned type *)addr;   \
while (count) { \
*value = in##bwl(port); \
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8c3e5af88643..b46ae140e39b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,7 +5,16 @@
 
 #define TDX_CPUID_LEAF_ID  0x21
 
-#ifdef CONFIG_INTEL_TDX_GUEST
+#define TDVMCALL   0
+#define TDINFO 1
+#define TDGETVEINFO3
+
+/* TDVMCALL R10 Input */
+#define TDVMCALL_STANDARD  0
+#define TDVMCALL_VENDOR1
+
+#ifndef __ASSEMBLY__
+#include 
 
 /*
  * TDCALL instruction is newly added in TDX architecture,
@@ -14,19 +23,55 @@
  */
 #define TDCALL ".byte 0x66,0x0f,0x01,0xcc"
 
-#define TDVMCALL   0
-#define TDINFO 1
-#define TDGETVEINFO3
-
-/* TDVMCALL R10 Input */
-#define TDVMCALL_STANDARD  0
-#define TDVMCALL_VENDOR1
+#ifdef CONFIG_INTEL_TDX_GUEST
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_t

[RFC v1 10/26] x86/io: Allow to override inX() and outX() implementation

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

The patch allows to override the implementation of the port IO
helpers. TDX code will provide an implementation that redirect the
helpers to paravirt calls.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/io.h | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..ef7a686a55a9 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -271,18 +271,26 @@ static inline bool sev_key_active(void) { return false; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
+#ifndef __out
+#define __out(bwl, bw) \
+   asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
+#endif
+
+#ifndef __in
+#define __in(bwl, bw)  \
+   asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
+#endif
+
 #define BUILDIO(bwl, bw, type) \
 static inline void out##bwl(unsigned type value, int port) \
 {  \
-   asm volatile("out" #bwl " %" #bw "0, %w1"   \
-: : "a"(value), "Nd"(port));   \
+   __out(bwl, bw); \
 }  \
\
 static inline unsigned type in##bwl(int port)  \
 {  \
unsigned type value;\
-   asm volatile("in" #bwl " %w1, %" #bw "0"\
-: "=a"(value) : "Nd"(port));   \
+   __in(bwl, bw);  \
return value;   \
 }  \
\
-- 
2.25.1



[RFC v1 09/26] x86/tdx: Handle CPUID via #VE

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

TDX has three classes of CPUID leaves: some CPUID leaves
are always handled by the CPU, others are handled by the TDX module,
and some others are handled by the VMM. Since the VMM cannot directly
intercept the instruction these are reflected with a #VE exception
to the guest, which then converts it into a TDCALL to the VMM,
or handled directly.

The TDX module EAS has a full list of CPUID leaves which are handled
natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
the #VE method. In practice this typically only applies to the
hypervisor specific CPUIDs unknown to the native CPU.

Therefore there is no risk of causing this in early CPUID code which
runs before the #VE handler is set up because it will never access
those exotic CPUID leaves.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/kernel/tdx.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5d961263601e..e98058c048b5 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -172,6 +172,35 @@ static int tdx_write_msr_safe(unsigned int msr, unsigned 
int low,
return ret || r10 ? -EIO : 0;
 }
 
+static void tdx_handle_cpuid(struct pt_regs *regs)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_CPUID;
+   register long r12 asm("r12") = regs->ax;
+   register long r13 asm("r13") = regs->cx;
+   register long r14 asm("r14");
+   register long r15 asm("r15");
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+ "=r"(r14), "=r"(r15)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+ "r"(r13)
+   : );
+
+   regs->ax = r12;
+   regs->bx = r13;
+   regs->cx = r14;
+   regs->dx = r15;
+
+   WARN_ON(ret || r10);
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -227,6 +256,9 @@ int tdx_handle_virtualization_exception(struct pt_regs 
*regs,
case EXIT_REASON_MSR_WRITE:
ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
break;
+   case EXIT_REASON_CPUID:
+   tdx_handle_cpuid(regs);
+   break;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
-- 
2.25.1



[RFC v1 08/26] x86/tdx: Add MSR support for TDX guest

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Operations on context-switched MSRs can be run natively. The rest of
MSRs should be handled through TDVMCALLs.

TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide
MSR oprations.

You can find RDMSR and WRMSR details in Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 3.10, 3.11.

Also, since CSTAR MSR is not used on Intel CPUs as SYSCALL
instruction, ignore accesses to CSTAR MSR. Ignore accesses to
the MSR for compatibility: no need in wrap callers in
!is_tdx_guest().

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/kernel/tdx.c | 94 ++-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index bbefe639a2ed..5d961263601e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -94,6 +94,84 @@ static __cpuidle void tdx_safe_halt(void)
BUG_ON(ret || r10);
 }
 
+static bool tdx_is_context_switched_msr(unsigned int msr)
+{
+   /*  XXX: Update the list of context-switched MSRs */
+
+   switch (msr) {
+   case MSR_EFER:
+   case MSR_IA32_CR_PAT:
+   case MSR_FS_BASE:
+   case MSR_GS_BASE:
+   case MSR_KERNEL_GS_BASE:
+   case MSR_IA32_SYSENTER_CS:
+   case MSR_IA32_SYSENTER_EIP:
+   case MSR_IA32_SYSENTER_ESP:
+   case MSR_STAR:
+   case MSR_LSTAR:
+   case MSR_SYSCALL_MASK:
+   case MSR_IA32_XSS:
+   case MSR_TSC_AUX:
+   case MSR_IA32_BNDCFGS:
+   return true;
+   }
+   return false;
+}
+
+static u64 tdx_read_msr_safe(unsigned int msr, int *err)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_MSR_READ;
+   register long r12 asm("r12") = msr;
+   register long rcx asm("rcx");
+   long ret;
+
+   WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
+
+   if (msr == MSR_CSTAR)
+   return 0;
+
+   /* Allow to pass R10, R11 and R12 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12)
+   : );
+
+   /* XXX: Better error handling needed? */
+   *err = (ret || r10) ? -EIO : 0;
+
+   return r11;
+}
+
+static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
+ unsigned int high)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_MSR_WRITE;
+   register long r12 asm("r12") = msr;
+   register long r13 asm("r13") = (u64)high << 32 | low;
+   register long rcx asm("rcx");
+   long ret;
+
+   WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
+
+   if (msr == MSR_CSTAR)
+   return 0;
+
+   /* Allow to pass R10, R11, R12 and R13 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+ "r"(r13)
+   : );
+
+   return ret || r10 ? -EIO : 0;
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -132,17 +210,31 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
 int tdx_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve)
 {
+   unsigned long val;
+   int ret = 0;
+
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
tdx_halt();
break;
+   case EXIT_REASON_MSR_READ:
+   val = tdx_read_msr_safe(regs->cx, (unsigned int *));
+   if (!ret) {
+   regs->ax = val & UINT_MAX;
+   regs->dx = val >> 32;
+   }
+   break;
+   case EXIT_REASON_MSR_WRITE:
+   ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
+   break;
default:
pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
return -EFAULT;
}
 
/* After successful #VE handling, move the IP */
-   regs->ip += ve->instr_len;
+   if (!ret)
+   regs->ip += ve->instr_len;
 
return ret;
 }
-- 
2.25.1



[RFC v1 07/26] x86/tdx: Wire up KVM hypercalls

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/kvm_para.h |  21 ++
 arch/x86/include/asm/tdx.h  |   8 +++
 arch/x86/kernel/tdx-kvm.c   | 116 
 arch/x86/kernel/tdx.c   |   4 ++
 4 files changed, 149 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
long ret;
+
+   if (is_tdx_guest())
+   return tdx_kvm_hypercall0(nr);
+
asm volatile(KVM_HYPERCALL
 : "=a"(ret)
 : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
long ret;
+
+   if (is_tdx_guest())
+   return tdx_kvm_hypercall1(nr, p1);
+
asm volatile(KVM_HYPERCALL
 : "=a"(ret)
 : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned 
long p1,
  unsigned long p2)
 {
long ret;
+
+   if (is_tdx_guest())
+   return tdx_kvm_hypercall2(nr, p1, p2);
+
asm volatile(KVM_HYPERCALL
 : "=a"(ret)
 : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned 
long p1,
  unsigned long p2, unsigned long p3)
 {
long ret;
+
+   if (is_tdx_guest())
+   return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
asm volatile(KVM_HYPERCALL
 : "=a"(ret)
 : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned 
long p1,
  unsigned long p4)
 {
long ret;
+
+   if (is_tdx_guest())
+   return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
asm volatile(KVM_HYPERCALL
 : "=a"(ret)
 : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b98de067257b..8c3e5af88643 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -51,4 +51,12 @@ unsigned long tdx_get_ve_info(struct ve_info *ve);
 int tdx_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve);
 
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+   unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+   unsigned long p3, unsigned long p4);
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index ..323d43fcb338
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+   register long r10 asm("r10") = TDVMCALL_VENDOR;
+   register long r11 asm("r11") = nr;
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10 and R11 down to the VMM */
+   rcx = BIT(10) | BIT(11);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11)
+   : "memory");
+
+   BUG_ON(ret);
+   return r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+   register long r10 asm("r10") = TDVMCALL_VENDOR;
+   register long r11 asm("r11") = nr;
+   register long r12 asm("r12") = p1;
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10, R11 and R12 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12);
+
+   asm volatile(TDCALL
+

[RFC v1 06/26] x86/tdx: Add HLT support for TDX guest

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 3.8,
TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement
halt() and safe_halt() paravirtualization calls.

The same TDVMCALL is used to handle #VE exception due to
EXIT_REASON_HLT.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/tdx.h |  5 
 arch/x86/kernel/tdx.c  | 61 ++
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 90eb61b07d1f..b98de067257b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -14,9 +14,14 @@
  */
 #define TDCALL ".byte 0x66,0x0f,0x01,0xcc"
 
+#define TDVMCALL   0
 #define TDINFO 1
 #define TDGETVEINFO3
 
+/* TDVMCALL R10 Input */
+#define TDVMCALL_STANDARD  0
+#define TDVMCALL_VENDOR1
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ae2d5c847700..25dd33bc2e49 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -51,6 +51,45 @@ static void tdx_get_info(void)
td_info.attributes = rdx;
 }
 
+static __cpuidle void tdx_halt(void)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_HLT;
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10 and R11 down to the VMM */
+   rcx = BIT(10) | BIT(11);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11)
+   : );
+
+   /* It should never fail */
+   BUG_ON(ret || r10);
+}
+
+static __cpuidle void tdx_safe_halt(void)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = EXIT_REASON_HLT;
+   register long rcx asm("rcx");
+   long ret;
+
+   /* Allow to pass R10 and R11 down to the VMM */
+   rcx = BIT(10) | BIT(11);
+
+   /* Enable interrupts next to the TDVMCALL to avoid performance 
degradation */
+   asm volatile("sti\n\t" TDCALL
+   : "=a"(ret), "=r"(r10), "=r"(r11)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11)
+   : );
+
+   /* It should never fail */
+   BUG_ON(ret || r10);
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -60,6 +99,9 @@ void __init tdx_early_init(void)
 
tdx_get_info();
 
+   pv_ops.irq.safe_halt = tdx_safe_halt;
+   pv_ops.irq.halt = tdx_halt;
+
pr_info("TDX guest is initialized\n");
 }
 
@@ -86,10 +128,17 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
 int tdx_handle_virtualization_exception(struct pt_regs *regs,
struct ve_info *ve)
 {
-   /*
-* TODO: Add handler support for various #VE exit
-* reasons
-*/
-   pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
-   return -EFAULT;
+   switch (ve->exit_reason) {
+   case EXIT_REASON_HLT:
+   tdx_halt();
+   break;
+   default:
+   pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
+   return -EFAULT;
+   }
+
+   /* After successful #VE handling, move the IP */
+   regs->ip += ve->instr_len;
+
+   return ret;
 }
-- 
2.25.1



[RFC v1 17/26] x86/boot: Avoid unnecessary #VE during boot process

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: Sean Christopherson 

Skip writing EFER during secondary_startup_64() if the current value is
also the desired value. This avoids a #VE when running as a TDX guest,
as the TDX-Module does not allow writes to EFER (even when writing the
current, fixed value).

Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
when running as a TDX guest. The TDX-Module (effectively part of the
hypervisor) requires CR4.MCE to be set at all times and injects a #VE
if the guest attempts to clear CR4.MCE.

Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/boot/compressed/head_64.S |  5 -
 arch/x86/kernel/head_64.S  | 13 +++--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index 37c2f37d4a0d..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
popl%ecx
 
/* Enable PAE and LA57 (if required) paging modes */
-   movl$X86_CR4_PAE, %eax
+   movl%cr4, %eax
+   /* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+   andl$X86_CR4_MCE, %eax
+   orl $X86_CR4_PAE, %eax
testl   %edx, %edx
jz  1f
orl $X86_CR4_LA57, %eax
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, 
SYM_L_GLOBAL)
 1:
 
/* Enable PAE mode, PGE and LA57 */
-   movl$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+   movq%cr4, %rcx
+   /* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+   andl$X86_CR4_MCE, %ecx
+   orl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
testl   $1, __pgtable_l5_enabled(%rip)
jz  1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, 
SYM_L_GLOBAL)
/* Setup EFER (Extended Feature Enable Register) */
movl$MSR_EFER, %ecx
rdmsr
+   movl%eax, %edx
btsl$_EFER_SCE, %eax/* Enable System Call */
btl $20,%edi/* No Execute supported? */
jnc 1f
btsl$_EFER_NX, %eax
btsq$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1: wrmsr   /* Make changes effective */
 
+   /* Skip the WRMSR if the current value matches the desired value. */
+1: cmpl%edx, %eax
+   je  1f
+   xor %edx, %edx
+   wrmsr   /* Make changes effective */
+1:
/* Setup cr0 */
movl$CR0_STATE, %eax
/* Make changes effective */
-- 
2.25.1



[RFC v1 01/26] x86/paravirt: Introduce CONFIG_PARAVIRT_XL

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
a separate config option. It provides a middle ground for
not-so-deep paravirtulized environments.

CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
calls that was hidden under CONFIG_PARAVIRT_XXL, but the rest of the
config would be a bloat for TDX.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/Kconfig  |  4 +++
 arch/x86/boot/compressed/misc.h   |  1 +
 arch/x86/include/asm/irqflags.h   | 42 +++
 arch/x86/include/asm/paravirt.h   | 22 +++---
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c|  4 ++-
 arch/x86/mm/mem_encrypt_identity.c|  1 +
 7 files changed, 46 insertions(+), 31 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7b6dd10b162a..8fe91114bfee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -776,8 +776,12 @@ config PARAVIRT
  over full virtualization.  However, when run without a hypervisor
  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_XL
+   bool
+
 config PARAVIRT_XXL
bool
+   select PARAVIRT_XL
 
 config PARAVIRT_DEBUG
bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..4b84abe43765 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 2dfc8d380dab..299c9b1ed857 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -68,11 +68,33 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_XL
 #include 
 #else
 #ifndef __ASSEMBLY__
 #include 
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline __cpuidle void arch_safe_halt(void)
+{
+   native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline __cpuidle void halt(void)
+{
+   native_halt();
+}
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_XL */
+
+#ifndef CONFIG_PARAVIRT_XXL
+#ifndef __ASSEMBLY__
 
 static __always_inline unsigned long arch_local_save_flags(void)
 {
@@ -94,24 +116,6 @@ static __always_inline void arch_local_irq_enable(void)
native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static inline __cpuidle void arch_safe_halt(void)
-{
-   native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static inline __cpuidle void halt(void)
-{
-   native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index f8dce11d2bc1..700b94abfd1b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct 
*mm)
PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+#ifdef CONFIG_PARAVIRT_XL
+static inline void arch_safe_halt(void)
+{
+   PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+   PVOP_VCALL0(irq.halt);
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-   PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-   PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index b6b02b7c19cc..634482a0a60d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -190,7 +190,8 @@ struct pv_irq_ops {
struct paravirt_callee_save restore_fl;
struct paravirt_callee_save irq_disable;
struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_XL
void (*safe_halt)(void);
void (*halt)(void);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 6c3407ba6ee9..85714a6389d6 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -327,9 +327,11 @@ struct paravirt_patch_template pv_ops = {
.irq.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
.irq.irq_disable= __PV_IS_CALLEE_SAVE(native_irq_disable),
   

[RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface

2021-02-05 Thread Kuppuswamy Sathyanarayanan
Add helper function to detect TDX feature support. It will be used
to protect TDX specific code.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdx.c| 32 +++
 arch/x86/include/asm/tdx.h|  8 
 arch/x86/kernel/tdx.c |  6 ++
 4 files changed, 47 insertions(+)
 create mode 100644 arch/x86/boot/compressed/tdx.c

diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..a2554621cefe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -96,6 +96,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index ..0a87c1775b67
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include 
+
+static int __ro_after_init tdx_guest = -1;
+
+static inline bool native_cpuid_has_tdx_guest(void)
+{
+   u32 eax = TDX_CPUID_LEAF_ID, signature[3] = {0};
+
+   if (native_cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+   return false;
+
+   native_cpuid(, [0], [1], [2]);
+
+   if (memcmp("IntelTDX", signature, 12))
+   return false;
+
+   return true;
+}
+
+bool is_tdx_guest(void)
+{
+   if (tdx_guest < 0)
+   tdx_guest = native_cpuid_has_tdx_guest();
+
+   return !!tdx_guest;
+}
+
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2cc246c0cecf..0b9d571b1f95 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,10 +7,18 @@
 
 #ifdef CONFIG_INTEL_TDX_GUEST
 
+/* Common API to check TDX support in decompression and common kernel code. */
+bool is_tdx_guest(void);
+
 void __init tdx_early_init(void);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
+static inline bool is_tdx_guest(void)
+{
+   return false;
+}
+
 static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 473b4c1c0920..e44e55d1e519 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -20,6 +20,12 @@ static inline bool cpuid_has_tdx_guest(void)
return true;
 }
 
+bool is_tdx_guest(void)
+{
+   return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+EXPORT_SYMBOL_GPL(is_tdx_guest);
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
-- 
2.25.1



[RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

The TDX module injects #VE exception to the guest TD in cases of
disallowed instructions, disallowed MSR accesses and subset of CPUID
leaves. Also, it's theoretically possible for CPU to inject #VE
exception on EPT violation, but the TDX module makes sure this does
not happen, as long as all memory used is properly accepted using
TDCALLs. You can find more details about it in, Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 2.3.

Add basic infrastructure to handle #VE. If there is no handler for a
given #VE, since its a unexpected event (fault case), treat it as a
general protection fault and handle it using do_general_protection()
call.

TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.

More details on cases where #VE exceptions are allowed/not-allowed:

The #VE exception do not occur in the paranoid entry paths, like NMIs.
While other operations during an NMI might cause #VE, these are in the
NMI code that can handle nesting, so there is no concern about
reentrancy. This is similar to how #PF is handled in NMIs.

The #VE exception also cannot happen in entry/exit code with the
wrong gs, such as the SWAPGS code, so it's entry point does not
need "paranoid" handling.

Any memory accesses can cause #VE if it causes an EPT
violation.  However, the VMM is only in direct control of some of the
EPT tables.  The Secure EPT tables are controlled by the TDX module
which guarantees no EPT violations will result in #VE for the guest,
once the memory has been accepted.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h  | 14 +++
 arch/x86/kernel/idt.c   |  6 +++
 arch/x86/kernel/tdx.c   | 31 ++
 arch/x86/kernel/traps.c | 73 ++---
 5 files changed, 105 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..a2cbb68f9ae8 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -615,6 +615,10 @@ DECLARE_IDTENTRY_VC(X86_TRAP_VC,   exc_vmm_communication);
 DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER, exc_xen_hypervisor_callback);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,  exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,   common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f8cdc8eb1046..90eb61b07d1f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 #define TDCALL ".byte 0x66,0x0f,0x01,0xcc"
 
 #define TDINFO 1
+#define TDGETVEINFO3
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
@@ -32,4 +33,17 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+struct ve_info {
+   unsigned int exit_reason;
+   unsigned long exit_qual;
+   unsigned long gla;
+   unsigned long gpa;
+   unsigned int instr_len;
+   unsigned int instr_info;
+};
+
+unsigned long tdx_get_ve_info(struct ve_info *ve);
+int tdx_handle_virtualization_exception(struct pt_regs *regs,
+   struct ve_info *ve);
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 */
INTG(X86_TRAP_PF,   asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+   INTG(X86_TRAP_VE,   asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
INTG(X86_TRAP_MF,   asm_exc_coprocessor_error),
INTG(X86_TRAP_AC,   asm_exc_alignment_check),
INTG(X86_TRAP_XF,   asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+   INTG(X86_TRAP_VE,   asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
TSKG(X86_TRAP_DF,   GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 13303bfdfdd1..ae2d5c847700 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -62,3 +62,34 @@ void __init tdx_early_init(void)
 
pr_info("TDX guest is initialized\n");
 }
+
+unsigned long tdx_get_ve_info(struct ve_info *ve)
+{
+   register long r8 asm("r8");
+   register long r9 asm("r9");
+   register long r10 asm(&quo

[RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 2.4.2,
TDCALL[TDINFO] provides basic TD execution environment information, not
provided by CPUID.

Call TDINFO during early boot to be used for following system
initialization.

The call provides info on which bit in pfn is used to indicate that the
page is shared with the host and attributes of the TD, such as debug.

We don't save information about the number of cpus as there's no users
so far.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/tdx.h |  9 +
 arch/x86/kernel/tdx.c  | 27 +++
 2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0b9d571b1f95..f8cdc8eb1046 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,15 @@
 
 #ifdef CONFIG_INTEL_TDX_GUEST
 
+/*
+ * TDCALL instruction is newly added in TDX architecture,
+ * used by TD for requesting the host VMM to provide
+ * (untrusted) services.
+ */
+#define TDCALL ".byte 0x66,0x0f,0x01,0xcc"
+
+#define TDINFO 1
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e44e55d1e519..13303bfdfdd1 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -3,6 +3,14 @@
 
 #include 
 #include 
+#include 
+#include 
+#include 
+
+static struct {
+   unsigned int gpa_width;
+   unsigned long attributes;
+} td_info __ro_after_init;
 
 static inline bool cpuid_has_tdx_guest(void)
 {
@@ -26,6 +34,23 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+static void tdx_get_info(void)
+{
+   register long rcx asm("rcx");
+   register long rdx asm("rdx");
+   register long r8 asm("r8");
+   long ret;
+
+   asm volatile(TDCALL
+: "=a"(ret), "=c"(rcx), "=r"(rdx), "=r"(r8)
+: "a"(TDINFO)
+: "r9", "r10", "r11", "memory");
+   BUG_ON(ret);
+
+   td_info.gpa_width = rcx & GENMASK(5, 0);
+   td_info.attributes = rdx;
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -33,5 +58,7 @@ void __init tdx_early_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+   tdx_get_info();
+
pr_info("TDX guest is initialized\n");
 }
-- 
2.25.1



[RFC v1 02/26] x86/cpufeatures: Add TDX Guest CPU feature

2021-02-05 Thread Kuppuswamy Sathyanarayanan
Add CPU feature detection for Trusted Domain Extensions support.
TDX feature adds capabilities to keep guest register state and
memory isolated from hypervisor.

For TDX guest platforms, executing CPUID(0x21, 0) will return
following values in EAX, EBX, ECX and EDX.

EAX:  Maximum sub-leaf number:  0
EBX/EDX/ECX:  Vendor string:

EBX =  “Inte”
EDX =  ”lTDX”
ECX =  ““

So when above condition is true, set X86_FEATURE_TDX_GUEST
feature cap bit

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/tdx.h | 18 +
 arch/x86/kernel/Makefile   |  1 +
 arch/x86/kernel/head64.c   |  3 +++
 arch/x86/kernel/tdx.c  | 31 ++
 5 files changed, 54 insertions(+)
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx.c

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..989e2b302880 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL( 8*32+19) /* "" VMware prefers 
VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted 
Virtualization - Encrypted State */
 #define X86_FEATURE_VM_PAGE_FLUSH  ( 8*32+21) /* "" VM Page Flush MSR is 
supported */
+#define X86_FEATURE_TDX_GUEST  ( 8*32+22) /* Trusted Domain Extensions 
Guest */
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE   ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index ..2cc246c0cecf
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#define TDX_CPUID_LEAF_ID  0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+
+#else // !CONFIG_INTEL_TDX_GUEST
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 5eeb808eb024..ba8ee9300f23 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)  += jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST)  += tdx.o
 
 obj-$(CONFIG_EISA) += eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)  += pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e9beb77cafd..75f2401cb5db 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Manage page tables very early on.
@@ -491,6 +492,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * 
real_mode_data)
 
kasan_early_init();
 
+   tdx_early_init();
+
idt_setup_early_handler();
 
copy_bootdata(__va(real_mode_data));
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index ..473b4c1c0920
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Copyright (C) 2020 Intel Corporation */
+
+#include 
+#include 
+
+static inline bool cpuid_has_tdx_guest(void)
+{
+   u32 eax, signature[3];
+
+   if (cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+   return false;
+
+   cpuid_count(TDX_CPUID_LEAF_ID, 0, , [0],
+   [1], [2]);
+
+   if (memcmp("IntelTDX", signature, 12))
+   return false;
+
+   return true;
+}
+
+void __init tdx_early_init(void)
+{
+   if (!cpuid_has_tdx_guest())
+   return;
+
+   setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+   pr_info("TDX guest is initialized\n");
+}
-- 
2.25.1



[RFC v1 24/26] x86/tdx: Add helper to do MapGPA TDVMALL

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
shared memory mappings. Shared GPA mappings can be used for
communication beteen TD guest and host VMM, for example for
paravirtualized IO.

The new helper tdx_map_gpa() provides access to the operation.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/tdx.h |  2 ++
 arch/x86/kernel/tdx.c  | 28 
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9bbfe6520ea4..efffdef35c78 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -105,5 +105,7 @@ long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, 
unsigned long p2,
unsigned long p3, unsigned long p4);
 
 phys_addr_t tdx_shared_mask(void);
+
+int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private);
 #endif
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 9681f4a0b4e0..f99fe54b4f88 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -14,6 +14,8 @@
 #include "tdx-kvm.c"
 #endif
 
+#define TDVMCALL_MAP_GPA   0x10001
+
 static struct {
unsigned int gpa_width;
unsigned long attributes;
@@ -66,6 +68,32 @@ static void tdx_get_info(void)
physical_mask &= ~tdx_shared_mask();
 }
 
+int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
+{
+   register long r10 asm("r10") = TDVMCALL_STANDARD;
+   register long r11 asm("r11") = TDVMCALL_MAP_GPA;
+   register long r12 asm("r12") = gpa;
+   register long r13 asm("r13") = PAGE_SIZE * numpages;
+   register long rcx asm("rcx");
+   long ret;
+
+   if (!private)
+   r12 |= tdx_shared_mask();
+
+   /* Allow to pass R10, R11, R12 and R13 down to the VMM */
+   rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13);
+
+   asm volatile(TDCALL
+   : "=a"(ret), "=r"(r10)
+   : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+ "r"(r13)
+   : );
+
+   // Host kernel doesn't implement it yet.
+   // WARN_ON(ret || r10);
+   return ret || r10 ? -EIO : 0;
+}
+
 static __cpuidle void tdx_halt(void)
 {
register long r10 asm("r10") = TDVMCALL_STANDARD;
-- 
2.25.1



Re: [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-02-05 Thread Kuppuswamy, Sathyanarayanan

Hi Andy,

On 2/5/21 3:43 PM, Andy Lutomirski wrote:

MWAIT turning into NOP is no good.  How about suppressing
X86_FEATURE_MWAIT instead?

Yes, we can suppress it in tdx_early_init().

 + setup_clear_cpu_cap(X86_FEATURE_MWAIT);

But do you want to leave the MWAIT #VE handler as it as
(just in case)?


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


[RFC v1 25/26] x86/tdx: Make DMA pages shared

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Make force_dma_unencrypted() return true for TDX to get DMA pages mapped
as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted by the

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory.

Tested-by: Kai Huang 
Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/tdx.h   |  3 +++
 arch/x86/kernel/tdx.c| 29 ++---
 arch/x86/mm/mem_encrypt_common.c |  4 ++--
 arch/x86/mm/pat/set_memory.c | 23 ++-
 4 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index efffdef35c78..9b66c3a5cf83 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,9 @@
 #define TDVMCALL   0
 #define TDINFO 1
 #define TDGETVEINFO3
+#define TDACCEPTPAGE   6
+
+#define TDX_PAGE_ALREADY_ACCEPTED   0x0B0A
 
 /* TDVMCALL R10 Input */
 #define TDVMCALL_STANDARD  0
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f99fe54b4f88..f51a19168adc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -68,7 +68,7 @@ static void tdx_get_info(void)
physical_mask &= ~tdx_shared_mask();
 }
 
-int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
+static int __tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
 {
register long r10 asm("r10") = TDVMCALL_STANDARD;
register long r11 asm("r11") = TDVMCALL_MAP_GPA;
@@ -89,11 +89,34 @@ int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
  "r"(r13)
: );
 
-   // Host kernel doesn't implement it yet.
-   // WARN_ON(ret || r10);
+   WARN_ON(ret || r10);
return ret || r10 ? -EIO : 0;
 }
 
+static void tdx_accept_page(phys_addr_t gpa)
+{
+   u64 ret;
+
+   asm volatile(TDCALL : "=a"(ret) : "a"(TDACCEPTPAGE), "c"(gpa));
+
+   BUG_ON(ret && (ret & ~0xffull) != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+
+int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
+{
+   int ret, i;
+
+   ret = __tdx_map_gpa(gpa, numpages, private);
+   if (ret || !private)
+   return ret;
+
+   for (i = 0; i < numpages; i++)
+   tdx_accept_page(gpa + i*PAGE_SIZE);
+
+   return 0;
+}
+
 static __cpuidle void tdx_halt(void)
 {
register long r10 asm("r10") = TDVMCALL_STANDARD;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..b6d93b0c5dcf 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -15,9 +15,9 @@
 bool force_dma_unencrypted(struct device *dev)
 {
/*
-* For SEV, all DMA must be to unencrypted/shared addresses.
+* For SEV and TDX, all DMA must be to unencrypted/shared addresses.
 */
-   if (sev_active())
+   if (sev_active() || is_tdx_guest())
return true;
 
/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..6f23a9816ef0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../mm_internal.h"
 
@@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int 
numpages, bool enc)
struct cpa_data cpa;
int ret;
 
-   /* Nothing to do if memory encryption is not active */
-   if (!mem_encrypt_active())
+   /* Nothing to do if memory encryption and TDX are not active */
+   if (!mem_encrypt_active() && !is_tdx_guest())
return 0;
 
/* Should not be working on unaligned addresses */
@@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, int 
numpages, bool enc)
memset(, 0, sizeof(cpa));
cpa.vaddr = 
cpa.numpages = numpages;
-   cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-   cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+   if (is_tdx_guest()) {
+   cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
+   cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
+   } else {
+   cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
+

[RFC v1 23/26] x86/tdx: Make pages shared in ioremap()

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

All ioremap()ed paged that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/include/asm/pgtable.h | 3 +++
 arch/x86/mm/ioremap.c  | 8 +---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..a82bab48379e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,9 @@
 #define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
 #define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))
 
+/* Make the page accesable by VMM */
+#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | tdx_shared_mask())
+
 #ifndef __ASSEMBLY__
 #include 
 #include 
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..a0ba760866d4 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource 
*res)
 }
 
 /*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
  */
 static unsigned int __ioremap_check_encrypted(struct resource *res)
 {
-   if (!sev_active())
+   if (!sev_active() && !is_tdx_guest())
return 0;
 
switch (res->desc) {
@@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long 
size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+   else if (is_tdx_guest())
+   prot = pgprot_tdx_shared(prot);
 
switch (pcm) {
case _PAGE_CACHE_MODE_UC:
-- 
2.25.1



[RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

TDX doesn't allow to perform DMA access to guest private memory.
In order for DMA to work properly in TD guest, user SWIOTLB bounce
buffers.

Move AMD SEV initialization into common code and adopt for TDX.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/kernel/pci-swiotlb.c|  2 +-
 arch/x86/kernel/tdx.c|  3 +++
 arch/x86/mm/mem_encrypt.c| 44 ---
 arch/x86/mm/mem_encrypt_common.c | 45 
 4 files changed, 49 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..020e13749758 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
 * buffers are allocated and used for devices that do not support
 * the addressing range required for the encryption mask.
 */
-   if (sme_active())
+   if (sme_active() || is_tdx_guest())
swiotlb = 1;
 
return swiotlb;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f51a19168adc..ccb9401bd706 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include  /* force_sig_fault() */
+#include 
 
 #ifdef CONFIG_KVM_GUEST
 #include "tdx-kvm.c"
@@ -472,6 +473,8 @@ void __init tdx_early_init(void)
 
legacy_pic = _legacy_pic;
 
+   swiotlb_force = SWIOTLB_FORCE;
+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
  NULL, tdx_cpu_offline_prepare);
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 11a6a7b3af7e..7fbbb2f3d426 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -408,47 +408,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-   pr_info("AMD Memory Encryption Features active:");
-
-   /* Secure Memory Encryption */
-   if (sme_active()) {
-   /*
-* SME is mutually exclusive with any of the SEV
-* features below.
-*/
-   pr_cont(" SME\n");
-   return;
-   }
-
-   /* Secure Encrypted Virtualization */
-   if (sev_active())
-   pr_cont(" SEV");
-
-   /* Encrypted Register State */
-   if (sev_es_active())
-   pr_cont(" SEV-ES");
-
-   pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-   if (!sme_me_mask)
-   return;
-
-   /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-   swiotlb_update_mem_attributes();
-
-   /*
-* With SEV, we need to unroll the rep string I/O instructions.
-*/
-   if (sev_active())
-   static_branch_enable(_enable_key);
-
-   print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index b6d93b0c5dcf..6f3d90d4d68e 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -36,3 +37,47 @@ bool force_dma_unencrypted(struct device *dev)
 
return false;
 }
+
+static void print_mem_encrypt_feature_info(void)
+{
+   pr_info("AMD Memory Encryption Features active:");
+
+   /* Secure Memory Encryption */
+   if (sme_active()) {
+   /*
+* SME is mutually exclusive with any of the SEV
+* features below.
+*/
+   pr_cont(" SME\n");
+   return;
+   }
+
+   /* Secure Encrypted Virtualization */
+   if (sev_active())
+   pr_cont(" SEV");
+
+   /* Encrypted Register State */
+   if (sev_es_active())
+   pr_cont(" SEV-ES");
+
+   pr_cont("\n");
+}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+   if (!sme_me_mask && !is_tdx_guest())
+   return;
+
+   /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+   swiotlb_update_mem_attributes();
+
+   /*
+* With SEV, we need to unroll the rep string I/O instructions.
+*/
+   if (sev_active())
+   static_branch_enable(_enable_key);
+
+   if (!is_tdx_guest())
+   print_mem_encrypt_feature_info();
+}
-- 
2.25.1



[RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

Intel TDX doesn't allow VMM to access guest memory. Any memory that is
required for communication with VMM suppose to be shared explicitly by
setting the bit in page table entry. The shared memory is similar to
unencrypted memory in AMD SME/SEV terminology.

force_dma_unencrypted() has to return true for TDX guest. Move it out of
AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for TDX changes in DMA code.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/Kconfig |  8 +--
 arch/x86/include/asm/io.h|  4 +++-
 arch/x86/mm/Makefile |  2 ++
 arch/x86/mm/mem_encrypt.c| 30 -
 arch/x86/mm/mem_encrypt_common.c | 38 
 5 files changed, 49 insertions(+), 33 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0374d9f262a5..8fa654d61ac2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS
  helps to determine the effectiveness of preserving large and huge
  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+   select DYNAMIC_PHYSICAL_MASK
+   def_bool n
+
 config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
-   select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
-   select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
+   select X86_MEM_ENCRYPT_COMMON
help
  Say yes to enable support for the encryption of system memory.
  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 30a3b30395ad..95e534cffa99 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -257,10 +257,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include 
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
return static_branch_unlikely(_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)+= 
pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)   += mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c79e5736ab2b..11a6a7b3af7e 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
 
 #include 
 #include 
@@ -389,32 +385,6 @@ bool noinstr sev_es_active(void)
return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-   /*
-* For SEV, all DMA must be to unencrypted addresses.
-*/
-   if (sev_active())
-   return true;
-
-   /*
-* For SME, all DMA must be to unencrypted addresses if the
-* device does not support DMA to addresses that include the
-* encryption mask.
-*/
-   if (sme_active()) {
-   u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-   u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-   dev->bus_dma_limit);
-
-   if (dma_dev_mask <= dma_enc_mask)
-   return true;
-   }
-
-   return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index ..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky 
+ */
+
+#include 
+#include 
+#include 
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+   /*
+* For SEV, all DM

[RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: "Kirill A. Shutemov" 

tdx_shared_mask() returns the mask that has to be set in page table
entry to make page shared with VMM.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/Kconfig   | 1 +
 arch/x86/include/asm/tdx.h | 1 +
 arch/x86/kernel/tdx.c  | 8 
 3 files changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8fa654d61ac2..f10a00c4ad7f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -875,6 +875,7 @@ config INTEL_TDX_GUEST
select PARAVIRT_XL
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
+   select X86_MEM_ENCRYPT_COMMON
help
  Provide support for running in a trusted domain on Intel processors
  equipped with Trusted Domain eXtenstions. TDX is an new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b46ae140e39b..9bbfe6520ea4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -104,5 +104,6 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, 
unsigned long p2,
 long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4);
 
+phys_addr_t tdx_shared_mask(void);
 #endif
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ae37498df981..9681f4a0b4e0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -41,6 +41,11 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+phys_addr_t tdx_shared_mask(void)
+{
+   return 1ULL << (td_info.gpa_width - 1);
+}
+
 static void tdx_get_info(void)
 {
register long rcx asm("rcx");
@@ -56,6 +61,9 @@ static void tdx_get_info(void)
 
td_info.gpa_width = rcx & GENMASK(5, 0);
td_info.attributes = rdx;
+
+   /* Exclude Shared bit from the __PHYSICAL_MASK */
+   physical_mask &= ~tdx_shared_mask();
 }
 
 static __cpuidle void tdx_halt(void)
-- 
2.25.1



[RFC v1 20/26] x86/tdx: Introduce INTEL_TDX_GUEST config option

2021-02-05 Thread Kuppuswamy Sathyanarayanan
Add INTEL_TDX_GUEST config option to selectively compile
TDX guest support.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---
 arch/x86/Kconfig | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8fe91114bfee..0374d9f262a5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -868,6 +868,21 @@ config ACRN_GUEST
  IOT with small footprint and real-time features. More details can be
  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+   bool "Intel Trusted Domain eXtensions Guest Support"
+   depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
+   depends on SECURITY
+   select PARAVIRT_XL
+   select X86_X2APIC
+   select SECURITY_LOCKDOWN_LSM
+   help
+ Provide support for running in a trusted domain on Intel processors
+ equipped with Trusted Domain eXtenstions. TDX is an new Intel
+ technology that extends VMX and Memory Encryption with a new kind of
+ virtual machine guest called Trust Domain (TD). A TD is designed to
+ run in a CPU mode that protects the confidentiality of TD memory
+ contents and the TD’s CPU state from other software, including VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
-- 
2.25.1



[RFC v1 18/26] x86/topology: Disable CPU hotplug support for TDX platforms.

2021-02-05 Thread Kuppuswamy Sathyanarayanan
As per Intel TDX Virtual Firmware Design Guide, sec 4.3.5 and
sec 9.4, all unused CPUs are put in spinning state by
TDVF until OS requests for CPU bring-up via mailbox address passed
by ACPI MADT table. Since by default all unused CPUs are always in
spinning state, there is no point in supporting dynamic CPU
online/offline feature. So current generation of TDVF does not
support CPU hotplug feature. It may be supported in next generation.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Andi Kleen 
---
 arch/x86/kernel/tdx.c  | 14 ++
 arch/x86/kernel/topology.c |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 8d1d7555fb56..a36b6ae14942 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -387,6 +387,17 @@ static int tdx_handle_mmio(struct pt_regs *regs, struct 
ve_info *ve)
return insn.length;
 }
 
+static int tdx_cpu_offline_prepare(unsigned int cpu)
+{
+   /*
+* Per Intel TDX Virtual Firmware Design Guide,
+* sec 4.3.5 and sec 9.4, Hotplug is not supported
+* in TDX platforms. So don't support CPU
+* offline feature once its turned on.
+*/
+   return -EOPNOTSUPP;
+}
+
 void __init tdx_early_init(void)
 {
if (!cpuid_has_tdx_guest())
@@ -399,6 +410,9 @@ void __init tdx_early_init(void)
pv_ops.irq.safe_halt = tdx_safe_halt;
pv_ops.irq.halt = tdx_halt;
 
+   cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
+ NULL, tdx_cpu_offline_prepare);
+
pr_info("TDX guest is initialized\n");
 }
 
diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c
index f5477eab5692..d879ea96d79c 100644
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
@@ -130,7 +131,7 @@ int arch_register_cpu(int num)
}
}
}
-   if (num || cpu0_hotpluggable)
+   if ((num || cpu0_hotpluggable) && !is_tdx_guest())
per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
 
return register_cpu(_cpu(cpu_devices, num).cpu, num);
-- 
2.25.1



[RFC v1 19/26] x86/tdx: Forcefully disable legacy PIC for TDX guests

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: Sean Christopherson 

Disable the legacy PIC (8259) for TDX guests as the PIC cannot be
supported by the VMM. TDX Module does not allow direct IRQ injection,
and using posted interrupt style delivery requires the guest to EOI
the IRQ, which diverges from the legacy PIC behavior.

Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/kernel/tdx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index a36b6ae14942..ae37498df981 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -410,6 +411,8 @@ void __init tdx_early_init(void)
pv_ops.irq.safe_halt = tdx_safe_halt;
pv_ops.irq.halt = tdx_halt;
 
+   legacy_pic = _legacy_pic;
+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
  NULL, tdx_cpu_offline_prepare);
 
-- 
2.25.1



[RFC v1 16/26] x86/boot: Avoid #VE during compressed boot for TDX platforms

2021-02-05 Thread Kuppuswamy Sathyanarayanan
From: Sean Christopherson 

Avoid operations which will inject #VE during compressed
boot, which is obviously fatal for TDX platforms.

Details are,

 1. TDX module injects #VE if a TDX guest attempts to write
EFER. So skip the WRMSR to set EFER.LME=1 if it's already
set. TDX also forces EFER.LME=1, i.e. the branch will always
be taken and thus the #VE avoided.

 2. TDX module also injects a #VE if the guest attempts to clear
CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
boot. The Setting CR0.NE should be a nop on all CPUs that
support 64-bit mode.

Signed-off-by: Sean Christopherson 
Reviewed-by: Andi Kleen 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---
 arch/x86/boot/compressed/head_64.S | 5 +++--
 arch/x86/boot/compressed/pgtable.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..37c2f37d4a0d 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
movl$MSR_EFER, %ecx
rdmsr
btsl$_EFER_LME, %eax
+   jc  1f
wrmsr
-   popl%edx
+1: popl%edx
popl%ecx
 
/* Enable PAE and LA57 (if required) paging modes */
@@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
pushl   %eax
 
/* Enable paging again */
-   movl$(X86_CR0_PG | X86_CR0_PE), %eax
+   movl$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
movl%eax, %cr0
 
lret
diff --git a/arch/x86/boot/compressed/pgtable.h 
b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
 
 #define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
 
-- 
2.25.1



[PATCH v13 4/5] PCI/ACPI: Centralize pcie_ports_native checking

2021-01-22 Thread Kuppuswamy Sathyanarayanan
If the user booted with "pcie_ports=native", we take control of the PCIe
features unconditionally, regardless of what _OSC says.

Centralize the testing of pcie_ports_native in acpi_pci_root_create(),
where we interpret the _OSC results, so other places only have to check
host_bridge->native_X and we don't have to sprinkle tests of
pcie_ports_native everywhere.

[bhelgaas: commit log, rework OSC_PCI_EXPRESS_CONTROL_MASKS, logging]
Link: 
https://lore.kernel.org/r/bc87c9e675118960949043a832bed86bc22becbd.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Bjorn Helgaas 
---
 drivers/acpi/pci_root.c   | 19 +++
 drivers/pci/hotplug/pciehp_core.c |  2 +-
 drivers/pci/pci-acpi.c|  3 ---
 drivers/pci/pcie/aer.c|  2 +-
 drivers/pci/pcie/portdrv_core.c   | 11 ---
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index 601fbe905993..16ca58d58fef 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -880,6 +880,8 @@ static void acpi_pci_root_release_info(struct 
pci_host_bridge *bridge)
flag = 0;   \
} while (0)
 
+#define FLAG(x)((x) ? '+' : '-')
+
 struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root,
 struct acpi_pci_root_ops *ops,
 struct acpi_pci_root_info *info,
@@ -928,6 +930,23 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root 
*root,
OSC_OWNER(ctrl, OSC_PCI_EXPRESS_LTR_CONTROL, host_bridge->native_ltr);
OSC_OWNER(ctrl, OSC_PCI_EXPRESS_DPC_CONTROL, host_bridge->native_dpc);
 
+   if (pcie_ports_native) {
+   dev_info(>device->dev, "Taking control of PCIe-related 
features because \"pcie_ports=native\" specified; may conflict with 
firmware\n");
+   host_bridge->native_pcie_hotplug = 1;
+   host_bridge->native_aer = 1;
+   host_bridge->native_pme = 1;
+   host_bridge->native_ltr = 1;
+   host_bridge->native_dpc = 1;
+   }
+
+   dev_info(>device->dev, "OS native features: SHPCHotplug%c 
PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n",
+   FLAG(host_bridge->native_shpc_hotplug),
+   FLAG(host_bridge->native_pcie_hotplug),
+   FLAG(host_bridge->native_pme),
+   FLAG(host_bridge->native_aer),
+   FLAG(host_bridge->native_dpc),
+   FLAG(host_bridge->native_ltr));
+
/*
 * Evaluate the "PCI Boot Configuration" _DSM Function.  If it
 * exists and returns 0, we must preserve any PCI resource
diff --git a/drivers/pci/hotplug/pciehp_core.c 
b/drivers/pci/hotplug/pciehp_core.c
index ad3393930ecb..d1831e6bf60a 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -256,7 +256,7 @@ static bool pme_is_native(struct pcie_device *dev)
const struct pci_host_bridge *host;
 
host = pci_find_host_bridge(dev->port->bus);
-   return pcie_ports_native || host->native_pme;
+   return host->native_pme;
 }
 
 static void pciehp_disable_interrupt(struct pcie_device *dev)
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 53502a751914..f6327cf0601b 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -800,9 +800,6 @@ bool pciehp_is_native(struct pci_dev *bridge)
if (!(slot_cap & PCI_EXP_SLTCAP_HPC))
return false;
 
-   if (pcie_ports_native)
-   return true;
-
host = pci_find_host_bridge(bridge->bus);
return host->native_pcie_hotplug;
 }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 77b0f2c45bc0..7fdeaadc40fe 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -219,7 +219,7 @@ int pcie_aer_is_native(struct pci_dev *dev)
if (!dev->aer_cap)
return 0;
 
-   return pcie_ports_native || host->native_aer;
+   return host->native_aer;
 }
 
 int pci_enable_pcie_error_reporting(struct pci_dev *dev)
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index e1fed6649c41..ea1099908d5d 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -208,8 +208,7 @@ static int get_port_device_capability(struct pci_dev *dev)
struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
int services = 0;
 
-   if (dev->is_hotplug_bridge &&
-   (pcie_ports_native || host->native_pcie_hotplug)) {
+   if (host->native_pcie_hotplug && dev->is_hotplug_bridge) {
services |= PCIE_PORT_SERVICE_HP;
 
/*
@@ -221,8 +220,7 @@ static int get_po

[PATCH v13 1/5] PCI/DPC: Ignore devices with no AER Capability

2021-01-22 Thread Kuppuswamy Sathyanarayanan
From: Bjorn Helgaas 

Downstream Ports may support DPC regardless of whether they support AER
(see PCIe r5.0, sec 6.2.10.2).  Previously, if the user booted with
"pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if
the device had no AER Capability, but dpc_get_aer_uncorrect_severity()
depends on the AER Capability.

dpc_probe() previously failed if:

  !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native
  !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law

so it succeeded if:

  pcie_aer_is_native() || pcie_ports_dpc_native

Fail dpc_probe() if the device has no AER Capability.

Signed-off-by: Bjorn Helgaas 
Cc: Olof Johansson 
---
 drivers/pci/pcie/dpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e05aba86a317..ed0dbc43d018 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
 
+   if (!pdev->aer_cap)
+   return -ENOTSUPP;
+
if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native)
return -ENOTSUPP;
 
-- 
2.25.1



[PATCH v13 0/5] Simplify PCIe native ownership

2021-01-22 Thread Kuppuswamy Sathyanarayanan
Currently, PCIe capabilities ownership status is detected by
verifying the status of pcie_ports_native, and _OSC negotiated
results (cached in  struct pci_host_bridge->native_* members).
But this logic can be simplified, and we can use only struct
pci_host_bridge ->native_* members to detect it. 

This patchset removes the distributed checks for pcie_ports_native,
parameter.

Changes since v12:
 * Rebased on top of v5.11-rc1

Changes since v11 (Bjorns update):
 * Add bugfix for DPC with no AER Capability
 * Split OSC_OWNER trivial changes from pcie_ports_native changes
 * Temporarily drop pcie_ports_dpc_native changes (revisit it later).

Changes since v10:
 * Addressed format issue reported by lkp test.

Changes since v9:
 * Rebased on top of v5.10-rc1

Changes since v8:
 * Simplified setting _OSC ownwership logic
 * Moved bridge->native_ltr out of #ifdef CONFIG_PCIEPORTBUS.

Changes since v7:
 * Fixed "fix array_size.cocci warnings".

Changes since v6:
 * Created new patch for CONFIG_PCIEPORTBUS check in
   pci_init_host_bridge().
 * Added warning message for a case when pcie_ports_native
   overrides _OSC negotiation result.

Changes since v5:
 * Rebased on top of v5.8-rc1

Changes since v4:
 * Changed the patch set title (Original link: 
https://lkml.org/lkml/2020/5/26/1710)
 * Added AER/DPC dependency logic cleanup fixes.

Bjorn Helgaas (2):
  PCI/DPC: Ignore devices with no AER Capability
  PCI/ACPI: Centralize pci_aer_available() checking

Kuppuswamy Sathyanarayanan (3):
  PCI: Assume control of portdrv-related features only when portdrv
enabled
  PCI/ACPI: Tidy _OSC control bit checking
  PCI/ACPI: Centralize pcie_ports_native checking

 drivers/acpi/pci_root.c   | 49 ---
 drivers/pci/hotplug/pciehp_core.c |  2 +-
 drivers/pci/pci-acpi.c|  3 --
 drivers/pci/pcie/aer.c|  2 +-
 drivers/pci/pcie/dpc.c|  3 ++
 drivers/pci/pcie/portdrv_core.c   | 11 +++
 drivers/pci/probe.c   |  6 ++--
 7 files changed, 51 insertions(+), 25 deletions(-)

-- 
2.25.1



[PATCH v13 3/5] PCI/ACPI: Tidy _OSC control bit checking

2021-01-22 Thread Kuppuswamy Sathyanarayanan
Add OSC_OWNER() helper to prettify checking the _OSC control bits to learn
whether the platform has granted us control of PCI features.  No functional
change intended.

[bhelgaas: split to separate patch, commit log]
Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Bjorn Helgaas 
---
 drivers/acpi/pci_root.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index 0bf072cef6cf..601fbe905993 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -874,6 +874,12 @@ static void acpi_pci_root_release_info(struct 
pci_host_bridge *bridge)
__acpi_pci_root_release_info(bridge->release_data);
 }
 
+#define OSC_OWNER(ctrl, bit, flag) \
+   do {\
+   if (!(ctrl & bit))  \
+   flag = 0;   \
+   } while (0)
+
 struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root,
 struct acpi_pci_root_ops *ops,
 struct acpi_pci_root_info *info,
@@ -885,6 +891,7 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root 
*root,
struct pci_bus *bus;
struct pci_host_bridge *host_bridge;
union acpi_object *obj;
+   u32 ctrl;
 
info->root = root;
info->bridge = device;
@@ -910,18 +917,16 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root 
*root,
goto out_release_info;
 
host_bridge = to_pci_host_bridge(bus->bridge);
-   if (!(root->osc_control_set & OSC_PCI_EXPRESS_NATIVE_HP_CONTROL))
-   host_bridge->native_pcie_hotplug = 0;
-   if (!(root->osc_control_set & OSC_PCI_SHPC_NATIVE_HP_CONTROL))
-   host_bridge->native_shpc_hotplug = 0;
-   if (!(root->osc_control_set & OSC_PCI_EXPRESS_AER_CONTROL))
-   host_bridge->native_aer = 0;
-   if (!(root->osc_control_set & OSC_PCI_EXPRESS_PME_CONTROL))
-   host_bridge->native_pme = 0;
-   if (!(root->osc_control_set & OSC_PCI_EXPRESS_LTR_CONTROL))
-   host_bridge->native_ltr = 0;
-   if (!(root->osc_control_set & OSC_PCI_EXPRESS_DPC_CONTROL))
-   host_bridge->native_dpc = 0;
+
+   ctrl = root->osc_control_set;
+   OSC_OWNER(ctrl, OSC_PCI_EXPRESS_NATIVE_HP_CONTROL,
+ host_bridge->native_pcie_hotplug);
+   OSC_OWNER(ctrl, OSC_PCI_SHPC_NATIVE_HP_CONTROL,
+ host_bridge->native_shpc_hotplug);
+   OSC_OWNER(ctrl, OSC_PCI_EXPRESS_AER_CONTROL, host_bridge->native_aer);
+   OSC_OWNER(ctrl, OSC_PCI_EXPRESS_PME_CONTROL, host_bridge->native_pme);
+   OSC_OWNER(ctrl, OSC_PCI_EXPRESS_LTR_CONTROL, host_bridge->native_ltr);
+   OSC_OWNER(ctrl, OSC_PCI_EXPRESS_DPC_CONTROL, host_bridge->native_dpc);
 
/*
 * Evaluate the "PCI Boot Configuration" _DSM Function.  If it
-- 
2.25.1



[PATCH v13 5/5] PCI/ACPI: Centralize pci_aer_available() checking

2021-01-22 Thread Kuppuswamy Sathyanarayanan
From: Bjorn Helgaas 

Check pci_aer_available() in acpi_pci_root_create() when we're interpreting
_OSC results so host_bridge->native_aer becomes the single way to determine
whether we control AER capabilities.

Signed-off-by: Bjorn Helgaas 
---
 drivers/acpi/pci_root.c | 3 +++
 drivers/pci/pcie/portdrv_core.c | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index 16ca58d58fef..f7d2eed3975c 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -939,6 +939,9 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root 
*root,
host_bridge->native_dpc = 1;
}
 
+   if (!pci_aer_available())
+   host_bridge->native_aer = 0;
+
dev_info(>device->dev, "OS native features: SHPCHotplug%c 
PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n",
FLAG(host_bridge->native_shpc_hotplug),
FLAG(host_bridge->native_pcie_hotplug),
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index ea1099908d5d..a2b8a4bc91fa 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -220,7 +220,7 @@ static int get_port_device_capability(struct pci_dev *dev)
}
 
 #ifdef CONFIG_PCIEAER
-   if (host->native_aer && dev->aer_cap && pci_aer_available()) {
+   if (host->native_aer && dev->aer_cap) {
services |= PCIE_PORT_SERVICE_AER;
 
/*
-- 
2.25.1



[PATCH v13 2/5] PCI: Assume control of portdrv-related features only when portdrv enabled

2021-01-22 Thread Kuppuswamy Sathyanarayanan
Native control of PME, AER, DPC, and PCIe hotplug depends on the portdrv,
so default to native handling of them only when CONFIG_PCIEPORTBUS is
enabled.

Native control LTR and SHPC hotplug does not depend on portdrv, so we can
always take control of them unless some platform interface, e.g., _OSC,
tells us otherwise.

[bhelgaas: commit log]
Link: 
https://lore.kernel.org/r/fcbe8a624166a1101a755edfef44a185d32ff493.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Bjorn Helgaas 
---
 drivers/pci/probe.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 953f15abc850..97498f61f5ad 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -588,12 +588,14 @@ static void pci_init_host_bridge(struct pci_host_bridge 
*bridge)
 * may implement its own AER handling and use _OSC to prevent the
 * OS from interfering.
 */
+#ifdef CONFIG_PCIEPORTBUS
bridge->native_aer = 1;
bridge->native_pcie_hotplug = 1;
-   bridge->native_shpc_hotplug = 1;
bridge->native_pme = 1;
-   bridge->native_ltr = 1;
bridge->native_dpc = 1;
+#endif
+   bridge->native_ltr = 1;
+   bridge->native_shpc_hotplug = 1;
 
device_initialize(>dev);
 }
-- 
2.25.1



Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()

2021-01-08 Thread Kuppuswamy, Sathyanarayanan




On 1/8/21 2:30 PM, Bjorn Helgaas wrote:

Can we push this forward now?  There are several pending patches in
this area from Keith and Sathyanarayanan; I haven't gotten to them
yet, so not sure whether they help address any of this.


Following two patches should also address the same issue.

My patch:

https://patchwork.kernel.org/project/linux-pci/patch/6f63321637fef86b6cf0beebf98b987062f9e811.1610153755.git.sathyanarayanan.kuppusw...@linux.intel.com/

Keith's patch:

https://patchwork.kernel.org/project/linux-pci/patch/20210104230300.1277180-4-kbu...@kernel.org/



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


[PATCH v9 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2021-01-08 Thread Kuppuswamy Sathyanarayanan
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug
handler for cleaning the affected devices/drivers and let error recovery
handler own this functionality. So this patch reverts Commit bdb5ac85777d
("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
affected device + rescan"  functionality in fatal error recovery handler.

Also holding pci_lock_rescan_remove() will prevent the race between hotplug
and DPC handler.

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Sinan Kaya 
---
 Changes since v8:
  * Rebased on top of v5.11-rc1.

 Changes since v7:
  * Rebased on top of v5.10-rc1.

 Changes since v6:
  * Renamed pcie_do_nonfatal_recovery() to pcie_nonfatal_recovery().
  * Renamed pcie_do_fatal_recovery() to pcie_fatal_recovery().
  * Addressed some format issues.

 Changes since v5:
  * Fixed static/non-static declartion issue.

 Changes since v4:
  * Added new interfaces for error recovery (pcie_do_fatal_recovery()
and pcie_do_nonfatal_recovery()).
 Documentation/PCI/pci-error-recovery.rst | 47 +++--
 Documentation/PCI/pcieaer-howto.rst  |  2 +-
 drivers/pci/pci.h|  7 +--
 drivers/pci/pcie/aer.c   | 10 ++--
 drivers/pci/pcie/dpc.c   |  2 +-
 drivers/pci/pcie/edr.c   |  2 +-
 drivers/pci/pcie/err.c   | 67 +---
 7 files changed, 91 insertions(+), 46 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst 
b/Documentation/PCI/pci-error-recovery.rst
index 84ceebb08cac..830c8af5838b 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a PCI 
error
 event will be platform-dependent, but will follow the general
 sequence described below.
 
-STEP 0: Error Event
+STEP 0: Error Event: ERR_NONFATAL
 ---
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0x,
@@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
recovery
 proceeds to STEP 2 (MMIO Enable).
 
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
-then recovery proceeds to STEP 4 (Slot Reset).
+then recovery proceeds to STEP 3 (Slot Reset).
 
 If the platform is unable to recover the slot, the next step
-is STEP 6 (Permanent Failure).
+is STEP 5 (Permanent Failure).
 
 .. note::
 
@@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+instead will have gone directly to STEP 3 (Slot Reset)
 
 .. note::
 
@@ -233,18 +233,12 @@ The driver should return one of the following result 
codes:
 
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to STEP 4 (Resume Operations).
 
 If any driver returned PCI_ERS_RESULT

[PATCH v9 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2021-01-08 Thread Kuppuswamy Sathyanarayanan
Currently if report_error_detected() or report_mmio_enabled()
functions requests PCI_ERS_RESULT_NEED_RESET, current
pcie_do_recovery() implementation does not do the requested
explicit device reset, but instead just calls the
report_slot_reset() on all affected devices. Notifying about the
reset via report_slot_reset() without doing the actual device
reset is incorrect. So call pci_bus_reset() before triggering
->slot_reset() callback.

Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Sinan Kaya 
Reviewed-by: Ashok Raj 
---
 Changes since v7:
  * Rebased on top of v5.11-rc1.

 Changes since v7:
  * Rebased on top of v5.10-rc1.

 Changes since v6:
  * None.

 Changes since v5:
  * Added Ashok's Reviewed-by tag.

 Changes since v4:
  * Added check for pci_reset_bus() return value.

 drivers/pci/pcie/err.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 510f31f0ef6d..6c19e9948232 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -177,6 +177,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
struct pci_dev *bridge;
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+   int ret;
 
/*
 * If the error was detected by a Root Port, Downstream Port, RCEC,
@@ -214,11 +215,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
}
 
if (status == PCI_ERS_RESULT_NEED_RESET) {
-   /*
-* TODO: Should call platform-specific
-* functions to reset slot before calling
-* drivers' slot_reset callbacks?
-*/
+   ret = pci_reset_bus(bridge);
+   if (ret < 0) {
+   pci_err(dev, "Failed to reset %d\n", ret);
+   status = PCI_ERS_RESULT_DISCONNECT;
+   goto failed;
+   }
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(bridge, "broadcast slot_reset message\n");
pci_walk_bridge(bridge, report_slot_reset, );
-- 
2.25.1



Re: linux-next: manual merge of the amdgpu tree with the pci tree

2020-12-14 Thread Kuppuswamy, Sathyanarayanan




On 12/14/20 3:37 PM, Bjorn Helgaas wrote:

On Mon, Dec 14, 2020 at 06:18:54PM -0500, Alex Deucher wrote:

On Mon, Dec 14, 2020 at 6:16 PM Bjorn Helgaas  wrote:

On Tue, Dec 15, 2020 at 07:34:31AM +1100, Stephen Rothwell wrote:

Hi all,

On Tue, 8 Dec 2020 13:56:20 +1100 Stephen Rothwell  
wrote:






I don't plan to merge this upstream via my tree.  I was just carrying
it in my drm-next branch because we have a number of users that depend
on it for working DPC and a number of people use this branch for
testing.


OK, thanks.  FWIW, it's currently marked "Changes Requested" in
patchwork, so it isn't really going anywhere right now:

https://patchwork.kernel.org/project/linux-pci/patch/cbba08a5e9ca62778c8937f44eda2192a2045da7.1595617529.git.sathyanarayanan.kuppusw...@linux.intel.com/

There is a newer version of this patch set. Please use it when merging this 
patch.
https://patchwork.kernel.org/project/linux-pci/list/?series=370855


I have merged Sean's series, so this would be a good time to try to
move this one forward.



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v8 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-12-12 Thread Kuppuswamy, Sathyanarayanan




On 10/26/20 12:37 PM, Kuppuswamy Sathyanarayanan wrote:

Currently if report_error_detected() or report_mmio_enabled()
functions requests PCI_ERS_RESULT_NEED_RESET, current
pcie_do_recovery() implementation does not do the requested
explicit device reset, but instead just calls the
report_slot_reset() on all affected devices. Notifying about the
reset via report_slot_reset() without doing the actual device
reset is incorrect. So call pci_bus_reset() before triggering
->slot_reset() callback.

Gentle ping! Any comments on this patch set?


Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Sinan Kaya 
Reviewed-by: Ashok Raj 
---
  Changes since v7:
   * Rebased on top of v5.10-rc1.

  Changes since v6:
   * None.

  Changes since v5:
   * Added Ashok's Reviewed-by tag.

  Changes since v4:
   * Added check for pci_reset_bus() return value.

  drivers/pci/pcie/err.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..315a4d559c4c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -152,6 +152,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  {
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_bus *bus;
+   int ret;
  
  	/*

 * Error recovery runs on all subordinates of the first downstream port.
@@ -181,11 +182,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
}
  
  	if (status == PCI_ERS_RESULT_NEED_RESET) {

-   /*
-* TODO: Should call platform-specific
-* functions to reset slot before calling
-* drivers' slot_reset callbacks?
-*/
+   ret = pci_reset_bus(dev);
+   if (ret < 0) {
+   pci_err(dev, "Failed to reset %d\n", ret);
+   status = PCI_ERS_RESULT_DISCONNECT;
+   goto failed;
+   }
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(dev, "broadcast slot_reset message\n");
pci_walk_bus(bus, report_slot_reset, );



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v12 0/5] Simplify PCIe native ownership

2020-12-07 Thread Kuppuswamy, Sathyanarayanan

Hi Bjorn,

On 11/30/20 5:11 PM, Kuppuswamy, Sathyanarayanan wrote:

Hi Bjorn,

On 11/25/20 7:48 PM, Kuppuswamy, Sathyanarayanan wrote:

Along with above patch, you also left following two cleanup patches. Is this
intentional? Following fixes have no dependency on pcie_ports_dpc_native change.

[PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC 
enable logic
[PATCH v11 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver


Any comment? If you want me to add these patches in my re-submission, please
let me know.

Gentle ping. Any comments?




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: pci 0000:00:07.0: DPC: RP PIO log size 0 is invalid

2020-12-07 Thread Kuppuswamy, Sathyanarayanan

Hi,

On 12/7/20 5:08 AM, Paul Menzel wrote:

[Bringing the issue up on the list in case the Linux Bugzilla is not 
monitored/used.]


Dear Linux folks,


On Intel Tiger Lake Dell laptop, Linux logs the error below [1].

     [    0.507307] pci :00:07.0: DPC: RP PIO log size 0 is invalid
     [    0.508835] pci :00:07.2: DPC: RP PIO log size 0 is invalid

     $ lspci -nn -s 00:07
     00:07.0 PCI bridge [0604]: Intel Corporation Tiger Lake-LP Thunderbolt PCI Express Root Port #0 
[8086:9a23] (rev 01)
     00:07.2 PCI bridge [0604]: Intel Corporation Tiger Lake-LP Thunderbolt PCI Express Root Port #2 
[8086:9a27] (rev 01)


Commit 2700561817 (PCI/DPC: Cache DPC capabilities in pci_init_capabilities()) [1] probably 
introduced it in Linux 5.7.


What does this error actually mean?

 pdev->dpc_rp_log_size = (cap & PCI_EXP_DPC_RP_PIO_LOG_SIZE) >> 8;
 if (pdev->dpc_rp_log_size < 4 || pdev->dpc_rp_log_size > 9) {
     pci_err(pdev, "RP PIO log size %u is invalid\n",
     pdev->dpc_rp_log_size);
     pdev->dpc_rp_log_size = 0;

As per PCIe spec r5.0, sec 7.9.15.2, valid RP log size is 4 or greater. Please 
see
the text copied from spec

- - - -
RP PIO Log Size - This field indicates how many DWORDs are allocated for the RP
PIO log registers, comprised by the RP PIO Header Log, the RP PIO ImpSpec Log,
and RP PIO TLP Prefix Log. If the Root Port supports RP Extensions for DPC, the
value of this field must be 4 or greater; otherwise, the value of
this field must be 0. See Section 7.9.15.11 , Section 7.9.15.12 , and Section 
7.9.15.13 .
- - - -

In this case, since "(!(cap & PCI_EXP_DPC_CAP_RP_EXT))" condition is false, RP
EXT is supported. If RP EXT is supported, valid log size should be at-least 4.



 }

(I guess `cap & PCI_EXP_DPC_RP_PIO_LOG_SIZE` is zero too?)

Is it a firmware issue or a hardware issue?

I think this could be hardware issue.



Kind regards,

Paul


[1]: https://bugzilla.kernel.org/show_bug.cgi?id=209943
  "pci :00:07.0: DPC: RP PIO log size 0 is invalid"
[2]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=27005618178ef9e9bf9c42fd91101771c92e9308 



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v12 0/5] Simplify PCIe native ownership

2020-11-30 Thread Kuppuswamy, Sathyanarayanan

Hi Bjorn,

On 11/25/20 7:48 PM, Kuppuswamy, Sathyanarayanan wrote:

Along with above patch, you also left following two cleanup patches. Is this
intentional? Following fixes have no dependency on pcie_ports_dpc_native change.

[PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC 
enable logic
[PATCH v11 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver


Any comment? If you want me to add these patches in my re-submission, please
let me know.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v12 10/15] PCI/ERR: Limit AER resets in pcie_do_recovery()

2020-11-30 Thread Kuppuswamy, Sathyanarayanan



On 11/30/20 4:25 PM, Bjorn Helgaas wrote:

I think EDR is the same as DPC?

Yes, EDR is same as DPC.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability

2020-11-28 Thread Kuppuswamy, Sathyanarayanan




On 11/28/20 3:25 PM, Bjorn Helgaas wrote:

On Sat, Nov 28, 2020 at 01:56:23PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 11/28/20 1:53 PM, Bjorn Helgaas wrote:

On Sat, Nov 28, 2020 at 01:49:46PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 11/28/20 12:24 PM, Bjorn Helgaas wrote:

On Wed, Nov 25, 2020 at 06:01:57PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 11/25/20 5:18 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

Downstream Ports may support DPC regardless of whether they support AER
(see PCIe r5.0, sec 6.2.10.2).  Previously, if the user booted with
"pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if
the device had no AER Capability, but dpc_get_aer_uncorrect_severity()
depends on the AER Capability.

dpc_probe() previously failed if:

  !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native
  !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law

so it succeeded if:

  pcie_aer_is_native() || pcie_ports_dpc_native

Fail dpc_probe() if the device has no AER Capability.

Signed-off-by: Bjorn Helgaas 
Cc: Olof Johansson 
---
 drivers/pci/pcie/dpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e05aba86a317..ed0dbc43d018 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
+   if (!pdev->aer_cap)
+   return -ENOTSUPP;

Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ?

We don't enable DPC service, if AER service is not enabled. And AER
service is only enabled if AER capability is supported.

So dpc_probe() should not happen if AER capability is not supported?


I don't think that's always true.  If I'm reading this right, we have
this:

 get_port_device_capability(...)
 {
 #ifdef CONFIG_PCIEAER
   if (dev->aer_cap && ...)
 services |= PCIE_PORT_SERVICE_AER;
 #endif

   if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
   pci_aer_available() &&
   (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
 services |= PCIE_PORT_SERVICE_DPC;
 }

and in the case where:

 - CONFIG_PCIEAER=y
 - booted with "pcie_ports=dpc-native" (pcie_ports_dpc_native is true)
 - "dev" has no AER capability
 - "dev" has DPC capability

I think we do enable PCIE_PORT_SERVICE_DPC.

Got it. But further looking into it, I am wondering whether
we should keep this dependency? Currently we just use it to
dump the error information. Do we need to create dependency
between DPC and AER (which is functionality not dependent) just
to see more details about the error?


That's a good question, but I don't really want to get into the actual
operation of the AER and DPC drivers in this series, so maybe
something we should explore later.



In that case, can you move this check to
drivers/pci/pcie/portdrv_core.c?  I don't see the point of
distributed checks in both get_port_device_capability() and
dpc_probe().


I totally agree that these distributed checks are terrible, but my
long-term hope is to get rid of portdrv and handle these "services"
more like we handle other capabilities.  For example, maybe we can
squash dpc_probe() into pci_dpc_init(), so I'd actually like to move
things from get_port_device_capability() into dpc_probe().

Removing the service driver model will be a major overhaul. It would
affect even the error recovery drivers. You can find motivation
for service drivers in Documentation/PCI/pciebus-howto.rst.

But till we fix this part, I recommend grouping all dependency checks
to one place (either dpc_probe() or portdrv service driver).




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability

2020-11-28 Thread Kuppuswamy, Sathyanarayanan




On 11/28/20 12:24 PM, Bjorn Helgaas wrote:

On Wed, Nov 25, 2020 at 06:01:57PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 11/25/20 5:18 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

Downstream Ports may support DPC regardless of whether they support AER
(see PCIe r5.0, sec 6.2.10.2).  Previously, if the user booted with
"pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if
the device had no AER Capability, but dpc_get_aer_uncorrect_severity()
depends on the AER Capability.

dpc_probe() previously failed if:

!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native
!(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law

so it succeeded if:

pcie_aer_is_native() || pcie_ports_dpc_native

Fail dpc_probe() if the device has no AER Capability.

Signed-off-by: Bjorn Helgaas 
Cc: Olof Johansson 
---
   drivers/pci/pcie/dpc.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e05aba86a317..ed0dbc43d018 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
+   if (!pdev->aer_cap)
+   return -ENOTSUPP;

Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ?

We don't enable DPC service, if AER service is not enabled. And AER
service is only enabled if AER capability is supported.

So dpc_probe() should not happen if AER capability is not supported?


I don't think that's always true.  If I'm reading this right, we have
this:

   get_port_device_capability(...)
   {
   #ifdef CONFIG_PCIEAER
 if (dev->aer_cap && ...)
   services |= PCIE_PORT_SERVICE_AER;
   #endif

 if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
 pci_aer_available() &&
 (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
   services |= PCIE_PORT_SERVICE_DPC;
   }

and in the case where:

   - CONFIG_PCIEAER=y
   - booted with "pcie_ports=dpc-native" (pcie_ports_dpc_native is true)
   - "dev" has no AER capability
   - "dev" has DPC capability

I think we do enable PCIE_PORT_SERVICE_DPC.

Got it. But further looking into it, I am wondering whether
we should keep this dependency? Currently we just use it to
dump the error information. Do we need to create dependency
between DPC and AER (which is functionality not dependent) just
to see more details about the error?



206 static int get_port_device_capability(struct pci_dev *dev)
...
...
251 if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
252 host->native_dpc &&
253 (host->native_dpc || (services & PCIE_PORT_SERVICE_AER)))
254 services |= PCIE_PORT_SERVICE_DPC;
255


+
if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native)
return -ENOTSUPP;



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability

2020-11-28 Thread Kuppuswamy, Sathyanarayanan




On 11/28/20 1:53 PM, Bjorn Helgaas wrote:

On Sat, Nov 28, 2020 at 01:49:46PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 11/28/20 12:24 PM, Bjorn Helgaas wrote:

On Wed, Nov 25, 2020 at 06:01:57PM -0800, Kuppuswamy, Sathyanarayanan wrote:

On 11/25/20 5:18 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

Downstream Ports may support DPC regardless of whether they support AER
(see PCIe r5.0, sec 6.2.10.2).  Previously, if the user booted with
"pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if
the device had no AER Capability, but dpc_get_aer_uncorrect_severity()
depends on the AER Capability.

dpc_probe() previously failed if:

 !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native
 !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law

so it succeeded if:

 pcie_aer_is_native() || pcie_ports_dpc_native

Fail dpc_probe() if the device has no AER Capability.

Signed-off-by: Bjorn Helgaas 
Cc: Olof Johansson 
---
drivers/pci/pcie/dpc.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e05aba86a317..ed0dbc43d018 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
+   if (!pdev->aer_cap)
+   return -ENOTSUPP;

Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ?

We don't enable DPC service, if AER service is not enabled. And AER
service is only enabled if AER capability is supported.

So dpc_probe() should not happen if AER capability is not supported?


I don't think that's always true.  If I'm reading this right, we have
this:

get_port_device_capability(...)
{
#ifdef CONFIG_PCIEAER
  if (dev->aer_cap && ...)
services |= PCIE_PORT_SERVICE_AER;
#endif

  if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
  pci_aer_available() &&
  (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
services |= PCIE_PORT_SERVICE_DPC;
}

and in the case where:

- CONFIG_PCIEAER=y
- booted with "pcie_ports=dpc-native" (pcie_ports_dpc_native is true)
- "dev" has no AER capability
- "dev" has DPC capability

I think we do enable PCIE_PORT_SERVICE_DPC.

Got it. But further looking into it, I am wondering whether
we should keep this dependency? Currently we just use it to
dump the error information. Do we need to create dependency
between DPC and AER (which is functionality not dependent) just
to see more details about the error?


That's a good question, but I don't really want to get into the actual
operation of the AER and DPC drivers in this series, so maybe
something we should explore later.

In that case, can you move this check to drivers/pci/pcie/portdrv_core.c?
I don't see the point of distributed checks in both get_port_device_capability()
and dpc_probe().




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v12 0/5] Simplify PCIe native ownership

2020-11-25 Thread Kuppuswamy, Sathyanarayanan




On 11/25/20 5:18 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

This is Sathy's work with a few tweaks on top.

I dropped the DPC pcie_ports_dpc_native changes for now just because I
haven't had time to understand it all.  We currently ignore the
OSC_PCI_EXPRESS_DPC_CONTROL bit, which seems wrong.  We might want to start
looking at it, but we should try to make that a separate patch that's as
small as possible.

Along with above patch, you also left following two cleanup patches. Is this
intentional? Following fixes have no dependency on pcie_ports_dpc_native change.

[PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC 
enable logic
[PATCH v11 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver


Changes since v11:
  * Add bugfix for DPC with no AER Capability
  * Split OSC_OWNER trivial changes from pcie_ports_native changes
  * Temporarily drop pcie_ports_dpc_native changes

v11 posting: 
https://lore.kernel.org/r/cover.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com

Bjorn Helgaas (2):
   PCI/DPC: Ignore devices with no AER Capability
   PCI/ACPI: Centralize pci_aer_available() checking

Kuppuswamy Sathyanarayanan (3):
   PCI: Assume control of portdrv-related features only when portdrv
 enabled
   PCI/ACPI: Tidy _OSC control bit checking
   PCI/ACPI: Centralize pcie_ports_native checking

  drivers/acpi/pci_root.c   | 49 ---
  drivers/pci/hotplug/pciehp_core.c |  2 +-
  drivers/pci/pci-acpi.c|  3 --
  drivers/pci/pcie/aer.c|  2 +-
  drivers/pci/pcie/dpc.c|  3 ++
  drivers/pci/pcie/portdrv_core.c   |  9 ++
  drivers/pci/probe.c   |  6 ++--
  7 files changed, 50 insertions(+), 24 deletions(-)



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH 4/5] PCI/ACPI: Centralize pcie_ports_native checking

2020-11-25 Thread Kuppuswamy, Sathyanarayanan




On 11/25/20 5:18 PM, Bjorn Helgaas wrote:

From: Kuppuswamy Sathyanarayanan 

If the user booted with "pcie_ports=native", we take control of the PCIe
features unconditionally, regardless of what _OSC says.

Centralize the testing of pcie_ports_native in acpi_pci_root_create(),
where we interpret the _OSC results, so other places only have to check
host_bridge->native_X and we don't have to sprinkle tests of
pcie_ports_native everywhere.

[bhelgaas: commit log, rework OSC_PCI_EXPRESS_CONTROL_MASKS, logging]
Link: 
https://lore.kernel.org/r/bc87c9e675118960949043a832bed86bc22becbd.1603766889.git.sathyanarayanan.kuppusw...@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan 

Signed-off-by: Bjorn Helgaas 
---
  drivers/acpi/pci_root.c   | 19 +++
  drivers/pci/hotplug/pciehp_core.c |  2 +-
  drivers/pci/pci-acpi.c|  3 ---
  drivers/pci/pcie/aer.c|  2 +-
  drivers/pci/pcie/portdrv_core.c   |  9 +++--
  5 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index 6db071038fd5..36142ed7b8f8 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -882,6 +882,8 @@ static void acpi_pci_root_release_info(struct 
pci_host_bridge *bridge)
flag = 0;   \
} while (0)
  
+#define FLAG(x)		((x) ? '+' : '-')

+
  struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root,
 struct acpi_pci_root_ops *ops,
 struct acpi_pci_root_info *info,
@@ -930,6 +932,23 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root 
*root,
OSC_OWNER(ctrl, OSC_PCI_EXPRESS_LTR_CONTROL, host_bridge->native_ltr);
OSC_OWNER(ctrl, OSC_PCI_EXPRESS_DPC_CONTROL, host_bridge->native_dpc);
  
+	if (pcie_ports_native) {

+   dev_info(>device->dev, "Taking control of PCIe-related features because 
\"pcie_ports=native\" specified; may conflict with firmware\n");
+   host_bridge->native_pcie_hotplug = 1;
+   host_bridge->native_aer = 1;
+   host_bridge->native_pme = 1;
+   host_bridge->native_ltr = 1;
+   host_bridge->native_dpc = 1;
+   }

Won't it be better if its merged with above OSC_OWNER() calls? If 
pcie_ports_native
is set, then above OSC_OWNER() calls for PCIe related features are redundant. 
Let me
know.

+
+   dev_info(>device->dev, "OS native features: SHPCHotplug%c 
PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n",
+   FLAG(host_bridge->native_shpc_hotplug),
+   FLAG(host_bridge->native_pcie_hotplug),
+   FLAG(host_bridge->native_pme),
+   FLAG(host_bridge->native_aer),
+   FLAG(host_bridge->native_dpc),
+   FLAG(host_bridge->native_ltr));
+
/*
 * Evaluate the "PCI Boot Configuration" _DSM Function.  If it
 * exists and returns 0, we must preserve any PCI resource
diff --git a/drivers/pci/hotplug/pciehp_core.c 
b/drivers/pci/hotplug/pciehp_core.c
index ad3393930ecb..d1831e6bf60a 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -256,7 +256,7 @@ static bool pme_is_native(struct pcie_device *dev)
const struct pci_host_bridge *host;
  
  	host = pci_find_host_bridge(dev->port->bus);

-   return pcie_ports_native || host->native_pme;
+   return host->native_pme;
  }
  
  static void pciehp_disable_interrupt(struct pcie_device *dev)

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index bf03648c2072..a84f75ec6df8 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -800,9 +800,6 @@ bool pciehp_is_native(struct pci_dev *bridge)
if (!(slot_cap & PCI_EXP_SLTCAP_HPC))
return false;
  
-	if (pcie_ports_native)

-   return true;
-
host = pci_find_host_bridge(bridge->bus);
return host->native_pcie_hotplug;
  }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 65dff5f3457a..79bb441139c2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -219,7 +219,7 @@ int pcie_aer_is_native(struct pci_dev *dev)
if (!dev->aer_cap)
return 0;
  
-	return pcie_ports_native || host->native_aer;

+   return host->native_aer;
  }
  
  int pci_enable_pcie_error_reporting(struct pci_dev *dev)

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 50a9522ab07d..2a1190e8db60 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -208,8 +208,7 @@ static int get_port_device_capability(struct pci_dev *dev)
struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
int services = 0;
  
-	if (dev->is_hotplug_bridge &&

-   (pcie_

Re: [PATCH 1/5] PCI/DPC: Ignore devices with no AER Capability

2020-11-25 Thread Kuppuswamy, Sathyanarayanan




On 11/25/20 5:18 PM, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

Downstream Ports may support DPC regardless of whether they support AER
(see PCIe r5.0, sec 6.2.10.2).  Previously, if the user booted with
"pcie_ports=dpc-native", it was possible for dpc_probe() to succeed even if
the device had no AER Capability, but dpc_get_aer_uncorrect_severity()
depends on the AER Capability.

dpc_probe() previously failed if:

   !pcie_aer_is_native(pdev) && !pcie_ports_dpc_native
   !(pcie_aer_is_native() || pcie_ports_dpc_native)# by De Morgan's law

so it succeeded if:

   pcie_aer_is_native() || pcie_ports_dpc_native

Fail dpc_probe() if the device has no AER Capability.

Signed-off-by: Bjorn Helgaas 
Cc: Olof Johansson 
---
  drivers/pci/pcie/dpc.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e05aba86a317..ed0dbc43d018 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,6 +287,9 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
  
+	if (!pdev->aer_cap)

+   return -ENOTSUPP;

Don't we check aer_cap support in drivers/pci/pcie/portdrv_core.c ?

We don't enable DPC service, if AER service is not enabled. And AER
service is only enabled if AER capability is supported.

So dpc_probe() should not happen if AER capability is not supported?

206 static int get_port_device_capability(struct pci_dev *dev)
...
...
251 if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
252 host->native_dpc &&
253 (host->native_dpc || (services & PCIE_PORT_SERVICE_AER)))
254 services |= PCIE_PORT_SERVICE_DPC;
255


+
if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native)
return -ENOTSUPP;
  



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 2/5] ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.

2020-11-25 Thread Kuppuswamy, Sathyanarayanan




On 11/25/20 12:28 PM, Bjorn Helgaas wrote:

On Mon, Oct 26, 2020 at 07:57:05PM -0700, Kuppuswamy Sathyanarayanan wrote:

pcie_ports_native is set only if user requests native handling
of PCIe capabilities via pcie_port_setup command line option.
User input takes precedence over _OSC based control negotiation
result. So consider the _OSC negotiated result only if
pcie_ports_native is unset.

Also, since struct pci_host_bridge ->native_* members caches the
ownership status of various PCIe capabilities, use them instead
of distributed checks for pcie_ports_native.



diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 50a9522ab07d..ccd5e0ce5605 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -208,8 +208,7 @@ static int get_port_device_capability(struct pci_dev *dev)
struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
int services = 0;
  
-	if (dev->is_hotplug_bridge &&

-   (pcie_ports_native || host->native_pcie_hotplug)) {
+   if (dev->is_hotplug_bridge && host->native_pcie_hotplug) {


This is a nit, but I think this and similar checks should be reordered
so we do the most generic test first, i.e.,

   if (host->native_pcie_hotplug && dev->is_hotplug_bridge)

ok. I can add this fix on top of your update.


Logically there's no point in looking at the device things if we
don't have native control.


services |= PCIE_PORT_SERVICE_HP;
  
  		/*

@@ -221,8 +220,7 @@ static int get_port_device_capability(struct pci_dev *dev)
}
  
  #ifdef CONFIG_PCIEAER

-   if (dev->aer_cap && pci_aer_available() &&
-   (pcie_ports_native || host->native_aer)) {
+   if (dev->aer_cap && pci_aer_available() && host->native_aer) {


Can't we clear host->native_aer when pci_aer_available() returns
false?  I'd like to have all the checks about whether we have native
control to be in one place instead of being scattered.  Something like
this above:

   OSC_OWNER(ctrl, OSC_PCI_EXPRESS_AER_CONTROL, host_bridge->native_aer);
   if (!pci_aer_available())
 host_bridge->native_aer = 0;

So this test would become:

   if (host->native_aer && dev->aer_cap)

We already use it in pci_root.c to confirm AER availability before
we request control over it via _OSC. So if pci_aer_available() is false
OSC_*AER_CONTROL bit in ctrl_set will be left zero (which means
->native_aer will also be zero).

490 if (pci_aer_available())
491 control |= OSC_PCI_EXPRESS_AER_CONTROL;

So, may we don't need pci_aer_available() check here.




services |= PCIE_PORT_SERVICE_AER;
  
  		/*

@@ -238,8 +236,7 @@ static int get_port_device_capability(struct pci_dev *dev)
 * Event Collectors can also generate PMEs, but we don't handle
 * those yet.
 */
-   if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT &&
-   (pcie_ports_native || host->native_pme)) {
+   if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT && host->native_pme) {
services |= PCIE_PORT_SERVICE_PME;


Also here:

   if (host->native_pme && pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT)



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 2/5] ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.

2020-11-25 Thread Kuppuswamy, Sathyanarayanan




On 11/25/20 2:25 PM, Bjorn Helgaas wrote:

I've been fiddling with this, so let me post a v12 tonight and you can
see what you think.

Ok. I will wait for your update.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 2/5] ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.

2020-11-25 Thread Kuppuswamy, Sathyanarayanan

Hi Bjorn,

Thanks for the review.

On 11/25/20 12:12 PM, Bjorn Helgaas wrote:

On Mon, Oct 26, 2020 at 07:57:05PM -0700, Kuppuswamy Sathyanarayanan wrote:

pcie_ports_native is set only if user requests native handling
of PCIe capabilities via pcie_port_setup command line option.
User input takes precedence over _OSC based control negotiation
result. So consider the _OSC negotiated result only if
pcie_ports_native is unset.

Also, since struct pci_host_bridge ->native_* members caches the
ownership status of various PCIe capabilities, use them instead
of distributed checks for pcie_ports_native.

Signed-off-by: Kuppuswamy Sathyanarayanan 

---
  drivers/acpi/pci_root.c   | 35 ++-
  drivers/pci/hotplug/pciehp_core.c |  2 +-
  drivers/pci/pci-acpi.c|  3 ---
  drivers/pci/pcie/aer.c|  2 +-
  drivers/pci/pcie/portdrv_core.c   |  9 +++-
  include/linux/acpi.h  |  2 ++
  6 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index c12b5fb3e8fb..a9e6b782622d 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -41,6 +41,12 @@ static int acpi_pci_root_scan_dependent(struct acpi_device 
*adev)
| OSC_PCI_CLOCK_PM_SUPPORT \
| OSC_PCI_MSI_SUPPORT)



+
+   if (pcie_ports_native) {
+   decode_osc_control(root, "OS forcibly taking over",
+  OSC_PCI_EXPRESS_CONTROL_MASKS);


The only place OSC_PCI_EXPRESS_CONTROL_MASKS is used is right here, so
it's kind of pointless.

I think I'd rather have this:

   dev_info(>device->dev, "Ignoring PCIe-related _OSC results because 
\"pcie_ports=native\" specified\n");

I was trying to keep the same print format. In pci_root.c,
decode_os_control() is repeatedly used to print info related to
PCIe capability ownership.

But either way is fine with me. I can use the format you mentioned.



followed by something like this after we're done fiddling with all the
host_bridge->native* bits:




   #define FLAG(x) ((x) ? '+' : '-')

   dev_info(>device->dev, "OS native features: SHPCHotplug%c PCIeCapability%c 
PCIeHotplug%c PME%c AER%c DPC%c LTR%c\n",
FLAG(host_bridge->native_shpc_hotplug),
   ?,
FLAG(host_bridge->native_pcie_hotplug),
   ...);

But I don't know how to handle OSC_PCI_EXPRESS_CAPABILITY_CONTROL
since we don't track it the same way.  Maybe we'd have to omit it from
this message for now?

I will add it in next version. But for now, its not worry about
OSC_PCI_EXPRESS_CAPABILITY_CONTROL.





/*
 * Evaluate the "PCI Boot Configuration" _DSM Function.  If it



--
2.17.1



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v7 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-11-24 Thread Kuppuswamy, Sathyanarayanan

Hi Guilherme,

On 11/24/20 10:45 AM, Guilherme G. Piccoli wrote:

Hi Kuppuswamy Sathyanarayanan (and all involved here), thanks for the
patch! I'd like to ask what is the status of this patchset - I just
"parachuted" in the issue, and by tracking the linux-pci ML, I found
this V7 (and all previous versions since V2). Also, noticed that Jay's
email might have gotten lost in translation (he's not CCed in latest
versions of the patchset).

I was able to find even another interesting thread that might be
related, Ethan's patchset. So, if any of the developers can clarify the
current status of this patchset or if the functionality hereby proposed
ended-up being implemented in another patch, I appreciate a lot.

Thanks for bringing this up. Its waiting for Bjorn's comments/approval.

Bjorn, any comments ? Some of our customers also looking for this issue
fix. Please let me know.


Thanks in advance! Below, some references to lore archives.
Cheers,








--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH 1/1] pci: pciehp: Handle MRL interrupts to enable slot for hotplug.

2020-11-20 Thread Kuppuswamy, Sathyanarayanan

Hi,

On 11/19/20 2:08 PM, Raj, Ashok wrote:

If an Attention Button is present, the current behavior is to bring up
the hotplug slot as soon as presence or link is detected.  We don't wait
for a button press.  This is intended as a convience feature to bring up
slots as quickly as possible, but the behavior can be changed if the
button press and 5 second delay is preferred.

No personal preference, I thought that is how the code in Linux was
earlier.

Looks like we still don't subscribe to PDC if ATTN is present? So you don't
get an event until someone presses the button to process hotplug right?

pciehp_hpc.c:pcie_enable_notification()
{


  * Always enable link events: thus link-up and link-down shall
  * always be treated as hotplug and unplug respectively. Enable
  * presence detect only if Attention Button is not present.
  */
 cmd = PCI_EXP_SLTCTL_DLLSCE;
 if (ATTN_BUTTN(ctrl))
 cmd |= PCI_EXP_SLTCTL_ABPE;
 else
 cmd |= PCI_EXP_SLTCTL_PDCE;

}


Looks like we still wait for button press to process. When you don't have a
power control yes the DLLSC would kick in and we would process them. but if
you have power control, you won't turn on until the button press?


Yes, as per current logic, if attention button is present, then we don't
subscribe to PDC event changes. we wait for button press  to turn on/off
the slot.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



Re: [PATCH v11 14/16] PCI/AER: Add pcie_walk_rcec() to RCEC AER handling

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

Root Complex Event Collectors (RCEC) appear as peers to Root Ports and also
have the AER capability. In addition, actions need to be taken for
associated RCiEPs. In such cases the RCECs will need to be walked in order
to find and act upon their respective RCiEPs.

Extend the existing ability to link the RCECs with a walking function
pcie_walk_rcec(). Add RCEC support to the current AER service driver and
attach the AER service driver to the RCEC device.


Reviewed-by: Kuppuswamy Sathyanarayanan 


[bhelgaas: kernel doc, whitespace cleanup]
Co-developed-by: Qiuxu Zhuo 
Link: 
https://lore.kernel.org/r/20201002184735.1229220-13-seanvk@oregontracks.org
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Jonathan Cameron 
---
  drivers/pci/pci.h   |  6 ++
  drivers/pci/pcie/aer.c  | 29 ++---
  drivers/pci/pcie/rcec.c | 37 +
  3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index ae2ee4df1cff..e988cc930d0b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -473,10 +473,16 @@ static inline void pci_dpc_init(struct pci_dev *pdev) {}
  void pci_rcec_init(struct pci_dev *dev);
  void pci_rcec_exit(struct pci_dev *dev);
  void pcie_link_rcec(struct pci_dev *rcec);
+void pcie_walk_rcec(struct pci_dev *rcec,
+   int (*cb)(struct pci_dev *, void *),
+   void *userdata);
  #else
  static inline void pci_rcec_init(struct pci_dev *dev) {}
  static inline void pci_rcec_exit(struct pci_dev *dev) {}
  static inline void pcie_link_rcec(struct pci_dev *rcec) {}
+static inline void pcie_walk_rcec(struct pci_dev *rcec,
+ int (*cb)(struct pci_dev *, void *),
+ void *userdata) {}
  #endif
  
  #ifdef CONFIG_PCI_ATS

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 2869212af8b4..72723a1b67af 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -300,7 +300,8 @@ int pci_aer_raw_clear_status(struct pci_dev *dev)
return -EIO;
  
  	port_type = pci_pcie_type(dev);

-   if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
+   if (port_type == PCI_EXP_TYPE_ROOT_PORT ||
+   port_type == PCI_EXP_TYPE_RC_EC) {
pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, );
pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, status);
}
@@ -595,7 +596,8 @@ static umode_t aer_stats_attrs_are_visible(struct kobject 
*kobj,
if ((a == _attr_aer_rootport_total_err_cor.attr ||
 a == _attr_aer_rootport_total_err_fatal.attr ||
 a == _attr_aer_rootport_total_err_nonfatal.attr) &&
-   pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT)
+   ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+(pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
return 0;
  
  	return a->mode;

@@ -916,7 +918,10 @@ static bool find_source_device(struct pci_dev *parent,
if (result)
return true;
  
-	pci_walk_bus(parent->subordinate, find_device_iter, e_info);

+   if (pci_pcie_type(parent) == PCI_EXP_TYPE_RC_EC)
+   pcie_walk_rcec(parent, find_device_iter, e_info);
+   else
+   pci_walk_bus(parent->subordinate, find_device_iter, e_info);
  
  	if (!e_info->error_dev_num) {

pci_info(parent, "can't find device of ID%04x\n", e_info->id);
@@ -1053,6 +1058,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct 
aer_err_info *info)
if (!(info->status & ~info->mask))
return 0;
} else if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+  pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC ||
   pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
   info->severity == AER_NONFATAL) {
  
@@ -1205,6 +1211,7 @@ static int set_device_error_reporting(struct pci_dev *dev, void *data)

int type = pci_pcie_type(dev);
  
  	if ((type == PCI_EXP_TYPE_ROOT_PORT) ||

+   (type == PCI_EXP_TYPE_RC_EC) ||
(type == PCI_EXP_TYPE_UPSTREAM) ||
(type == PCI_EXP_TYPE_DOWNSTREAM)) {
if (enable)
@@ -1229,9 +1236,12 @@ static void 
set_downstream_devices_error_reporting(struct pci_dev *dev,
  {
set_device_error_reporting(dev, );
  
-	if (!dev->subordinate)

-   return;
-   pci_walk_bus(dev->subordinate, set_device_error_reporting, );
+   if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
+   pcie_walk_rcec(dev, set_device_error_reporting, );
+   else if (dev->subordinate)
+   pci_walk_bus(dev->subordinate, set_device_error_reporting,
+);
+
  }
  
  /**

@@ -1329,6 +1339,

Re: [PATCH v11 15/16] PCI/PME: Add pcie_walk_rcec() to RCEC PME handling

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

Root Complex Event Collectors (RCEC) appear as peers of Root Ports and also
have the PME capability. As with AER, there is a need to be able to walk
the RCiEPs associated with their RCEC for purposes of acting upon them with
callbacks.

Add RCEC support through the use of pcie_walk_rcec() to the current PME
service driver and attach the PME service driver to the RCEC device.

Co-developed-by: Qiuxu Zhuo 
Link: 
https://lore.kernel.org/r/20201002184735.1229220-14-seanvk@oregontracks.org
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
---
  drivers/pci/pcie/pme.c  | 15 +++
  drivers/pci/pcie/portdrv_core.c |  9 +++--
  2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pcie/pme.c b/drivers/pci/pcie/pme.c
index 6a32970bb731..87799166c96a 100644
--- a/drivers/pci/pcie/pme.c
+++ b/drivers/pci/pcie/pme.c
@@ -310,7 +310,10 @@ static int pcie_pme_can_wakeup(struct pci_dev *dev, void 
*ign)
  static void pcie_pme_mark_devices(struct pci_dev *port)
  {
pcie_pme_can_wakeup(port, NULL);
-   if (port->subordinate)
+
+   if (pci_pcie_type(port) == PCI_EXP_TYPE_RC_EC)
+   pcie_walk_rcec(port, pcie_pme_can_wakeup, NULL);
+   else if (port->subordinate)
pci_walk_bus(port->subordinate, pcie_pme_can_wakeup, NULL);
  }
  
@@ -320,10 +323,15 @@ static void pcie_pme_mark_devices(struct pci_dev *port)

   */
  static int pcie_pme_probe(struct pcie_device *srv)
  {
-   struct pci_dev *port;
+   struct pci_dev *port = srv->port;
struct pcie_pme_service_data *data;
int ret;
  
+	/* Limit to Root Ports or Root Complex Event Collectors */

+   if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) &&
+   (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))

may be you can save the value pci_pcie_type(port)?

+   return -ENODEV;
+
data = kzalloc(sizeof(*data), GFP_KERNEL);
if (!data)
return -ENOMEM;
@@ -333,7 +341,6 @@ static int pcie_pme_probe(struct pcie_device *srv)
data->srv = srv;
set_service_data(srv, data);
  
-	port = srv->port;

pcie_pme_interrupt_enable(port, false);
pcie_clear_root_pme_status(port);
  
@@ -445,7 +452,7 @@ static void pcie_pme_remove(struct pcie_device *srv)
  
  static struct pcie_port_service_driver pcie_pme_driver = {

.name   = "pcie_pme",
-   .port_type  = PCI_EXP_TYPE_ROOT_PORT,
+   .port_type  = PCIE_ANY_PORT,
.service= PCIE_PORT_SERVICE_PME,
  
  	.probe		= pcie_pme_probe,

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 50a9522ab07d..e1fed6649c41 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -233,12 +233,9 @@ static int get_port_device_capability(struct pci_dev *dev)
}
  #endif
  
-	/*

-* Root ports are capable of generating PME too.  Root Complex
-* Event Collectors can also generate PMEs, but we don't handle
-* those yet.
-*/
-   if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT &&
+   /* Root Ports and Root Complex Event Collectors may generate PMEs */
+   if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC) &&
(pcie_ports_native || host->native_pme)) {
services |= PCIE_PORT_SERVICE_PME;
  



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 16/16] PCI/AER: Add RCEC AER error injection support

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

From: Qiuxu Zhuo 

Root Complex Event Collectors (RCEC) appear as peers to Root Ports and may
also have the AER capability.

Add RCEC support to the AER error injection driver.

Reviewed-by: Kuppuswamy Sathyanarayanan 



Co-developed-by: Sean V Kelley 
Link: 
https://lore.kernel.org/r/20201002184735.1229220-15-seanvk@oregontracks.org
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
---
  drivers/pci/pcie/aer_inject.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer_inject.c b/drivers/pci/pcie/aer_inject.c
index c2cbf425afc5..767f8859b99b 100644
--- a/drivers/pci/pcie/aer_inject.c
+++ b/drivers/pci/pcie/aer_inject.c
@@ -333,8 +333,11 @@ static int aer_inject(struct aer_error_inj *einj)
if (!dev)
return -ENODEV;
rpdev = pcie_find_root_port(dev);
+   /* If Root Port not found, try to find an RCEC */
+   if (!rpdev)
+   rpdev = dev->rcec;
if (!rpdev) {
-   pci_err(dev, "Root port not found\n");
+   pci_err(dev, "Neither Root Port nor RCEC found\n");
ret = -ENODEV;
goto out_put;
}



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 09/16] PCI/ERR: Avoid negated conditional for clarity

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

Reverse the sense of the Root Port/Downstream Port conditional for clarity.
No functional change intended.

[bhelgaas: split to separate patch]
Link: 
https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
Acked-by: Jonathan Cameron 

Reviewed-by: Kuppuswamy Sathyanarayanan 


---
  drivers/pci/pcie/err.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 46a5b84f8842..931e75f2549d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -159,11 +159,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 * Error recovery runs on all subordinates of the bridge.  If the
 * bridge detected the error, it is cleared at the end.
 */
-   if (!(type == PCI_EXP_TYPE_ROOT_PORT ||
- type == PCI_EXP_TYPE_DOWNSTREAM))
-   bridge = pci_upstream_bridge(dev);
-   else
+   if (type == PCI_EXP_TYPE_ROOT_PORT ||
+   type == PCI_EXP_TYPE_DOWNSTREAM)
bridge = dev;
+   else
+   bridge = pci_upstream_bridge(dev);
  
  	bus = bridge->subordinate;

pci_dbg(bridge, "broadcast error_detected message\n");



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 08/16] PCI/ERR: Use "bridge" for clarity in pcie_do_recovery()

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

pcie_do_recovery() may be called with "dev" being either a bridge (Root
Port or Switch Downstream Port) or an Endpoint.  The bulk of the function
deals with the bridge, so if we start with an Endpoint, we reset "dev" to
be the bridge leading to it.

For clarity, replace "dev" in the body of the function with "bridge".  No
functional change intended.

[bhelgaas: commit log, split pieces out so this is pure rename, also
replace "dev" with "bridge" in messages and pci_uevent_ers()]
Suggested-by: Bjorn Helgaas 
Link: 
https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
Acked-by: Jonathan Cameron 

Reviewed-by: Kuppuswamy Sathyanarayanan 


---
  drivers/pci/pcie/err.c | 37 -
  1 file changed, 20 insertions(+), 17 deletions(-)




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 07/16] PCI/ERR: Simplify by computing pci_pcie_type() once

2020-11-17 Thread Kuppuswamy, Sathyanarayanan

Hi,

On 11/17/20 11:19 AM, Sean V Kelley wrote:

Instead of calling pci_pcie_type(dev) twice, call it once and save the
result.  No functional change intended.


Same optimization can be applied to drivers/pci/pcie/portdrv_pci.c and
drivers/pci/pcie/aer.c.

Can you fix them together ?



[bhelgaas: split to separate patch]
Link: 
https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
Acked-by: Jonathan Cameron 
---
  drivers/pci/pcie/err.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 05f61da5ed9d..7a5af873d8bc 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -150,6 +150,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_channel_state_t state,
pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
  {
+   int type = pci_pcie_type(dev);
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_bus *bus;
  
@@ -157,8 +158,8 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,

 * Error recovery runs on all subordinates of the first downstream port.
 * If the downstream port detected the error, it is cleared at the end.
 */
-   if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
- pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
+   if (!(type == PCI_EXP_TYPE_ROOT_PORT ||
+ type == PCI_EXP_TYPE_DOWNSTREAM))
dev = pci_upstream_bridge(dev);
bus = dev->subordinate;
  



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 06/16] PCI/ERR: Simplify by using pci_upstream_bridge()

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

Use pci_upstream_bridge() in place of dev->bus->self.  No functional change
intended.

[bhelgaas: split to separate patch]
Link: 
https://lore.kernel.org/r/20201002184735.1229220-6-seanvk@oregontracks.org
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
Acked-by: Jonathan Cameron 

Reviewed-by: Kuppuswamy Sathyanarayanan 


---
  drivers/pci/pcie/err.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index db149c6ce4fb..05f61da5ed9d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -159,7 +159,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 */
if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
  pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
-   dev = dev->bus->self;
+   dev = pci_upstream_bridge(dev);

Makes it easier to read.

bus = dev->subordinate;
  
  	pci_dbg(dev, "broadcast error_detected message\n");




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 05/16] PCI/ERR: Rename reset_link() to reset_subordinates()

2020-11-17 Thread Kuppuswamy, Sathyanarayanan

Hi,

On 11/17/20 11:19 AM, Sean V Kelley wrote:

reset_link() appears to be misnamed.  The point is to reset any devices
below a given bridge, so rename it to reset_subordinates() to make it clear
that we are passing a bridge with the intent to reset the devices below it.

[bhelgaas: fix reset_subordinate_device() typo, shorten name]
Suggested-by: Bjorn Helgaas 
Link: 
https://lore.kernel.org/r/20201002184735.1229220-5-seanvk@oregontracks.org
Signed-off-by: Sean V Kelley 
Signed-off-by: Bjorn Helgaas 
Acked-by: Jonathan Cameron 

Reviewed-by: Kuppuswamy Sathyanarayanan 


---
  drivers/pci/pci.h  | 4 ++--
  drivers/pci/pcie/err.c | 8 
  2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12dcad8f3635..3c4570a3058f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -572,8 +572,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct 
pci_dev *dev)
  
  /* PCI error reporting and recovery */

  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
-   pci_channel_state_t state,
-   pci_ers_result_t (*reset_link)(struct pci_dev *pdev));
+   pci_channel_state_t state,
+   pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
  
  bool pcie_wait_for_link(struct pci_dev *pdev, bool active);

  #ifdef CONFIG_PCIEASPM
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..db149c6ce4fb 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -147,8 +147,8 @@ static int report_resume(struct pci_dev *dev, void *data)
  }
  
  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,

-   pci_channel_state_t state,
-   pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
+   pci_channel_state_t state,
+   pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
  {
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_bus *bus;
@@ -165,9 +165,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_dbg(dev, "broadcast error_detected message\n");
if (state == pci_channel_io_frozen) {
pci_walk_bus(bus, report_frozen_detected, );
-   status = reset_link(dev);
+   status = reset_subordinates(dev);
if (status != PCI_ERS_RESULT_RECOVERED) {
-   pci_warn(dev, "link reset failed\n");
+   pci_warn(dev, "subordinate device reset failed\n");
goto failed;
}
} else {



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 03/16] PCI/RCEC: Bind RCEC devices to the Root Port driver

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

From: Qiuxu Zhuo 

If a Root Complex Integrated Endpoint (RCiEP) is implemented, it may signal
errors through a Root Complex Event Collector (RCEC).  Each RCiEP must be
associated with no more than one RCEC.

For an RCEC (which is technically not a Bridge), error messages "received"
from associated RCiEPs must be enabled for "transmission" in order to cause
a System Error via the Root Control register or (when the Advanced Error
Reporting Capability is present) reporting via the Root Error Command
register and logging in the Root Error Status register and Error Source
Identification register.

Given the commonality with Root Ports and the need to also support AER and
PME services for RCECs, extend the Root Port driver to support RCEC devices
by adding the RCEC Class ID to the driver structure.

Co-developed-by: Sean V Kelley 
Link: 
https://lore.kernel.org/r/20201002184735.1229220-3-seanvk@oregontracks.org
Signed-off-by: Sean V Kelley 
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Jonathan Cameron 

Reviewed-by: Kuppuswamy Sathyanarayanan 


---
  drivers/pci/pcie/portdrv_pci.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 3a3ce40ae1ab..4d880679b9b1 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -106,7 +106,8 @@ static int pcie_portdrv_probe(struct pci_dev *dev,
if (!pci_is_pcie(dev) ||
((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
 (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
-(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM) &&
+(pci_pcie_type(dev) != PCI_EXP_TYPE_RC_EC)))
return -ENODEV;
  
  	status = pcie_port_device_register(dev);

@@ -195,6 +196,8 @@ static const struct pci_device_id port_pci_ids[] = {
{ PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x00), ~0) },
/* subtractive decode PCI-to-PCI bridge, class type is 060401h */
{ PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x01), ~0) },
+   /* handle any Root Complex Event Collector */
+   { PCI_DEVICE_CLASS(((PCI_CLASS_SYSTEM_RCEC << 8) | 0x00), ~0) },
{ },
  };
  



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 02/16] PCI/RCEC: Add RCEC class code and extended capability

2020-11-17 Thread Kuppuswamy, Sathyanarayanan

Hi,

On 11/17/20 11:19 AM, Sean V Kelley wrote:

From: Qiuxu Zhuo 

A PCIe Root Complex Event Collector (RCEC) has base class 0x08, sub-class
0x07, and programming interface 0x00.  Add the class code 0x0807 to
identify RCEC devices and add #defines for the RCEC Endpoint Association
Extended Capability.

See PCIe r5.0, sec 1.3.4 ("Root Complex Event Collector") and sec 7.9.10
("Root Complex Event Collector Endpoint Association Extended Capability").

Why not merge this change with usage patch ? Keeping changes together will help
in case of reverting the code.


Link: 
https://lore.kernel.org/r/20201002184735.1229220-2-seanvk@oregontracks.org
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Jonathan Cameron 
---
  include/linux/pci_ids.h   | 1 +
  include/uapi/linux/pci_regs.h | 7 +++
  2 files changed, 8 insertions(+)

diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 1ab1e24bcbce..d8156a5dbee8 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -81,6 +81,7 @@
  #define PCI_CLASS_SYSTEM_RTC  0x0803
  #define PCI_CLASS_SYSTEM_PCI_HOTPLUG  0x0804
  #define PCI_CLASS_SYSTEM_SDHCI0x0805
+#define PCI_CLASS_SYSTEM_RCEC  0x0807
  #define PCI_CLASS_SYSTEM_OTHER0x0880
  
  #define PCI_BASE_CLASS_INPUT		0x09

diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index a95d55f9f257..bccd3e35cb65 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -831,6 +831,13 @@
  #define  PCI_PWR_CAP_BUDGET(x)((x) & 1)   /* Included in system 
budget */
  #define PCI_EXT_CAP_PWR_SIZEOF16
  
+/* Root Complex Event Collector Endpoint Association  */

+#define PCI_RCEC_RCIEP_BITMAP  4   /* Associated Bitmap for RCiEPs */
+#define PCI_RCEC_BUSN  8   /* RCEC Associated Bus Numbers */
+#define  PCI_RCEC_BUSN_REG_VER 0x02/* Least version with BUSN present */
+#define  PCI_RCEC_BUSN_NEXT(x) (((x) >> 8) & 0xff)
+#define  PCI_RCEC_BUSN_LAST(x) (((x) >> 16) & 0xff)
+
  /* Vendor-Specific (VSEC, PCI_EXT_CAP_ID_VNDR) */
  #define PCI_VNDR_HEADER   4   /* Vendor-Specific Header */
  #define  PCI_VNDR_HEADER_ID(x)((x) & 0x)



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 01/16] AER: aer_root_reset() non-native handling

2020-11-17 Thread Kuppuswamy, Sathyanarayanan




On 11/17/20 11:19 AM, Sean V Kelley wrote:

If an OS has not been granted AER control via _OSC, then
the OS should not make changes to PCI_ERR_ROOT_COMMAND and
PCI_ERR_ROOT_STATUS related registers. Per section 4.5.1 of
the System Firmware Intermediary (SFI) _OSC and DPC Updates
ECN [1], this bit also covers these aspects of the PCI
Express Advanced Error Reporting. Based on the above and
earlier discussion [2], make the following changes:

Add a check for the native case (i.e., AER control via _OSC)

Note that the current "clear, reset, enable" order suggests that the
reset might cause errors that we should ignore. Lacking historical
context, these will be retained.

[1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24,
 2020, affecting PCI Firmware Specification, Rev. 3.2
 https://members.pcisig.com/wg/PCI-SIG/document/14076
[2] 
https://lore.kernel.org/linux-pci/20201020162820.GA370938@bjorn-Precision-5520/


Reviewed-by: Kuppuswamy Sathyanarayanan 


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v11 0/5] Simplify PCIe native ownership detection logic

2020-11-13 Thread Kuppuswamy, Sathyanarayanan

Hi Bjorn,

On 10/26/20 7:57 PM, Kuppuswamy Sathyanarayanan wrote:

Currently, PCIe capabilities ownership status is detected by
verifying the status of pcie_ports_native, pcie_ports_dpc_native
and _OSC negotiated results (cached in  struct pci_host_bridge
->native_* members). But this logic can be simplified, and we can
use only struct pci_host_bridge ->native_* members to detect it.

This patchset removes the distributed checks for pcie_ports_native,
pcie_ports_dpc_native parameters.

Any comments on this series?


Changes since v10:
  * Addressed format issue reported by lkp test.

Changes since v9:
  * Rebased on top of v5.10-rc1

Changes since v8:
  * Simplified setting _OSC ownwership logic
  * Moved bridge->native_ltr out of #ifdef CONFIG_PCIEPORTBUS.

Changes since v7:
  * Fixed "fix array_size.cocci warnings".

Changes since v6:
  * Created new patch for CONFIG_PCIEPORTBUS check in
pci_init_host_bridge().
  * Added warning message for a case when pcie_ports_native
overrides _OSC negotiation result.

Changes since v5:
  * Rebased on top of v5.8-rc1

Changes since v4:
  * Changed the patch set title (Original link: 
https://lkml.org/lkml/2020/5/26/1710)
  * Added AER/DPC dependency logic cleanup fixes.
  


Kuppuswamy Sathyanarayanan (5):
   PCI: Conditionally initialize host bridge native_* members
   ACPI/PCI: Ignore _OSC negotiation result if pcie_ports_native is set.
   ACPI/PCI: Ignore _OSC DPC negotiation result if pcie_ports_dpc_native
 is set.
   PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable
 logic
   PCI/DPC: Move AER/DPC dependency checks out of DPC driver

  drivers/acpi/pci_root.c   | 39 +++
  drivers/pci/hotplug/pciehp_core.c |  2 +-
  drivers/pci/pci-acpi.c|  3 ---
  drivers/pci/pcie/aer.c|  2 +-
  drivers/pci/pcie/dpc.c|  3 ---
  drivers/pci/pcie/portdrv.h|  2 --
  drivers/pci/pcie/portdrv_core.c   | 13 ---
  drivers/pci/probe.c   |  6 +++--
  include/linux/acpi.h  |  2 ++
  include/linux/pci.h   |  2 ++
  10 files changed, 44 insertions(+), 30 deletions(-)



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v8 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-11-13 Thread Kuppuswamy, Sathyanarayanan

Hi Bjorn,

On 10/26/20 12:37 PM, Kuppuswamy Sathyanarayanan wrote:

Currently if report_error_detected() or report_mmio_enabled()
functions requests PCI_ERS_RESULT_NEED_RESET, current
pcie_do_recovery() implementation does not do the requested
explicit device reset, but instead just calls the
report_slot_reset() on all affected devices. Notifying about the
reset via report_slot_reset() without doing the actual device
reset is incorrect. So call pci_bus_reset() before triggering
->slot_reset() callback.

Any comments on this series?


Signed-off-by: Kuppuswamy Sathyanarayanan 

Reviewed-by: Sinan Kaya 
Reviewed-by: Ashok Raj 
---
  Changes since v7:
   * Rebased on top of v5.10-rc1.

  Changes since v6:
   * None.

  Changes since v5:
   * Added Ashok's Reviewed-by tag.

  Changes since v4:
   * Added check for pci_reset_bus() return value.

  drivers/pci/pcie/err.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..315a4d559c4c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -152,6 +152,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  {
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_bus *bus;
+   int ret;
  
  	/*

 * Error recovery runs on all subordinates of the first downstream port.
@@ -181,11 +182,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
}
  
  	if (status == PCI_ERS_RESULT_NEED_RESET) {

-   /*
-* TODO: Should call platform-specific
-* functions to reset slot before calling
-* drivers' slot_reset callbacks?
-*/
+   ret = pci_reset_bus(dev);
+   if (ret < 0) {
+   pci_err(dev, "Failed to reset %d\n", ret);
+   status = PCI_ERS_RESULT_DISCONNECT;
+   goto failed;
+   }
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(dev, "broadcast slot_reset message\n");
pci_walk_bus(bus, report_slot_reset, );



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v4 5/6] PCI/ACPI: Replace open coded variant of resource_union()

2020-11-02 Thread Kuppuswamy, Sathyanarayanan




On 11/2/20 1:00 PM, Andy Shevchenko wrote:

Since we have resource_union() helper, let's utilize it here.

Signed-off-by: Andy Shevchenko 
Cc: Kuppuswamy Sathyanarayanan 
Cc: Bjorn Helgaas 
Cc: linux-...@vger.kernel.org
Reviewed-by: Rafael J. Wysocki 

Reviewed-by: Kuppuswamy Sathyanarayanan 


---
  drivers/acpi/pci_root.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index c12b5fb3e8fb..0bf072cef6cf 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -722,9 +722,7 @@ static void acpi_pci_root_validate_resources(struct device 
*dev,
 * our resources no longer match the ACPI _CRS, but
 * the kernel resource tree doesn't allow overlaps.
 */
-   if (resource_overlaps(res1, res2)) {
-   res2->start = min(res1->start, res2->start);
-   res2->end = max(res1->end, res2->end);
+   if (resource_union(res1, res2, res2)) {
dev_info(dev, "host bridge window expanded to %pR; 
%pR ignored\n",
 res2, res1);
free = true;



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


  1   2   3   4   >