On 07/15/2015 04:07 AM, Ingo Molnar wrote:
> * Dave Hansen <dave.han...@linux.intel.com> wrote:
>>>     /*
>>> -    * Setup init_xstate_buf to represent the init state of
>>> +    * Setup init_xstate_ctx to represent the init state of
>>>      * all the features managed by the xsave
>>>      */
>>> -   init_xstate_buf = alloc_bootmem_align(xstate_size,
>>> -                                         __alignof__(struct xsave_struct));
>>> -   fx_finit(&init_xstate_buf->i387);
>>> +   fx_finit(&init_xstate_ctx.i387);
>>
>> This is causing memory corruption in 4.2-rc2.
...
>> This patch works around the problem, btw:
>>
>>      https://www.sr71.net/~dave/intel/bloat-xsave-gunk-2.patch
> 
> Yeah, so I got this prototype hardware boot crash reported in private mail 
> and 
> decoded it and after some debugging I suggested the +PAGE_SIZE hack - 
> possibly you 
> got that hack from the same person?

Nope, I came up with that gem of a patch all on my own.

I also wouldn't characterize this as prototype hardware.  There are
obviously plenty of folks depending on mainline to boot and function on
hardware that has AVX-512 support.  That's why two different Intel folks
came to you independently.

> My suggestion was to solve this properly: if we list xstate features as 
> supported 
> then we should size their max size correctly. The AVX bits are currently not 
> properly enumerated and sized - and I refuse to add feature support to the 
> kernel 
> where per task CPU state fields that the kernel saves/restores are opaque...

We might know the size and composition of the individual components, but
we do not know the size of the buffer.  Different implementations of a
given feature are quite free to have different data stored in the
buffer, or even to rearrange or pad it.  That's why the sizes are not
explicitly called out by the architecture and why we enumerated them
before your patch that caused this regression.

The component itself may not be opaque, but the size of the *buffer* is
not a simple sum of the component sizes.  Here's a real-world example:

[    0.000000] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
[    0.000000] x86/fpu: xstate_offset[3]: 03c0, xstate_sizes[3]: 0040

Notice that component 3 is not at 0x240+0x100.  This means our existing
init_xstate_size(), and why any attempt to staticlly-size the buffer is
broken.

I understand why you were misled by it, but the old "xsave_hdr_struct"
was wrong.  Fenghua even posted patches to remove it before the FPU
rework (you were cc'd):

        https://lkml.org/lkml/2015/4/18/164

> So please add proper AVX512 support structures to fpu/types.h and size 
> XSTATE_RESERVE correctly - or alternatively we can remove the current 
> incomplete 
> AVX512 bits.

The old code sized the buffer in a fully architectural way and it
worked.  The CPU *tells* you how much memory the 'xsave' instruction is
going to scribble on.  The new code just merrily calls it and let it
scribble away.  This is as clear-cut a regression as I've ever seen.

The least we can do is detect that the kernel undersized the buffer and
disable support for the features that do not fit.  A very lightly tested
patch to do that is attached.  I'm not super eager to put that in to an
-rc2 kernel though.
This came out a lot more complicated than I would have liked.

Instead of simply enabling all of the XSAVE features that we
both know about and the CPU supports, we have to be careful
to not overflow our buffer in 'init_fpstate.xsave'.

To do this, we enable each XSAVE feature and then ask the CPU
how large that makes the buffer.  If we would overflow the
buffer we allocated, we turn off the feature.

This means that no matter what the CPU does, we will not
corrupt random memory like we do before this patch.  It also
means that we can fall back in a way which cripples the
system the least.

---

 b/arch/x86/include/asm/fpu/types.h |   20 ++++-
 b/arch/x86/kernel/fpu/xstate.c     |  140 +++++++++++++++++++++++++++++++------
 2 files changed, 134 insertions(+), 26 deletions(-)

diff -puN arch/x86/kernel/fpu/xstate.c~xsaves-are-my-friend arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~xsaves-are-my-friend	2015-07-15 15:44:58.280300715 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-07-15 17:12:28.744659856 -0700
@@ -161,11 +161,10 @@ void fpstate_sanitize_xstate(struct fpu
  */
 void fpu__init_cpu_xstate(void)
 {
-	if (!cpu_has_xsave || !xfeatures_mask)
+	if (!cpu_has_xsave)
 		return;
 
 	cr4_set_bits(X86_CR4_OSXSAVE);
-	xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
 }
 
 /*
@@ -291,25 +290,129 @@ static void __init setup_init_fpu_buf(vo
 }
 
 /*
- * Calculate total size of enabled xstates in XCR0/xfeatures_mask.
+ * This returns the size that the 'xsave*' instructions will
+ * write to the xsave buffer.  This is dynamic and will change
+ * as we enable more feature bits in XCR0.
  */
-static void __init init_xstate_size(void)
+unsigned int kernel_xstate_buffer_size(void)
 {
 	unsigned int eax, ebx, ecx, edx;
-	int i;
 
 	if (!cpu_has_xsaves) {
 		cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-		xstate_size = ebx;
-		return;
+		return ebx;
+	} else {
+		cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
+		return ebx;
 	}
+}
+
+u64 cpu_supported_xfeatures_mask(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+	u64 ret;
+
+	cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
+	ret = eax + ((u64)edx << 32);
+
+	return ret;
+}
+
+/*
+ * We enable the features in sets.  Some implementations are
+ * buggy and do not handle it if only one of two related
+ * xfeatures are enabled.
+ */
+u64 xfeature_sets[] = {
+	XSTATE_YMM,
+	XSTATE_BNDREGS | XSTATE_BNDCSR,
+	XSTATE_AVX512,
+};
+
+/*
+ * Enable each supported feature in XCR0, determine the required
+ * size of the save buffer with the feature enabled, and determine
+ * if it fits in the buffer we allocated.
+ *
+ * This function will initialize these global variables:
+ * 1. xfeatures_mask
+ * 2. xstate_size
+ */
+static void __init enable_supported_xfeatures(void)
+{
+	int i;
+	int xbufsz = sizeof(init_fpstate.xsave);
+	u64 cpu_support_mask = cpu_supported_xfeatures_mask();
 
-	xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
-	for (i = 2; i < 64; i++) {
-		if (test_bit(i, (unsigned long *)&xfeatures_mask)) {
-			cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
-			xstate_size += eax;
+	/*
+	 * We are past sanity_check_xfeatures_supported() so we
+	 * know these states are supported.
+	 */
+	xfeatures_mask |= XSTATE_FPSSE;
+
+	for (i = 0; i < ARRAY_SIZE(xfeature_sets); i++) {
+		u64 feature_mask = xfeature_sets[i];
+		int tmp_xstate_size;
+
+		/* Support only the state known to the OS: */
+		if ((XCNTXT_MASK & feature_mask) != feature_mask)
+			continue;
+		/* Only try the states supported by the CPU */
+		if ((cpu_support_mask & feature_mask) != feature_mask)
+			continue;
+
+		/* Enable the potential feature in XCR0 */
+		xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask | feature_mask);
+		/* Ask the CPU how large of an XSAVE buffer we need now */
+		tmp_xstate_size = kernel_xstate_buffer_size();
+
+		if (tmp_xstate_size > xbufsz) {
+			/* reset XCR0 to its previous value */
+			xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
+			pr_warn("XSAVE feature 0x%02llx overflows init buf, "
+				"disabling (%d > %d)\n", feature_mask,
+				tmp_xstate_size, xbufsz);
+			continue;
 		}
+		/* "Commit" the feature to the permanent mask */
+		xfeatures_mask |= feature_mask;
+		xstate_size = tmp_xstate_size;
+	}
+	pr_info("x86/fpu: enabled %ld xfeatures (0x%llx), in %d byte buffer\n",
+			hweight64(xfeatures_mask), xfeatures_mask, xstate_size);
+	/* XCR0 should be fully initialized now */
+}
+
+/*
+ * The kernel generally only checks CPUID bits to tell if it
+ * should be using a given feature.  Some features are unusable
+ * unless we are able to enable XSAVE state management for them.
+ * So, we clear the CPUID bits for a features that we did not
+ * enable.
+ */
+void clear_unsupported_xfeature_cpuid_flags(void)
+{
+	/*
+	 * SDM: An operating system must enable its YMM state
+	 * management to support AVX... otherwise, an attempt
+	 * to execute an instruction cause a #UD exception.
+	 */
+	if (!(xfeatures_mask & XSTATE_YMM)) {
+		setup_clear_cpu_cap(X86_FEATURE_AVX);
+		setup_clear_cpu_cap(X86_FEATURE_AVX2);
+	}
+	/*
+	 * SDM: An OS must enable its ZMM and opmask state
+	 * management to support Intel AVX-512
+	 */
+	if (!(xfeatures_mask & XSTATE_OPMASK) ||
+	    !(xfeatures_mask & XSTATE_ZMM_Hi256) ||
+	    !(xfeatures_mask & XSTATE_Hi16_ZMM)) {
+		setup_clear_cpu_cap(X86_FEATURE_AVX512F);
+	}
+	if (!(xfeatures_mask & XSTATE_BNDREGS) ||
+	    !(xfeatures_mask & XSTATE_BNDCSR)) {
+		setup_clear_cpu_cap(X86_FEATURE_MPX);
 	}
 }
 
@@ -319,7 +422,6 @@ static void __init init_xstate_size(void
  */
 void __init fpu__init_system_xstate(void)
 {
-	unsigned int eax, ebx, ecx, edx;
 	static int on_boot_cpu = 1;
 
 	WARN_ON_FPU(!on_boot_cpu);
@@ -335,22 +437,16 @@ void __init fpu__init_system_xstate(void
 		return;
 	}
 
-	cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-	xfeatures_mask = eax + ((u64)edx << 32);
-
-	if ((xfeatures_mask & XSTATE_FPSSE) != XSTATE_FPSSE) {
+	if ((cpu_supported_xfeatures_mask() & XSTATE_FPSSE) != XSTATE_FPSSE) {
 		pr_err("x86/fpu: FP/SSE not present amongst the CPU's xstate features: 0x%llx.\n", xfeatures_mask);
 		BUG();
 	}
 
-	/* Support only the state known to the OS: */
-	xfeatures_mask = xfeatures_mask & XCNTXT_MASK;
-
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
 
-	/* Recompute the context size for enabled features: */
-	init_xstate_size();
+	enable_supported_xfeatures();
+	clear_unsupported_xfeature_cpuid_flags();
 
 	update_regset_xstate_info(xstate_size, xfeatures_mask);
 	fpu__init_prepare_fx_sw_frame();
diff -puN arch/x86/include/asm/fpu/types.h~xsaves-are-my-friend arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~xsaves-are-my-friend	2015-07-15 15:44:58.285300941 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-07-15 16:54:46.641091924 -0700
@@ -159,10 +159,22 @@ struct xstate_header {
 	u64				reserved[6];
 } __attribute__((packed));
 
-/* New processor state extensions should be added here: */
-#define XSTATE_RESERVE			(sizeof(struct ymmh_struct) + \
-					 sizeof(struct lwp_struct)  + \
-					 sizeof(struct mpx_struct)  )
+/*
+ * The largest xsave buffer known today is 2752 bytes on a system
+ * implementing AVX-512.  This includes the 512-byte i387 state
+ * and 64-byte header.  We add a small amount of padding in case
+ * an implementation adds some padding or wastes some space.
+ *
+ * Note, if we overflow this, we will disable some XSTATE
+ * features.
+ *
+ * Also, note that the real size we need is enumerated by
+ * cpuid leaves and can not be known at compile time.
+ */
+#define XSTATE_RESERVE			(2752 + 256 -				\
+					 sizeof(struct xstate_header) - 	\
+					 sizeof(struct fxregs_state))
+
 /*
  * This is our most modern FPU state format, as saved by the XSAVE
  * and restored by the XRSTOR instructions.
_

Reply via email to