User space tools which do automated task placement need information about AVX-512 usage of tasks, because AVX-512 usage could cause core turbo frequency drop and impact the running task on the sibling CPU.
XSAVE header contains a state-component bitmap, which allows software to discover the state of the init optimization used by XSAVEOPT and XSAVES. Set bits in the bitmap denotes the usage of the components. AVX-512 component has 3 states, only Hi16_ZMM state causes notable frequency drop. Add per task Hi16_ZMM state tracking to context switch. The tracking turns on the usage flag immediately, but requires 3 consecutive context switches with no usage to clear it. This decay is required because of AVX-512 using tasks could set Hi16_ZMM state back to the init state themselves. Signed-off-by: Aubrey Li <aubrey...@linux.intel.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Andi Kleen <a...@linux.intel.com> Cc: Tim Chen <tim.c.c...@linux.intel.com> Cc: Dave Hansen <dave.han...@intel.com> Cc: Arjan van de Ven <ar...@linux.intel.com> --- arch/x86/include/asm/fpu/internal.h | 26 ++++++++++++++++++++++++++ arch/x86/include/asm/fpu/types.h | 9 +++++++++ 2 files changed, 35 insertions(+) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index a38bf5a..f382449 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -275,6 +275,31 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu) : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \ : "memory") +#define HI16ZMM_STATE_DECAY_COUNT 3 +/* + * This function is called during context switch to update Hi16_ZMM state + */ +static inline void update_hi16zmm_state(struct fpu *fpu) +{ + /* + * XSAVE header contains a state-component bitmap(xfeatures), + * which allows software to discover the state of the init + * optimization used by XSAVEOPT and XSAVES. + * + * Hi16_ZMM state(one state of AVX-512 component) is tracked here + * because its usage could cause notable core turbo frequency drop. + * + * AVX512-using tasks could set Hi16_ZMM state back to the init + * state themselves. Thus, this tracking mechanism can miss. + * The decay usage ensures that false-negatives do not immediately + * make a task be considered as not using Hi16_ZMM registers. + */ + if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_Hi16_ZMM) + fpu->hi16zmm_usage = HI16ZMM_STATE_DECAY_COUNT; + else if (fpu->hi16zmm_usage) + fpu->hi16zmm_usage--; +} + /* * This function is called only during boot time when x86 caps are not set * up and alternative can not be used yet. @@ -411,6 +436,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu) { if (likely(use_xsave())) { copy_xregs_to_kernel(&fpu->state.xsave); + update_hi16zmm_state(fpu); return 1; } diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h index 202c539..c0c7577 100644 --- a/arch/x86/include/asm/fpu/types.h +++ b/arch/x86/include/asm/fpu/types.h @@ -303,6 +303,15 @@ struct fpu { unsigned char initialized; /* + * @hi16zmm_usage: + * + * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31. + * A value of non-zero is used to indicate whether there is valid + * state in these AVX512 registers. + */ + unsigned char hi16zmm_usage; + + /* * @state: * * In-memory copy of all FPU registers that we save/restore -- 2.7.4