On Thu, May 31, 2018 at 8:08 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> >> This is the patch I am going to check into GCC 8. >> >> -- >> H.J. > >> From 9ecbfa1fd04dc4370a9ec4f3d56189cc07aee668 Mon Sep 17 00:00:00 2001 >> From: "H.J. Lu" <hjl.to...@gmail.com> >> Date: Thu, 17 May 2018 09:52:09 -0700 >> Subject: [PATCH] x86: Re-enable partial_reg_dependency and movx for Haswell >> >> r254152 disabled partial_reg_dependency and movx for Haswell and newer >> Intel processors. r258972 restored them for skylake-avx512. For Haswell, >> movx improves performance. But partial_reg_stall may be better than >> partial_reg_dependency in theory. We will investigate performance impact >> of partial_reg_stall vs partial_reg_dependency on Haswell for GCC 9. In >> the meantime, this patch restores both partial_reg_dependency and mox for >> Haswell in GCC 8. >> >> On Haswell, improvements for EEMBC benchmarks with >> >> -mtune-ctrl=movx,partial_reg_dependency -Ofast -march=haswell >> >> vs >> >> -Ofast -mtune=haswell >> >> are >> >> automotive >> ========= >> aifftr01 (default) - goodperf: Runtime improvement of 2.6% (time). >> aiifft01 (default) - goodperf: Runtime improvement of 2.2% (time). >> >> networking >> ========= >> ip_pktcheckb1m (default) - goodperf: Runtime improvement of 3.8% (time). >> ip_pktcheckb2m (default) - goodperf: Runtime improvement of 5.2% (time). >> ip_pktcheckb4m (default) - goodperf: Runtime improvement of 4.4% (time). >> ip_pktcheckb512k (default) - goodperf: Runtime improvement of 4.2% >> (time). >> >> telecom >> ========= >> fft00data_1 (default) - goodperf: Runtime improvement of 8.4% (time). >> fft00data_2 (default) - goodperf: Runtime improvement of 8.6% (time). >> fft00data_3 (default) - goodperf: Runtime improvement of 9.0% (time). > > Thanks for data. Why did you commited the patch to release branch only? > The patch is OK for mainline too.
I am checking this patch into trunk now. > I do not have access to the benchmark so I can not check. Why do we get >From Intel optimization guide: 3.5.2.4 Partial Register Stalls General purpose registers can be accessed in granularities of bytes, words, doublewords; 64-bit mode also supports quadword granularity. Referencing a portion of a register is referred to as a partial register reference. A partial register stall happens when an instruction refers to a register, portions of which were previously modified by other instructions. For example, partial register stalls occurs with a read to AX while previous instructions stored AL and AH, or a read to EAX while previous instruction modified AX. The delay of a partial register stall is small in processors based on Intel Core and NetBurst microarchitec- tures, and in Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M processors (CPUID signature with family 6, model 9) and the P6 family incur a large penalty. Note that in Intel 64 architecture, an update to the lower 32 bits of a 64 bit integer register is architec- turally defined to zero extend the upper 32 bits. While this action may be logically viewed as a 32 bit update, it is really a 64 bit update (and therefore does not cause a partial stall). Referencing partial registers frequently produces code sequences with either false or real dependencies. Example 3-18 demonstrates a series of false and real dependencies caused by referencing partial regis- ters. ... When you want to load from memory to a partial register, consider using MOVZX or MOVSX to avoid the additional merge micro-op penalty. We have movx, partial_reg_dependency and partial_reg_stall to deal with it. movx is always good. But partial_reg_stall is enabled only for i686. We need to investigate partial_reg_stall vs partial_reg_dependency on Haswell+. > the improvements here and how does that behave on skylake+? This is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84413 We are working on it. > honza >> >> PR target/85829 >> * config/i386/x86-tune.def: Re-enable partial_reg_dependency >> and movx for Haswell. >> --- >> gcc/config/i386/x86-tune.def | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def >> index 5649fdcf416..60625668236 100644 >> --- a/gcc/config/i386/x86-tune.def >> +++ b/gcc/config/i386/x86-tune.def >> @@ -48,7 +48,7 @@ DEF_TUNE (X86_TUNE_SCHEDULE, "schedule", >> over partial stores. For example preffer MOVZBL or MOVQ to load 8bit >> value over movb. */ >> DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency", >> - m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE >> + m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_HASWELL >> | m_BONNELL | m_SILVERMONT | m_INTEL >> | m_KNL | m_KNM | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 | m_GENERIC) >> >> @@ -84,7 +84,7 @@ DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, >> "partial_flag_reg_stall", >> partial dependencies. */ >> DEF_TUNE (X86_TUNE_MOVX, "movx", >> m_PPRO | m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE >> - | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL >> + | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL | m_HASWELL >> | m_GEODE | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 | m_GENERIC) >> >> /* X86_TUNE_MEMORY_MISMATCH_STALL: Avoid partial stores that are followed by >> -- >> 2.17.0 >> > -- H.J.
From b76a9074e8919f63934e04083e67371a6090e7a0 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.to...@gmail.com> Date: Thu, 17 May 2018 09:52:09 -0700 Subject: [PATCH] x86: Re-enable partial_reg_dependency and movx for Haswell r254152 disabled partial_reg_dependency and movx for Haswell and newer Intel processors. r258972 restored them for skylake-avx512. For Haswell, movx improves performance. But partial_reg_stall may be better than partial_reg_dependency in theory. We will investigate performance impact of partial_reg_stall vs partial_reg_dependency on Haswell for GCC 9. In the meantime, this patch restores both partial_reg_dependency and mox for Haswell in GCC 8. On Haswell, improvements for EEMBC benchmarks with -mtune-ctrl=movx,partial_reg_dependency -Ofast -march=haswell vs -Ofast -mtune=haswell are automotive ========= aifftr01 (default) - goodperf: Runtime improvement of 2.6% (time). aiifft01 (default) - goodperf: Runtime improvement of 2.2% (time). networking ========= ip_pktcheckb1m (default) - goodperf: Runtime improvement of 3.8% (time). ip_pktcheckb2m (default) - goodperf: Runtime improvement of 5.2% (time). ip_pktcheckb4m (default) - goodperf: Runtime improvement of 4.4% (time). ip_pktcheckb512k (default) - goodperf: Runtime improvement of 4.2% (time). telecom ========= fft00data_1 (default) - goodperf: Runtime improvement of 8.4% (time). fft00data_2 (default) - goodperf: Runtime improvement of 8.6% (time). fft00data_3 (default) - goodperf: Runtime improvement of 9.0% (time). PR target/85829 * config/i386/x86-tune.def: Re-enable partial_reg_dependency and movx for Haswell. --- gcc/config/i386/x86-tune.def | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 77d99340ebe..f95c0701d5d 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -49,7 +49,7 @@ DEF_TUNE (X86_TUNE_SCHEDULE, "schedule", over partial stores. For example preffer MOVZBL or MOVQ to load 8bit value over movb. */ DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency", - m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE + m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_HASWELL | m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_GOLDMONT_PLUS | m_INTEL | m_KNL | m_KNM | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 | m_GENERIC) @@ -87,7 +87,7 @@ DEF_TUNE (X86_TUNE_MOVX, "movx", m_PPRO | m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_BONNELL | m_SILVERMONT | m_GOLDMONT | m_KNL | m_KNM | m_INTEL | m_GOLDMONT_PLUS | m_GEODE | m_AMD_MULTIPLE | m_SKYLAKE_AVX512 - | m_GENERIC) + | m_HASWELL | m_GENERIC) /* X86_TUNE_MEMORY_MISMATCH_STALL: Avoid partial stores that are followed by full sized loads. */ -- 2.17.0