[gcc r14-10397] i386: Correct AVX10 CPUID emulation
https://gcc.gnu.org/g:74c15cb93b3830fee79f75805329d4299ff4a2f0 commit r14-10397-g74c15cb93b3830fee79f75805329d4299ff4a2f0 Author: Haochen Jiang Date: Tue Jul 9 16:31:02 2024 +0800 i386: Correct AVX10 CPUID emulation AVX10 Documentaion has specified ecx value as 0 for AVX10 version and vector size under 0x24 subleaf. Although for ecx=1, the bits are all reserved for now, we still need to specify ecx as 0 to avoid dirty value in ecx. gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_available_features): Correct AVX10 CPUID emulation to specify ecx value. Diff: --- gcc/common/config/i386/cpuinfo.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h index 017a952a5db0..56427474b7be 100644 --- a/gcc/common/config/i386/cpuinfo.h +++ b/gcc/common/config/i386/cpuinfo.h @@ -1014,10 +1014,10 @@ get_available_features (struct __processor_model *cpu_model, } } - /* Get Advanced Features at level 0x24 (eax = 0x24). */ + /* Get Advanced Features at level 0x24 (eax = 0x24, ecx = 0). */ if (avx10_set && max_cpuid_level >= 0x24) { - __cpuid (0x24, eax, ebx, ecx, edx); + __cpuid_count (0x24, 0, eax, ebx, ecx, edx); version = ebx & 0xff; if (ebx & bit_AVX10_256) switch (version)
[gcc r15-1908] i386: Correct AVX10 CPUID emulation
https://gcc.gnu.org/g:298a576f00c49b8f4529ea2f87b9943a32743250 commit r15-1908-g298a576f00c49b8f4529ea2f87b9943a32743250 Author: Haochen Jiang Date: Tue Jul 9 16:31:02 2024 +0800 i386: Correct AVX10 CPUID emulation AVX10 Documentaion has specified ecx value as 0 for AVX10 version and vector size under 0x24 subleaf. Although for ecx=1, the bits are all reserved for now, we still need to specify ecx as 0 to avoid dirty value in ecx. gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_available_features): Correct AVX10 CPUID emulation to specify ecx value. Diff: --- gcc/common/config/i386/cpuinfo.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h index 936039725ab6..2ae77d335d24 100644 --- a/gcc/common/config/i386/cpuinfo.h +++ b/gcc/common/config/i386/cpuinfo.h @@ -998,10 +998,10 @@ get_available_features (struct __processor_model *cpu_model, } } - /* Get Advanced Features at level 0x24 (eax = 0x24). */ + /* Get Advanced Features at level 0x24 (eax = 0x24, ecx = 0). */ if (avx10_set && max_cpuid_level >= 0x24) { - __cpuid (0x24, eax, ebx, ecx, edx); + __cpuid_count (0x24, 0, eax, ebx, ecx, edx); version = ebx & 0xff; if (ebx & bit_AVX10_256) switch (version)
[gcc r14-10283] testsuite: i386: Require ifunc support in gcc.target/i386/avx10_1-25.c etc.
https://gcc.gnu.org/g:e11a42b8c7ac32f8a1e307f99719a0f9c63813e8 commit r14-10283-ge11a42b8c7ac32f8a1e307f99719a0f9c63813e8 Author: Rainer Orth Date: Tue Jun 4 13:33:46 2024 +0200 testsuite: i386: Require ifunc support in gcc.target/i386/avx10_1-25.c etc. Two new AVX10.1 tests FAIL on Solaris/x86: FAIL: gcc.target/i386/avx10_1-25.c (test for excess errors) FAIL: gcc.target/i386/avx10_1-26.c (test for excess errors) Excess errors: /vol/gcc/src/hg/master/local/gcc/testsuite/gcc.target/i386/avx10_1-25.c:6:9: error: the call requires 'ifunc', which is not supported by this target Fixed by requiring ifunc support. Tested on i386-pc-solaris2.11 and x86_64-pc-linux-gnu. 2024-06-04 Rainer Orth gcc/testsuite: * gcc.target/i386/avx10_1-25.c: Require ifunc support. * gcc.target/i386/avx10_1-26.c: Likewise. Diff: --- gcc/testsuite/gcc.target/i386/avx10_1-25.c | 1 + gcc/testsuite/gcc.target/i386/avx10_1-26.c | 1 + 2 files changed, 2 insertions(+) diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-25.c b/gcc/testsuite/gcc.target/i386/avx10_1-25.c index 73f1b724560..5bd2b88fb08 100644 --- a/gcc/testsuite/gcc.target/i386/avx10_1-25.c +++ b/gcc/testsuite/gcc.target/i386/avx10_1-25.c @@ -1,5 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-O2 -mavx" } */ +/* { dg-require-ifunc "" } */ #include __attribute__((target_clones ("default","avx10.1-256"))) diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-26.c b/gcc/testsuite/gcc.target/i386/avx10_1-26.c index 514ab57a406..cf8c976e21f 100644 --- a/gcc/testsuite/gcc.target/i386/avx10_1-26.c +++ b/gcc/testsuite/gcc.target/i386/avx10_1-26.c @@ -1,5 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-O2 -mavx512f" } */ +/* { dg-require-ifunc "" } */ #include __attribute__((target_clones ("default","avx10.1-512")))
[gcc r14-10271] Add AVX10.1 target_clones support
https://gcc.gnu.org/g:97474ba2075dc3c397bbc2861646561dcfd13386 commit r14-10271-g97474ba2075dc3c397bbc2861646561dcfd13386 Author: Haochen Jiang Date: Mon May 20 15:52:32 2024 +0800 Add AVX10.1 target_clones support Since AVX10 is the first major ISA introduced after AVX-512, we propose to add target_clones support for it. Although AVX10.1-256 won't cover 512-bit part of AVX512F, but since it is only for priority but not for implication, it won't be an issue. gcc/ChangeLog: * common/config/i386/i386-common.cc: Change Granite Rapids series CPU type to P_PROC_AVX10_1_512. * common/config/i386/i386-cpuinfo.h (enum feature_priority): Revise comment part. Add P_AVX10_1_256, P_AVX10_1_512, P_PROC_AVX10_1_512. * common/config/i386/i386-isas.h: Link to avx10.1-256, avx10.1-512. gcc/testsuite/ChangeLog: * gcc.target/i386/avx10_1-25.c: New test. * gcc.target/i386/avx10_1-26.c: Ditto. Diff: --- gcc/common/config/i386/i386-common.cc | 4 ++-- gcc/common/config/i386/i386-cpuinfo.h | 5 - gcc/common/config/i386/i386-isas.h | 4 ++-- gcc/testsuite/gcc.target/i386/avx10_1-25.c | 9 + gcc/testsuite/gcc.target/i386/avx10_1-26.c | 9 + 5 files changed, 26 insertions(+), 5 deletions(-) diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc index 77b154663bc..d578918dfb7 100644 --- a/gcc/common/config/i386/i386-common.cc +++ b/gcc/common/config/i386/i386-common.cc @@ -2273,10 +2273,10 @@ const pta processor_alias_table[] = {"meteorlake", PROCESSOR_ALDERLAKE, CPU_HASWELL, PTA_ALDERLAKE, M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2}, {"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL, PTA_GRANITERAPIDS, -M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F}, +M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX10_1_512}, {"graniterapids-d", PROCESSOR_GRANITERAPIDS_D, CPU_HASWELL, PTA_GRANITERAPIDS_D, M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS_D), -P_PROC_AVX512F}, +P_PROC_AVX10_1_512}, {"arrowlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE, M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE), P_PROC_AVX2}, {"arrowlake-s", PROCESSOR_ARROWLAKE_S, CPU_HASWELL, PTA_ARROWLAKE_S, diff --git a/gcc/common/config/i386/i386-cpuinfo.h b/gcc/common/config/i386/i386-cpuinfo.h index 73131657eab..be52ad2c60d 100644 --- a/gcc/common/config/i386/i386-cpuinfo.h +++ b/gcc/common/config/i386/i386-cpuinfo.h @@ -112,7 +112,7 @@ enum processor_subtypes /* Priority of i386 features, greater value is higher priority. This is used to decide the order in which function dispatch must happen. For instance, a version specialized for SSE4.2 should be checked for dispatch - before a version for SSE3, as SSE4.2 implies SSE3. */ + before a version for SSE3. */ enum feature_priority { P_NONE = 0, @@ -148,6 +148,9 @@ enum feature_priority P_AVX512F, P_PROC_AVX512F, P_X86_64_V4, + P_AVX10_1_256, + P_AVX10_1_512, + P_PROC_AVX10_1_512, P_PROC_DYNAMIC }; diff --git a/gcc/common/config/i386/i386-isas.h b/gcc/common/config/i386/i386-isas.h index d6deb9a1522..9c2179a3dd8 100644 --- a/gcc/common/config/i386/i386-isas.h +++ b/gcc/common/config/i386/i386-isas.h @@ -194,6 +194,6 @@ ISA_NAMES_TABLE_START ISA_NAMES_TABLE_ENTRY("apxf", FEATURE_APX_F, P_NONE, "-mapxf") ISA_NAMES_TABLE_ENTRY("usermsr", FEATURE_USER_MSR, P_NONE, "-musermsr") ISA_NAMES_TABLE_ENTRY("avx10.1", FEATURE_AVX10_1_256, P_NONE, "-mavx10.1") - ISA_NAMES_TABLE_ENTRY("avx10.1-256", FEATURE_AVX10_1_256, P_NONE, "-mavx10.1-256") - ISA_NAMES_TABLE_ENTRY("avx10.1-512", FEATURE_AVX10_1_512, P_NONE, "-mavx10.1-512") + ISA_NAMES_TABLE_ENTRY("avx10.1-256", FEATURE_AVX10_1_256, P_AVX10_1_256, "-mavx10.1-256") + ISA_NAMES_TABLE_ENTRY("avx10.1-512", FEATURE_AVX10_1_512, P_AVX10_1_512, "-mavx10.1-512") ISA_NAMES_TABLE_END diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-25.c b/gcc/testsuite/gcc.target/i386/avx10_1-25.c new file mode 100644 index 000..73f1b724560 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx10_1-25.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx" } */ + +#include +__attribute__((target_clones ("default","avx10.1-256"))) +__m256d foo(__m256d a, __m256d b) +{ + return a + b; +} diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-26.c b/gcc/testsuite/gcc.target/i386/avx10_1-26.c new file mode 100644 index 000..514ab57a406 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx10_1-26.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx512f" } */ + +#include +__attribute__((target_clones ("default","avx10.1-512"))) +__m512d foo(__m512d a, __m512d b) +{ + return a + b; +}
[gcc r15-983] Add AVX10.1 target_clones support
https://gcc.gnu.org/g:1f2ca510065a2033bac408eb5a960ef0126f25cc commit r15-983-g1f2ca510065a2033bac408eb5a960ef0126f25cc Author: Haochen Jiang Date: Mon May 20 15:52:32 2024 +0800 Add AVX10.1 target_clones support Since AVX10 is the first major ISA introduced after AVX-512, we propose to add target_clones support for it. Although AVX10.1-256 won't cover 512-bit part of AVX512F, but since it is only for priority but not for implication, it won't be an issue. gcc/ChangeLog: * common/config/i386/i386-common.cc: Change Granite Rapids series CPU type to P_PROC_AVX10_1_512. * common/config/i386/i386-cpuinfo.h (enum feature_priority): Revise comment part. Add P_AVX10_1_256, P_AVX10_1_512, P_PROC_AVX10_1_512. * common/config/i386/i386-isas.h: Link to avx10.1-256, avx10.1-512. gcc/testsuite/ChangeLog: * gcc.target/i386/avx10_1-25.c: New test. * gcc.target/i386/avx10_1-26.c: Ditto. Diff: --- gcc/common/config/i386/i386-common.cc | 4 ++-- gcc/common/config/i386/i386-cpuinfo.h | 5 - gcc/common/config/i386/i386-isas.h | 4 ++-- gcc/testsuite/gcc.target/i386/avx10_1-25.c | 9 + gcc/testsuite/gcc.target/i386/avx10_1-26.c | 9 + 5 files changed, 26 insertions(+), 5 deletions(-) diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc index 895e5fa662d..5d9c188c9c7 100644 --- a/gcc/common/config/i386/i386-common.cc +++ b/gcc/common/config/i386/i386-common.cc @@ -2187,10 +2187,10 @@ const pta processor_alias_table[] = {"meteorlake", PROCESSOR_ALDERLAKE, CPU_HASWELL, PTA_ALDERLAKE, M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2}, {"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL, PTA_GRANITERAPIDS, -M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F}, +M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX10_1_512}, {"graniterapids-d", PROCESSOR_GRANITERAPIDS_D, CPU_HASWELL, PTA_GRANITERAPIDS_D, M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS_D), -P_PROC_AVX512F}, +P_PROC_AVX10_1_512}, {"arrowlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE, M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE), P_PROC_AVX2}, {"arrowlake-s", PROCESSOR_ARROWLAKE_S, CPU_HASWELL, PTA_ARROWLAKE_S, diff --git a/gcc/common/config/i386/i386-cpuinfo.h b/gcc/common/config/i386/i386-cpuinfo.h index 9edad96d4fd..3ec9e005a6a 100644 --- a/gcc/common/config/i386/i386-cpuinfo.h +++ b/gcc/common/config/i386/i386-cpuinfo.h @@ -110,7 +110,7 @@ enum processor_subtypes /* Priority of i386 features, greater value is higher priority. This is used to decide the order in which function dispatch must happen. For instance, a version specialized for SSE4.2 should be checked for dispatch - before a version for SSE3, as SSE4.2 implies SSE3. */ + before a version for SSE3. */ enum feature_priority { P_NONE = 0, @@ -146,6 +146,9 @@ enum feature_priority P_AVX512F, P_PROC_AVX512F, P_X86_64_V4, + P_AVX10_1_256, + P_AVX10_1_512, + P_PROC_AVX10_1_512, P_PROC_DYNAMIC }; diff --git a/gcc/common/config/i386/i386-isas.h b/gcc/common/config/i386/i386-isas.h index 4b4d4b4af99..2a092f740bb 100644 --- a/gcc/common/config/i386/i386-isas.h +++ b/gcc/common/config/i386/i386-isas.h @@ -184,6 +184,6 @@ ISA_NAMES_TABLE_START ISA_NAMES_TABLE_ENTRY("apxf", FEATURE_APX_F, P_NONE, "-mapxf") ISA_NAMES_TABLE_ENTRY("usermsr", FEATURE_USER_MSR, P_NONE, "-musermsr") ISA_NAMES_TABLE_ENTRY("avx10.1", FEATURE_AVX10_1_256, P_NONE, "-mavx10.1") - ISA_NAMES_TABLE_ENTRY("avx10.1-256", FEATURE_AVX10_1_256, P_NONE, "-mavx10.1-256") - ISA_NAMES_TABLE_ENTRY("avx10.1-512", FEATURE_AVX10_1_512, P_NONE, "-mavx10.1-512") + ISA_NAMES_TABLE_ENTRY("avx10.1-256", FEATURE_AVX10_1_256, P_AVX10_1_256, "-mavx10.1-256") + ISA_NAMES_TABLE_ENTRY("avx10.1-512", FEATURE_AVX10_1_512, P_AVX10_1_512, "-mavx10.1-512") ISA_NAMES_TABLE_END diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-25.c b/gcc/testsuite/gcc.target/i386/avx10_1-25.c new file mode 100644 index 000..73f1b724560 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx10_1-25.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx" } */ + +#include +__attribute__((target_clones ("default","avx10.1-256"))) +__m256d foo(__m256d a, __m256d b) +{ + return a + b; +} diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-26.c b/gcc/testsuite/gcc.target/i386/avx10_1-26.c new file mode 100644 index 000..514ab57a406 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx10_1-26.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx512f" } */ + +#include +__attribute__((target_clones ("default","avx10.1-512"))) +__m512d foo(__m512d a, __m512d b) +{ + return a + b; +}
[gcc r15-888] Align tight loop without considering max skipping bytes.
https://gcc.gnu.org/g:b644126237a1aa8599f767a5e0bbada1d7286f44 commit r15-888-gb644126237a1aa8599f767a5e0bbada1d7286f44 Author: liuhongt Date: Wed May 29 11:14:26 2024 +0800 Align tight loop without considering max skipping bytes. When hot loop is small enough to fix into one cacheline, we should align the loop with ceil_log2 (loop_size) without considering maximum skipp bytes. It will help code prefetch. gcc/ChangeLog: * config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change gen_pad to gen_max_skip_align. (ix86_align_loops): New function. (ix86_reorg): Call ix86_align_loops. * config/i386/i386.md (pad): Rename to .. (max_skip_align): .. this, and accept 2 operands for align and skip. Diff: --- gcc/config/i386/i386.cc | 148 +++- gcc/config/i386/i386.md | 10 ++-- 2 files changed, 153 insertions(+), 5 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 85d87b9f778..1a0206ab573 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23146,7 +23146,7 @@ ix86_avoid_jump_mispredicts (void) if (dump_file) fprintf (dump_file, "Padding insn %i by %i bytes!\n", INSN_UID (insn), padsize); - emit_insn_before (gen_pad (GEN_INT (padsize)), insn); + emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn); } } } @@ -23419,6 +23419,150 @@ ix86_split_stlf_stall_load () } } +/* When a hot loop can be fit into one cacheline, + force align the loop without considering the max skip. */ +static void +ix86_align_loops () +{ + basic_block bb; + + /* Don't do this when we don't know cache line size. */ + if (ix86_cost->prefetch_block == 0) +return; + + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + profile_count count_threshold = cfun->cfg->count_max / param_align_threshold; + FOR_EACH_BB_FN (bb, cfun) +{ + rtx_insn *label = BB_HEAD (bb); + bool has_fallthru = 0; + edge e; + edge_iterator ei; + + if (!LABEL_P (label)) + continue; + + profile_count fallthru_count = profile_count::zero (); + profile_count branch_count = profile_count::zero (); + + FOR_EACH_EDGE (e, ei, bb->preds) + { + if (e->flags & EDGE_FALLTHRU) + has_fallthru = 1, fallthru_count += e->count (); + else + branch_count += e->count (); + } + + if (!fallthru_count.initialized_p () || !branch_count.initialized_p ()) + continue; + + if (bb->loop_father + && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun) + && (has_fallthru + ? (!(single_succ_p (bb) + && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun)) +&& optimize_bb_for_speed_p (bb) +&& branch_count + fallthru_count > count_threshold +&& (branch_count > fallthru_count * param_align_loop_iterations)) + /* In case there'no fallthru for the loop. +Nops inserted won't be executed. */ + : (branch_count > count_threshold +|| (bb->count > bb->prev_bb->count * 10 +&& (bb->prev_bb->count +<= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2) + { + rtx_insn* insn, *end_insn; + HOST_WIDE_INT size = 0; + bool padding_p = true; + basic_block tbb = bb; + unsigned cond_branch_num = 0; + bool detect_tight_loop_p = false; + + for (unsigned int i = 0; i != bb->loop_father->num_nodes; + i++, tbb = tbb->next_bb) + { + /* Only handle continuous cfg layout. */ + if (bb->loop_father != tbb->loop_father) + { + padding_p = false; + break; + } + + FOR_BB_INSNS (tbb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + size += ix86_min_insn_size (insn); + + /* We don't know size of inline asm. +Don't align loop for call. */ + if (asm_noperands (PATTERN (insn)) >= 0 + || CALL_P (insn)) + { + size = -1; + break; + } + } + + if (size == -1 || size > ix86_cost->prefetch_block) + { + padding_p = false; + break; + } + + FOR_EACH_EDGE (e, ei, tbb->succs) + { + /* It could be part of the loop. */ + if (e->dest == bb) + { + detect_tight_loop_p = true; + break; + } + } + + if
[gcc r15-887] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
https://gcc.gnu.org/g:00ed5424b1d4dcccfa187f55205521826794898c commit r15-887-g00ed5424b1d4dcccfa187f55205521826794898c Author: Haochen Jiang Date: Wed May 29 11:13:55 2024 +0800 Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Previously, we use 16:11:8 in generic tune for Intel processors, which lead to cross cache line issue and result in some random performance penalty in benchmarks with small loops commit to commit. After changing to always aligning to 16 bytes, it will somehow solve the issue. gcc/ChangeLog: * config/i386/x86-tune-costs.h (generic_cost): Change from 16:11:8 to 16. Diff: --- gcc/config/i386/x86-tune-costs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 65d7d1f7e42..d34b5cc 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -3758,7 +3758,7 @@ struct processor_costs generic_cost = { generic_memset, COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ COSTS_N_INSNS (2), /* cond_not_taken_branch_cost. */ - "16:11:8", /* Loop alignment. */ + "16",/* Loop alignment. */ "16:11:8", /* Jump alignment. */ "0:0:8", /* Label alignment. */ "16",/* Func alignment. */
[gcc r14-10254] Align tight loop without considering max skipping bytes.
https://gcc.gnu.org/g:b4d4ece0443433cd5c3078cfe03f18429e73b77a commit r14-10254-gb4d4ece0443433cd5c3078cfe03f18429e73b77a Author: liuhongt Date: Wed May 29 11:12:51 2024 +0800 Align tight loop without considering max skipping bytes. When hot loop is small enough to fix into one cacheline, we should align the loop with ceil_log2 (loop_size) without considering maximum skipp bytes. It will help code prefetch. gcc/ChangeLog: * config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change gen_pad to gen_max_skip_align. (ix86_align_loops): New function. (ix86_reorg): Call ix86_align_loops. * config/i386/i386.md (pad): Rename to .. (max_skip_align): .. this, and accept 2 operands for align and skip. Diff: --- gcc/config/i386/i386.cc | 148 +++- gcc/config/i386/i386.md | 10 ++-- 2 files changed, 153 insertions(+), 5 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index fbd9b4dac2e..984ba37beeb 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23135,7 +23135,7 @@ ix86_avoid_jump_mispredicts (void) if (dump_file) fprintf (dump_file, "Padding insn %i by %i bytes!\n", INSN_UID (insn), padsize); - emit_insn_before (gen_pad (GEN_INT (padsize)), insn); + emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn); } } } @@ -23408,6 +23408,150 @@ ix86_split_stlf_stall_load () } } +/* When a hot loop can be fit into one cacheline, + force align the loop without considering the max skip. */ +static void +ix86_align_loops () +{ + basic_block bb; + + /* Don't do this when we don't know cache line size. */ + if (ix86_cost->prefetch_block == 0) +return; + + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + profile_count count_threshold = cfun->cfg->count_max / param_align_threshold; + FOR_EACH_BB_FN (bb, cfun) +{ + rtx_insn *label = BB_HEAD (bb); + bool has_fallthru = 0; + edge e; + edge_iterator ei; + + if (!LABEL_P (label)) + continue; + + profile_count fallthru_count = profile_count::zero (); + profile_count branch_count = profile_count::zero (); + + FOR_EACH_EDGE (e, ei, bb->preds) + { + if (e->flags & EDGE_FALLTHRU) + has_fallthru = 1, fallthru_count += e->count (); + else + branch_count += e->count (); + } + + if (!fallthru_count.initialized_p () || !branch_count.initialized_p ()) + continue; + + if (bb->loop_father + && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun) + && (has_fallthru + ? (!(single_succ_p (bb) + && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun)) +&& optimize_bb_for_speed_p (bb) +&& branch_count + fallthru_count > count_threshold +&& (branch_count > fallthru_count * param_align_loop_iterations)) + /* In case there'no fallthru for the loop. +Nops inserted won't be executed. */ + : (branch_count > count_threshold +|| (bb->count > bb->prev_bb->count * 10 +&& (bb->prev_bb->count +<= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2) + { + rtx_insn* insn, *end_insn; + HOST_WIDE_INT size = 0; + bool padding_p = true; + basic_block tbb = bb; + unsigned cond_branch_num = 0; + bool detect_tight_loop_p = false; + + for (unsigned int i = 0; i != bb->loop_father->num_nodes; + i++, tbb = tbb->next_bb) + { + /* Only handle continuous cfg layout. */ + if (bb->loop_father != tbb->loop_father) + { + padding_p = false; + break; + } + + FOR_BB_INSNS (tbb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + size += ix86_min_insn_size (insn); + + /* We don't know size of inline asm. +Don't align loop for call. */ + if (asm_noperands (PATTERN (insn)) >= 0 + || CALL_P (insn)) + { + size = -1; + break; + } + } + + if (size == -1 || size > ix86_cost->prefetch_block) + { + padding_p = false; + break; + } + + FOR_EACH_EDGE (e, ei, tbb->succs) + { + /* It could be part of the loop. */ + if (e->dest == bb) + { + detect_tight_loop_p = true; + break; + } + } + + if
[gcc r14-10253] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
https://gcc.gnu.org/g:80600352d1282f084900ab444f2d4c83986f2ae5 commit r14-10253-g80600352d1282f084900ab444f2d4c83986f2ae5 Author: Haochen Jiang Date: Wed May 29 11:12:37 2024 +0800 Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Previously, we use 16:11:8 in generic tune for Intel processors, which lead to cross cache line issue and result in some random performance penalty in benchmarks with small loops commit to commit. After changing to always aligning to 16 bytes, it will somehow solve the issue. gcc/ChangeLog: * config/i386/x86-tune-costs.h (generic_cost): Change from 16:11:8 to 16. Diff: --- gcc/config/i386/x86-tune-costs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 65d7d1f7e42..d34b5cc 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -3758,7 +3758,7 @@ struct processor_costs generic_cost = { generic_memset, COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ COSTS_N_INSNS (2), /* cond_not_taken_branch_cost. */ - "16:11:8", /* Loop alignment. */ + "16",/* Loop alignment. */ "16:11:8", /* Jump alignment. */ "0:0:8", /* Label alignment. */ "16",/* Func alignment. */
[gcc r14-10229] i386: Disable ix86_expand_vecop_qihi2 when !TARGET_AVX512BW
https://gcc.gnu.org/g:1ad5c9d524d8fa99773045e75da04ae958012085 commit r14-10229-g1ad5c9d524d8fa99773045e75da04ae958012085 Author: Haochen Jiang Date: Tue May 21 14:10:43 2024 +0800 i386: Disable ix86_expand_vecop_qihi2 when !TARGET_AVX512BW Since vpermq is really slow, we should avoid using it for permutation when vpmovwb is not available (needs AVX512BW) for ix86_expand_vecop_qihi2 and fall back to ix86_expand_vecop_qihi. gcc/ChangeLog: PR target/115069 * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Do not enable the optimization when AVX512BW is not enabled. gcc/testsuite/ChangeLog: PR target/115069 * gcc.target/i386/pr115069.c: New. Diff: --- gcc/config/i386/i386-expand.cc | 7 +++ gcc/testsuite/gcc.target/i386/pr115069.c | 9 + 2 files changed, 16 insertions(+) diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 8bb8f21e686..51efe6fdd7d 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -23963,6 +23963,13 @@ ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest, rtx op1, rtx op2) bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT; bool uns_p = code != ASHIFTRT; + /* Without VPMOVWB (provided by AVX512BW ISA), the expansion uses the + generic permutation to merge the data back into the right place. This + permutation results in VPERMQ, which is slow, so better fall back to + ix86_expand_vecop_qihi. */ + if (!TARGET_AVX512BW) +return false; + if ((qimode == V16QImode && !TARGET_AVX2) || (qimode == V32QImode && (!TARGET_AVX512BW || !TARGET_EVEX512)) /* There are no V64HImode instructions. */ diff --git a/gcc/testsuite/gcc.target/i386/pr115069.c b/gcc/testsuite/gcc.target/i386/pr115069.c new file mode 100644 index 000..50a3e033079 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr115069.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx2" } */ +/* { dg-final { scan-assembler-not "vpermq" } } */ + +typedef char v16qi __attribute__((vector_size(16))); + +v16qi foo (v16qi a, v16qi b) { +return a * b; +}
[gcc r15-764] i386: Disable ix86_expand_vecop_qihi2 when !TARGET_AVX512BW
https://gcc.gnu.org/g:73a167cfa225d5ee7092d41596b9fea1719898ff commit r15-764-g73a167cfa225d5ee7092d41596b9fea1719898ff Author: Haochen Jiang Date: Tue May 21 14:10:43 2024 +0800 i386: Disable ix86_expand_vecop_qihi2 when !TARGET_AVX512BW Since vpermq is really slow, we should avoid using it for permutation when vpmovwb is not available (needs AVX512BW) for ix86_expand_vecop_qihi2 and fall back to ix86_expand_vecop_qihi. gcc/ChangeLog: PR target/115069 * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Do not enable the optimization when AVX512BW is not enabled. gcc/testsuite/ChangeLog: PR target/115069 * gcc.target/i386/pr115069.c: New. Diff: --- gcc/config/i386/i386-expand.cc | 7 +++ gcc/testsuite/gcc.target/i386/pr115069.c | 9 + 2 files changed, 16 insertions(+) diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 7142c0a9d77..ec402a78a09 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -24188,6 +24188,13 @@ ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest, rtx op1, rtx op2) bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT; bool uns_p = code != ASHIFTRT; + /* Without VPMOVWB (provided by AVX512BW ISA), the expansion uses the + generic permutation to merge the data back into the right place. This + permutation results in VPERMQ, which is slow, so better fall back to + ix86_expand_vecop_qihi. */ + if (!TARGET_AVX512BW) +return false; + if ((qimode == V16QImode && !TARGET_AVX2) || (qimode == V32QImode && (!TARGET_AVX512BW || !TARGET_EVEX512)) /* There are no V64HImode instructions. */ diff --git a/gcc/testsuite/gcc.target/i386/pr115069.c b/gcc/testsuite/gcc.target/i386/pr115069.c new file mode 100644 index 000..50a3e033079 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr115069.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx2" } */ +/* { dg-final { scan-assembler-not "vpermq" } } */ + +typedef char v16qi __attribute__((vector_size(16))); + +v16qi foo (v16qi a, v16qi b) { +return a * b; +}
[gcc r13-8652] i386: Fix array index overflow in pr105354-2.c
https://gcc.gnu.org/g:7425436b5382a04f3eb28c7c7912f4d9a1cad0bd commit r13-8652-g7425436b5382a04f3eb28c7c7912f4d9a1cad0bd Author: Haochen Jiang Date: Fri Apr 26 16:48:29 2024 +0800 i386: Fix array index overflow in pr105354-2.c The array index should not be over 8 for v8hi, or it will fail under -O0 or using -fstack-protector. gcc/testsuite/ChangeLog: PR target/110621 * gcc.target/i386/pr105354-2.c: As mentioned. Diff: --- gcc/testsuite/gcc.target/i386/pr105354-2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.target/i386/pr105354-2.c b/gcc/testsuite/gcc.target/i386/pr105354-2.c index b78b62e1e7e..1c592e84860 100644 --- a/gcc/testsuite/gcc.target/i386/pr105354-2.c +++ b/gcc/testsuite/gcc.target/i386/pr105354-2.c @@ -17,7 +17,7 @@ sse2_test (void) b.a[i] = i + 16; res_ab.a[i] = 0; exp_ab.a[i] = -1; - if (i <= 8) + if (i < 8) { c.a[i] = i; d.a[i] = i + 8;
[gcc r14-10137] i386: Fix array index overflow in pr105354-2.c
https://gcc.gnu.org/g:4a2e55b3ada20fe6457466bb687a66c8d03e056e commit r14-10137-g4a2e55b3ada20fe6457466bb687a66c8d03e056e Author: Haochen Jiang Date: Fri Apr 26 16:48:29 2024 +0800 i386: Fix array index overflow in pr105354-2.c The array index should not be over 8 for v8hi, or it will fail under -O0 or using -fstack-protector. gcc/testsuite/ChangeLog: PR target/110621 * gcc.target/i386/pr105354-2.c: As mentioned. Diff: --- gcc/testsuite/gcc.target/i386/pr105354-2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.target/i386/pr105354-2.c b/gcc/testsuite/gcc.target/i386/pr105354-2.c index b78b62e1e7e..1c592e84860 100644 --- a/gcc/testsuite/gcc.target/i386/pr105354-2.c +++ b/gcc/testsuite/gcc.target/i386/pr105354-2.c @@ -17,7 +17,7 @@ sse2_test (void) b.a[i] = i + 16; res_ab.a[i] = 0; exp_ab.a[i] = -1; - if (i <= 8) + if (i < 8) { c.a[i] = i; d.a[i] = i + 8;
[gcc r14-10104] i386: Fix behavior for both using AVX10.1-256 in options and function attribute
https://gcc.gnu.org/g:d279c9d89b2f6ce89c1eec0ff4b980e9c5f51fd1 commit r14-10104-gd279c9d89b2f6ce89c1eec0ff4b980e9c5f51fd1 Author: Haochen Jiang Date: Wed Apr 24 10:43:18 2024 +0800 i386: Fix behavior for both using AVX10.1-256 in options and function attribute When we are using -mavx10.1-256 in command line and avx10.1-256 in target attribute together, zmm should never be generated. But current GCC will generate zmm since it wrongly enables EVEX512 for non-explicitly set AVX512. This patch will fix that issue. gcc/ChangeLog: * config/i386/i386-options.cc (ix86_valid_target_attribute_tree): Check whether AVX512F is explicitly enabled. gcc/testsuite/ChangeLog: * gcc.target/i386/avx10_1-24.c: New test. Diff: --- gcc/config/i386/i386-options.cc| 1 + gcc/testsuite/gcc.target/i386/avx10_1-24.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc index 68a2e1c6910..ac48b5c61c4 100644 --- a/gcc/config/i386/i386-options.cc +++ b/gcc/config/i386/i386-options.cc @@ -1431,6 +1431,7 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args, scenario. */ if ((def->x_ix86_isa_flags2 & OPTION_MASK_ISA2_AVX10_1_256) && (opts->x_ix86_isa_flags & OPTION_MASK_ISA_AVX512F) + && (opts->x_ix86_isa_flags_explicit & OPTION_MASK_ISA_AVX512F) && !(def->x_ix86_isa_flags2_explicit & OPTION_MASK_ISA2_EVEX512) && !(opts->x_ix86_isa_flags2_explicit & OPTION_MASK_ISA2_EVEX512)) opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_EVEX512; diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-24.c b/gcc/testsuite/gcc.target/i386/avx10_1-24.c new file mode 100644 index 000..2e93f041760 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx10_1-24.c @@ -0,0 +1,7 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=x86-64 -mavx10.1" } */ +/* { dg-final { scan-assembler-not "%zmm" } } */ + +typedef float __m512 __attribute__ ((__vector_size__ (64), __may_alias__)); + +void __attribute__((target("avx10.1-256"))) callee256(__m512 *a, __m512 *b) { *a = *b; }
[gcc r13-8641] i386: Fix Sierra Forest auto dispatch
https://gcc.gnu.org/g:d80c9df20ed77a26eb71457679dad2b564c5da60 commit r13-8641-gd80c9df20ed77a26eb71457679dad2b564c5da60 Author: Haochen Jiang Date: Mon Apr 22 16:57:36 2024 +0800 i386: Fix Sierra Forest auto dispatch gcc/ChangeLog: * common/config/i386/i386-common.cc (processor_alias_table): Let Sierra Forest map to CPU_TYPE enum. Diff: --- gcc/common/config/i386/i386-common.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc index 988805a3aed..a8809889360 100644 --- a/gcc/common/config/i386/i386-common.cc +++ b/gcc/common/config/i386/i386-common.cc @@ -2110,7 +2110,7 @@ const pta processor_alias_table[] = {"gracemont", PROCESSOR_ALDERLAKE, CPU_HASWELL, PTA_ALDERLAKE, M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2}, {"sierraforest", PROCESSOR_SIERRAFOREST, CPU_HASWELL, PTA_SIERRAFOREST, -M_CPU_SUBTYPE (INTEL_SIERRAFOREST), P_PROC_AVX2}, +M_CPU_TYPE (INTEL_SIERRAFOREST), P_PROC_AVX2}, {"grandridge", PROCESSOR_GRANDRIDGE, CPU_HASWELL, PTA_GRANDRIDGE, M_CPU_TYPE (INTEL_GRANDRIDGE), P_PROC_AVX2}, {"knl", PROCESSOR_KNL, CPU_SLM, PTA_KNL,
[gcc r14-10072] i386: Fix Sierra Forest auto dispatch
https://gcc.gnu.org/g:6b5248d15c6d10325c6cbb92a0e0a9eb04e3f122 commit r14-10072-g6b5248d15c6d10325c6cbb92a0e0a9eb04e3f122 Author: Haochen Jiang Date: Mon Apr 22 16:57:36 2024 +0800 i386: Fix Sierra Forest auto dispatch gcc/ChangeLog: * common/config/i386/i386-common.cc (processor_alias_table): Let Sierra Forest map to CPU_TYPE enum. Diff: --- gcc/common/config/i386/i386-common.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc index f814df8385b..77b154663bc 100644 --- a/gcc/common/config/i386/i386-common.cc +++ b/gcc/common/config/i386/i386-common.cc @@ -2302,7 +2302,7 @@ const pta processor_alias_table[] = {"gracemont", PROCESSOR_ALDERLAKE, CPU_HASWELL, PTA_ALDERLAKE, M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2}, {"sierraforest", PROCESSOR_SIERRAFOREST, CPU_HASWELL, PTA_SIERRAFOREST, -M_CPU_SUBTYPE (INTEL_SIERRAFOREST), P_PROC_AVX2}, +M_CPU_TYPE (INTEL_SIERRAFOREST), P_PROC_AVX2}, {"grandridge", PROCESSOR_GRANDRIDGE, CPU_HASWELL, PTA_GRANDRIDGE, M_CPU_TYPE (INTEL_GRANDRIDGE), P_PROC_AVX2}, {"clearwaterforest", PROCESSOR_CLEARWATERFOREST, CPU_HASWELL,
gcc-wwwdocs branch master updated. 033976162ed4745f7f808f14ba62b1c055e35d16
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "gcc-wwwdocs". The branch, master has been updated via 033976162ed4745f7f808f14ba62b1c055e35d16 (commit) from 9e32f911b70a8c2303b9b60679ce337896ccffdd (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log - commit 033976162ed4745f7f808f14ba62b1c055e35d16 Author: Haochen Jiang Date: Fri Apr 12 16:34:48 2024 +0800 Uncomment MCore part title diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index 14301157..8ac08e9a 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -828,7 +828,7 @@ __asm (".global __flmap_lock" "\n\t" - +MCore Bitfields are now signed by default per GCC policy. If you need bitfields to be unsigned, use -funsigned-bitfields. --- Summary of changes: htdocs/gcc-14/changes.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) hooks/post-receive -- gcc-wwwdocs