Re: [PATCH] Enable auto-vectorization at O2 with very-cheap cost model.
On Wed, Sep 22, 2021 at 10:21 PM Martin Sebor wrote: > > On 9/21/21 7:38 PM, Hongtao Liu wrote: > > On Mon, Sep 20, 2021 at 4:13 AM Martin Sebor wrote: > ... > > diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > index 1d79930cd58..9351f7e7a1a 100644 > > --- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > +++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > @@ -1,7 +1,7 @@ > >/* PR middle-end/91458 - inconsistent warning for writing past the > > end > > of an array member > > { dg-do compile } > > - { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf" } */ > > + { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf > > -fno-tree-vectorize" } */ > > The testcase is large - what part requires this change? Given the > testcase was added for inconsistent warnings do they now become > inconsistent again as we enable vectorization at -O2? > > That said, the testcase adjustments need some explaining - I suppose > you didn't just slap -fno-tree-vectorize to all of those changing > behavior? > > >>> void ga1_ (void) > >>> { > >>> a1_.a[0] = 0; > >>> a1_.a[1] = 1; // { dg-warning > >>> "\\\[-Wstringop-overflow" } > >>> a1_.a[2] = 2; // { dg-warning > >>> "\\\[-Wstringop-overflow" } > >>> > >>> struct A1 a; > >>> a.a[0] = 0; > >>> a.a[1] = 1; // { dg-warning > >>> "\\\[-Wstringop-overflow" } > >>> a.a[2] = 2; // { dg-warning > >>> "\\\[-Wstringop-overflow" } > >>> sink (&a); > >>> } > >>> > >>> It's supposed to be 2 warning for a.a[1] = 1 and a.a[2] = 1 since > >>> there are 2 accesses, but after enabling vectorization, there's only > >>> one access, so one warning is missing which causes the failure. > > With the stores vectorized, is the warning on the correct line or > does it point to the first store, the one that's in bounds, as > it does with -O3? The latter would be a regression at -O2. For the upper case, It points to the second store which is out of bounds, the third store warning is missing. > > >> > >> I would find it preferable to change the test code over disabling > >> optimizations that are on by default. My concern is that the test > >> would no longer exercise the default behavior. (The same goes for > >> the -fno-ipa-icf option.) > > Hmm, it's a middle-end test, for some backend, it may not do > > vectorization(it depends on TARGET_VECTOR_MODE_SUPPORTED_P and > > relative cost model). > > Yes, there are quite a few warning tests like that. Their main > purpose is to verify that in common GCC invocations (i.e., without > any special options) warnings are a) issued when expected and b) > not issued when not expected. Otherwise, middle end warnings are > known to have both false positives and false negatives in some > invocations, depending on what optimizations are in effect. > Indiscriminately disabling common optimizations for these large > tests and invoking them under artificial conditions would > compromise this goal and hide the problems. > > If enabling vectorization at -O2 causes regressions in the quality > of diagnostics (as the test failure above indicates seems to be > happening) we should investigate these and open bugs for them so > they can be fixed. We can then tweak the specific failing test > cases to avoid the failures until they are fixed. There are indeed cases of false positives and false negatives .i.e. // Verify warning for access to a definition with an initializer that // initializes the one-element array member. struct A1 a1i_1 = { 0, { 1 } }; void ga1i_1 (void) { a1i_1.a[0] = 0; a1i_1.a[1] = 1; // { dg-warning "\\\[-Wstringop-overflow" } a1i_1.a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" } struct A1 a = { 0, { 1 } }; --- false positive here. a.a[0] = 1; a.a[1] = 2; // { dg-warning "\\\[-Wstringop-overflow" } false negative here. a.a[2] = 3; // { dg-warning "\\\[-Wstringop-overflow" } false negative here. sink (&a); } the last 2 warnings are missing, and there's new warning on the line *struct A1 a = { 0, { 1 } }; > > Martin -- BR, Hongtao
Re: [PATCH] Enable auto-vectorization at O2 with very-cheap cost model.
On Thu, Sep 23, 2021 at 9:48 AM Hongtao Liu wrote: > > On Wed, Sep 22, 2021 at 10:21 PM Martin Sebor wrote: > > > > On 9/21/21 7:38 PM, Hongtao Liu wrote: > > > On Mon, Sep 20, 2021 at 4:13 AM Martin Sebor wrote: > > ... > > > diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > index 1d79930cd58..9351f7e7a1a 100644 > > > --- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > +++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > @@ -1,7 +1,7 @@ > > >/* PR middle-end/91458 - inconsistent warning for writing past the > > > end > > > of an array member > > > { dg-do compile } > > > - { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf" } */ > > > + { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf > > > -fno-tree-vectorize" } */ > > > > The testcase is large - what part requires this change? Given the > > testcase was added for inconsistent warnings do they now become > > inconsistent again as we enable vectorization at -O2? > > > > That said, the testcase adjustments need some explaining - I suppose > > you didn't just slap -fno-tree-vectorize to all of those changing > > behavior? > > > > >>> void ga1_ (void) > > >>> { > > >>> a1_.a[0] = 0; > > >>> a1_.a[1] = 1; // { dg-warning > > >>> "\\\[-Wstringop-overflow" } > > >>> a1_.a[2] = 2; // { dg-warning > > >>> "\\\[-Wstringop-overflow" } > > >>> > > >>> struct A1 a; > > >>> a.a[0] = 0; > > >>> a.a[1] = 1; // { dg-warning > > >>> "\\\[-Wstringop-overflow" } > > >>> a.a[2] = 2; // { dg-warning > > >>> "\\\[-Wstringop-overflow" } > > >>> sink (&a); > > >>> } > > >>> > > >>> It's supposed to be 2 warning for a.a[1] = 1 and a.a[2] = 1 since > > >>> there are 2 accesses, but after enabling vectorization, there's only > > >>> one access, so one warning is missing which causes the failure. > > > > With the stores vectorized, is the warning on the correct line or > > does it point to the first store, the one that's in bounds, as > > it does with -O3? The latter would be a regression at -O2. > For the upper case, It points to the second store which is out of > bounds, the third store warning is missing. > > > > >> > > >> I would find it preferable to change the test code over disabling > > >> optimizations that are on by default. My concern is that the test > > >> would no longer exercise the default behavior. (The same goes for > > >> the -fno-ipa-icf option.) > > > Hmm, it's a middle-end test, for some backend, it may not do > > > vectorization(it depends on TARGET_VECTOR_MODE_SUPPORTED_P and > > > relative cost model). > > > > Yes, there are quite a few warning tests like that. Their main > > purpose is to verify that in common GCC invocations (i.e., without > > any special options) warnings are a) issued when expected and b) > > not issued when not expected. Otherwise, middle end warnings are > > known to have both false positives and false negatives in some > > invocations, depending on what optimizations are in effect. > > Indiscriminately disabling common optimizations for these large > > tests and invoking them under artificial conditions would > > compromise this goal and hide the problems. > > > > If enabling vectorization at -O2 causes regressions in the quality > > of diagnostics (as the test failure above indicates seems to be > > happening) we should investigate these and open bugs for them so > > they can be fixed. We can then tweak the specific failing test > > cases to avoid the failures until they are fixed. > There are indeed cases of false positives and false negatives > .i.e. > // Verify warning for access to a definition with an initializer that > // initializes the one-element array member. > struct A1 a1i_1 = { 0, { 1 } }; > > void ga1i_1 (void) > { > a1i_1.a[0] = 0; > a1i_1.a[1] = 1; // { dg-warning "\\\[-Wstringop-overflow" } > a1i_1.a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" } > > struct A1 a = { 0, { 1 } }; --- false positive here. > a.a[0] = 1; > a.a[1] = 2; // { dg-warning > "\\\[-Wstringop-overflow" } false negative here. > a.a[2] = 3; // { dg-warning > "\\\[-Wstringop-overflow" } false negative here. > sink (&a); > } Similar for * gcc.dg/Warray-bounds-51.c. * gcc.dg/Warray-parameter-3.c * gcc.dg/Wstringop-overflow-14.c * gcc.dg/Wstringop-overflow-21.c So there're 3 situations. 1. All accesses are out of bound, and after vectorization, there are some warnings missing. 2. Part of accesses are inbound, part of accesses are out of bound, and after vectorization, the warning goes from out of bound line to inbound line. 3. All access are out of bound, and after vectoriation, all warning are missing, and goes to a false-positive li
[PATCH v4 0/2] Implement indirect external access
Changes in the v4 patch. 1. Add nodirect_extern_access attribute. Changes in the v3 patch. 1. GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS support has been added to GNU binutils 2.38. But the -z indirect-extern-access linker option is only available for Linux/x86. However, the --max-cache-size=SIZE linker option was also addded within a day. --max-cache-size=SIZE is used to check for GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS support. Changes in the v2 patch. 1. Rename the option to -fdirect-extern-access. --- On systems with copy relocation: * A copy in executable is created for the definition in a shared library at run-time by ld.so. * The copy is referenced by executable and shared libraries. * Executable can access the copy directly. Issues are: * Overhead of a copy, time and space, may be visible at run-time. * Read-only data in the shared library becomes read-write copy in executable at run-time. * Local access to data with the STV_PROTECTED visibility in the shared library must use GOT. On systems without function descriptor, function pointers vary depending on where and how the functions are defined. * If the function is defined in executable, it can be the address of function body. * If the function, including the function with STV_PROTECTED visibility, is defined in the shared library, it can be the address of the PLT entry in executable or shared library. Issues are: * The address of function body may not be used as its function pointer. * ld.so needs to search loaded shared libraries for the function pointer of the function with STV_PROTECTED visibility. Here is a proposal to remove copy relocation and use canonical function pointer: 1. Accesses, including in PIE and non-PIE, to undefined symbols must use GOT. a. Linker may optimize out GOT access if the data is defined in PIE or non-PIE. 2. Read-only data in the shared library remain read-only at run-time 3. Address of global data with the STV_PROTECTED visibility in the shared library is the address of data body. a. Can use IP-relative access. b. May need GOT without IP-relative access. 4. For systems without function descriptor, a. All global function pointers of undefined functions in PIE and non-PIE must use GOT. Linker may optimize out GOT access if the function is defined in PIE or non-PIE. b. Function pointer of functions with the STV_PROTECTED visibility in executable and shared library is the address of function body. i. Can use IP-relative access. ii. May need GOT without IP-relative access. iii. Branches to undefined functions may use PLT. 5. Single global definition marker: Add GNU_PROPERTY_1_NEEDED: #define GNU_PROPERTY_1_NEEDED GNU_PROPERTY_UINT32_OR_LO to indicate the needed properties by the object file. Add GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS: #define GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS (1U << 0) to indicate that the object file requires canonical function pointers and cannot be used with copy relocation. This bit should be cleared in executable when there are non-GOT or non-PLT relocations in relocatable input files without this bit set. a. Protected symbol access within the shared library can be treated as local. b. Copy relocation should be disallowed at link-time and run-time. c. GOT function pointer reference is required at link-time and run-time. The indirect external access marker can be used in the following ways: 1. Linker can decide the best way to resolve a relocation against a protected symbol before seeing all relocations against the symbol. 2. Dynamic linker can decide if it is an error to have a copy relocation in executable against the protected symbol in a shared library by checking if the shared library is built with -fno-direct-extern-access. Add a compiler option, -fdirect-extern-access. -fdirect-extern-access is the default. With -fno-direct-extern-access: 1. Always to use GOT to access undefined symbols, including in PIE and non-PIE. This is safe to do and does not break the ABI. 2. In executable and shared library, for symbols with the STV_PROTECTED visibility: a. The address of data symbol is the address of data body. b. For systems without function descriptor, the function pointer is the address of function body. These break the ABI and resulting shared libraries may not be compatible with executables which are not compiled with -fno-direct-extern-access. 3. Generate an indirect external access marker in relocatable objects if supported by linker. H.J. Lu (2): Add -f[no-]direct-extern-access Add TARGET_ASM_EMIT_GNU_PROPERTY_NOTE gcc/c-family/c-attribs.c| 34 +++ gcc/common.opt | 4 ++ gcc/config.in | 7 +++ gcc/config/i386/gnu-property.c | 31 -- gcc/config/i386/i386-protos.h | 2 +- gcc/config/i386/i386.c | 64 - gcc/configure
[PATCH v4 2/2] Add TARGET_ASM_EMIT_GNU_PROPERTY_NOTE
Generate the marker for -fno-direct-extern-access to indicate that the object file uses GOT to access all external symbols. Access to protected symbols in the resulting shared library is treated as local, which requires canonical function pointers and cannot be used with copy relocation. This marker can be used in the following ways: 1. Linker can decide the best way to resolve a relocation against a protected symbol before seeing all relocations against the symbol. 2. Dynamic linker can decide if it is an error to have a copy relocation in executable against the protected symbol in a shared library by checking if the shared library is built with -fno-direct-extern-access. * configure.ac (HAVE_LD_INDIRECT_EXTERN_ACCESS_SUPPORT): New. Define to 1 if linker supports GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS. * output.h (emit_gnu_property): New. (emit_gnu_property_note): Likewise. * target.def (emit_gnu_property_note): Add a argetm.asm_out hook. * toplev.c (compile_file): Call emit_gnu_property_note before file_end. * varasm.c (emit_gnu_property): New. (emit_gnu_property_note): Likewise. * config.in: Regenerated. * configure: Likewise. * doc/tm.texi: Likewise. * config/i386/gnu-property.c (emit_gnu_property): Removed. (TARGET_ASM_EMIT_GNU_PROPERTY_NOTE): New. * doc/tm.texi.in: Add TARGET_ASM_EMIT_GNU_PROPERTY_NOTE. --- gcc/config.in | 7 + gcc/config/i386/gnu-property.c | 31 -- gcc/config/i386/i386.c | 2 ++ gcc/configure | 27 +++ gcc/configure.ac | 23 + gcc/doc/tm.texi| 5 gcc/doc/tm.texi.in | 2 ++ gcc/output.h | 2 ++ gcc/target.def | 8 ++ gcc/toplev.c | 3 +++ gcc/varasm.c | 47 ++ 11 files changed, 126 insertions(+), 31 deletions(-) diff --git a/gcc/config.in b/gcc/config.in index 61cafe4f6c0..8a756aa3541 100644 --- a/gcc/config.in +++ b/gcc/config.in @@ -1690,6 +1690,13 @@ #endif +/* Define to 1 if your linker supports + GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS. */ +#ifndef USED_FOR_TARGET +#undef HAVE_LD_INDIRECT_EXTERN_ACCESS_SUPPORT +#endif + + /* Define if your PowerPC64 linker supports a large TOC. */ #ifndef USED_FOR_TARGET #undef HAVE_LD_LARGE_TOC diff --git a/gcc/config/i386/gnu-property.c b/gcc/config/i386/gnu-property.c index 4ba04403002..9fe8d00132e 100644 --- a/gcc/config/i386/gnu-property.c +++ b/gcc/config/i386/gnu-property.c @@ -24,37 +24,6 @@ along with GCC; see the file COPYING3. If not see #include "output.h" #include "linux-common.h" -static void -emit_gnu_property (unsigned int type, unsigned int data) -{ - int p2align = ptr_mode == SImode ? 2 : 3; - - switch_to_section (get_section (".note.gnu.property", - SECTION_NOTYPE, NULL)); - - ASM_OUTPUT_ALIGN (asm_out_file, p2align); - /* name length. */ - fprintf (asm_out_file, ASM_LONG "1f - 0f\n"); - /* data length. */ - fprintf (asm_out_file, ASM_LONG "4f - 1f\n"); - /* note type: NT_GNU_PROPERTY_TYPE_0. */ - fprintf (asm_out_file, ASM_LONG "5\n"); - fprintf (asm_out_file, "0:\n"); - /* vendor name: "GNU". */ - fprintf (asm_out_file, STRING_ASM_OP "\"GNU\"\n"); - fprintf (asm_out_file, "1:\n"); - ASM_OUTPUT_ALIGN (asm_out_file, p2align); - /* pr_type. */ - fprintf (asm_out_file, ASM_LONG "0x%x\n", type); - /* pr_datasz. */ - fprintf (asm_out_file, ASM_LONG "3f - 2f\n"); - fprintf (asm_out_file, "2:\n"); - fprintf (asm_out_file, ASM_LONG "0x%x\n", data); - fprintf (asm_out_file, "3:\n"); - ASM_OUTPUT_ALIGN (asm_out_file, p2align); - fprintf (asm_out_file, "4:\n"); -} - void file_end_indicate_exec_stack_and_gnu_property (void) { diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9ca1ef512a4..3b678c4d5b6 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -24569,6 +24569,8 @@ ix86_libgcc_floating_mode_supported_p #if !TARGET_MACHO && !TARGET_DLLIMPORT_DECL_ATTRIBUTES # undef TARGET_ASM_RELOC_RW_MASK # define TARGET_ASM_RELOC_RW_MASK ix86_reloc_rw_mask +# undef TARGET_ASM_EMIT_GNU_PROPERTY_NOTE +# define TARGET_ASM_EMIT_GNU_PROPERTY_NOTE emit_gnu_property_note #endif static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED) diff --git a/gcc/configure b/gcc/configure index b3de17009b8..13fe041b0b6 100755 --- a/gcc/configure +++ b/gcc/configure @@ -32172,6 +32172,33 @@ fi { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ld_bndplt_support" >&5 $as_echo "$ld_bndplt_support" >&6; } +# Check if linker supports GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS. +ld_indirect_extern_access=0 +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking linker with GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS" >&5 +$as_echo_n "checking lin
[PATCH v4 1/2] Add -f[no-]direct-extern-access
Add -f[no-]direct-extern-access and nodirect_extern_access attribute. -fdirect-extern-access is the default and always use GOT to access undefined data and function symbols with nodirect_extern_access attribute, including in PIE and non-PIE. With -fno-direct-extern-access: 1. Always use GOT to access undefined data and function symbols, including in PIE and non-PIE. These will avoid copy relocations in executables. This is compatible with existing executables and shared libraries. 2. In executable and shared library, bind symbols with the STV_PROTECTED visibility locally: a. The address of data symbol is the address of data body. b. For systems without function descriptor, the function pointer is the address of function body. c. The resulting shared libraries may not be incompatible with executables which have copy relocations on protected symbols or use executable PLT entries as function addresses for protected functions in shared libraries. 3. Update asm_preferred_eh_data_format to select PC relative EH encoding format with -fno-direct-extern-access to avoid copy relocation. 4. Add ix86_reloc_rw_mask for TARGET_ASM_RELOC_RW_MASK to avoid copy relocation with -fno-direct-extern-access. gcc/ PR target/35513 PR target/100593 * common.opt: Add -fdirect-extern-access. * config/i386/i386-protos.h (ix86_force_load_from_GOT_p): Add a bool argument. * config/i386/i386.c (ix86_force_load_from_GOT_p): Add a bool argument to indicate call operand. Force non-call load from GOT for -fno-direct-extern-access or nodirect_extern_access attribute. (legitimate_pic_address_disp_p): Avoid copy relocation in PIE for -fno-direct-extern-access or nodirect_extern_access attribute. (ix86_print_operand): Pass true to ix86_force_load_from_GOT_p for call operand. (asm_preferred_eh_data_format): Use PC-relative format for -fno-direct-extern-access to avoid copy relocation. Check ptr_mode instead of TARGET_64BIT when selecting DW_EH_PE_sdata4. (ix86_binds_local_p): Don't treat protected data as extern and avoid copy relocation on common symbol with -fno-direct-extern-access or nodirect_extern_access attribute. (ix86_reloc_rw_mask): New to avoid copy relocation for -fno-direct-extern-access. (TARGET_ASM_RELOC_RW_MASK): New. * doc/extend.texi: Document nodirect_extern_access attribute. * doc/invoke.texi: Document -f[no-]direct-extern-access. gcc/c-family/ PR target/35513 PR target/100593 * c-attribs.c (handle_nodirect_extern_access_attribute): New. (c_common_attribute_table): Add nodirect_extern_access. gcc/testsuite/ PR target/35513 PR target/100593 * g++.dg/pr35513-1.C: New file. * g++.dg/pr35513-2.C: Likewise. * gcc.target/i386/pr35513-1a.c: Likewise. * gcc.target/i386/pr35513-1b.c: Likewise. * gcc.target/i386/pr35513-2a.c: Likewise. * gcc.target/i386/pr35513-2b.c: Likewise. * gcc.target/i386/pr35513-3a.c: Likewise. * gcc.target/i386/pr35513-3b.c: Likewise. * gcc.target/i386/pr35513-4a.c: Likewise. * gcc.target/i386/pr35513-4b.c: Likewise. * gcc.target/i386/pr35513-5a.c: Likewise. * gcc.target/i386/pr35513-5b.c: Likewise. * gcc.target/i386/pr35513-6a.c: Likewise. * gcc.target/i386/pr35513-6b.c: Likewise. * gcc.target/i386/pr35513-7a.c: Likewise. * gcc.target/i386/pr35513-7b.c: Likewise. * gcc.target/i386/pr35513-8a.c: Likewise. * gcc.target/i386/pr35513-8b.c: Likewise. * gcc.target/i386/pr35513-9a.c: Likewise. * gcc.target/i386/pr35513-9b.c: Likewise. * gcc.target/i386/pr35513-10a.c: Likewise. * gcc.target/i386/pr35513-10b.c: Likewise. * gcc.target/i386/pr35513-11a.c: Likewise. * gcc.target/i386/pr35513-11b.c: Likewise. * gcc.target/i386/pr35513-12a.c: Likewise. * gcc.target/i386/pr35513-12b.c: Likewise. --- gcc/c-family/c-attribs.c| 34 +++ gcc/common.opt | 4 ++ gcc/config/i386/i386-protos.h | 2 +- gcc/config/i386/i386.c | 62 - gcc/doc/extend.texi | 6 ++ gcc/doc/invoke.texi | 13 + gcc/testsuite/g++.dg/pr35513-1.C| 25 + gcc/testsuite/g++.dg/pr35513-2.C| 53 ++ gcc/testsuite/gcc.target/i386/pr35513-10a.c | 17 ++ gcc/testsuite/gcc.target/i386/pr35513-10b.c | 17 ++ gcc/testsuite/gcc.target/i386/pr35513-11a.c | 17 ++ gcc/testsuite/gcc.target/i386/pr35513-11b.c | 17 ++ gcc/testsuite/gcc.target/i386/pr35513-12a.c | 17 ++ gcc/testsuite/gcc.target/i386/pr35513-12b.c | 17 ++ gcc/testsuite/gcc.target/
Re: [RFC] Don't move cold code out of loop by checking bb count
On 2021/9/22 17:14, Richard Biener wrote: On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo wrote: On 2021/8/26 19:33, Richard Biener wrote: On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo wrote: Hi, On 2021/8/6 20:15, Richard Biener wrote: On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo wrote: There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". This patch does this profile count check in both gimple LIM move_computations_worker and RTL loop-invariant.c find_invariants_bb, if the loop bb is colder than loop preheader, don't hoist it out of loop. Also, the profile count in loop split pass should be corrected to avoid lim2 and lim4 mismatch behavior, currently, the new loop preheader generated by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will move statement out of loop unexpectely when lim2 didn't move it. This change could fix regression on 544.nab_r from -1.55% to +0.46%. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. Regression and bootstrap tested pass on P8LE, any comments? Thanks. While I'm not familiar with the RTL invariant motion pass the patch there looks reasonable. Note that we should assess the profile quality somehow - I'm not sure how to do that, CCed Honza for that. Thanks. For the GIMPLE part the patch looks quite complicated - but note it probably has to be since LIM performs kind of a "CSE" on loads (and stores for store-motion), so when there are multiple stmts affected by a hoisting decision the biggest block count has to be accounted. Likewise when there are dependent stmts involved that might include conditional stmts (a "PHI"), but the overall cost should be looked at. Currently, The gimple code check two situations with the patch: 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out of loop; 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs couldn't be moved out of loop, also don't move it out of loop to avoid definition not dominates use error. But part 2) is obviously already done. What I tried to say is your heuristic doesn't integrate nicely with the pass but I admitted that it might be a bit difficult to find a place to add this heuristic. There is lim_data->cost which we could bias negatively but then this is a cost that is independent on the hoisting distance. But doing this would work at least for the case where the immediately enclosing loop preheader is hotter than the stmt and with this it would be a patch that's similarly simple as the RTL one. Another possibility is to simply only adjust PHI processing in compute_invariantness, capping movement according to the hotness heuristic. The same could be done for regular stmts there but I'm not sure that will do good in the end since this function is supposed to compute "correctness" (well, it also has the cost stuff), and it's not the place to do overall cost considerations. Thanks. I found that adding a function find_coldest_out_loop and check it in outermost_invariant_loop to find the coldest invariant loop between outermost loop and itself could also reach the purpose. Then the gimple code check is redundant and could be removed. May be I could collect the number of instructions not hoisted with the patch on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" and "overall cost" need to be considered? But it seems move_computations_worker couldn't rollback if we still want to hoist multiple stmts out during the iterations? Now - GIMPLE LIM "costing" is somewhat backward right now and it isn't set up to consider those multiple involved stmts. Plus the store-motion part does not have any cost part (but it depends on previously decided invariant motions). I think the way you implemented the check will cause no hoisting to be performed instead of, say, hoisting to a different loop level only. Possibly shown when you consider a loop nest like for (;;) if (unlikely_cond) for (;;) invariant; we want to hoist 'invariant' but only from the inner loop even if it is invariant also in the outer loop. For this case, theorotically I think the master GCC will optimize it to: invariant; for (;;) if (unlikely_cond) for (;;) ; 'invariant' is moved out of outer loop, but with the patch, it will get: for (;;) if (unlikely_cond) { invariant; for (;;) ; } 'invariant' is *cold* for outer loop, but it is still *ho
Re: [RFC] Don't move cold code out of loop by checking bb count
On 2021/9/23 10:13, Xionghu Luo via Gcc-patches wrote: On 2021/9/22 17:14, Richard Biener wrote: On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo wrote: On 2021/8/26 19:33, Richard Biener wrote: On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo wrote: Hi, On 2021/8/6 20:15, Richard Biener wrote: On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo wrote: There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". This patch does this profile count check in both gimple LIM move_computations_worker and RTL loop-invariant.c find_invariants_bb, if the loop bb is colder than loop preheader, don't hoist it out of loop. Also, the profile count in loop split pass should be corrected to avoid lim2 and lim4 mismatch behavior, currently, the new loop preheader generated by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will move statement out of loop unexpectely when lim2 didn't move it. This change could fix regression on 544.nab_r from -1.55% to +0.46%. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. Regression and bootstrap tested pass on P8LE, any comments? Thanks. While I'm not familiar with the RTL invariant motion pass the patch there looks reasonable. Note that we should assess the profile quality somehow - I'm not sure how to do that, CCed Honza for that. Thanks. For the GIMPLE part the patch looks quite complicated - but note it probably has to be since LIM performs kind of a "CSE" on loads (and stores for store-motion), so when there are multiple stmts affected by a hoisting decision the biggest block count has to be accounted. Likewise when there are dependent stmts involved that might include conditional stmts (a "PHI"), but the overall cost should be looked at. Currently, The gimple code check two situations with the patch: 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out of loop; 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs couldn't be moved out of loop, also don't move it out of loop to avoid definition not dominates use error. But part 2) is obviously already done. What I tried to say is your heuristic doesn't integrate nicely with the pass but I admitted that it might be a bit difficult to find a place to add this heuristic. There is lim_data->cost which we could bias negatively but then this is a cost that is independent on the hoisting distance. But doing this would work at least for the case where the immediately enclosing loop preheader is hotter than the stmt and with this it would be a patch that's similarly simple as the RTL one. Another possibility is to simply only adjust PHI processing in compute_invariantness, capping movement according to the hotness heuristic. The same could be done for regular stmts there but I'm not sure that will do good in the end since this function is supposed to compute "correctness" (well, it also has the cost stuff), and it's not the place to do overall cost considerations. Thanks. I found that adding a function find_coldest_out_loop and check it in outermost_invariant_loop to find the coldest invariant loop between outermost loop and itself could also reach the purpose. Then the gimple code check is redundant and could be removed. May be I could collect the number of instructions not hoisted with the patch on regression tests and SPEC2017 to do a estimation for "multiple stmts affected" and "overall cost" need to be considered? But it seems move_computations_worker couldn't rollback if we still want to hoist multiple stmts out during the iterations? Now - GIMPLE LIM "costing" is somewhat backward right now and it isn't set up to consider those multiple involved stmts. Plus the store-motion part does not have any cost part (but it depends on previously decided invariant motions). I think the way you implemented the check will cause no hoisting to be performed instead of, say, hoisting to a different loop level only. Possibly shown when you consider a loop nest like for (;;) if (unlikely_cond) for (;;) invariant; we want to hoist 'invariant' but only from the inner loop even if it is invariant also in the outer loop. For this case, theorotically I think the master GCC will optimize it to: invariant; for (;;) if (unlikely_cond) for (;;) ; 'invariant' is moved out of outer loop, but with the patch, it will get: for (;;) if (unlikely_cond) { invariant; fo
Re: [PATCH][Hashtable 6/6] PR 68303 small size optimization
Ping ? On 16/08/21 9:03 pm, François Dumont wrote: On 17/07/20 2:58 pm, Jonathan Wakely wrote: On 17/11/19 22:31 +0100, François Dumont wrote: This is an implementation of PR 68303. I try to use this idea as much as possible to avoid computation of hash codes. Note that tests are not showing any gain. I guess hash computation must be quite bad to get a benefit from it. So I am only activating it when hash code is not cached and/or when computation is not fast. If the tests don't show any benefit, why bother making the change? I eventually managed to demonstrate this optimization through a performance test case. Does it help the example in the PR? No, the code attached to the PR just show what the user has done to put in place this optim on his side. What I needed was a slow hash code computation compared to the equal operation. I realized that I had to use longer string to achieve this. Moreover making this optim dependant on _Hashtable_traits::__hash_cached was just wrong as we cannot use the cached hash code here as the input is a key instance, not a node. I introduce _Hashtable_hash_traits<_Hash> to offer a single customization point as this optim depends highly on the difference between a hash code computation and a comparison. Maybe I should put it at std namespace scope to ease partial specialization ? Performance test results before the patch: unordered_small_size.cc std::unordered_set: 1st insert 40r 32u 8s 264000112mem 0pf unordered_small_size.cc std::unordered_set: find/erase 22r 22u 0s -191999808mem 0pf unordered_small_size.cc std::unordered_set: 2nd insert 36r 36u 0s 191999776mem 0pf unordered_small_size.cc std::unordered_set: erase key 25r 25u 0s -191999808mem 0pf unordered_small_size.cc std::unordered_set: 1st insert 404r 244u 156s -1989936256mem 0pf unordered_small_size.cc std::unordered_set: find/erase 315r 315u 0s 2061942272mem 0pf unordered_small_size.cc std::unordered_set: 2nd insert 233r 233u 0s -2061942208mem 0pf unordered_small_size.cc std::unordered_set: erase key 299r 298u 0s 2061942208mem 0pf after the patch: unordered_small_size.cc std::unordered_set: 1st insert 41r 33u 7s 264000112mem 0pf unordered_small_size.cc std::unordered_set: find/erase 24r 25u 1s -191999808mem 0pf unordered_small_size.cc std::unordered_set: 2nd insert 34r 34u 0s 191999776mem 0pf unordered_small_size.cc std::unordered_set: erase key 25r 25u 0s -191999808mem 0pf unordered_small_size.cc std::unordered_set: 1st insert 399r 232u 165s -1989936256mem 0pf unordered_small_size.cc std::unordered_set: find/erase 196r 197u 0s 2061942272mem 0pf unordered_small_size.cc std::unordered_set: 2nd insert 221r 222u 0s -2061942208mem 0pf unordered_small_size.cc std::unordered_set: erase key 178r 178u 0s 2061942208mem 0pf libstdc++: Optimize operations on small size hashtable [PR 68303] When hasher is identified as slow and the number of elements is limited in the container use a brute-force loop on those elements to look for a given key using the key_equal functor. For the moment the default threshold below which the container is considered as small is 20. libstdc++-v3/ChangeLog: PR libstdc++/68303 * include/bits/hashtable_policy.h (_Hashtable_hash_traits<_Hash>): New. (_Hash_code_base<>::_M_hash_code(const _Hash_node_value<>&)): New. (_Hashtable_base<>::_M_key_equals): New. (_Hashtable_base<>::_M_equals): Use latter. (_Hashtable_base<>::_M_key_equals_tr): New. (_Hashtable_base<>::_M_equals_tr): Use latter. * include/bits/hashtable.h (_Hashtable<>::__small_size_threshold()): New, use _Hashtable_hash_traits. (_Hashtable<>::find): Loop through elements to look for key if size is lower than __small_size_threshold(). (_Hashtable<>::_M_emplace(true_type, _Args&&...)): Likewise. (_Hashtable<>::_M_insert_unique(_Kt&&, _Args&&, const _NodeGenerator&)): Likewise. (_Hashtable<>::_M_compute_hash_code(const_iterator, const key_type&)): New. (_Hashtable<>::_M_emplace(const_iterator, false_type, _Args&&...)): Use latter. (_Hashtable<>::_M_find_before_node(const key_type&)): New. (_Hashtable<>::_M_erase(true_type, const key_type&)): Use latter. (_Hashtable<>::_M_erase(false_type, const key_type&)): Likewise. * src/c++11/hashtable_c++0x.cc: Include . * testsuite/util/testsuite_performane.h (report_performance): Use 9 width to display memory. * testsuite/performance/23_containers/insert_erase/unordered_small_size.c
[PATCH] wwwdocs: [GCC12] Mention Intel AVX512-FP16.
--- htdocs/gcc-12/changes.html | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html index 81f62fe3..14149212 100644 --- a/htdocs/gcc-12/changes.html +++ b/htdocs/gcc-12/changes.html @@ -165,8 +165,12 @@ a work-in-progress. - - +IA-32/x86-64 + + New ISA extension support for Intel AVX512-FP16 was added to GCC. + AVX512FP16 intrinsics are available via the -mavx512fp16 + compiler switch. + -- 2.18.1
[PATCH 0/7] AVX512FP16: Support bunch of expanders for HFmode and vector HFmodes
xfail are added for testcases related to truncmn2/extendmn2 expanders since V2HF/V4HFmode are not supported yet, they should be removed later. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Newly added runtime testcases passed on sde{-m32,}. Hongyu Wang (5): AVX512FP16: Add expander for smin/maxhf3. AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes AVX512FP16: Add float(uns)?mn2 expander AVX512FP16: add truncmn2/extendmn2 expanders AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes. liuhongt (2): AVX512FP16: Add expander for rint/nearbyinthf2. AVX512FP16: Add expander for fmahf4 gcc/config/i386/i386-expand.c | 2 + gcc/config/i386/i386.md | 62 + gcc/config/i386/sse.md| 259 +++--- .../i386/avx512fp16-vcondmn-minmax.C | 25 ++ .../g++.target/i386/avx512fp16-vcondmn-vec.C | 70 + .../i386/avx512fp16-builtin-minmax-1.c| 35 +++ .../i386/avx512fp16-builtin-round-1.c | 14 + .../gcc.target/i386/avx512fp16-floatvnhf.c| 61 + .../gcc.target/i386/avx512fp16-fma-1.c| 69 + .../i386/avx512fp16-trunc-extendvnhf.c| 55 .../gcc.target/i386/avx512fp16-trunchf.c | 59 .../gcc.target/i386/avx512fp16-truncvnhf.c| 61 + .../i386/avx512fp16-vcondmn-loop-1.c | 70 + .../i386/avx512fp16-vcondmn-loop-2.c | 143 ++ .../gcc.target/i386/avx512fp16-vec_cmpmn.c| 32 +++ .../gcc.target/i386/avx512fp16vl-fma-1.c | 70 + .../i386/avx512fp16vl-fma-vectorize-1.c | 45 +++ 17 files changed, 1100 insertions(+), 32 deletions(-) create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-minmax.C create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-vec.C create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunc-extendvnhf.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunchf.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-truncvnhf.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-2.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vec_cmpmn.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-vectorize-1.c -- 2.27.0
[PATCH 1/7] AVX512FP16: Add expander for rint/nearbyinthf2.
gcc/ChangeLog: * config/i386/i386.md (rinthf2): New expander. (nearbyinthf2): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-builtin-round-1.c: Add new testcase. --- gcc/config/i386/i386.md | 22 +++ .../i386/avx512fp16-builtin-round-1.c | 14 2 files changed, 36 insertions(+) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 60d877668d5..4b13a59be82 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -18287,6 +18287,17 @@ (define_insn "rintxf2" (set_attr "znver1_decode" "vector") (set_attr "mode" "XF")]) +(define_expand "rinthf2" + [(match_operand:HF 0 "register_operand") + (match_operand:HF 1 "nonimmediate_operand")] + "TARGET_AVX512FP16" +{ + emit_insn (gen_sse4_1_roundhf2 (operands[0], + operands[1], + GEN_INT (ROUND_MXCSR))); + DONE; +}) + (define_expand "rint2" [(use (match_operand:MODEF 0 "register_operand")) (use (match_operand:MODEF 1 "nonimmediate_operand"))] @@ -18320,6 +18331,17 @@ (define_expand "nearbyintxf2" "TARGET_USE_FANCY_MATH_387 && !flag_trapping_math") +(define_expand "nearbyinthf2" + [(match_operand:HF 0 "register_operand") + (match_operand:HF 1 "nonimmediate_operand")] + "TARGET_AVX512FP16" +{ + emit_insn (gen_sse4_1_roundhf2 (operands[0], + operands[1], + GEN_INT (ROUND_MXCSR | ROUND_NO_EXC))); + DONE; +}) + (define_expand "nearbyint2" [(use (match_operand:MODEF 0 "register_operand")) (use (match_operand:MODEF 1 "nonimmediate_operand"))] diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c index 3cab1526967..a1c6636e354 100644 --- a/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c @@ -25,7 +25,21 @@ f4 (_Float16 x) return __builtin_roundevenf16 (x); } +_Float16 +f5 (_Float16 x) +{ + return __builtin_rintf16 (x); +} + +_Float16 +f6 (_Float16 x) +{ + return __builtin_nearbyintf16 (x); +} + /* { dg-final { scan-assembler-times "vrndscalesh\[ \\t\]+\\\$11\[^\n\r\]*xmm\[0-9\]" 1 } } */ /* { dg-final { scan-assembler-times "vrndscalesh\[ \\t\]+\\\$10\[^\n\r\]*xmm\[0-9\]" 1 } } */ /* { dg-final { scan-assembler-times "vrndscalesh\[ \\t\]+\\\$9\[^\n\r\]*xmm\[0-9\]" 1 } } */ /* { dg-final { scan-assembler-times "vrndscalesh\[ \\t\]+\\\$8\[^\n\r\]*xmm\[0-9\]" 1 } } */ +/* { dg-final { scan-assembler-times "vrndscalesh\[ \\t\]+\\\$4\[^\n\r\]*xmm\[0-9\]" 1 } } */ +/* { dg-final { scan-assembler-times "vrndscalesh\[ \\t\]+\\\$12\[^\n\r\]*xmm\[0-9\]" 1 } } */ -- 2.27.0
[PATCH 2/7] AVX512FP16: Add expander for fmahf4
gcc/ChangeLog: * config/i386/sse.md (FMAMODEM): extend to handle FP16. (VFH_SF_AVX512VL): Extend to handle HFmode. (VF_SF_AVX512VL): Deleted. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-fma-1.c: New test. * gcc.target/i386/avx512fp16vl-fma-1.c: New test. * gcc.target/i386/avx512fp16vl-fma-vectorize-1.c: New test. --- gcc/config/i386/sse.md| 11 +-- .../gcc.target/i386/avx512fp16-fma-1.c| 69 ++ .../gcc.target/i386/avx512fp16vl-fma-1.c | 70 +++ .../i386/avx512fp16vl-fma-vectorize-1.c | 45 4 files changed, 190 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-vectorize-1.c diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 9079613e829..1ca95984afc 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -4650,7 +4650,11 @@ (define_mode_iterator FMAMODEM (V8SF "TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL") (V4DF "TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL") (V16SF "TARGET_AVX512F") - (V8DF "TARGET_AVX512F")]) + (V8DF "TARGET_AVX512F") + (HF "TARGET_AVX512FP16") + (V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL") + (V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL") + (V32HF "TARGET_AVX512FP16")]) (define_expand "fma4" [(set (match_operand:FMAMODEM 0 "register_operand") @@ -4758,14 +4762,11 @@ (define_insn "*fma_fmadd_" (set_attr "mode" "")]) ;; Suppose AVX-512F as baseline -(define_mode_iterator VF_SF_AVX512VL - [SF V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL") - DF V8DF (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")]) - (define_mode_iterator VFH_SF_AVX512VL [(V32HF "TARGET_AVX512FP16") (V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL") (V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL") + (HF "TARGET_AVX512FP16") SF V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL") DF V8DF (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")]) diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c b/gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c new file mode 100644 index 000..d78d7629838 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c @@ -0,0 +1,69 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -mavx512fp16" } */ + +typedef _Float16 v32hf __attribute__ ((__vector_size__ (64))); + +_Float16 +foo1 (_Float16 a, _Float16 b, _Float16 c) +{ + return a * b + c; +} + +/* { dg-final { scan-assembler-times "vfmadd132sh\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +_Float16 +foo2 (_Float16 a, _Float16 b, _Float16 c) +{ + return -a * b + c; +} + +/* { dg-final { scan-assembler-times "vfnmadd132sh\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +_Float16 +foo3 (_Float16 a, _Float16 b, _Float16 c) +{ + return a * b - c; +} + +/* { dg-final { scan-assembler-times "vfmsub132sh\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +_Float16 +foo4 (_Float16 a, _Float16 b, _Float16 c) +{ + return -a * b - c; +} + +/* { dg-final { scan-assembler-times "vfnmsub132sh\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +v32hf +foo5 (v32hf a, v32hf b, v32hf c) +{ + return a * b + c; +} + +/* { dg-final { scan-assembler-times "vfmadd132ph\[^\n\r\]*zmm\[0-9\]" 1 } } */ + +v32hf +foo6 (v32hf a, v32hf b, v32hf c) +{ + return -a * b + c; +} + +/* { dg-final { scan-assembler-times "vfnmadd132ph\[^\n\r\]*zmm\[0-9\]" 1 } } */ + +v32hf +foo7 (v32hf a, v32hf b, v32hf c) +{ + return a * b - c; +} + +/* { dg-final { scan-assembler-times "vfmsub132ph\[^\n\r\]*zmm\[0-9\]" 1 } } */ + +v32hf +foo8 (v32hf a, v32hf b, v32hf c) +{ + return -a * b - c; +} + +/* { dg-final { scan-assembler-times "vfnmsub132ph\[^\n\r\]*zmm\[0-9\]" 1 } } */ + diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c b/gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c new file mode 100644 index 000..1a832f37d6c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c @@ -0,0 +1,70 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -mavx512fp16 -mavx512vl" } */ + +typedef _Float16 v8hf __attribute__ ((__vector_size__ (16))); +typedef _Float16 v16hf __attribute__ ((__vector_size__ (32))); + +v8hf +foo1 (v8hf a, v8hf b, v8hf c) +{ + return a * b + c; +} + +/* { dg-final { scan-assembler-times "vfmadd132ph\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +v8hf +foo2 (v8hf a, v8hf b, v8hf c) +{ + return -a * b + c; +} + +/* { dg-final { scan-assembler-times "vfnmadd132ph\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +v8hf +foo3 (v8hf a, v8hf b, v8hf c) +{ + return a * b - c; +} + +/* { dg-final { scan-assembler-times "vfmsub132ph\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +v8hf +foo4 (v8hf a, v8hf b, v8hf c) +{ + return -a * b - c; +} + +/* { dg-final { scan-assembler-times "vfnmsub132ph\[^\n\r\]*xmm\[0-9\]" 1 } } */ + +v16hf +foo5 (v16hf a, v16hf b, v16hf c) +{ + return a * b + c; +} +
[PATCH 3/7] AVX512FP16: Add expander for smin/maxhf3.
From: Hongyu Wang gcc/ChangeLog: * config/i386/i386.md (hf3): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-builtin-minmax-1.c: New test. --- gcc/config/i386/i386.md | 11 ++ .../i386/avx512fp16-builtin-minmax-1.c| 35 +++ 2 files changed, 46 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 4b13a59be82..a087e557d7f 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -19946,6 +19946,17 @@ (define_insn "3" (set_attr "type" "sseadd") (set_attr "mode" "")]) +(define_insn "hf3" + [(set (match_operand:HF 0 "register_operand" "=v") + (smaxmin:HF + (match_operand:HF 1 "nonimmediate_operand" "%v") + (match_operand:HF 2 "nonimmediate_operand" "vm")))] + "TARGET_AVX512FP16" + "vsh\t{%2, %1, %0|%0, %1, %2}" + [(set_attr "prefix" "evex") + (set_attr "type" "sseadd") + (set_attr "mode" "HF")]) + ;; These versions of the min/max patterns implement exactly the operations ;; min = (op1 < op2 ? op1 : op2) ;; max = (!(op1 < op2) ? op1 : op2) diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c new file mode 100644 index 000..90080e44216 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -mavx512fp16 -mprefer-vector-width=512" } */ + +_Float16 +minf1 (_Float16 a, _Float16 b) +{ + return __builtin_fminf16 (a, b); +} + +void +minf2 (_Float16* __restrict psrc1, _Float16* __restrict psrc2, + _Float16* __restrict pdst) +{ + for (int i = 0; i != 32; i++) +pdst[i] = __builtin_fminf16 (psrc1[i], psrc2[i]); +} + +_Float16 +maxf1 (_Float16 a, _Float16 b) +{ + return __builtin_fmaxf16 (a, b); +} + +void +maxf2 (_Float16* __restrict psrc1, _Float16* __restrict psrc2, + _Float16* __restrict pdst) +{ + for (int i = 0; i != 32; i++) +pdst[i] = __builtin_fmaxf16 (psrc1[i], psrc2[i]); +} + +/* { dg-final { scan-assembler-times "vmaxsh\[^\n\r\]*xmm\[0-9\]" 1 } } */ +/* { dg-final { scan-assembler-times "vmaxph\[^\n\r\]*zmm\[0-9\]" 1 } } */ +/* { dg-final { scan-assembler-times "vminsh\[^\n\r\]*xmm\[0-9\]" 1 } } */ +/* { dg-final { scan-assembler-times "vminph\[^\n\r\]*zmm\[0-9\]" 1 } } */ -- 2.27.0
[PATCH 4/7] AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes
From: Hongyu Wang NB: 64bit/32bit vectorize for HFmode is not supported for now, will adjust this patch when V2HF/V4HF operations supported. gcc/ChangeLog: * config/i386/i386.md (fix_trunchf2): New expander. (fixuns_trunchfhi2): Likewise. (*fixuns_trunchfsi2zext): New define_insn. * config/i386/sse.md (ssePHmodelower): New mode_attr. (fix_trunc2): New expander for same element vector fix_truncate. (fix_trunc2): Likewise for V4HF to V4SI/V4DI fix_truncate. (fix_truncv2hfv2di2): Likeise for V2HF to V2DI fix_truncate. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-trunchf.c: New test. * gcc.target/i386/avx512fp16-truncvnhf.c: Ditto. --- gcc/config/i386/i386.md | 29 + gcc/config/i386/sse.md| 43 + .../gcc.target/i386/avx512fp16-trunchf.c | 59 ++ .../gcc.target/i386/avx512fp16-truncvnhf.c| 61 +++ 4 files changed, 192 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunchf.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-truncvnhf.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index a087e557d7f..c6279e620c9 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4810,6 +4810,16 @@ (define_expand "fix_truncdi2" } }) +(define_insn "fix_trunchf2" + [(set (match_operand:SWI48 0 "register_operand" "=r") + (any_fix:SWI48 + (match_operand:HF 1 "nonimmediate_operand" "vm")))] + "TARGET_AVX512FP16" + "vcvttsh2si\t{%1, %0|%0, %1}" + [(set_attr "type" "sseicvt") + (set_attr "prefix" "evex") + (set_attr "mode" "")]) + ;; Signed conversion to SImode. (define_expand "fix_truncxfsi2" @@ -4917,6 +4927,17 @@ (define_insn "fixuns_truncsi2_avx512f" (set_attr "prefix" "evex") (set_attr "mode" "SI")]) +(define_insn "*fixuns_trunchfsi2zext" + [(set (match_operand:DI 0 "register_operand" "=r") + (zero_extend:DI + (unsigned_fix:SI + (match_operand:HF 1 "nonimmediate_operand" "vm"] + "TARGET_64BIT && TARGET_AVX512FP16" + "vcvttsh2usi\t{%1, %k0|%k0, %1}" + [(set_attr "type" "sseicvt") + (set_attr "prefix" "evex") + (set_attr "mode" "SI")]) + (define_insn "*fixuns_truncsi2_avx512f_zext" [(set (match_operand:DI 0 "register_operand" "=r") (zero_extend:DI @@ -4949,6 +4970,14 @@ (define_insn_and_split "*fixuns_trunc_1" ;; Without these patterns, we'll try the unsigned SI conversion which ;; is complex for SSE, rather than the signed SI conversion, which isn't. +(define_expand "fixuns_trunchfhi2" + [(set (match_dup 2) + (fix:SI (match_operand:HF 1 "nonimmediate_operand"))) + (set (match_operand:HI 0 "nonimmediate_operand") + (subreg:HI (match_dup 2) 0))] + "TARGET_AVX512FP16" + "operands[2] = gen_reg_rtx (SImode);") + (define_expand "fixuns_trunchi2" [(set (match_dup 2) (fix:SI (match_operand:MODEF 1 "nonimmediate_operand"))) diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 1ca95984afc..f8a5f197f3c 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -1034,6 +1034,13 @@ (define_mode_attr ssePHmode (V8DI "V8HF") (V4DI "V8HF") (V2DI "V8HF") (V8DF "V8HF") (V16SF "V16HF") (V8SF "V8HF")]) +;; Mapping of vector modes to vector hf modes of same element. +(define_mode_attr ssePHmodelower + [(V32HI "v32hf") (V16HI "v16hf") (V8HI "v8hf") + (V16SI "v16hf") (V8SI "v8hf") (V4SI "v4hf") + (V8DI "v8hf") (V4DI "v4hf") (V2DI "v2hf") + (V8DF "v8hf") (V16SF "v16hf") (V8SF "v8hf")]) + ;; Mapping of vector modes to packed single mode of the same size (define_mode_attr ssePSmode [(V16SI "V16SF") (V8DF "V16SF") @@ -6175,6 +6182,12 @@ (define_insn "avx512fp16_vcvtsi2sh" (set_attr "prefix" "evex") (set_attr "mode" "HF")]) +(define_expand "fix_trunc2" + [(set (match_operand:VI2H_AVX512VL 0 "register_operand") + (any_fix:VI2H_AVX512VL + (match_operand: 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16") + (define_insn "avx512fp16_fix_trunc2" [(set (match_operand:VI2H_AVX512VL 0 "register_operand" "=v") (any_fix:VI2H_AVX512VL @@ -6185,6 +6198,21 @@ (define_insn "avx512fp16_fix_trunc2")]) +(define_expand "fix_truncv4hf2" + [(set (match_operand:VI4_128_8_256 0 "register_operand") + (any_fix:VI4_128_8_256 + (match_operand:V4HF 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + if (!MEM_P (operands[1])) +{ + operands[1] = lowpart_subreg (V8HFmode, operands[1], V4HFmode); + emit_insn (gen_avx512fp16_fix_trunc2 (operands[0], + operands[1])); + DONE; +} +}) + (define_insn "avx512fp16_fix_trunc2" [(set (match_operand:VI4_128_8_256 0 "register_operand" "=v") (any_fix:VI4_128_8_256 @@ -6207,6 +6235,21 @@ (define_insn "*avx512fp16_fix_tru
[PATCH 5/7] AVX512FP16: Add float(uns)?mn2 expander
From: Hongyu Wang gcc/ChangeLog: * config/i386/sse.md (float2): New expander. (avx512fp16_vcvt2ph_): Rename to ... (floatv4hf2): ... this, and drop constraints. (avx512fp16_vcvtqq2ph_v2di): Rename to ... (floatv2div2hf2): ... this, and likewise. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-floatvnhf.c: New test. --- gcc/config/i386/sse.md| 46 +++--- .../gcc.target/i386/avx512fp16-floatvnhf.c| 61 +++ 2 files changed, 99 insertions(+), 8 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index f8a5f197f3c..66062dc3bcf 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6006,6 +6006,12 @@ (define_insn "avx512fp16_vcvtph2_< (set_attr "prefix" "evex") (set_attr "mode" "")]) +(define_expand "float2" + [(set (match_operand: 0 "register_operand") + (any_float: + (match_operand:VI2H_AVX512VL 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16") + (define_insn "avx512fp16_vcvt2ph_" [(set (match_operand: 0 "register_operand" "=v") (any_float: @@ -6016,11 +6022,23 @@ (define_insn "avx512fp16_vcvt2ph_")]) -(define_expand "avx512fp16_vcvt2ph_" - [(set (match_operand:V8HF 0 "register_operand" "=v") +(define_expand "floatv4hf2" + [(set (match_operand:V4HF 0 "register_operand") + (any_float:V4HF + (match_operand:VI4_128_8_256 1 "vector_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + operands[0] = lowpart_subreg (V8HFmode, operands[0], V4HFmode); + emit_insn (gen_avx512fp16_floatv4hf2 (operands[0], + operands[1])); + DONE; +}) + +(define_expand "avx512fp16_floatv4hf2" + [(set (match_operand:V8HF 0 "register_operand") (vec_concat:V8HF - (any_float:V4HF (match_operand:VI4_128_8_256 1 "vector_operand" "vm")) - (match_dup 2)))] + (any_float:V4HF (match_operand:VI4_128_8_256 1 "vector_operand")) + (match_dup 2)))] "TARGET_AVX512FP16 && TARGET_AVX512VL" "operands[2] = CONST0_RTX (V4HFmode);") @@ -6079,11 +6097,23 @@ (define_insn "*avx512fp16_vcvt2ph__mask_1" (set_attr "prefix" "evex") (set_attr "mode" "")]) -(define_expand "avx512fp16_vcvtqq2ph_v2di" - [(set (match_operand:V8HF 0 "register_operand" "=v") +(define_expand "floatv2div2hf2" + [(set (match_operand:V2HF 0 "register_operand") + (any_float:V2HF + (match_operand:V2DI 1 "vector_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + operands[0] = lowpart_subreg (V8HFmode, operands[0], V2HFmode); + emit_insn (gen_avx512fp16_floatv2div2hf2 (operands[0], + operands[1])); + DONE; +}) + +(define_expand "avx512fp16_floatv2div2hf2" + [(set (match_operand:V8HF 0 "register_operand") (vec_concat:V8HF - (any_float:V2HF (match_operand:V2DI 1 "vector_operand" "vm")) - (match_dup 2)))] + (any_float:V2HF (match_operand:V2DI 1 "vector_operand")) + (match_dup 2)))] "TARGET_AVX512FP16 && TARGET_AVX512VL" "operands[2] = CONST0_RTX (V6HFmode);") diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c b/gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c new file mode 100644 index 000..112ac3e74d5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c @@ -0,0 +1,61 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-slp-vectorize -mprefer-vector-width=512" } */ + +extern long long di[8]; +extern unsigned long long udi[8]; +extern int si[16]; +extern unsigned int usi[16]; +extern short hi[32]; +extern unsigned short uhi[32]; +extern _Float16 hf[32]; + +#define DO_PRAGMA(X) _Pragma(#X) + +#define FLOATHFVV(size, mode) \ + void __attribute__ ((noinline, noclone)) \ +float##v##size##mode##v##size##hf () \ +{\ + int i; \ + DO_PRAGMA (GCC unroll size) \ + for (i = 0; i < size; i++) \ +hf[i] = (_Float16) mode[i]; \ +} + +FLOATHFVV(32, hi) +FLOATHFVV(16, hi) +FLOATHFVV(8, hi) +FLOATHFVV(16, si) +FLOATHFVV(8, si) +FLOATHFVV(4, si) +FLOATHFVV(8, di) +FLOATHFVV(4, di) +FLOATHFVV(2, di) + +FLOATHFVV(32, uhi) +FLOATHFVV(16, uhi) +FLOATHFVV(8, uhi) +FLOATHFVV(16, usi) +FLOATHFVV(8, usi) +FLOATHFVV(4, usi) +FLOATHFVV(8, udi) +FLOATHFVV(4, udi) +FLOATHFVV(2, udi) + +/* { dg-final { scan-assembler-times "vcvtqq2phz\[ \\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ +/* { dg-final { scan-assembler-times "vcvtuqq2phz\[ \\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ +/* { dg-final { scan-assembler-times "vcvtqq2phy\[ \\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times "vcvtuqq2phy\[ \\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } }
[PATCH 6/7] AVX512FP16: add truncmn2/extendmn2 expanders
From: Hongyu Wang gcc/ChangeLog: * config/i386/sse.md (extend2): New expander. (extendv4hf2): Likewise. (extendv2hfv2df2): Likewise. (trunc2): Likewise. (avx512fp16_vcvt2ph_): Rename to ... (truncv4hf2): ... this, and drop constraints. (avx512fp16_vcvtpd2ph_v2df): Rename to ... (truncv2dfv2hf2): ... this, and likewise. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-trunc-extendvnhf.c: New test. --- gcc/config/i386/sse.md| 75 +-- .../i386/avx512fp16-trunc-extendvnhf.c| 55 ++ 2 files changed, 123 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunc-extendvnhf.c diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 66062dc3bcf..a48c8e8bede 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6328,6 +6328,12 @@ (define_mode_attr ph2pssuffix [(V16SF "x") (V8SF "x") (V4SF "x") (V8DF "") (V4DF "") (V2DF "")]) +(define_expand "extend2" + [(set (match_operand:VF48H_AVX512VL 0 "register_operand") + (float_extend:VF48H_AVX512VL + (match_operand: 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16") + (define_insn "avx512fp16_float_extend_ph2" [(set (match_operand:VF48H_AVX512VL 0 "register_operand" "=v") (float_extend:VF48H_AVX512VL @@ -6338,6 +6344,21 @@ (define_insn "avx512fp16_float_extend_ph2" (set_attr "prefix" "evex") (set_attr "mode" "")]) +(define_expand "extendv4hf2" + [(set (match_operand:VF4_128_8_256 0 "register_operand") + (float_extend:VF4_128_8_256 + (match_operand:V4HF 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + if (!MEM_P (operands[1])) +{ + operands[1] = lowpart_subreg (V8HFmode, operands[1], V4HFmode); + emit_insn (gen_avx512fp16_float_extend_ph2 +(operands[0], operands[1])); + DONE; +} +}) + (define_insn "avx512fp16_float_extend_ph2" [(set (match_operand:VF4_128_8_256 0 "register_operand" "=v") (float_extend:VF4_128_8_256 @@ -6360,6 +6381,21 @@ (define_insn "*avx512fp16_float_extend_ph2_load" (set_attr "prefix" "evex") (set_attr "mode" "")]) +(define_expand "extendv2hfv2df2" + [(set (match_operand:V2DF 0 "register_operand") + (float_extend:V2DF + (match_operand:V2HF 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + if (!MEM_P (operands[1])) +{ + operands[1] = lowpart_subreg (V8HFmode, operands[1], V2HFmode); + emit_insn (gen_avx512fp16_float_extend_phv2df2 +(operands[0], operands[1])); + DONE; +} +}) + (define_insn "avx512fp16_float_extend_phv2df2" [(set (match_operand:V2DF 0 "register_operand" "=v") (float_extend:V2DF @@ -6382,6 +6418,12 @@ (define_insn "*avx512fp16_float_extend_phv2df2_load" (set_attr "prefix" "evex") (set_attr "mode" "TI")]) +(define_expand "trunc2" + [(set (match_operand: 0 "register_operand") + (float_truncate: + (match_operand:VF48H_AVX512VL 1 "nonimmediate_operand")))] + "TARGET_AVX512FP16") + (define_insn "avx512fp16_vcvt2ph_" [(set (match_operand: 0 "register_operand" "=v") (float_truncate: @@ -6392,11 +6434,21 @@ (define_insn "avx512fp16_vcvt2ph_" (set_attr "prefix" "evex") (set_attr "mode" "")]) -(define_expand "avx512fp16_vcvt2ph_" - [(set (match_operand:V8HF 0 "register_operand" "=v") +(define_expand "truncv4hf2" + [(set (match_operand:V4HF 0 "register_operand") + (float_truncate:V4HF (match_operand:VF4_128_8_256 1 "vector_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + operands[0] = lowpart_subreg (V8HFmode, operands[0], V4HFmode); + emit_insn (gen_avx512fp16_truncv4hf2 (operands[0], operands[1])); + DONE; +}) + +(define_expand "avx512fp16_truncv4hf2" + [(set (match_operand:V8HF 0 "register_operand") (vec_concat:V8HF (float_truncate:V4HF - (match_operand:VF4_128_8_256 1 "vector_operand" "vm")) + (match_operand:VF4_128_8_256 1 "vector_operand")) (match_dup 2)))] "TARGET_AVX512FP16 && TARGET_AVX512VL" "operands[2] = CONST0_RTX (V4HFmode);") @@ -6461,11 +6513,20 @@ (define_insn "*avx512fp16_vcvt2ph__mask_1" (set_attr "prefix" "evex") (set_attr "mode" "")]) -(define_expand "avx512fp16_vcvtpd2ph_v2df" - [(set (match_operand:V8HF 0 "register_operand" "=v") +(define_expand "truncv2dfv2hf2" + [(set (match_operand:V2HF 0 "register_operand") + (float_truncate:V2HF (match_operand:V2DF 1 "vector_operand")))] + "TARGET_AVX512FP16 && TARGET_AVX512VL" +{ + operands[0] = lowpart_subreg (V8HFmode, operands[0], V2HFmode); + emit_insn (gen_avx512fp16_truncv2dfv2hf2 (operands[0], operands[1])); + DONE; +}) + +(define_expand "avx512fp16_truncv2dfv2hf2" + [(set (match_operand:V8HF 0 "register_operand") (vec_concat:V8HF - (float_truncate:V2HF -
[PATCH 7/7] AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes.
From: Hongyu Wang gcc/ChangeLog: * config/i386/i386-expand.c (ix86_use_mask_cmp_p): Enable HFmode mask_cmp. * config/i386/sse.md (sseintvecmodelower): Add HF vector modes. (_store_mask): Extend to support HF vector modes. (vec_cmp): Likewise. (vcond_mask_): Likewise. (vcond): New expander. (vcond): Likewise. (vcond): Likewise. (vcondu): Likewise. gcc/testsuite/ChangeLog: * g++.target/i386/avx512fp16-vcondmn-vec.C: New test. * g++.target/i386/avx512fp16-vcondmn-minmax.C: Ditto. * gcc.target/i386/avx512fp16-vcondmn-loop-1.c: Ditto. * gcc.target/i386/avx512fp16-vcondmn-loop-2.c: Ditto. * gcc.target/i386/avx512fp16-vec_cmpmn.c: Ditto. --- gcc/config/i386/i386-expand.c | 2 + gcc/config/i386/sse.md| 84 -- .../i386/avx512fp16-vcondmn-minmax.C | 25 +++ .../g++.target/i386/avx512fp16-vcondmn-vec.C | 70 + .../i386/avx512fp16-vcondmn-loop-1.c | 70 + .../i386/avx512fp16-vcondmn-loop-2.c | 143 ++ .../gcc.target/i386/avx512fp16-vec_cmpmn.c| 32 7 files changed, 414 insertions(+), 12 deletions(-) create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-minmax.C create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-vec.C create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-1.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-2.c create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vec_cmpmn.c diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index dbbf5e34656..94ac303585e 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -3638,6 +3638,8 @@ ix86_use_mask_cmp_p (machine_mode mode, machine_mode cmp_mode, return false; else if (vector_size == 64) return true; + else if (GET_MODE_INNER (cmp_mode) == HFmode) +return true; /* When op_true is NULL, op_false must be NULL, or vice versa. */ gcc_assert (!op_true == !op_false); diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index a48c8e8bede..084fc7f4693 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -989,9 +989,9 @@ (define_mode_attr sseintvecmode2 (V16HF "OI") (V8HF "TI")]) (define_mode_attr sseintvecmodelower - [(V16SF "v16si") (V8DF "v8di") - (V8SF "v8si") (V4DF "v4di") - (V4SF "v4si") (V2DF "v2di") + [(V32HF "v32hi") (V16SF "v16si") (V8DF "v8di") + (V16HF "v16hi") (V8SF "v8si") (V4DF "v4di") + (V8HF "v8hi") (V4SF "v4si") (V2DF "v2di") (V8SI "v8si") (V4DI "v4di") (V4SI "v4si") (V2DI "v2di") (V16HI "v16hi") (V8HI "v8hi") @@ -1568,9 +1568,9 @@ (define_insn "_store_mask" (set_attr "mode" "")]) (define_insn "_store_mask" - [(set (match_operand:VI12_AVX512VL 0 "memory_operand" "=m") - (vec_merge:VI12_AVX512VL - (match_operand:VI12_AVX512VL 1 "register_operand" "v") + [(set (match_operand:VI12HF_AVX512VL 0 "memory_operand" "=m") + (vec_merge:VI12HF_AVX512VL + (match_operand:VI12HF_AVX512VL 1 "register_operand" "v") (match_dup 0) (match_operand: 2 "register_operand" "Yk")))] "TARGET_AVX512BW" @@ -3810,8 +3810,8 @@ (define_insn "_comi" (define_expand "vec_cmp" [(set (match_operand: 0 "register_operand") (match_operator: 1 "" - [(match_operand:V48_AVX512VL 2 "register_operand") - (match_operand:V48_AVX512VL 3 "nonimmediate_operand")]))] + [(match_operand:V48H_AVX512VL 2 "register_operand") + (match_operand:V48H_AVX512VL 3 "nonimmediate_operand")]))] "TARGET_AVX512F" { bool ok = ix86_expand_mask_vec_cmp (operands[0], GET_CODE (operands[1]), @@ -4018,6 +4018,51 @@ (define_expand "vcond" DONE; }) +(define_expand "vcond" + [(set (match_operand:VF_AVX512FP16VL 0 "register_operand") + (if_then_else:VF_AVX512FP16VL + (match_operator 3 "" + [(match_operand:VF_AVX512FP16VL 4 "vector_operand") +(match_operand:VF_AVX512FP16VL 5 "vector_operand")]) + (match_operand:VF_AVX512FP16VL 1 "general_operand") + (match_operand:VF_AVX512FP16VL 2 "general_operand")))] + "TARGET_AVX512FP16" +{ + bool ok = ix86_expand_fp_vcond (operands); + gcc_assert (ok); + DONE; +}) + +(define_expand "vcond" + [(set (match_operand:VF_AVX512FP16VL 0 "register_operand") + (if_then_else:VF_AVX512FP16VL + (match_operator 3 "" + [(match_operand: 4 "vector_operand") +(match_operand: 5 "vector_operand")]) + (match_operand:VF_AVX512FP16VL 1 "general_operand") + (match_operand:VF_AVX512FP16VL 2 "general_operand")))] + "TARGET_AVX512FP16" +{ + bool ok = ix86_expand_int_vcond (operands); + gcc_assert (ok); + DONE; +}) + +(define_expand "vcond" + [(set (match_operand: 0 "register_operand") + (if_then_else: + (match_op
Re: [PATCH] Enable auto-vectorization at O2 with very-cheap cost model.
On Thu, 23 Sep 2021, Hongtao Liu wrote: > On Thu, Sep 23, 2021 at 9:48 AM Hongtao Liu wrote: > > > > On Wed, Sep 22, 2021 at 10:21 PM Martin Sebor wrote: > > > > > > On 9/21/21 7:38 PM, Hongtao Liu wrote: > > > > On Mon, Sep 20, 2021 at 4:13 AM Martin Sebor wrote: > > > ... > > > > diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > > b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > > index 1d79930cd58..9351f7e7a1a 100644 > > > > --- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > > +++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c > > > > @@ -1,7 +1,7 @@ > > > >/* PR middle-end/91458 - inconsistent warning for writing past > > > > the end > > > > of an array member > > > > { dg-do compile } > > > > - { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf" } */ > > > > + { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf > > > > -fno-tree-vectorize" } */ > > > > > > The testcase is large - what part requires this change? Given the > > > testcase was added for inconsistent warnings do they now become > > > inconsistent again as we enable vectorization at -O2? > > > > > > That said, the testcase adjustments need some explaining - I suppose > > > you didn't just slap -fno-tree-vectorize to all of those changing > > > behavior? > > > > > > >>> void ga1_ (void) > > > >>> { > > > >>> a1_.a[0] = 0; > > > >>> a1_.a[1] = 1; // { dg-warning > > > >>> "\\\[-Wstringop-overflow" } > > > >>> a1_.a[2] = 2; // { dg-warning > > > >>> "\\\[-Wstringop-overflow" } > > > >>> > > > >>> struct A1 a; > > > >>> a.a[0] = 0; > > > >>> a.a[1] = 1; // { dg-warning > > > >>> "\\\[-Wstringop-overflow" } > > > >>> a.a[2] = 2; // { dg-warning > > > >>> "\\\[-Wstringop-overflow" } > > > >>> sink (&a); > > > >>> } > > > >>> > > > >>> It's supposed to be 2 warning for a.a[1] = 1 and a.a[2] = 1 since > > > >>> there are 2 accesses, but after enabling vectorization, there's only > > > >>> one access, so one warning is missing which causes the failure. > > > > > > With the stores vectorized, is the warning on the correct line or > > > does it point to the first store, the one that's in bounds, as > > > it does with -O3? The latter would be a regression at -O2. > > For the upper case, It points to the second store which is out of > > bounds, the third store warning is missing. > > > > > > >> > > > >> I would find it preferable to change the test code over disabling > > > >> optimizations that are on by default. My concern is that the test > > > >> would no longer exercise the default behavior. (The same goes for > > > >> the -fno-ipa-icf option.) > > > > Hmm, it's a middle-end test, for some backend, it may not do > > > > vectorization(it depends on TARGET_VECTOR_MODE_SUPPORTED_P and > > > > relative cost model). > > > > > > Yes, there are quite a few warning tests like that. Their main > > > purpose is to verify that in common GCC invocations (i.e., without > > > any special options) warnings are a) issued when expected and b) > > > not issued when not expected. Otherwise, middle end warnings are > > > known to have both false positives and false negatives in some > > > invocations, depending on what optimizations are in effect. > > > Indiscriminately disabling common optimizations for these large > > > tests and invoking them under artificial conditions would > > > compromise this goal and hide the problems. > > > > > > If enabling vectorization at -O2 causes regressions in the quality > > > of diagnostics (as the test failure above indicates seems to be > > > happening) we should investigate these and open bugs for them so > > > they can be fixed. We can then tweak the specific failing test > > > cases to avoid the failures until they are fixed. > > There are indeed cases of false positives and false negatives > > .i.e. > > // Verify warning for access to a definition with an initializer that > > // initializes the one-element array member. > > struct A1 a1i_1 = { 0, { 1 } }; > > > > void ga1i_1 (void) > > { > > a1i_1.a[0] = 0; > > a1i_1.a[1] = 1; // { dg-warning "\\\[-Wstringop-overflow" } > > a1i_1.a[2] = 2; // { dg-warning "\\\[-Wstringop-overflow" } > > > > struct A1 a = { 0, { 1 } }; --- false positive here. > > a.a[0] = 1; > > a.a[1] = 2; // { dg-warning > > "\\\[-Wstringop-overflow" } false negative here. > > a.a[2] = 3; // { dg-warning > > "\\\[-Wstringop-overflow" } false negative here. > > sink (&a); > > } > Similar for > * gcc.dg/Warray-bounds-51.c. > * gcc.dg/Warray-parameter-3.c > * gcc.dg/Wstringop-overflow-14.c > * gcc.dg/Wstringop-overflow-21.c > > So there're 3 situations. > 1. All accesses are out of bound, and after vectorization, there are > some warnings missing. > 2