Re: [PATCH] Enable auto-vectorization at O2 with very-cheap cost model.

2021-09-22 Thread Hongtao Liu via Gcc-patches
On Wed, Sep 22, 2021 at 10:21 PM Martin Sebor  wrote:
>
> On 9/21/21 7:38 PM, Hongtao Liu wrote:
> > On Mon, Sep 20, 2021 at 4:13 AM Martin Sebor  wrote:
> ...
> > diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c 
> > b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > index 1d79930cd58..9351f7e7a1a 100644
> > --- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > +++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > @@ -1,7 +1,7 @@
> >/* PR middle-end/91458 - inconsistent warning for writing past the 
> > end
> >   of an array member
> >   { dg-do compile }
> > -   { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf" } */
> > +   { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf 
> > -fno-tree-vectorize" } */
> 
>  The testcase is large - what part requires this change?  Given the
>  testcase was added for inconsistent warnings do they now become
>  inconsistent again as we enable vectorization at -O2?
> 
>  That said, the testcase adjustments need some explaining - I suppose
>  you didn't just slap -fno-tree-vectorize to all of those changing
>  behavior?
> 
> >>> void ga1_ (void)
> >>> {
> >>> a1_.a[0] = 0;
> >>> a1_.a[1] = 1; // { dg-warning 
> >>> "\\\[-Wstringop-overflow" }
> >>> a1_.a[2] = 2; // { dg-warning 
> >>> "\\\[-Wstringop-overflow" }
> >>>
> >>> struct A1 a;
> >>> a.a[0] = 0;
> >>> a.a[1] = 1;   // { dg-warning 
> >>> "\\\[-Wstringop-overflow" }
> >>> a.a[2] = 2;   // { dg-warning 
> >>> "\\\[-Wstringop-overflow" }
> >>> sink (&a);
> >>> }
> >>>
> >>> It's supposed to be 2 warning for a.a[1] = 1 and a.a[2] = 1 since
> >>> there are 2 accesses, but after enabling vectorization, there's only
> >>> one access, so one warning is missing which causes the failure.
>
> With the stores vectorized, is the warning on the correct line or
> does it point to the first store, the one that's in bounds, as
> it does with -O3?  The latter would be a regression at -O2.
For the upper case, It points to the second store which is out of
bounds, the third store warning is missing.
>
> >>
> >> I would find it preferable to change the test code over disabling
> >> optimizations that are on by default.  My concern is that the test
> >> would no longer exercise the default behavior.  (The same goes for
> >> the -fno-ipa-icf option.)
> > Hmm, it's a middle-end test, for some backend, it may not do
> > vectorization(it depends on TARGET_VECTOR_MODE_SUPPORTED_P and
> > relative cost model).
>
> Yes, there are quite a few warning tests like that.  Their main
> purpose is to verify that in common GCC invocations (i.e., without
> any special options) warnings are a) issued when expected and b)
> not issued when not expected.  Otherwise, middle end warnings are
> known to have both false positives and false negatives in some
> invocations, depending on what optimizations are in effect.
> Indiscriminately disabling common optimizations for these large
> tests and invoking them under artificial conditions would
> compromise this goal and hide the problems.
>
> If enabling vectorization at -O2 causes regressions in the quality
> of diagnostics (as the test failure above indicates seems to be
> happening) we should investigate these and open bugs for them so
> they can be fixed.  We can then tweak the specific failing test
> cases to avoid the failures until they are fixed.
There are indeed cases of false positives and false negatives
.i.e.
// Verify warning for access to a definition with an initializer that
// initializes the one-element array member.
struct A1 a1i_1 = { 0, { 1 } };

void ga1i_1 (void)
{
  a1i_1.a[0] = 0;
  a1i_1.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" }
  a1i_1.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" }

  struct A1 a = { 0, { 1 } }; --- false positive here.
  a.a[0] = 1;
  a.a[1] = 2;   // { dg-warning
"\\\[-Wstringop-overflow" } false negative here.
  a.a[2] = 3;   // { dg-warning
"\\\[-Wstringop-overflow" } false negative here.
  sink (&a);
}

the last 2 warnings are missing, and there's new warning on the line
*struct A1 a = { 0, { 1 } };
>
> Martin



-- 
BR,
Hongtao


Re: [PATCH] Enable auto-vectorization at O2 with very-cheap cost model.

2021-09-22 Thread Hongtao Liu via Gcc-patches
On Thu, Sep 23, 2021 at 9:48 AM Hongtao Liu  wrote:
>
> On Wed, Sep 22, 2021 at 10:21 PM Martin Sebor  wrote:
> >
> > On 9/21/21 7:38 PM, Hongtao Liu wrote:
> > > On Mon, Sep 20, 2021 at 4:13 AM Martin Sebor  wrote:
> > ...
> > > diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c 
> > > b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > > index 1d79930cd58..9351f7e7a1a 100644
> > > --- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > > +++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > > @@ -1,7 +1,7 @@
> > >/* PR middle-end/91458 - inconsistent warning for writing past the 
> > > end
> > >   of an array member
> > >   { dg-do compile }
> > > -   { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf" } */
> > > +   { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf 
> > > -fno-tree-vectorize" } */
> > 
> >  The testcase is large - what part requires this change?  Given the
> >  testcase was added for inconsistent warnings do they now become
> >  inconsistent again as we enable vectorization at -O2?
> > 
> >  That said, the testcase adjustments need some explaining - I suppose
> >  you didn't just slap -fno-tree-vectorize to all of those changing
> >  behavior?
> > 
> > >>> void ga1_ (void)
> > >>> {
> > >>> a1_.a[0] = 0;
> > >>> a1_.a[1] = 1; // { dg-warning 
> > >>> "\\\[-Wstringop-overflow" }
> > >>> a1_.a[2] = 2; // { dg-warning 
> > >>> "\\\[-Wstringop-overflow" }
> > >>>
> > >>> struct A1 a;
> > >>> a.a[0] = 0;
> > >>> a.a[1] = 1;   // { dg-warning 
> > >>> "\\\[-Wstringop-overflow" }
> > >>> a.a[2] = 2;   // { dg-warning 
> > >>> "\\\[-Wstringop-overflow" }
> > >>> sink (&a);
> > >>> }
> > >>>
> > >>> It's supposed to be 2 warning for a.a[1] = 1 and a.a[2] = 1 since
> > >>> there are 2 accesses, but after enabling vectorization, there's only
> > >>> one access, so one warning is missing which causes the failure.
> >
> > With the stores vectorized, is the warning on the correct line or
> > does it point to the first store, the one that's in bounds, as
> > it does with -O3?  The latter would be a regression at -O2.
> For the upper case, It points to the second store which is out of
> bounds, the third store warning is missing.
> >
> > >>
> > >> I would find it preferable to change the test code over disabling
> > >> optimizations that are on by default.  My concern is that the test
> > >> would no longer exercise the default behavior.  (The same goes for
> > >> the -fno-ipa-icf option.)
> > > Hmm, it's a middle-end test, for some backend, it may not do
> > > vectorization(it depends on TARGET_VECTOR_MODE_SUPPORTED_P and
> > > relative cost model).
> >
> > Yes, there are quite a few warning tests like that.  Their main
> > purpose is to verify that in common GCC invocations (i.e., without
> > any special options) warnings are a) issued when expected and b)
> > not issued when not expected.  Otherwise, middle end warnings are
> > known to have both false positives and false negatives in some
> > invocations, depending on what optimizations are in effect.
> > Indiscriminately disabling common optimizations for these large
> > tests and invoking them under artificial conditions would
> > compromise this goal and hide the problems.
> >
> > If enabling vectorization at -O2 causes regressions in the quality
> > of diagnostics (as the test failure above indicates seems to be
> > happening) we should investigate these and open bugs for them so
> > they can be fixed.  We can then tweak the specific failing test
> > cases to avoid the failures until they are fixed.
> There are indeed cases of false positives and false negatives
> .i.e.
> // Verify warning for access to a definition with an initializer that
> // initializes the one-element array member.
> struct A1 a1i_1 = { 0, { 1 } };
>
> void ga1i_1 (void)
> {
>   a1i_1.a[0] = 0;
>   a1i_1.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" }
>   a1i_1.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" }
>
>   struct A1 a = { 0, { 1 } }; --- false positive here.
>   a.a[0] = 1;
>   a.a[1] = 2;   // { dg-warning
> "\\\[-Wstringop-overflow" } false negative here.
>   a.a[2] = 3;   // { dg-warning
> "\\\[-Wstringop-overflow" } false negative here.
>   sink (&a);
> }
Similar for
* gcc.dg/Warray-bounds-51.c.
* gcc.dg/Warray-parameter-3.c
* gcc.dg/Wstringop-overflow-14.c
* gcc.dg/Wstringop-overflow-21.c

So there're 3 situations.
1. All accesses are out of bound, and after vectorization, there are
some warnings missing.
2. Part of accesses are inbound, part of accesses are out of bound,
and after vectorization, the warning goes from out of bound line to
inbound line.
3. All access are out of bound, and after vectoriation, all warning
are missing, and goes to a false-positive li

[PATCH v4 0/2] Implement indirect external access

2021-09-22 Thread H.J. Lu via Gcc-patches
Changes in the v4 patch.

1. Add nodirect_extern_access attribute.

Changes in the v3 patch.

1. GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS support has been added to
GNU binutils 2.38.  But the -z indirect-extern-access linker option is
only available for Linux/x86.  However, the --max-cache-size=SIZE linker
option was also addded within a day.  --max-cache-size=SIZE is used to
check for GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS support.

Changes in the v2 patch.

1. Rename the option to -fdirect-extern-access.

---
On systems with copy relocation:
* A copy in executable is created for the definition in a shared library
at run-time by ld.so.
* The copy is referenced by executable and shared libraries.
* Executable can access the copy directly.

Issues are:
* Overhead of a copy, time and space, may be visible at run-time.
* Read-only data in the shared library becomes read-write copy in
executable at run-time.
* Local access to data with the STV_PROTECTED visibility in the shared
library must use GOT.

On systems without function descriptor, function pointers vary depending
on where and how the functions are defined.
* If the function is defined in executable, it can be the address of
function body.
* If the function, including the function with STV_PROTECTED visibility,
is defined in the shared library, it can be the address of the PLT entry
in executable or shared library.

Issues are:
* The address of function body may not be used as its function pointer.
* ld.so needs to search loaded shared libraries for the function pointer
of the function with STV_PROTECTED visibility.

Here is a proposal to remove copy relocation and use canonical function
pointer:

1. Accesses, including in PIE and non-PIE, to undefined symbols must
use GOT.
  a. Linker may optimize out GOT access if the data is defined in PIE or
  non-PIE.
2. Read-only data in the shared library remain read-only at run-time
3. Address of global data with the STV_PROTECTED visibility in the shared
library is the address of data body.
  a. Can use IP-relative access.
  b. May need GOT without IP-relative access.
4. For systems without function descriptor,
  a. All global function pointers of undefined functions in PIE and
  non-PIE must use GOT.  Linker may optimize out GOT access if the
  function is defined in PIE or non-PIE.
  b. Function pointer of functions with the STV_PROTECTED visibility in
  executable and shared library is the address of function body.
   i. Can use IP-relative access.
   ii. May need GOT without IP-relative access.
   iii. Branches to undefined functions may use PLT.
5. Single global definition marker:

Add GNU_PROPERTY_1_NEEDED:

#define GNU_PROPERTY_1_NEEDED GNU_PROPERTY_UINT32_OR_LO

to indicate the needed properties by the object file.

Add GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS:

#define GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS (1U << 0)

to indicate that the object file requires canonical function pointers and
cannot be used with copy relocation.  This bit should be cleared in
executable when there are non-GOT or non-PLT relocations in relocatable
input files without this bit set.

  a. Protected symbol access within the shared library can be treated as
  local.
  b. Copy relocation should be disallowed at link-time and run-time.
  c. GOT function pointer reference is required at link-time and run-time.

The indirect external access marker can be used in the following ways:

1. Linker can decide the best way to resolve a relocation against a
protected symbol before seeing all relocations against the symbol.
2. Dynamic linker can decide if it is an error to have a copy relocation
in executable against the protected symbol in a shared library by checking
if the shared library is built with -fno-direct-extern-access.

Add a compiler option, -fdirect-extern-access. -fdirect-extern-access is
the default.  With -fno-direct-extern-access:

1. Always to use GOT to access undefined symbols, including in PIE and
non-PIE.  This is safe to do and does not break the ABI.
2. In executable and shared library, for symbols with the STV_PROTECTED
visibility:
  a. The address of data symbol is the address of data body.
  b. For systems without function descriptor, the function pointer is
  the address of function body.
These break the ABI and resulting shared libraries may not be compatible
with executables which are not compiled with -fno-direct-extern-access.
3. Generate an indirect external access marker in relocatable objects if
supported by linker.

H.J. Lu (2):
  Add -f[no-]direct-extern-access
  Add TARGET_ASM_EMIT_GNU_PROPERTY_NOTE

 gcc/c-family/c-attribs.c| 34 +++
 gcc/common.opt  |  4 ++
 gcc/config.in   |  7 +++
 gcc/config/i386/gnu-property.c  | 31 --
 gcc/config/i386/i386-protos.h   |  2 +-
 gcc/config/i386/i386.c  | 64 -
 gcc/configure 

[PATCH v4 2/2] Add TARGET_ASM_EMIT_GNU_PROPERTY_NOTE

2021-09-22 Thread H.J. Lu via Gcc-patches
Generate the marker for -fno-direct-extern-access to indicate that the
object file uses GOT to access all external symbols.  Access to protected
symbols in the resulting shared library is treated as local, which requires
canonical function pointers and cannot be used with copy relocation.

This marker can be used in the following ways:

1. Linker can decide the best way to resolve a relocation against a
protected symbol before seeing all relocations against the symbol.
2. Dynamic linker can decide if it is an error to have a copy relocation
in executable against the protected symbol in a shared library by checking
if the shared library is built with -fno-direct-extern-access.

* configure.ac (HAVE_LD_INDIRECT_EXTERN_ACCESS_SUPPORT): New.
Define to 1 if linker supports
GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS.
* output.h (emit_gnu_property): New.
(emit_gnu_property_note): Likewise.
* target.def (emit_gnu_property_note): Add a argetm.asm_out hook.
* toplev.c (compile_file): Call emit_gnu_property_note before
file_end.
* varasm.c (emit_gnu_property): New.
(emit_gnu_property_note): Likewise.
* config.in: Regenerated.
* configure: Likewise.
* doc/tm.texi: Likewise.
* config/i386/gnu-property.c (emit_gnu_property): Removed.
(TARGET_ASM_EMIT_GNU_PROPERTY_NOTE): New.
* doc/tm.texi.in: Add TARGET_ASM_EMIT_GNU_PROPERTY_NOTE.
---
 gcc/config.in  |  7 +
 gcc/config/i386/gnu-property.c | 31 --
 gcc/config/i386/i386.c |  2 ++
 gcc/configure  | 27 +++
 gcc/configure.ac   | 23 +
 gcc/doc/tm.texi|  5 
 gcc/doc/tm.texi.in |  2 ++
 gcc/output.h   |  2 ++
 gcc/target.def |  8 ++
 gcc/toplev.c   |  3 +++
 gcc/varasm.c   | 47 ++
 11 files changed, 126 insertions(+), 31 deletions(-)

diff --git a/gcc/config.in b/gcc/config.in
index 61cafe4f6c0..8a756aa3541 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -1690,6 +1690,13 @@
 #endif
 
 
+/* Define to 1 if your linker supports
+   GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS. */
+#ifndef USED_FOR_TARGET
+#undef HAVE_LD_INDIRECT_EXTERN_ACCESS_SUPPORT
+#endif
+
+
 /* Define if your PowerPC64 linker supports a large TOC. */
 #ifndef USED_FOR_TARGET
 #undef HAVE_LD_LARGE_TOC
diff --git a/gcc/config/i386/gnu-property.c b/gcc/config/i386/gnu-property.c
index 4ba04403002..9fe8d00132e 100644
--- a/gcc/config/i386/gnu-property.c
+++ b/gcc/config/i386/gnu-property.c
@@ -24,37 +24,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "linux-common.h"
 
-static void
-emit_gnu_property (unsigned int type, unsigned int data)
-{
-  int p2align = ptr_mode == SImode ? 2 : 3;
-
-  switch_to_section (get_section (".note.gnu.property",
- SECTION_NOTYPE, NULL));
-
-  ASM_OUTPUT_ALIGN (asm_out_file, p2align);
-  /* name length.  */
-  fprintf (asm_out_file, ASM_LONG "1f - 0f\n");
-  /* data length.  */
-  fprintf (asm_out_file, ASM_LONG "4f - 1f\n");
-  /* note type: NT_GNU_PROPERTY_TYPE_0.  */
-  fprintf (asm_out_file, ASM_LONG "5\n");
-  fprintf (asm_out_file, "0:\n");
-  /* vendor name: "GNU".  */
-  fprintf (asm_out_file, STRING_ASM_OP "\"GNU\"\n");
-  fprintf (asm_out_file, "1:\n");
-  ASM_OUTPUT_ALIGN (asm_out_file, p2align);
-  /* pr_type.  */
-  fprintf (asm_out_file, ASM_LONG "0x%x\n", type);
-  /* pr_datasz.  */
-  fprintf (asm_out_file, ASM_LONG "3f - 2f\n");
-  fprintf (asm_out_file, "2:\n");
-  fprintf (asm_out_file, ASM_LONG "0x%x\n", data);
-  fprintf (asm_out_file, "3:\n");
-  ASM_OUTPUT_ALIGN (asm_out_file, p2align);
-  fprintf (asm_out_file, "4:\n");
-}
-
 void
 file_end_indicate_exec_stack_and_gnu_property (void)
 {
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 9ca1ef512a4..3b678c4d5b6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24569,6 +24569,8 @@ ix86_libgcc_floating_mode_supported_p
 #if !TARGET_MACHO && !TARGET_DLLIMPORT_DECL_ATTRIBUTES
 # undef TARGET_ASM_RELOC_RW_MASK
 # define TARGET_ASM_RELOC_RW_MASK ix86_reloc_rw_mask
+# undef TARGET_ASM_EMIT_GNU_PROPERTY_NOTE
+# define TARGET_ASM_EMIT_GNU_PROPERTY_NOTE emit_gnu_property_note
 #endif
 
 static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
diff --git a/gcc/configure b/gcc/configure
index b3de17009b8..13fe041b0b6 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -32172,6 +32172,33 @@ fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ld_bndplt_support" >&5
 $as_echo "$ld_bndplt_support" >&6; }
 
+# Check if linker supports GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS.
+ld_indirect_extern_access=0
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking linker with 
GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS" >&5
+$as_echo_n "checking lin

[PATCH v4 1/2] Add -f[no-]direct-extern-access

2021-09-22 Thread H.J. Lu via Gcc-patches
Add -f[no-]direct-extern-access and nodirect_extern_access attribute.
-fdirect-extern-access is the default and always use GOT to access
undefined data and function symbols with nodirect_extern_access attribute,
including in PIE and non-PIE.  With -fno-direct-extern-access:

1. Always use GOT to access undefined data and function symbols,
   including in PIE and non-PIE.  These will avoid copy relocations
   in executables.  This is compatible with existing executables and
   shared libraries.
2. In executable and shared library, bind symbols with the STV_PROTECTED
   visibility locally:
   a. The address of data symbol is the address of data body.
   b. For systems without function descriptor, the function pointer is
  the address of function body.
   c. The resulting shared libraries may not be incompatible with
  executables which have copy relocations on protected symbols or
  use executable PLT entries as function addresses for protected
  functions in shared libraries.
3. Update asm_preferred_eh_data_format to select PC relative EH encoding
format with -fno-direct-extern-access to avoid copy relocation.
4. Add ix86_reloc_rw_mask for TARGET_ASM_RELOC_RW_MASK to avoid copy
relocation with -fno-direct-extern-access.

gcc/

PR target/35513
PR target/100593
* common.opt: Add -fdirect-extern-access.
* config/i386/i386-protos.h (ix86_force_load_from_GOT_p): Add a
bool argument.
* config/i386/i386.c (ix86_force_load_from_GOT_p): Add a bool
argument to indicate call operand.  Force non-call load
from GOT for -fno-direct-extern-access or nodirect_extern_access
attribute.
(legitimate_pic_address_disp_p): Avoid copy relocation in PIE
for -fno-direct-extern-access or nodirect_extern_access attribute.
(ix86_print_operand): Pass true to ix86_force_load_from_GOT_p
for call operand.
(asm_preferred_eh_data_format): Use PC-relative format for
-fno-direct-extern-access to avoid copy relocation.  Check
ptr_mode instead of TARGET_64BIT when selecting DW_EH_PE_sdata4.
(ix86_binds_local_p): Don't treat protected data as extern and
avoid copy relocation on common symbol with
-fno-direct-extern-access or nodirect_extern_access attribute.
(ix86_reloc_rw_mask): New to avoid copy relocation for
-fno-direct-extern-access.
(TARGET_ASM_RELOC_RW_MASK): New.
* doc/extend.texi: Document nodirect_extern_access attribute.
* doc/invoke.texi: Document -f[no-]direct-extern-access.

gcc/c-family/

PR target/35513
PR target/100593
* c-attribs.c (handle_nodirect_extern_access_attribute): New.
(c_common_attribute_table): Add nodirect_extern_access.

gcc/testsuite/

PR target/35513
PR target/100593
* g++.dg/pr35513-1.C: New file.
* g++.dg/pr35513-2.C: Likewise.
* gcc.target/i386/pr35513-1a.c: Likewise.
* gcc.target/i386/pr35513-1b.c: Likewise.
* gcc.target/i386/pr35513-2a.c: Likewise.
* gcc.target/i386/pr35513-2b.c: Likewise.
* gcc.target/i386/pr35513-3a.c: Likewise.
* gcc.target/i386/pr35513-3b.c: Likewise.
* gcc.target/i386/pr35513-4a.c: Likewise.
* gcc.target/i386/pr35513-4b.c: Likewise.
* gcc.target/i386/pr35513-5a.c: Likewise.
* gcc.target/i386/pr35513-5b.c: Likewise.
* gcc.target/i386/pr35513-6a.c: Likewise.
* gcc.target/i386/pr35513-6b.c: Likewise.
* gcc.target/i386/pr35513-7a.c: Likewise.
* gcc.target/i386/pr35513-7b.c: Likewise.
* gcc.target/i386/pr35513-8a.c: Likewise.
* gcc.target/i386/pr35513-8b.c: Likewise.
* gcc.target/i386/pr35513-9a.c: Likewise.
* gcc.target/i386/pr35513-9b.c: Likewise.
* gcc.target/i386/pr35513-10a.c: Likewise.
* gcc.target/i386/pr35513-10b.c: Likewise.
* gcc.target/i386/pr35513-11a.c: Likewise.
* gcc.target/i386/pr35513-11b.c: Likewise.
* gcc.target/i386/pr35513-12a.c: Likewise.
* gcc.target/i386/pr35513-12b.c: Likewise.
---
 gcc/c-family/c-attribs.c| 34 +++
 gcc/common.opt  |  4 ++
 gcc/config/i386/i386-protos.h   |  2 +-
 gcc/config/i386/i386.c  | 62 -
 gcc/doc/extend.texi |  6 ++
 gcc/doc/invoke.texi | 13 +
 gcc/testsuite/g++.dg/pr35513-1.C| 25 +
 gcc/testsuite/g++.dg/pr35513-2.C| 53 ++
 gcc/testsuite/gcc.target/i386/pr35513-10a.c | 17 ++
 gcc/testsuite/gcc.target/i386/pr35513-10b.c | 17 ++
 gcc/testsuite/gcc.target/i386/pr35513-11a.c | 17 ++
 gcc/testsuite/gcc.target/i386/pr35513-11b.c | 17 ++
 gcc/testsuite/gcc.target/i386/pr35513-12a.c | 17 ++
 gcc/testsuite/gcc.target/i386/pr35513-12b.c | 17 ++
 gcc/testsuite/gcc.target/

Re: [RFC] Don't move cold code out of loop by checking bb count

2021-09-22 Thread Xionghu Luo via Gcc-patches




On 2021/9/22 17:14, Richard Biener wrote:

On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo  wrote:




On 2021/8/26 19:33, Richard Biener wrote:

On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo  wrote:


Hi,

On 2021/8/6 20:15, Richard Biener wrote:

On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo  wrote:


There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

This patch does this profile count check in both gimple LIM
move_computations_worker and RTL loop-invariant.c find_invariants_bb,
if the loop bb is colder than loop preheader, don't hoist it out of
loop.

Also, the profile count in loop split pass should be corrected to avoid
lim2 and lim4 mismatch behavior, currently, the new loop preheader generated
by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
move statement out of loop unexpectely when lim2 didn't move it.  This
change could fix regression on 544.nab_r from -1.55% to +0.46%.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.


While I'm not familiar with the RTL invariant motion pass the patch there
looks reasonable.  Note that we should assess the profile quality
somehow - I'm not sure how to do that, CCed Honza for that.


Thanks.



For the GIMPLE part the patch looks quite complicated - but note it
probably has to be since LIM performs kind of a "CSE" on loads
(and stores for store-motion), so when there are multiple stmts
affected by a hoisting decision the biggest block count has to be
accounted.  Likewise when there are dependent stmts involved
that might include conditional stmts (a "PHI"), but the overall
cost should be looked at.


Currently, The gimple code check two situations with the patch:
1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
of loop;
2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
couldn't be moved out of loop, also don't move it out of loop to avoid 
definition
not dominates use error.


But part 2) is obviously already done.  What I tried to say is your heuristic
doesn't integrate nicely with the pass but I admitted that it might be a bit
difficult to find a place to add this heuristic.

There is lim_data->cost which we could bias negatively but then this is
a cost that is independent on the hoisting distance.  But doing this would
work at least for the case where the immediately enclosing loop preheader
is hotter than the stmt and with this it would be a patch that's similarly
simple as the RTL one.

Another possibility is to simply only adjust PHI processing in
compute_invariantness, capping movement according to the hotness
heuristic.  The same could be done for regular stmts there but I'm
not sure that will do good in the end since this function is supposed
to compute "correctness" (well, it also has the cost stuff), and it's
not the place to do overall cost considerations.


Thanks.  I found that adding a function find_coldest_out_loop and check it in
outermost_invariant_loop to find the coldest invariant loop between outermost
loop and itself could also reach the purpose.  Then the gimple code check is
redundant and could be removed.




May be I could collect the number of instructions not hoisted with the patch
on regression tests and SPEC2017 to do a estimation for "multiple stmts 
affected"
and "overall cost" need to be considered?  But it seems move_computations_worker
couldn't rollback if we still want to hoist multiple stmts out during the 
iterations?



Now - GIMPLE LIM "costing" is somewhat backward right now
and it isn't set up to consider those multiple involved stmts.  Plus
the store-motion part does not have any cost part (but it depends
on previously decided invariant motions).

I think the way you implemented the check will cause no hoisting
to be performed instead of, say, hoisting to a different loop level
only.  Possibly shown when you consider a loop nest like

 for (;;)
   if (unlikely_cond)
 for (;;)
invariant;

we want to hoist 'invariant' but only from the inner loop even if it
is invariant also in the outer loop.



For this case, theorotically I think the master GCC will optimize it to:

invariant;
for (;;)
  if (unlikely_cond)
for (;;)
   ;

'invariant' is moved out of outer loop, but with the patch, it will get:

for (;;)
  if (unlikely_cond)
{
  invariant;
  for (;;)
 ;
}

'invariant' is *cold* for outer loop, but it is still *ho

Re: [RFC] Don't move cold code out of loop by checking bb count

2021-09-22 Thread Xionghu Luo via Gcc-patches




On 2021/9/23 10:13, Xionghu Luo via Gcc-patches wrote:



On 2021/9/22 17:14, Richard Biener wrote:

On Thu, Sep 9, 2021 at 3:56 AM Xionghu Luo  wrote:




On 2021/8/26 19:33, Richard Biener wrote:
On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo  
wrote:


Hi,

On 2021/8/6 20:15, Richard Biener wrote:
On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo  
wrote:


There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower 
execution

frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

This patch does this profile count check in both gimple LIM
move_computations_worker and RTL loop-invariant.c 
find_invariants_bb,

if the loop bb is colder than loop preheader, don't hoist it out of
loop.

Also, the profile count in loop split pass should be corrected to 
avoid
lim2 and lim4 mismatch behavior, currently, the new loop 
preheader generated
by loop_version is set to "[count: 0]:", then lim4 after lsplt 
pass will
move statement out of loop unexpectely when lim2 didn't move it.  
This

change could fix regression on 544.nab_r from -1.55% to +0.46%.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

Regression and bootstrap tested pass on P8LE, any comments?  Thanks.


While I'm not familiar with the RTL invariant motion pass the 
patch there

looks reasonable.  Note that we should assess the profile quality
somehow - I'm not sure how to do that, CCed Honza for that.


Thanks.



For the GIMPLE part the patch looks quite complicated - but note it
probably has to be since LIM performs kind of a "CSE" on loads
(and stores for store-motion), so when there are multiple stmts
affected by a hoisting decision the biggest block count has to be
accounted.  Likewise when there are dependent stmts involved
that might include conditional stmts (a "PHI"), but the overall
cost should be looked at.


Currently, The gimple code check two situations with the patch:
1) The statement or PHI‘s BB is *colder* then preheader, don't move 
it out

of loop;
2) The statement or PHI's BB is *hotter* then preheader, but any of 
it's rhs
couldn't be moved out of loop, also don't move it out of loop to 
avoid definition

not dominates use error.


But part 2) is obviously already done.  What I tried to say is your 
heuristic
doesn't integrate nicely with the pass but I admitted that it might 
be a bit

difficult to find a place to add this heuristic.

There is lim_data->cost which we could bias negatively but then this is
a cost that is independent on the hoisting distance.  But doing this 
would
work at least for the case where the immediately enclosing loop 
preheader
is hotter than the stmt and with this it would be a patch that's 
similarly

simple as the RTL one.

Another possibility is to simply only adjust PHI processing in
compute_invariantness, capping movement according to the hotness
heuristic.  The same could be done for regular stmts there but I'm
not sure that will do good in the end since this function is supposed
to compute "correctness" (well, it also has the cost stuff), and it's
not the place to do overall cost considerations.


Thanks.  I found that adding a function find_coldest_out_loop and 
check it in
outermost_invariant_loop to find the coldest invariant loop between 
outermost
loop and itself could also reach the purpose.  Then the gimple code 
check is

redundant and could be removed.



May be I could collect the number of instructions not hoisted with 
the patch
on regression tests and SPEC2017 to do a estimation for "multiple 
stmts affected"
and "overall cost" need to be considered?  But it seems 
move_computations_worker
couldn't rollback if we still want to hoist multiple stmts out 
during the iterations?




Now - GIMPLE LIM "costing" is somewhat backward right now
and it isn't set up to consider those multiple involved stmts.  Plus
the store-motion part does not have any cost part (but it depends
on previously decided invariant motions).

I think the way you implemented the check will cause no hoisting
to be performed instead of, say, hoisting to a different loop level
only.  Possibly shown when you consider a loop nest like

 for (;;)
   if (unlikely_cond)
 for (;;)
    invariant;

we want to hoist 'invariant' but only from the inner loop even if it
is invariant also in the outer loop.



For this case, theorotically I think the master GCC will optimize 
it to:


    invariant;
    for (;;)
  if (unlikely_cond)
    for (;;)
   ;

'invariant' is moved out of outer loop, but with the patch, it will 
get:


    for (;;)
  if (unlikely_cond)
    {
  invariant;
  fo

Re: [PATCH][Hashtable 6/6] PR 68303 small size optimization

2021-09-22 Thread François Dumont via Gcc-patches

Ping ?

On 16/08/21 9:03 pm, François Dumont wrote:

On 17/07/20 2:58 pm, Jonathan Wakely wrote:

On 17/11/19 22:31 +0100, François Dumont wrote:

This is an implementation of PR 68303.

I try to use this idea as much as possible to avoid computation of 
hash codes.


Note that tests are not showing any gain. I guess hash computation 
must be quite bad to get a benefit from it. So I am only activating 
it when hash code is not cached and/or when computation is not fast.


If the tests don't show any benefit, why bother making the change?


I eventually managed to demonstrate this optimization through a 
performance test case.




Does it help the example in the PR?


No, the code attached to the PR just show what the user has done to 
put in place this optim on his side.


What I needed was a slow hash code computation compared to the equal 
operation. I realized that I had to use longer string to achieve this.


Moreover making this optim dependant on 
_Hashtable_traits::__hash_cached was just wrong as we cannot use the 
cached hash code here as the input is a key instance, not a node.


I introduce _Hashtable_hash_traits<_Hash> to offer a single 
customization point as this optim depends highly on the difference 
between a hash code computation and a comparison. Maybe I should put 
it at std namespace scope to ease partial specialization ?


Performance test results before the patch:

unordered_small_size.cc      std::unordered_set: 1st insert      
40r   32u    8s 264000112mem    0pf
unordered_small_size.cc      std::unordered_set: find/erase      
22r   22u    0s -191999808mem    0pf
unordered_small_size.cc      std::unordered_set: 2nd insert      
36r   36u    0s 191999776mem    0pf
unordered_small_size.cc      std::unordered_set: erase key       
25r   25u    0s -191999808mem    0pf
unordered_small_size.cc      std::unordered_set: 1st insert    
 404r  244u  156s -1989936256mem    0pf
unordered_small_size.cc      std::unordered_set: find/erase    
 315r  315u    0s 2061942272mem    0pf
unordered_small_size.cc      std::unordered_set: 2nd insert    
 233r  233u    0s -2061942208mem    0pf
unordered_small_size.cc      std::unordered_set: erase key     
 299r  298u    0s 2061942208mem    0pf


after the patch:

unordered_small_size.cc      std::unordered_set: 1st insert      
41r   33u    7s 264000112mem    0pf
unordered_small_size.cc      std::unordered_set: find/erase      
24r   25u    1s -191999808mem    0pf
unordered_small_size.cc      std::unordered_set: 2nd insert      
34r   34u    0s 191999776mem    0pf
unordered_small_size.cc      std::unordered_set: erase key       
25r   25u    0s -191999808mem    0pf
unordered_small_size.cc      std::unordered_set: 1st insert    
 399r  232u  165s -1989936256mem    0pf
unordered_small_size.cc      std::unordered_set: find/erase    
 196r  197u    0s 2061942272mem    0pf
unordered_small_size.cc      std::unordered_set: 2nd insert    
 221r  222u    0s -2061942208mem    0pf
unordered_small_size.cc      std::unordered_set: erase key     
 178r  178u    0s 2061942208mem    0pf


    libstdc++: Optimize operations on small size hashtable [PR 68303]

    When hasher is identified as slow and the number of elements is 
limited in the
    container use a brute-force loop on those elements to look for a 
given key using
    the key_equal functor. For the moment the default threshold below 
which the

    container is considered as small is 20.

    libstdc++-v3/ChangeLog:

    PR libstdc++/68303
    * include/bits/hashtable_policy.h
    (_Hashtable_hash_traits<_Hash>): New.
    (_Hash_code_base<>::_M_hash_code(const 
_Hash_node_value<>&)): New.

    (_Hashtable_base<>::_M_key_equals): New.
    (_Hashtable_base<>::_M_equals): Use latter.
    (_Hashtable_base<>::_M_key_equals_tr): New.
    (_Hashtable_base<>::_M_equals_tr): Use latter.
    * include/bits/hashtable.h
    (_Hashtable<>::__small_size_threshold()): New, use 
_Hashtable_hash_traits.
    (_Hashtable<>::find): Loop through elements to look for 
key if size is lower

    than __small_size_threshold().
    (_Hashtable<>::_M_emplace(true_type, _Args&&...)): Likewise.
    (_Hashtable<>::_M_insert_unique(_Kt&&, _Args&&, const 
_NodeGenerator&)): Likewise.
(_Hashtable<>::_M_compute_hash_code(const_iterator, const key_type&)): 
New.
    (_Hashtable<>::_M_emplace(const_iterator, false_type, 
_Args&&...)): Use latter.

    (_Hashtable<>::_M_find_before_node(const key_type&)): New.
    (_Hashtable<>::_M_erase(true_type, const key_type&)): Use 
latter.
    (_Hashtable<>::_M_erase(false_type, const key_type&)): 
Likewise.
    * src/c++11/hashtable_c++0x.cc: Include 
.

    * testsuite/util/testsuite_performane.h
    (report_performance): Use 9 width to display memory.
    * 
testsuite/performance/23_containers/insert_erase/unordered_small_size.c

[PATCH] wwwdocs: [GCC12] Mention Intel AVX512-FP16.

2021-09-22 Thread liuhongt via Gcc-patches
---
 htdocs/gcc-12/changes.html | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html
index 81f62fe3..14149212 100644
--- a/htdocs/gcc-12/changes.html
+++ b/htdocs/gcc-12/changes.html
@@ -165,8 +165,12 @@ a work-in-progress.
   
 
 
-
-
+IA-32/x86-64
+
+  New ISA extension support for Intel AVX512-FP16 was added to GCC.
+  AVX512FP16 intrinsics are available via the -mavx512fp16
+  compiler switch.
+
 
 
 
-- 
2.18.1



[PATCH 0/7] AVX512FP16: Support bunch of expanders for HFmode and vector HFmodes

2021-09-22 Thread liuhongt via Gcc-patches
  xfail are added for testcases related to truncmn2/extendmn2 expanders since 
V2HF/V4HFmode
are not supported yet, they should be removed later.

  Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
  Newly added runtime testcases passed on sde{-m32,}.

Hongyu Wang (5):
  AVX512FP16: Add expander for smin/maxhf3.
  AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes
  AVX512FP16: Add float(uns)?mn2 expander
  AVX512FP16: add truncmn2/extendmn2 expanders
  AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes.

liuhongt (2):
  AVX512FP16: Add expander for rint/nearbyinthf2.
  AVX512FP16: Add expander for fmahf4

 gcc/config/i386/i386-expand.c |   2 +
 gcc/config/i386/i386.md   |  62 +
 gcc/config/i386/sse.md| 259 +++---
 .../i386/avx512fp16-vcondmn-minmax.C  |  25 ++
 .../g++.target/i386/avx512fp16-vcondmn-vec.C  |  70 +
 .../i386/avx512fp16-builtin-minmax-1.c|  35 +++
 .../i386/avx512fp16-builtin-round-1.c |  14 +
 .../gcc.target/i386/avx512fp16-floatvnhf.c|  61 +
 .../gcc.target/i386/avx512fp16-fma-1.c|  69 +
 .../i386/avx512fp16-trunc-extendvnhf.c|  55 
 .../gcc.target/i386/avx512fp16-trunchf.c  |  59 
 .../gcc.target/i386/avx512fp16-truncvnhf.c|  61 +
 .../i386/avx512fp16-vcondmn-loop-1.c  |  70 +
 .../i386/avx512fp16-vcondmn-loop-2.c  | 143 ++
 .../gcc.target/i386/avx512fp16-vec_cmpmn.c|  32 +++
 .../gcc.target/i386/avx512fp16vl-fma-1.c  |  70 +
 .../i386/avx512fp16vl-fma-vectorize-1.c   |  45 +++
 17 files changed, 1100 insertions(+), 32 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-minmax.C
 create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-vec.C
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunc-extendvnhf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunchf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-truncvnhf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vec_cmpmn.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-vectorize-1.c

-- 
2.27.0



[PATCH 1/7] AVX512FP16: Add expander for rint/nearbyinthf2.

2021-09-22 Thread liuhongt via Gcc-patches
gcc/ChangeLog:

* config/i386/i386.md (rinthf2): New expander.
(nearbyinthf2): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fp16-builtin-round-1.c: Add new testcase.
---
 gcc/config/i386/i386.md   | 22 +++
 .../i386/avx512fp16-builtin-round-1.c | 14 
 2 files changed, 36 insertions(+)

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 60d877668d5..4b13a59be82 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -18287,6 +18287,17 @@ (define_insn "rintxf2"
(set_attr "znver1_decode" "vector")
(set_attr "mode" "XF")])
 
+(define_expand "rinthf2"
+  [(match_operand:HF 0 "register_operand")
+   (match_operand:HF 1 "nonimmediate_operand")]
+  "TARGET_AVX512FP16"
+{
+  emit_insn (gen_sse4_1_roundhf2 (operands[0],
+ operands[1],
+ GEN_INT (ROUND_MXCSR)));
+  DONE;
+})
+
 (define_expand "rint2"
   [(use (match_operand:MODEF 0 "register_operand"))
(use (match_operand:MODEF 1 "nonimmediate_operand"))]
@@ -18320,6 +18331,17 @@ (define_expand "nearbyintxf2"
   "TARGET_USE_FANCY_MATH_387
&& !flag_trapping_math")
 
+(define_expand "nearbyinthf2"
+  [(match_operand:HF 0 "register_operand")
+   (match_operand:HF 1 "nonimmediate_operand")]
+  "TARGET_AVX512FP16"
+{
+  emit_insn (gen_sse4_1_roundhf2 (operands[0],
+ operands[1],
+ GEN_INT (ROUND_MXCSR | ROUND_NO_EXC)));
+  DONE;
+})
+
 (define_expand "nearbyint2"
   [(use (match_operand:MODEF 0 "register_operand"))
(use (match_operand:MODEF 1 "nonimmediate_operand"))]
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c 
b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c
index 3cab1526967..a1c6636e354 100644
--- a/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-round-1.c
@@ -25,7 +25,21 @@ f4 (_Float16 x)
   return __builtin_roundevenf16 (x);
 }
 
+_Float16
+f5 (_Float16 x)
+{
+  return __builtin_rintf16 (x);
+}
+
+_Float16
+f6 (_Float16 x)
+{
+  return __builtin_nearbyintf16 (x);
+}
+
 /* { dg-final { scan-assembler-times "vrndscalesh\[ 
\\t\]+\\\$11\[^\n\r\]*xmm\[0-9\]" 1 } } */
 /* { dg-final { scan-assembler-times "vrndscalesh\[ 
\\t\]+\\\$10\[^\n\r\]*xmm\[0-9\]" 1 } } */
 /* { dg-final { scan-assembler-times "vrndscalesh\[ 
\\t\]+\\\$9\[^\n\r\]*xmm\[0-9\]" 1 } } */
 /* { dg-final { scan-assembler-times "vrndscalesh\[ 
\\t\]+\\\$8\[^\n\r\]*xmm\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "vrndscalesh\[ 
\\t\]+\\\$4\[^\n\r\]*xmm\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "vrndscalesh\[ 
\\t\]+\\\$12\[^\n\r\]*xmm\[0-9\]" 1 } } */
-- 
2.27.0



[PATCH 2/7] AVX512FP16: Add expander for fmahf4

2021-09-22 Thread liuhongt via Gcc-patches
gcc/ChangeLog:

* config/i386/sse.md (FMAMODEM): extend to handle FP16.
(VFH_SF_AVX512VL): Extend to handle HFmode.
(VF_SF_AVX512VL): Deleted.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fp16-fma-1.c: New test.
* gcc.target/i386/avx512fp16vl-fma-1.c: New test.
* gcc.target/i386/avx512fp16vl-fma-vectorize-1.c: New test.
---
 gcc/config/i386/sse.md| 11 +--
 .../gcc.target/i386/avx512fp16-fma-1.c| 69 ++
 .../gcc.target/i386/avx512fp16vl-fma-1.c  | 70 +++
 .../i386/avx512fp16vl-fma-vectorize-1.c   | 45 
 4 files changed, 190 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-vectorize-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9079613e829..1ca95984afc 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -4650,7 +4650,11 @@ (define_mode_iterator FMAMODEM
(V8SF "TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL")
(V4DF "TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL")
(V16SF "TARGET_AVX512F")
-   (V8DF "TARGET_AVX512F")])
+   (V8DF "TARGET_AVX512F")
+   (HF "TARGET_AVX512FP16")
+   (V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (V32HF "TARGET_AVX512FP16")])
 
 (define_expand "fma4"
   [(set (match_operand:FMAMODEM 0 "register_operand")
@@ -4758,14 +4762,11 @@ (define_insn "*fma_fmadd_"
(set_attr "mode" "")])
 
 ;; Suppose AVX-512F as baseline
-(define_mode_iterator VF_SF_AVX512VL
-  [SF V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL")
-   DF V8DF (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")])
-
 (define_mode_iterator VFH_SF_AVX512VL
   [(V32HF "TARGET_AVX512FP16")
(V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (HF "TARGET_AVX512FP16")
SF V16SF (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL")
DF V8DF (V4DF "TARGET_AVX512VL") (V2DF "TARGET_AVX512VL")])
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c 
b/gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c
new file mode 100644
index 000..d78d7629838
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-fma-1.c
@@ -0,0 +1,69 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mavx512fp16" } */
+
+typedef _Float16 v32hf __attribute__ ((__vector_size__ (64)));
+
+_Float16
+foo1 (_Float16 a, _Float16 b, _Float16 c)
+{
+  return a * b + c;
+}
+
+/* { dg-final { scan-assembler-times "vfmadd132sh\[^\n\r\]*xmm\[0-9\]" 1 } } */
+
+_Float16
+foo2 (_Float16 a, _Float16 b, _Float16 c)
+{
+  return -a * b + c;
+}
+
+/* { dg-final { scan-assembler-times "vfnmadd132sh\[^\n\r\]*xmm\[0-9\]" 1 } } 
*/
+
+_Float16
+foo3 (_Float16 a, _Float16 b, _Float16 c)
+{
+  return a * b - c;
+}
+
+/* { dg-final { scan-assembler-times "vfmsub132sh\[^\n\r\]*xmm\[0-9\]" 1 } } */
+
+_Float16
+foo4 (_Float16 a, _Float16 b, _Float16 c)
+{
+  return -a * b - c;
+}
+
+/* { dg-final { scan-assembler-times "vfnmsub132sh\[^\n\r\]*xmm\[0-9\]" 1 } } 
*/
+
+v32hf
+foo5 (v32hf a, v32hf b, v32hf c)
+{
+  return a * b + c;
+}
+
+/* { dg-final { scan-assembler-times "vfmadd132ph\[^\n\r\]*zmm\[0-9\]" 1 } } */
+
+v32hf
+foo6 (v32hf a, v32hf b, v32hf c)
+{
+  return -a * b + c;
+}
+
+/* { dg-final { scan-assembler-times "vfnmadd132ph\[^\n\r\]*zmm\[0-9\]" 1 } } 
*/
+
+v32hf
+foo7 (v32hf a, v32hf b, v32hf c)
+{
+  return a * b - c;
+}
+
+/* { dg-final { scan-assembler-times "vfmsub132ph\[^\n\r\]*zmm\[0-9\]" 1 } } */
+
+v32hf
+foo8 (v32hf a, v32hf b, v32hf c)
+{
+  return -a * b - c;
+}
+
+/* { dg-final { scan-assembler-times "vfnmsub132ph\[^\n\r\]*zmm\[0-9\]" 1 } } 
*/
+
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c 
b/gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c
new file mode 100644
index 000..1a832f37d6c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16vl-fma-1.c
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mavx512fp16 -mavx512vl" } */
+
+typedef _Float16 v8hf __attribute__ ((__vector_size__ (16)));
+typedef _Float16 v16hf __attribute__ ((__vector_size__ (32)));
+
+v8hf
+foo1 (v8hf a, v8hf b, v8hf c)
+{
+  return a * b + c;
+}
+
+/* { dg-final { scan-assembler-times "vfmadd132ph\[^\n\r\]*xmm\[0-9\]" 1 } } */
+
+v8hf
+foo2 (v8hf a, v8hf b, v8hf c)
+{
+  return -a * b + c;
+}
+
+/* { dg-final { scan-assembler-times "vfnmadd132ph\[^\n\r\]*xmm\[0-9\]" 1 } } 
*/
+
+v8hf
+foo3 (v8hf a, v8hf b, v8hf c)
+{
+  return a * b - c;
+}
+
+/* { dg-final { scan-assembler-times "vfmsub132ph\[^\n\r\]*xmm\[0-9\]" 1 } } */
+
+v8hf
+foo4 (v8hf a, v8hf b, v8hf c)
+{
+  return -a * b - c;
+}
+
+/* { dg-final { scan-assembler-times "vfnmsub132ph\[^\n\r\]*xmm\[0-9\]" 1 } } 
*/
+
+v16hf
+foo5 (v16hf a, v16hf b, v16hf c)
+{
+  return a * b + c;
+}
+

[PATCH 3/7] AVX512FP16: Add expander for smin/maxhf3.

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang 

gcc/ChangeLog:

* config/i386/i386.md (hf3): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fp16-builtin-minmax-1.c: New test.
---
 gcc/config/i386/i386.md   | 11 ++
 .../i386/avx512fp16-builtin-minmax-1.c| 35 +++
 2 files changed, 46 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 4b13a59be82..a087e557d7f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -19946,6 +19946,17 @@ (define_insn "3"
(set_attr "type" "sseadd")
(set_attr "mode" "")])
 
+(define_insn "hf3"
+  [(set (match_operand:HF 0 "register_operand" "=v")
+   (smaxmin:HF
+ (match_operand:HF 1 "nonimmediate_operand" "%v")
+ (match_operand:HF 2 "nonimmediate_operand" "vm")))]
+  "TARGET_AVX512FP16"
+  "vsh\t{%2, %1, %0|%0, %1, %2}"
+  [(set_attr "prefix" "evex")
+   (set_attr "type" "sseadd")
+   (set_attr "mode" "HF")])
+
 ;; These versions of the min/max patterns implement exactly the operations
 ;;   min = (op1 < op2 ? op1 : op2)
 ;;   max = (!(op1 < op2) ? op1 : op2)
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c 
b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c
new file mode 100644
index 000..90080e44216
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-builtin-minmax-1.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mavx512fp16 -mprefer-vector-width=512" } */
+
+_Float16
+minf1 (_Float16 a, _Float16 b)
+{
+  return __builtin_fminf16 (a, b);
+}
+
+void
+minf2 (_Float16* __restrict psrc1, _Float16* __restrict psrc2,
+   _Float16* __restrict pdst)
+{
+  for (int i = 0; i != 32; i++)
+pdst[i] = __builtin_fminf16 (psrc1[i], psrc2[i]);
+}
+
+_Float16
+maxf1 (_Float16 a, _Float16 b)
+{
+  return __builtin_fmaxf16 (a, b);
+}
+
+void
+maxf2 (_Float16* __restrict psrc1, _Float16* __restrict psrc2,
+   _Float16* __restrict pdst)
+{
+  for (int i = 0; i != 32; i++)
+pdst[i] = __builtin_fmaxf16 (psrc1[i], psrc2[i]);
+}
+
+/* { dg-final { scan-assembler-times "vmaxsh\[^\n\r\]*xmm\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "vmaxph\[^\n\r\]*zmm\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "vminsh\[^\n\r\]*xmm\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "vminph\[^\n\r\]*zmm\[0-9\]" 1 } } */
-- 
2.27.0



[PATCH 4/7] AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang 

NB: 64bit/32bit vectorize for HFmode is not supported for now, will
adjust this patch when V2HF/V4HF operations supported.

gcc/ChangeLog:

* config/i386/i386.md (fix_trunchf2): New expander.
(fixuns_trunchfhi2): Likewise.
(*fixuns_trunchfsi2zext): New define_insn.
* config/i386/sse.md (ssePHmodelower): New mode_attr.
(fix_trunc2):
New expander for same element vector fix_truncate.
(fix_trunc2):
Likewise for V4HF to V4SI/V4DI fix_truncate.
(fix_truncv2hfv2di2):
Likeise for V2HF to V2DI fix_truncate.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fp16-trunchf.c: New test.
* gcc.target/i386/avx512fp16-truncvnhf.c: Ditto.
---
 gcc/config/i386/i386.md   | 29 +
 gcc/config/i386/sse.md| 43 +
 .../gcc.target/i386/avx512fp16-trunchf.c  | 59 ++
 .../gcc.target/i386/avx512fp16-truncvnhf.c| 61 +++
 4 files changed, 192 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunchf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-truncvnhf.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index a087e557d7f..c6279e620c9 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4810,6 +4810,16 @@ (define_expand "fix_truncdi2"
}
 })
 
+(define_insn "fix_trunchf2"
+  [(set (match_operand:SWI48 0 "register_operand" "=r")
+   (any_fix:SWI48
+ (match_operand:HF 1 "nonimmediate_operand" "vm")))]
+  "TARGET_AVX512FP16"
+  "vcvttsh2si\t{%1, %0|%0, %1}"
+  [(set_attr "type" "sseicvt")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "")])
+
 ;; Signed conversion to SImode.
 
 (define_expand "fix_truncxfsi2"
@@ -4917,6 +4927,17 @@ (define_insn "fixuns_truncsi2_avx512f"
(set_attr "prefix" "evex")
(set_attr "mode" "SI")])
 
+(define_insn "*fixuns_trunchfsi2zext"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (zero_extend:DI
+ (unsigned_fix:SI
+   (match_operand:HF 1 "nonimmediate_operand" "vm"]
+  "TARGET_64BIT && TARGET_AVX512FP16"
+  "vcvttsh2usi\t{%1, %k0|%k0, %1}"
+  [(set_attr "type" "sseicvt")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "SI")])
+
 (define_insn "*fixuns_truncsi2_avx512f_zext"
   [(set (match_operand:DI 0 "register_operand" "=r")
(zero_extend:DI
@@ -4949,6 +4970,14 @@ (define_insn_and_split "*fixuns_trunc_1"
 ;; Without these patterns, we'll try the unsigned SI conversion which
 ;; is complex for SSE, rather than the signed SI conversion, which isn't.
 
+(define_expand "fixuns_trunchfhi2"
+  [(set (match_dup 2)
+   (fix:SI (match_operand:HF 1 "nonimmediate_operand")))
+   (set (match_operand:HI 0 "nonimmediate_operand")
+   (subreg:HI (match_dup 2) 0))]
+  "TARGET_AVX512FP16"
+  "operands[2] = gen_reg_rtx (SImode);")
+
 (define_expand "fixuns_trunchi2"
   [(set (match_dup 2)
(fix:SI (match_operand:MODEF 1 "nonimmediate_operand")))
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 1ca95984afc..f8a5f197f3c 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1034,6 +1034,13 @@ (define_mode_attr ssePHmode
(V8DI "V8HF") (V4DI "V8HF") (V2DI "V8HF")
(V8DF "V8HF") (V16SF "V16HF") (V8SF "V8HF")])
 
+;; Mapping of vector modes to vector hf modes of same element.
+(define_mode_attr ssePHmodelower
+  [(V32HI "v32hf") (V16HI "v16hf") (V8HI "v8hf")
+   (V16SI "v16hf") (V8SI "v8hf") (V4SI "v4hf")
+   (V8DI "v8hf") (V4DI "v4hf") (V2DI "v2hf")
+   (V8DF "v8hf") (V16SF "v16hf") (V8SF "v8hf")])
+
 ;; Mapping of vector modes to packed single mode of the same size
 (define_mode_attr ssePSmode
   [(V16SI "V16SF") (V8DF "V16SF")
@@ -6175,6 +6182,12 @@ (define_insn 
"avx512fp16_vcvtsi2sh"
(set_attr "prefix" "evex")
(set_attr "mode" "HF")])
 
+(define_expand "fix_trunc2"
+  [(set (match_operand:VI2H_AVX512VL 0 "register_operand")
+   (any_fix:VI2H_AVX512VL
+ (match_operand: 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16")
+
 (define_insn 
"avx512fp16_fix_trunc2"
   [(set (match_operand:VI2H_AVX512VL 0 "register_operand" "=v")
(any_fix:VI2H_AVX512VL
@@ -6185,6 +6198,21 @@ (define_insn 
"avx512fp16_fix_trunc2")])
 
+(define_expand "fix_truncv4hf2"
+  [(set (match_operand:VI4_128_8_256 0 "register_operand")
+   (any_fix:VI4_128_8_256
+ (match_operand:V4HF 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  if (!MEM_P (operands[1]))
+{
+  operands[1] = lowpart_subreg (V8HFmode, operands[1], V4HFmode);
+  emit_insn (gen_avx512fp16_fix_trunc2 (operands[0],
+   operands[1]));
+  DONE;
+}
+})
+
 (define_insn "avx512fp16_fix_trunc2"
   [(set (match_operand:VI4_128_8_256 0 "register_operand" "=v")
(any_fix:VI4_128_8_256
@@ -6207,6 +6235,21 @@ (define_insn 
"*avx512fp16_fix_tru

[PATCH 5/7] AVX512FP16: Add float(uns)?mn2 expander

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang 

gcc/ChangeLog:

* config/i386/sse.md (float2):
New expander.
(avx512fp16_vcvt2ph_):
Rename to ...
(floatv4hf2): ... this, and drop constraints.
(avx512fp16_vcvtqq2ph_v2di): Rename to ...
(floatv2div2hf2): ... this, and likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fp16-floatvnhf.c: New test.
---
 gcc/config/i386/sse.md| 46 +++---
 .../gcc.target/i386/avx512fp16-floatvnhf.c| 61 +++
 2 files changed, 99 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index f8a5f197f3c..66062dc3bcf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6006,6 +6006,12 @@ (define_insn 
"avx512fp16_vcvtph2_<
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
+(define_expand "float2"
+  [(set (match_operand: 0 "register_operand")
+   (any_float:
+ (match_operand:VI2H_AVX512VL 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16")
+
 (define_insn 
"avx512fp16_vcvt2ph_"
   [(set (match_operand: 0 "register_operand" "=v")
(any_float:
@@ -6016,11 +6022,23 @@ (define_insn 
"avx512fp16_vcvt2ph_")])
 
-(define_expand "avx512fp16_vcvt2ph_"
-  [(set (match_operand:V8HF 0 "register_operand" "=v")
+(define_expand "floatv4hf2"
+  [(set (match_operand:V4HF 0 "register_operand")
+   (any_float:V4HF
+ (match_operand:VI4_128_8_256 1 "vector_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  operands[0] = lowpart_subreg (V8HFmode, operands[0], V4HFmode);
+  emit_insn (gen_avx512fp16_floatv4hf2 (operands[0],
+ operands[1]));
+  DONE;
+})
+
+(define_expand "avx512fp16_floatv4hf2"
+  [(set (match_operand:V8HF 0 "register_operand")
(vec_concat:V8HF
-   (any_float:V4HF (match_operand:VI4_128_8_256 1 "vector_operand" 
"vm"))
-   (match_dup 2)))]
+ (any_float:V4HF (match_operand:VI4_128_8_256 1 "vector_operand"))
+ (match_dup 2)))]
   "TARGET_AVX512FP16 && TARGET_AVX512VL"
   "operands[2] = CONST0_RTX (V4HFmode);")
 
@@ -6079,11 +6097,23 @@ (define_insn 
"*avx512fp16_vcvt2ph__mask_1"
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
-(define_expand "avx512fp16_vcvtqq2ph_v2di"
-  [(set (match_operand:V8HF 0 "register_operand" "=v")
+(define_expand "floatv2div2hf2"
+  [(set (match_operand:V2HF 0 "register_operand")
+   (any_float:V2HF
+ (match_operand:V2DI 1 "vector_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  operands[0] = lowpart_subreg (V8HFmode, operands[0], V2HFmode);
+  emit_insn (gen_avx512fp16_floatv2div2hf2 (operands[0],
+   operands[1]));
+  DONE;
+})
+
+(define_expand "avx512fp16_floatv2div2hf2"
+  [(set (match_operand:V8HF 0 "register_operand")
(vec_concat:V8HF
-   (any_float:V2HF (match_operand:V2DI 1 "vector_operand" "vm"))
-   (match_dup 2)))]
+ (any_float:V2HF (match_operand:V2DI 1 "vector_operand"))
+ (match_dup 2)))]
   "TARGET_AVX512FP16 && TARGET_AVX512VL"
   "operands[2] = CONST0_RTX (V6HFmode);")
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c 
b/gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c
new file mode 100644
index 000..112ac3e74d5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512fp16-floatvnhf.c
@@ -0,0 +1,61 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-slp-vectorize 
-mprefer-vector-width=512" } */
+
+extern long long di[8];
+extern unsigned long long udi[8];
+extern int si[16];
+extern unsigned int usi[16];
+extern short hi[32];
+extern unsigned short uhi[32];
+extern _Float16 hf[32];
+
+#define DO_PRAGMA(X) _Pragma(#X)
+
+#define FLOATHFVV(size, mode)  \
+  void __attribute__ ((noinline, noclone))  \
+float##v##size##mode##v##size##hf ()   \
+{\
+  int i;  \
+  DO_PRAGMA (GCC unroll size)  \
+  for (i = 0; i < size; i++)  \
+hf[i] = (_Float16) mode[i];  \
+}
+
+FLOATHFVV(32, hi)
+FLOATHFVV(16, hi)
+FLOATHFVV(8, hi)
+FLOATHFVV(16, si)
+FLOATHFVV(8, si)
+FLOATHFVV(4, si)
+FLOATHFVV(8, di)
+FLOATHFVV(4, di)
+FLOATHFVV(2, di)
+
+FLOATHFVV(32, uhi)
+FLOATHFVV(16, uhi)
+FLOATHFVV(8, uhi)
+FLOATHFVV(16, usi)
+FLOATHFVV(8, usi)
+FLOATHFVV(4, usi)
+FLOATHFVV(8, udi)
+FLOATHFVV(4, udi)
+FLOATHFVV(2, udi)
+
+/* { dg-final { scan-assembler-times "vcvtqq2phz\[ 
\\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vcvtuqq2phz\[ 
\\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vcvtqq2phy\[ 
\\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times "vcvtuqq2phy\[ 
\\t\]+\[^\{\n\]*\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } }

[PATCH 6/7] AVX512FP16: add truncmn2/extendmn2 expanders

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang 

gcc/ChangeLog:

* config/i386/sse.md (extend2):
New expander.
(extendv4hf2): Likewise.
(extendv2hfv2df2): Likewise.
(trunc2): Likewise.
(avx512fp16_vcvt2ph_): Rename to ...
(truncv4hf2): ... this, and drop constraints.
(avx512fp16_vcvtpd2ph_v2df): Rename to ...
(truncv2dfv2hf2): ... this, and likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fp16-trunc-extendvnhf.c: New test.
---
 gcc/config/i386/sse.md| 75 +--
 .../i386/avx512fp16-trunc-extendvnhf.c| 55 ++
 2 files changed, 123 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-trunc-extendvnhf.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 66062dc3bcf..a48c8e8bede 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6328,6 +6328,12 @@ (define_mode_attr ph2pssuffix
   [(V16SF "x") (V8SF "x") (V4SF "x")
(V8DF "") (V4DF "") (V2DF "")])
 
+(define_expand "extend2"
+  [(set (match_operand:VF48H_AVX512VL 0 "register_operand")
+   (float_extend:VF48H_AVX512VL
+ (match_operand: 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16")
+
 (define_insn "avx512fp16_float_extend_ph2"
   [(set (match_operand:VF48H_AVX512VL 0 "register_operand" "=v")
(float_extend:VF48H_AVX512VL
@@ -6338,6 +6344,21 @@ (define_insn 
"avx512fp16_float_extend_ph2"
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
+(define_expand "extendv4hf2"
+  [(set (match_operand:VF4_128_8_256 0 "register_operand")
+   (float_extend:VF4_128_8_256
+ (match_operand:V4HF 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  if (!MEM_P (operands[1]))
+{
+  operands[1] = lowpart_subreg (V8HFmode, operands[1], V4HFmode);
+  emit_insn (gen_avx512fp16_float_extend_ph2
+(operands[0], operands[1]));
+  DONE;
+}
+})
+
 (define_insn "avx512fp16_float_extend_ph2"
   [(set (match_operand:VF4_128_8_256 0 "register_operand" "=v")
(float_extend:VF4_128_8_256
@@ -6360,6 +6381,21 @@ (define_insn 
"*avx512fp16_float_extend_ph2_load"
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
+(define_expand "extendv2hfv2df2"
+  [(set (match_operand:V2DF 0 "register_operand")
+   (float_extend:V2DF
+ (match_operand:V2HF 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  if (!MEM_P (operands[1]))
+{
+  operands[1] = lowpart_subreg (V8HFmode, operands[1], V2HFmode);
+  emit_insn (gen_avx512fp16_float_extend_phv2df2
+(operands[0], operands[1]));
+  DONE;
+}
+})
+
 (define_insn "avx512fp16_float_extend_phv2df2"
   [(set (match_operand:V2DF 0 "register_operand" "=v")
(float_extend:V2DF
@@ -6382,6 +6418,12 @@ (define_insn 
"*avx512fp16_float_extend_phv2df2_load"
(set_attr "prefix" "evex")
(set_attr "mode" "TI")])
 
+(define_expand "trunc2"
+  [(set (match_operand: 0 "register_operand")
+   (float_truncate:
+ (match_operand:VF48H_AVX512VL 1 "nonimmediate_operand")))]
+  "TARGET_AVX512FP16")
+
 (define_insn "avx512fp16_vcvt2ph_"
   [(set (match_operand: 0 "register_operand" "=v")
(float_truncate:
@@ -6392,11 +6434,21 @@ (define_insn 
"avx512fp16_vcvt2ph_"
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
-(define_expand "avx512fp16_vcvt2ph_"
-  [(set (match_operand:V8HF 0 "register_operand" "=v")
+(define_expand "truncv4hf2"
+  [(set (match_operand:V4HF 0 "register_operand")
+   (float_truncate:V4HF (match_operand:VF4_128_8_256 1 "vector_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  operands[0] = lowpart_subreg (V8HFmode, operands[0], V4HFmode);
+  emit_insn (gen_avx512fp16_truncv4hf2 (operands[0], operands[1]));
+  DONE;
+})
+
+(define_expand "avx512fp16_truncv4hf2"
+  [(set (match_operand:V8HF 0 "register_operand")
(vec_concat:V8HF
(float_truncate:V4HF
- (match_operand:VF4_128_8_256 1 "vector_operand" "vm"))
+ (match_operand:VF4_128_8_256 1 "vector_operand"))
(match_dup 2)))]
   "TARGET_AVX512FP16 && TARGET_AVX512VL"
   "operands[2] = CONST0_RTX (V4HFmode);")
@@ -6461,11 +6513,20 @@ (define_insn 
"*avx512fp16_vcvt2ph__mask_1"
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
-(define_expand "avx512fp16_vcvtpd2ph_v2df"
-  [(set (match_operand:V8HF 0 "register_operand" "=v")
+(define_expand "truncv2dfv2hf2"
+  [(set (match_operand:V2HF 0 "register_operand")
+   (float_truncate:V2HF (match_operand:V2DF 1 "vector_operand")))]
+  "TARGET_AVX512FP16 && TARGET_AVX512VL"
+{
+  operands[0] = lowpart_subreg (V8HFmode, operands[0], V2HFmode);
+  emit_insn (gen_avx512fp16_truncv2dfv2hf2 (operands[0], operands[1]));
+  DONE;
+})
+
+(define_expand "avx512fp16_truncv2dfv2hf2"
+  [(set (match_operand:V8HF 0 "register_operand")
(vec_concat:V8HF
- (float_truncate:V2HF
-

[PATCH 7/7] AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes.

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang 

gcc/ChangeLog:

* config/i386/i386-expand.c (ix86_use_mask_cmp_p): Enable
HFmode mask_cmp.
* config/i386/sse.md (sseintvecmodelower): Add HF vector modes.
(_store_mask): Extend to support HF vector modes.
(vec_cmp): Likewise.
(vcond_mask_): Likewise.
(vcond): New expander.
(vcond): Likewise.
(vcond): Likewise.
(vcondu): Likewise.

gcc/testsuite/ChangeLog:

* g++.target/i386/avx512fp16-vcondmn-vec.C: New test.
* g++.target/i386/avx512fp16-vcondmn-minmax.C: Ditto.
* gcc.target/i386/avx512fp16-vcondmn-loop-1.c: Ditto.
* gcc.target/i386/avx512fp16-vcondmn-loop-2.c: Ditto.
* gcc.target/i386/avx512fp16-vec_cmpmn.c: Ditto.
---
 gcc/config/i386/i386-expand.c |   2 +
 gcc/config/i386/sse.md|  84 --
 .../i386/avx512fp16-vcondmn-minmax.C  |  25 +++
 .../g++.target/i386/avx512fp16-vcondmn-vec.C  |  70 +
 .../i386/avx512fp16-vcondmn-loop-1.c  |  70 +
 .../i386/avx512fp16-vcondmn-loop-2.c  | 143 ++
 .../gcc.target/i386/avx512fp16-vec_cmpmn.c|  32 
 7 files changed, 414 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-minmax.C
 create mode 100644 gcc/testsuite/g++.target/i386/avx512fp16-vcondmn-vec.C
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vcondmn-loop-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vec_cmpmn.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index dbbf5e34656..94ac303585e 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -3638,6 +3638,8 @@ ix86_use_mask_cmp_p (machine_mode mode, machine_mode 
cmp_mode,
 return false;
   else if (vector_size == 64)
 return true;
+  else if (GET_MODE_INNER (cmp_mode) == HFmode)
+return true;
 
   /* When op_true is NULL, op_false must be NULL, or vice versa.  */
   gcc_assert (!op_true == !op_false);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index a48c8e8bede..084fc7f4693 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -989,9 +989,9 @@ (define_mode_attr sseintvecmode2
(V16HF "OI") (V8HF "TI")])
 
 (define_mode_attr sseintvecmodelower
-  [(V16SF "v16si") (V8DF "v8di")
-   (V8SF "v8si") (V4DF "v4di")
-   (V4SF "v4si") (V2DF "v2di")
+  [(V32HF "v32hi") (V16SF "v16si") (V8DF "v8di")
+   (V16HF "v16hi") (V8SF "v8si") (V4DF "v4di")
+   (V8HF "v8hi") (V4SF "v4si") (V2DF "v2di")
(V8SI "v8si") (V4DI "v4di")
(V4SI "v4si") (V2DI "v2di")
(V16HI "v16hi") (V8HI "v8hi")
@@ -1568,9 +1568,9 @@ (define_insn "_store_mask"
(set_attr "mode" "")])
 
 (define_insn "_store_mask"
-  [(set (match_operand:VI12_AVX512VL 0 "memory_operand" "=m")
-   (vec_merge:VI12_AVX512VL
- (match_operand:VI12_AVX512VL 1 "register_operand" "v")
+  [(set (match_operand:VI12HF_AVX512VL 0 "memory_operand" "=m")
+   (vec_merge:VI12HF_AVX512VL
+ (match_operand:VI12HF_AVX512VL 1 "register_operand" "v")
  (match_dup 0)
  (match_operand: 2 "register_operand" "Yk")))]
   "TARGET_AVX512BW"
@@ -3810,8 +3810,8 @@ (define_insn "_comi"
 (define_expand "vec_cmp"
   [(set (match_operand: 0 "register_operand")
(match_operator: 1 ""
- [(match_operand:V48_AVX512VL 2 "register_operand")
-  (match_operand:V48_AVX512VL 3 "nonimmediate_operand")]))]
+ [(match_operand:V48H_AVX512VL 2 "register_operand")
+  (match_operand:V48H_AVX512VL 3 "nonimmediate_operand")]))]
   "TARGET_AVX512F"
 {
   bool ok = ix86_expand_mask_vec_cmp (operands[0], GET_CODE (operands[1]),
@@ -4018,6 +4018,51 @@ (define_expand "vcond"
   DONE;
 })
 
+(define_expand "vcond"
+  [(set (match_operand:VF_AVX512FP16VL 0 "register_operand")
+   (if_then_else:VF_AVX512FP16VL
+ (match_operator 3 ""
+   [(match_operand:VF_AVX512FP16VL 4 "vector_operand")
+(match_operand:VF_AVX512FP16VL 5 "vector_operand")])
+ (match_operand:VF_AVX512FP16VL 1 "general_operand")
+ (match_operand:VF_AVX512FP16VL 2 "general_operand")))]
+  "TARGET_AVX512FP16"
+{
+  bool ok = ix86_expand_fp_vcond (operands);
+  gcc_assert (ok);
+  DONE;
+})
+
+(define_expand "vcond"
+  [(set (match_operand:VF_AVX512FP16VL 0 "register_operand")
+   (if_then_else:VF_AVX512FP16VL
+ (match_operator 3 ""
+   [(match_operand: 4 "vector_operand")
+(match_operand: 5 "vector_operand")])
+ (match_operand:VF_AVX512FP16VL 1 "general_operand")
+ (match_operand:VF_AVX512FP16VL 2 "general_operand")))]
+  "TARGET_AVX512FP16"
+{
+  bool ok = ix86_expand_int_vcond (operands);
+  gcc_assert (ok);
+  DONE;
+})
+
+(define_expand "vcond"
+  [(set (match_operand: 0 "register_operand")
+   (if_then_else:
+ (match_op

Re: [PATCH] Enable auto-vectorization at O2 with very-cheap cost model.

2021-09-22 Thread Richard Biener via Gcc-patches
On Thu, 23 Sep 2021, Hongtao Liu wrote:

> On Thu, Sep 23, 2021 at 9:48 AM Hongtao Liu  wrote:
> >
> > On Wed, Sep 22, 2021 at 10:21 PM Martin Sebor  wrote:
> > >
> > > On 9/21/21 7:38 PM, Hongtao Liu wrote:
> > > > On Mon, Sep 20, 2021 at 4:13 AM Martin Sebor  wrote:
> > > ...
> > > > diff --git a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c 
> > > > b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > > > index 1d79930cd58..9351f7e7a1a 100644
> > > > --- a/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > > > +++ b/gcc/testsuite/c-c++-common/Wstringop-overflow-2.c
> > > > @@ -1,7 +1,7 @@
> > > >/* PR middle-end/91458 - inconsistent warning for writing past 
> > > > the end
> > > >   of an array member
> > > >   { dg-do compile }
> > > > -   { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf" } */
> > > > +   { dg-options "-O2 -Wall -Wno-array-bounds -fno-ipa-icf 
> > > > -fno-tree-vectorize" } */
> > > 
> > >  The testcase is large - what part requires this change?  Given the
> > >  testcase was added for inconsistent warnings do they now become
> > >  inconsistent again as we enable vectorization at -O2?
> > > 
> > >  That said, the testcase adjustments need some explaining - I suppose
> > >  you didn't just slap -fno-tree-vectorize to all of those changing
> > >  behavior?
> > > 
> > > >>> void ga1_ (void)
> > > >>> {
> > > >>> a1_.a[0] = 0;
> > > >>> a1_.a[1] = 1; // { dg-warning 
> > > >>> "\\\[-Wstringop-overflow" }
> > > >>> a1_.a[2] = 2; // { dg-warning 
> > > >>> "\\\[-Wstringop-overflow" }
> > > >>>
> > > >>> struct A1 a;
> > > >>> a.a[0] = 0;
> > > >>> a.a[1] = 1;   // { dg-warning 
> > > >>> "\\\[-Wstringop-overflow" }
> > > >>> a.a[2] = 2;   // { dg-warning 
> > > >>> "\\\[-Wstringop-overflow" }
> > > >>> sink (&a);
> > > >>> }
> > > >>>
> > > >>> It's supposed to be 2 warning for a.a[1] = 1 and a.a[2] = 1 since
> > > >>> there are 2 accesses, but after enabling vectorization, there's only
> > > >>> one access, so one warning is missing which causes the failure.
> > >
> > > With the stores vectorized, is the warning on the correct line or
> > > does it point to the first store, the one that's in bounds, as
> > > it does with -O3?  The latter would be a regression at -O2.
> > For the upper case, It points to the second store which is out of
> > bounds, the third store warning is missing.
> > >
> > > >>
> > > >> I would find it preferable to change the test code over disabling
> > > >> optimizations that are on by default.  My concern is that the test
> > > >> would no longer exercise the default behavior.  (The same goes for
> > > >> the -fno-ipa-icf option.)
> > > > Hmm, it's a middle-end test, for some backend, it may not do
> > > > vectorization(it depends on TARGET_VECTOR_MODE_SUPPORTED_P and
> > > > relative cost model).
> > >
> > > Yes, there are quite a few warning tests like that.  Their main
> > > purpose is to verify that in common GCC invocations (i.e., without
> > > any special options) warnings are a) issued when expected and b)
> > > not issued when not expected.  Otherwise, middle end warnings are
> > > known to have both false positives and false negatives in some
> > > invocations, depending on what optimizations are in effect.
> > > Indiscriminately disabling common optimizations for these large
> > > tests and invoking them under artificial conditions would
> > > compromise this goal and hide the problems.
> > >
> > > If enabling vectorization at -O2 causes regressions in the quality
> > > of diagnostics (as the test failure above indicates seems to be
> > > happening) we should investigate these and open bugs for them so
> > > they can be fixed.  We can then tweak the specific failing test
> > > cases to avoid the failures until they are fixed.
> > There are indeed cases of false positives and false negatives
> > .i.e.
> > // Verify warning for access to a definition with an initializer that
> > // initializes the one-element array member.
> > struct A1 a1i_1 = { 0, { 1 } };
> >
> > void ga1i_1 (void)
> > {
> >   a1i_1.a[0] = 0;
> >   a1i_1.a[1] = 1;   // { dg-warning "\\\[-Wstringop-overflow" }
> >   a1i_1.a[2] = 2;   // { dg-warning "\\\[-Wstringop-overflow" }
> >
> >   struct A1 a = { 0, { 1 } }; --- false positive here.
> >   a.a[0] = 1;
> >   a.a[1] = 2;   // { dg-warning
> > "\\\[-Wstringop-overflow" } false negative here.
> >   a.a[2] = 3;   // { dg-warning
> > "\\\[-Wstringop-overflow" } false negative here.
> >   sink (&a);
> > }
> Similar for
> * gcc.dg/Warray-bounds-51.c.
> * gcc.dg/Warray-parameter-3.c
> * gcc.dg/Wstringop-overflow-14.c
> * gcc.dg/Wstringop-overflow-21.c
> 
> So there're 3 situations.
> 1. All accesses are out of bound, and after vectorization, there are
> some warnings missing.
> 2

<    1   2