[PATCH 4/4] xtensa: Improve constant synthesis for both integer and floating-point

2022-06-09 Thread Takayuki 'January June' Suwa via Gcc-patches

This patch revises the previous implementation of constant synthesis.

First, changed to use define_split machine description pattern and to run
after reload pass, in order not to interfere some optimizations such as
the loop invariant motion.

Second, not only integer but floating-point is subject to processing.

Third, several new synthesis patterns - when the constant cannot fit into
a "MOVI Ax, simm12" instruction, but:

I.   can be represented as a power of two minus one (eg. 32767, 65535 or
 0x7fffUL)
   => "MOVI(.N) Ax, -1" + "SRLI Ax, Ax, 1 ... 31" (or "EXTUI")
II.  is between -34816 and 34559
   => "MOVI(.N) Ax, -2048 ... 2047" + "ADDMI Ax, Ax, -32768 ... 32512"
III. (existing case) can fit into a signed 12-bit if the trailing zero bits
 are stripped
   => "MOVI(.N) Ax, -2048 ... 2047" + "SLLI Ax, Ax, 1 ... 31"

The above sequences consist of 5 or 6 bytes and have latency of 2 clock 
cycles,
in contrast with "L32R Ax, " (3 bytes and one clock latency, 
but may

suffer additional one clock pipeline stall and implementation-specific
InstRAM/ROM access penalty) plus 4 bytes of constant value.

In addition, 3-instructions synthesis patterns (8 or 9 bytes, 3 clock 
latency)

are also provided when optimizing for speed and L32R instruction has
considerable access penalty:

IV.  2-instructions synthesis (any of I ... III) followed by
 "SLLI Ax, Ax, 1 ... 31"
V.   2-instructions synthesis followed by either "ADDX[248] Ax, Ax, Ax"
 or "SUBX8 Ax, Ax, Ax" (multiplying by 3, 5, 7 or 9)

gcc/ChangeLog:

* config/xtensa/xtensa-protos.h (xtensa_constantsynth):
New prototype.
* config/xtensa/xtensa.cc (xtensa_emit_constantsynth,
xtensa_constantsynth_2insn, xtensa_constantsynth_rtx_SLLI,
xtensa_constantsynth_rtx_ADDSUBX, xtensa_constantsynth):
New backend functions that process the abovementioned logic.
(xtensa_emit_move_sequence): Revert the previous changes.
* config/xtensa/xtensa.md (): New split patterns for integer
and floating-point, as the frontend part.

gcc/testsuite/ChangeLog:

* gcc.target/xtensa/constsynth_2insns.c: New.
* gcc.target/xtensa/constsynth_3insns.c: Ditto.
* gcc.target/xtensa/constsynth_double.c: Ditto.
---
 gcc/config/xtensa/xtensa-protos.h |   1 +
 gcc/config/xtensa/xtensa.cc   | 144 --
 gcc/config/xtensa/xtensa.md   |  50 ++
 .../gcc.target/xtensa/constsynth_2insns.c |  44 ++
 .../gcc.target/xtensa/constsynth_3insns.c |  24 +++
 .../gcc.target/xtensa/constsynth_double.c |  11 ++
 6 files changed, 258 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/xtensa/constsynth_2insns.c
 create mode 100644 gcc/testsuite/gcc.target/xtensa/constsynth_3insns.c
 create mode 100644 gcc/testsuite/gcc.target/xtensa/constsynth_double.c

diff --git a/gcc/config/xtensa/xtensa-protos.h 
b/gcc/config/xtensa/xtensa-protos.h

index 30e4b54394a..c2fd750cd3a 100644
--- a/gcc/config/xtensa/xtensa-protos.h
+++ b/gcc/config/xtensa/xtensa-protos.h
@@ -44,6 +44,7 @@ extern int xtensa_expand_block_move (rtx *);
 extern int xtensa_expand_block_set_unrolled_loop (rtx *);
 extern int xtensa_expand_block_set_small_loop (rtx *);
 extern void xtensa_split_operand_pair (rtx *, machine_mode);
+extern int xtensa_constantsynth (rtx, HOST_WIDE_INT);
 extern int xtensa_emit_move_sequence (rtx *, machine_mode);
 extern rtx xtensa_copy_incoming_a7 (rtx);
 extern void xtensa_expand_nonlocal_goto (rtx *);
diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
index 1769e43c7b5..2febea0eb3d 100644
--- a/gcc/config/xtensa/xtensa.cc
+++ b/gcc/config/xtensa/xtensa.cc
@@ -1037,6 +1037,134 @@ xtensa_split_operand_pair (rtx operands[4], 
machine_mode mode)

 }


+/* Try to emit insns to load srcval (that cannot fit into signed 12-bit)
+   into dst with synthesizing a such constant value from a sequence of
+   load-immediate / arithmetic ones, instead of a L32R instruction
+   (plus a constant in litpool).  */
+
+static void
+xtensa_emit_constantsynth (rtx dst, enum rtx_code code,
+  HOST_WIDE_INT imm0, HOST_WIDE_INT imm1,
+  rtx (*gen_op)(rtx, HOST_WIDE_INT),
+  HOST_WIDE_INT imm2)
+{
+  if (REG_P (dst))
+{
+  emit_move_insn (dst, GEN_INT (imm0));
+  emit_move_insn (dst, gen_rtx_fmt_ee (code, SImode,
+  dst, GEN_INT (imm1)));
+  if (gen_op)
+   emit_move_insn (dst, gen_op (dst, imm2));
+}
+  else
+{
+  rtx r = gen_reg_rtx (SImode);
+
+  emit_move_insn (r, GEN_INT (imm0));
+  emit_move_insn (r, gen_rtx_fmt_ee (code, SImode,
+r, GEN_INT (imm1)));
+  emit_move_insn (dst, gen_op ? gen_op (r, imm2) : r);
+}
+}
+
+static int
+xtensa_constantsynth_2insn (rtx dst, HOST_WIDE_INT srcval,
+ 

[PATCH 2/4] xtensa: Consider the Loop Option when setmemsi is expanded to small loop

2022-06-09 Thread Takayuki 'January June' Suwa via Gcc-patches

Now apply to almost any size of aligned block under such circumstances.

gcc/ChangeLog:

* config/xtensa/xtensa.cc (xtensa_expand_block_set_small_loop):
Pass through the block length / loop count conditions if
zero-overhead looping is configured and active,
---
 gcc/config/xtensa/xtensa.cc | 65 +
 1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
index c7b54babc37..616ced3ed38 100644
--- a/gcc/config/xtensa/xtensa.cc
+++ b/gcc/config/xtensa/xtensa.cc
@@ -1483,7 +1483,7 @@ xtensa_expand_block_set_unrolled_loop (rtx *operands)
 int
 xtensa_expand_block_set_small_loop (rtx *operands)
 {
-  HOST_WIDE_INT bytes, value, align;
+  HOST_WIDE_INT bytes, value, align, count;
   int expand_len, funccall_len;
   rtx x, dst, end, reg;
   machine_mode unit_mode;
@@ -1503,17 +1503,25 @@ xtensa_expand_block_set_small_loop (rtx *operands)
   /* Totally-aligned block only.  */
   if (bytes % align != 0)
 return 0;
+  count = bytes / align;

-  /* If 4-byte aligned, small loop substitution is almost optimal, thus
- limited to only offset to the end address for ADDI/ADDMI 
instruction.  */

-  if (align == 4
-  && ! (bytes <= 127 || (bytes <= 32512 && bytes % 256 == 0)))
-return 0;
+  /* If the Loop Option (zero-overhead looping) is configured and active,
+ almost no restrictions about the length of the block.  */
+  if (! (TARGET_LOOPS && optimize))
+{
+  /* If 4-byte aligned, small loop substitution is almost optimal,
+thus limited to only offset to the end address for ADDI/ADDMI
+instruction.  */
+  if (align == 4
+ && ! (bytes <= 127 || (bytes <= 32512 && bytes % 256 == 0)))
+   return 0;

-  /* If no 4-byte aligned, loop count should be treated as the 
constraint.  */

-  if (align != 4
-  && bytes / align > ((optimize > 1 && !optimize_size) ? 8 : 15))
-return 0;
+  /* If no 4-byte aligned, loop count should be treated as the
+constraint.  */
+  if (align != 4
+ && count > ((optimize > 1 && !optimize_size) ? 8 : 15))
+   return 0;
+}

   /* Insn expansion: holding the init value.
  Either MOV(.N) or L32R w/litpool.  */
@@ -1523,16 +1531,33 @@ xtensa_expand_block_set_small_loop (rtx *operands)
 expand_len = TARGET_DENSITY ? 2 : 3;
   else
 expand_len = 3 + 4;
-  /* Insn expansion: Either ADDI(.N) or ADDMI for the end address.  */
-  expand_len += bytes > 127 ? 3
-   : (TARGET_DENSITY && bytes <= 15) ? 2 : 3;
-
-  /* Insn expansion: the loop body and branch instruction.
- For store, one of S8I, S16I or S32I(.N).
- For advance, ADDI(.N).
- For branch, BNE.  */
-  expand_len += (TARGET_DENSITY && align == 4 ? 2 : 3)
-   + (TARGET_DENSITY ? 2 : 3) + 3;
+  if (TARGET_LOOPS && optimize) /* zero-overhead looping */
+{
+  /* Insn translation: Either MOV(.N) or L32R w/litpool for the
+loop count.  */
+  expand_len += xtensa_simm12b (count) ? xtensa_sizeof_MOVI (count)
+  : 3 + 4;
+  /* Insn translation: LOOP, the zero-overhead looping setup
+instruction.  */
+  expand_len += 3;
+  /* Insn expansion: the loop body instructions.
+   For store, one of S8I, S16I or S32I(.N).
+   For advance, ADDI(.N).  */
+  expand_len += (TARGET_DENSITY && align == 4 ? 2 : 3)
+   + (TARGET_DENSITY ? 2 : 3);
+}
+  else /* NO zero-overhead looping */
+{
+  /* Insn expansion: Either ADDI(.N) or ADDMI for the end address.  */
+  expand_len += bytes > 127 ? 3
+   : (TARGET_DENSITY && bytes <= 15) ? 2 : 3;
+  /* Insn expansion: the loop body and branch instruction.
+   For store, one of S8I, S16I or S32I(.N).
+   For advance, ADDI(.N).
+   For branch, BNE.  */
+  expand_len += (TARGET_DENSITY && align == 4 ? 2 : 3)
+   + (TARGET_DENSITY ? 2 : 3) + 3;
+}

   /* Function call: preparing two arguments.  */
   funccall_len = xtensa_sizeof_MOVI (value);
--
2.20.1


[PATCH 3/4] xtensa: Improve instruction cost estimation and suggestion

2022-06-09 Thread Takayuki 'January June' Suwa via Gcc-patches

This patch implements a new target-specific relative RTL insn cost function
because of suboptimal cost estimation by default, and fixes several "length"
insn attributes (related to the cost estimation).

And also introduces a new machine-dependent option "-mextra-l32r-costs="
that tells implementation-specific InstRAM/ROM access penalty for L32R
instruction to the compiler (in clock-cycle units, 0 by default).

gcc/ChangeLog:

* config/xtensa/xtensa.cc (xtensa_rtx_costs): Correct wrong case
for ABS and NEG, add missing case for BSWAP and CLRSB, and
double the costs for integer divisions using libfuncs if
optimizing for speed, in order to take advantage of fast constant
division by multiplication.
(TARGET_INSN_COST): New macro definition.
(xtensa_is_insn_L32R_p, xtensa_insn_cost): New functions for
calculating relative costs of a RTL insns, for both of speed and
size.
* config/xtensa/xtensa.md (return, nop, trap): Correct values of
the attribute "length" that depends on TARGET_DENSITY.
(define_asm_attributes, blockage, frame_blockage): Add missing
attributes.
* config/xtensa/xtensa.opt (-mextra-l32r-costs=): New machine-
dependent option, however, preparatory work for now.
---
 gcc/config/xtensa/xtensa.cc  | 116 ---
 gcc/config/xtensa/xtensa.md  |  29 ++---
 gcc/config/xtensa/xtensa.opt |   4 ++
 3 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
index 616ced3ed38..1769e43c7b5 100644
--- a/gcc/config/xtensa/xtensa.cc
+++ b/gcc/config/xtensa/xtensa.cc
@@ -55,6 +55,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "dumpfile.h"
 #include "hw-doloop.h"
 #include "rtl-iter.h"
+#include "insn-attr.h"

 /* This file should be included last.  */
 #include "target-def.h"
@@ -134,6 +135,7 @@ static unsigned int 
xtensa_multibss_section_type_flags (tree, const char *,

 static section *xtensa_select_rtx_section (machine_mode, rtx,
   unsigned HOST_WIDE_INT);
 static bool xtensa_rtx_costs (rtx, machine_mode, int, int, int *, bool);
+static int xtensa_insn_cost (rtx_insn *, bool);
 static int xtensa_register_move_cost (machine_mode, reg_class_t,
  reg_class_t);
 static int xtensa_memory_move_cost (machine_mode, reg_class_t, bool);
@@ -212,6 +214,8 @@ static rtx xtensa_delegitimize_address (rtx);
 #define TARGET_MEMORY_MOVE_COST xtensa_memory_move_cost
 #undef TARGET_RTX_COSTS
 #define TARGET_RTX_COSTS xtensa_rtx_costs
+#undef TARGET_INSN_COST
+#define TARGET_INSN_COST xtensa_insn_cost
 #undef TARGET_ADDRESS_COST
 #define TARGET_ADDRESS_COST hook_int_rtx_mode_as_bool_0

@@ -3933,7 +3937,7 @@ xtensa_memory_move_cost (machine_mode mode 
ATTRIBUTE_UNUSED,

 static bool
 xtensa_rtx_costs (rtx x, machine_mode mode, int outer_code,
  int opno ATTRIBUTE_UNUSED,
- int *total, bool speed ATTRIBUTE_UNUSED)
+ int *total, bool speed)
 {
   int code = GET_CODE (x);

@@ -4021,9 +4025,14 @@ xtensa_rtx_costs (rtx x, machine_mode mode, int 
outer_code,

   return true;

 case CLZ:
+case CLRSB:
   *total = COSTS_N_INSNS (TARGET_NSA ? 1 : 50);
   return true;

+case BSWAP:
+  *total = COSTS_N_INSNS (mode == HImode ? 3 : 5);
+  return true;
+
 case NOT:
   *total = COSTS_N_INSNS (mode == DImode ? 3 : 2);
   return true;
@@ -4047,13 +4056,16 @@ xtensa_rtx_costs (rtx x, machine_mode mode, int 
outer_code,

   return true;

 case ABS:
+case NEG:
   {
if (mode == SFmode)
  *total = COSTS_N_INSNS (TARGET_HARD_FLOAT ? 1 : 50);
else if (mode == DFmode)
  *total = COSTS_N_INSNS (50);
-   else
+   else if (mode == DImode)
  *total = COSTS_N_INSNS (4);
+   else
+ *total = COSTS_N_INSNS (1);
return true;
   }

@@ -4069,10 +4081,6 @@ xtensa_rtx_costs (rtx x, machine_mode mode, int 
outer_code,

return true;
   }

-case NEG:
-  *total = COSTS_N_INSNS (mode == DImode ? 4 : 2);
-  return true;
-
 case MULT:
   {
if (mode == SFmode)
@@ -4112,11 +4120,11 @@ xtensa_rtx_costs (rtx x, machine_mode mode, int 
outer_code,

 case UMOD:
   {
if (mode == DImode)
- *total = COSTS_N_INSNS (50);
+ *total = COSTS_N_INSNS (speed ? 100 : 50);
else if (TARGET_DIV32)
  *total = COSTS_N_INSNS (32);
else
- *total = COSTS_N_INSNS (50);
+ *total = COSTS_N_INSNS (speed ? 100 : 50);
return true;
   }

@@ -4149,6 +4157,98 @@ xtensa_rtx_costs (rtx x, machine_mode mode, int 
outer_code,

 }
 }

+static bool
+xtensa_is_insn_L32R_p(const rtx_insn *insn)
+{
+  rtx x = PATTERN (insn);
+
+  if (GET_CODE (x) == SET)
+{
+  x = XEXP (x, 1);
+  if (GET_CODE (x)

[PATCH 1/4] xtensa: Tweak some widen multiplications

2022-06-09 Thread Takayuki 'January June' Suwa via Gcc-patches

umulsidi3 is faster than umuldi3 even if library call, and is also
prerequisite for fast constant division by multiplication.

gcc/ChangeLog:

* config/xtensa/xtensa.md (mulsidi3, umulsidi3):
Split into individual signedness, in order to use libcall
"__umulsidi3" but not the other.
(mulhisi3): Merge into one by using code iterator.
(mulsidi3, mulhisi3, umulhisi3): Remove.
---
 gcc/config/xtensa/xtensa.md | 56 +
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/gcc/config/xtensa/xtensa.md b/gcc/config/xtensa/xtensa.md
index 8ff6f9a95fe..33cbd546de3 100644
--- a/gcc/config/xtensa/xtensa.md
+++ b/gcc/config/xtensa/xtensa.md
@@ -224,20 +224,42 @@
 
 ;; Multiplication.

-(define_expand "mulsidi3"
+(define_expand "mulsidi3"
   [(set (match_operand:DI 0 "register_operand")
-   (mult:DI (any_extend:DI (match_operand:SI 1 "register_operand"))
-(any_extend:DI (match_operand:SI 2 "register_operand"]
+   (mult:DI (sign_extend:DI (match_operand:SI 1 "register_operand"))
+(sign_extend:DI (match_operand:SI 2 "register_operand"]
   "TARGET_MUL32_HIGH"
 {
   rtx temp = gen_reg_rtx (SImode);
   emit_insn (gen_mulsi3 (temp, operands[1], operands[2]));
-  emit_insn (gen_mulsi3_highpart (gen_highpart (SImode, operands[0]),
-operands[1], operands[2]));
+  emit_insn (gen_mulsi3_highpart (gen_highpart (SImode, operands[0]),
+ operands[1], operands[2]));
   emit_insn (gen_movsi (gen_lowpart (SImode, operands[0]), temp));
   DONE;
 })

+(define_expand "umulsidi3"
+  [(set (match_operand:DI 0 "register_operand")
+   (mult:DI (zero_extend:DI (match_operand:SI 1 "register_operand"))
+(zero_extend:DI (match_operand:SI 2 "register_operand"]
+  ""
+{
+  if (TARGET_MUL32_HIGH)
+{
+  rtx temp = gen_reg_rtx (SImode);
+  emit_insn (gen_mulsi3 (temp, operands[1], operands[2]));
+  emit_insn (gen_umulsi3_highpart (gen_highpart (SImode, operands[0]),
+  operands[1], operands[2]));
+  emit_insn (gen_movsi (gen_lowpart (SImode, operands[0]), temp));
+}
+  else
+emit_library_call_value (gen_rtx_SYMBOL_REF (Pmode, "__umulsidi3"),
+operands[0], LCT_NORMAL, DImode,
+operands[1], SImode,
+operands[2], SImode);
+   DONE;
+})
+
 (define_insn "mulsi3_highpart"
   [(set (match_operand:SI 0 "register_operand" "=a")
(truncate:SI
@@ -261,30 +283,16 @@
(set_attr "mode"  "SI")
(set_attr "length""3")])

-(define_insn "mulhisi3"
-  [(set (match_operand:SI 0 "register_operand" "=C,A")
-   (mult:SI (sign_extend:SI
- (match_operand:HI 1 "register_operand" "%r,r"))
-(sign_extend:SI
- (match_operand:HI 2 "register_operand" "r,r"]
-  "TARGET_MUL16 || TARGET_MAC16"
-  "@
-   mul16s\t%0, %1, %2
-   mul.aa.ll\t%1, %2"
-  [(set_attr "type"  "mul16,mac16")
-   (set_attr "mode"  "SI")
-   (set_attr "length""3,3")])
-
-(define_insn "umulhisi3"
+(define_insn "mulhisi3"
   [(set (match_operand:SI 0 "register_operand" "=C,A")
-   (mult:SI (zero_extend:SI
+   (mult:SI (any_extend:SI
  (match_operand:HI 1 "register_operand" "%r,r"))
-(zero_extend:SI
+(any_extend:SI
  (match_operand:HI 2 "register_operand" "r,r"]
   "TARGET_MUL16 || TARGET_MAC16"
   "@
-   mul16u\t%0, %1, %2
-   umul.aa.ll\t%1, %2"
+   mul16\t%0, %1, %2
+   mul.aa.ll\t%1, %2"
   [(set_attr "type"  "mul16,mac16")
(set_attr "mode"  "SI")
(set_attr "length""3,3")])
--
2.20.1


Re: [PATCH] or1k: Add support for a little-endian target variant

2022-06-09 Thread Samuel Holland
Hi Stafford,

On 6/9/22 6:29 AM, Stafford Horne wrote:
>> diff --git a/gcc/config.gcc b/gcc/config.gcc
>> index c5064dd37666..0c3a09dfe810 100644
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -2866,6 +2866,11 @@ or1k*-*-*)
>>  done
>>  TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>>  
>> +case ${target} in
>> +or1k*le*-*)
> 
> Should this be just or1kle*-*?

I wasn't sure what the order of "le" and "nd" would be if both were present.

>> +tm_defines="${tm_defines} TARGET_LITTLE_ENDIAN_DEFAULT=1"
>> +;;
>> +esac
>>  case ${target} in
>>  or1k*-*-linux*)
>>  tm_file="${tm_file} gnu-user.h linux.h glibc-stdint.h"
> 
> 
>> diff --git a/gcc/config/or1k/or1k.opt b/gcc/config/or1k/or1k.opt
>> index 8a66832a99b1..497f259faae9 100644
>> --- a/gcc/config/or1k/or1k.opt
>> +++ b/gcc/config/or1k/or1k.opt
>> @@ -24,6 +24,14 @@
>>  HeaderInclude
>>  config/or1k/or1k-opts.h
>>  
>> +mbig-endian
>> +Target Report RejectNegative Mask(BIG_ENDIAN)
>> +Use big-endian byte order.
>> +
>> +mlittle-endian
>> +Target Report RejectNegative InverseMask(BIG_ENDIAN, LITTLE_ENDIAN)
>> +Use little-endian byte order.
>> +
> 
> We should explain what is the default int he doc's.
> 
> Can you also document in: gcc/doc/invoke.texi

Yes, I will do that for v2.

Regards,
Samuel


Re: [PATCH v4, rs6000] Implemented f[min/max]_optab by xs[min/max]dp [PR103605]

2022-06-09 Thread HAO CHEN GUI via Gcc-patches



On 9/6/2022 下午 11:07, Segher Boessenkool wrote:
> Ah, good.  Should we then have an assert that there is no fast-math if
> we ever get the rtl fmin/fmax stuff?

Sure, I will add a condition for it. Thanks a lot.
Gui Haochen


Re: [PATCH 4/4] xtensa: Add clrsbsi2 insn pattern

2022-06-09 Thread Max Filippov via Gcc-patches
On Sun, May 29, 2022 at 4:00 AM Takayuki 'January June' Suwa
 wrote:
>
>  > (clrsb:m x)
>  > Represents the number of redundant leading sign bits in x, represented
>  > as an integer of mode m, starting at the most significant bit position.
>
> This explanation is just what the NSA instruction (not ever emitted before)
> calculates in Xtensa ISA.
>
> gcc/ChangeLog:
>
> * config/xtensa/xtensa.md (clrsbsi2): New insn pattern.
>
> libgcc/ChangeLog:
>
> * config/xtensa/lib1funcs.S (__clrsbsi2): New function.
> * config/xtensa/t-xtensa (LIB1ASMFUNCS): Add _clrsbsi2.
> ---
>   gcc/config/xtensa/xtensa.md  | 12 +++-
>   libgcc/config/xtensa/lib1funcs.S | 23 +++
>   libgcc/config/xtensa/t-xtensa|  2 +-
>   3 files changed, 35 insertions(+), 2 deletions(-)

Regtested for target=xtensa-linux-uclibc, no new regressions.
Committed to master.

-- 
Thanks.
-- Max


Re: [PATCH 3/4] xtensa: Optimize '(~x & y)' to '((x & y) ^ y)'

2022-06-09 Thread Max Filippov via Gcc-patches
On Sun, May 29, 2022 at 4:00 AM Takayuki 'January June' Suwa
 wrote:
>
> In Xtensa ISA, there is no single machine instruction that calculates unary
> bitwise negation.
>
> gcc/ChangeLog:
>
> * config/xtensa/xtensa.md (*andsi3_bitcmpl):
> New insn_and_split pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/xtensa/check_zero_byte.c: New.
> ---
>   gcc/config/xtensa/xtensa.md   | 20 +++
>   .../gcc.target/xtensa/check_zero_byte.c   |  9 +
>   2 files changed, 29 insertions(+)
>   create mode 100644 gcc/testsuite/gcc.target/xtensa/check_zero_byte.c

Regtested for target=xtensa-linux-uclibc, no new regressions.
Committed to master.

-- 
Thanks.
-- Max


Re: [PATCH 2/4] xtensa: Make one_cmplsi2 optimizer-friendly

2022-06-09 Thread Max Filippov via Gcc-patches
On Sun, May 29, 2022 at 4:00 AM Takayuki 'January June' Suwa
 wrote:
>
> In Xtensa ISA, there is no single machine instruction that calculates unary
> bitwise negation.  But a few optimizers assume that bitwise negation can be
> done by a single insn.
>
> As a result, '((x < 0) ? ~x : x)' cannot be optimized to '(x ^ (x >> 31))'
> ever before, for example.
>
> This patch relaxes such limitation, by putting the insn expansion off till
> the split pass.
>
> gcc/ChangeLog:
>
> * config/xtensa/xtensa.md (one_cmplsi2):
> Rearrange as an insn_and_split pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/xtensa/one_cmpl_abs.c: New.
> ---
>   gcc/config/xtensa/xtensa.md   | 26 +--
>   .../gcc.target/xtensa/one_cmpl_abs.c  |  9 +++
>   2 files changed, 27 insertions(+), 8 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.target/xtensa/one_cmpl_abs.c

Regtested for target=xtensa-linux-uclibc, no new regressions.
Committed to master.

-- 
Thanks.
-- Max


Re: [PATCH 1/4] xtensa: Implement bswaphi2 insn pattern

2022-06-09 Thread Max Filippov via Gcc-patches
On Sun, May 29, 2022 at 4:00 AM Takayuki 'January June' Suwa
 wrote:
>
> This patch adds bswaphi2 insn pattern that is one instruction less than the
> default expansion.
>
> gcc/ChangeLog:
>
> * config/xtensa/xtensa.md (bswaphi2): New insn pattern.
> ---
>   gcc/config/xtensa/xtensa.md | 10 ++
>   1 file changed, 10 insertions(+)

Regtested for target=xtensa-linux-uclibc, no new regressions.
Committed to master.

-- 
Thanks.
-- Max


New Swedish PO file for 'gcc' (version 12.1.0)

2022-06-09 Thread Translation Project Robot
Hello, gentle maintainer.

This is a message from the Translation Project robot.

A revised PO file for textual domain 'gcc' has been submitted
by the Swedish team of translators.  The file is available at:

https://translationproject.org/latest/gcc/sv.po

(This file, 'gcc-12.1.0.sv.po', has just now been sent to you in
a separate email.)

All other PO files for your package are available in:

https://translationproject.org/latest/gcc/

Please consider including all of these in your next release, whether
official or a pretest.

Whenever you have a new distribution with a new version number ready,
containing a newer POT file, please send the URL of that distribution
tarball to the address below.  The tarball may be just a pretest or a
snapshot, it does not even have to compile.  It is just used by the
translators when they need some extra translation context.

The following HTML page has been updated:

https://translationproject.org/domain/gcc.html

If any question arises, please contact the translation coordinator.

Thank you for all your work,

The Translation Project robot, in the
name of your translation coordinator.




[PING][PATCH][WIP] have configure probe prefix for gmp/mpfr/mpc [PR44425]

2022-06-09 Thread Eric Gallager via Gcc-patches
Hi, I'd like to ping this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596126.html
(cc-ing the build machinery maintainers listed in MAINTAINERS this time)

On Thu, Jun 2, 2022 at 11:53 AM Eric Gallager  wrote:
>
> So, I'm working on fixing PR bootstrap/44425, and have this patch to
> have the top-level configure script check in the value passed to
> `--prefix=` when looking for gmp/mpfr/mpc. It "works" (in that
> configuring with just `--prefix=` and none of
> `--with-gmp=`/`--with-mpfr=`/`--with-mpc=` now works where it failed
> before), but unfortunately it results in a bunch of duplicated
> `-I`/`-L` flags stuck in ${gmplibs} and ${gmpinc}... is that
> acceptable or should I try another approach?
> Eric


patch-configure.diff
Description: Binary data


[PATCH] rs6000: Delete FP_ISA3

2022-06-09 Thread Segher Boessenkool
FP_ISA3 is exactly the same as SFDF, just a less obvious name.  So,
let's delete it.

Tested, committed, the works.


Segher


2022-06-09  Segher Boessenkool  

* config/rs6000/rs6000.md (FP_ISA3): Delete.
(float2): Rename to...
(float2): ... this.  Adjust.
(*float2_internal): Rename to...
(*float2_internal): ... this.  Adjust.
(floatuns2): Rename to...
(floatuns2): ... this.  Adjust.
(*floatuns2_internal): Rename to...
(*floatuns2_internal): ... this.  Adjust.
---
 gcc/config/rs6000/rs6000.md | 31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 1c125f07e895..c55ee7e171a3 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -619,9 +619,6 @@ (define_mode_iterator FLOAT128 [(KF "TARGET_FLOAT128_TYPE")
 (define_mode_iterator SIGNBIT [(KF "FLOAT128_VECTOR_P (KFmode)")
   (TF "FLOAT128_VECTOR_P (TFmode)")])
 
-; Iterator for ISA 3.0 supported floating point types
-(define_mode_iterator FP_ISA3 [SF DF])
-
 ; Which isa is needed for those float instructions?
 (define_mode_attr Fisa [(SF "p8v")  (DF "*") (DI "*")])
 
@@ -6012,9 +6009,9 @@ (define_insn_and_split "*floatunssidf2_internal"
 ;; the vector registers, rather then loading up a GPR, doing a sign/zero
 ;; extension and then a direct move.
 
-(define_expand "float2"
-  [(parallel [(set (match_operand:FP_ISA3 0 "vsx_register_operand")
-  (float:FP_ISA3
+(define_expand "float2"
+  [(parallel [(set (match_operand:SFDF 0 "vsx_register_operand")
+  (float:SFDF
(match_operand:QHI 1 "input_operand")))
  (clobber (match_scratch:DI 2))
  (clobber (match_scratch:DI 3))
@@ -6025,9 +6022,9 @@ (define_expand "float2"
 operands[1] = rs6000_force_indexed_or_indirect_mem (operands[1]);
 })
 
-(define_insn_and_split "*float2_internal"
-  [(set (match_operand:FP_ISA3 0 "vsx_register_operand" "=wa,wa,wa")
-   (float:FP_ISA3
+(define_insn_and_split "*float2_internal"
+  [(set (match_operand:SFDF 0 "vsx_register_operand" "=wa,wa,wa")
+   (float:SFDF
 (match_operand:QHI 1 "reg_or_indexed_operand" "v,r,Z")))
(clobber (match_scratch:DI 2 "=v,wa,v"))
(clobber (match_scratch:DI 3 "=X,r,X"))
@@ -6061,14 +6058,14 @@ (define_insn_and_split 
"*float2_internal"
   emit_insn (gen_extenddi2 (di, tmp));
 }
 
-  emit_insn (gen_floatdi2 (result, di));
+  emit_insn (gen_floatdi2 (result, di));
   DONE;
 }
   [(set_attr "isa" "p9v,*,p9v")])
 
-(define_expand "floatuns2"
-  [(parallel [(set (match_operand:FP_ISA3 0 "vsx_register_operand")
-  (unsigned_float:FP_ISA3
+(define_expand "floatuns2"
+  [(parallel [(set (match_operand:SFDF 0 "vsx_register_operand")
+  (unsigned_float:SFDF
(match_operand:QHI 1 "input_operand")))
  (clobber (match_scratch:DI 2))
  (clobber (match_scratch:DI 3))])]
@@ -6078,9 +6075,9 @@ (define_expand "floatuns2"
 operands[1] = rs6000_force_indexed_or_indirect_mem (operands[1]);
 })
 
-(define_insn_and_split "*floatuns2_internal"
-  [(set (match_operand:FP_ISA3 0 "vsx_register_operand" "=wa,wa,wa")
-   (unsigned_float:FP_ISA3
+(define_insn_and_split "*floatuns2_internal"
+  [(set (match_operand:SFDF 0 "vsx_register_operand" "=wa,wa,wa")
+   (unsigned_float:SFDF
 (match_operand:QHI 1 "reg_or_indexed_operand" "v,r,Z")))
(clobber (match_scratch:DI 2 "=v,wa,wa"))
(clobber (match_scratch:DI 3 "=X,r,X"))]
@@ -6107,7 +6104,7 @@ (define_insn_and_split 
"*floatuns2_internal"
}
 }
 
-  emit_insn (gen_floatdi2 (result, di));
+  emit_insn (gen_floatdi2 (result, di));
   DONE;
 }
   [(set_attr "isa" "p9v,*,p9v")])
-- 
1.8.3.1



Re: [PATCH 2/1] c++: optimize specialization of templated member functions

2022-06-09 Thread Patrick Palka via Gcc-patches
On Thu, 9 Jun 2022, Jason Merrill wrote:

> On 6/9/22 09:00, Patrick Palka wrote:
> > This performs one of the optimizations added by the previous
> > patch to lookup_template_class, to instantiate_template as well.
> > (For the libstdc++ ranges tests this optimization appears to be
> > effective around 30% of the time, i.e. 30% of the time context of 'tmpl'
> > is non-dependent while the context of 'gen_tmpl' is dependent.)
> 
> If this is a significant optimization, how about doing it in tsubst_aggr_type
> rather than its callers?

I'm not sure how we'd do this optimization in tsubst_aggr_type?

I haven't observed any significant time/memory improvements based on my
limited benchmarking, but I can imagine for deeply nested templates it
could be significant.  And avoiding redundant work should hopefully help
streamline debugging I suppose.

> 
> > gcc/cp/ChangeLog:
> > 
> > * pt.cc (instantiate_template): Don't substitute the context
> > of the most general template if that of the partially
> > instantiated template is non-dependent.
> > ---
> >   gcc/cp/pt.cc | 10 --
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
> > index e021c254872..208daad298a 100644
> > --- a/gcc/cp/pt.cc
> > +++ b/gcc/cp/pt.cc
> > @@ -21661,8 +21661,14 @@ instantiate_template (tree tmpl, tree orig_args,
> > tsubst_flags_t complain)
> >   ++processing_template_decl;
> > if (DECL_CLASS_SCOPE_P (gen_tmpl))
> >   {
> > -  tree ctx = tsubst_aggr_type (DECL_CONTEXT (gen_tmpl), targ_ptr,
> > -  complain, gen_tmpl, true);
> > +  tree ctx;
> > +  if (!uses_template_parms (DECL_CONTEXT (tmpl)))
> > +   /* If the context of the partially instantiated template is already
> > +  non-dependent, then we might as well use it.  */
> > +   ctx = DECL_CONTEXT (tmpl);
> > +  else
> > +   ctx = tsubst_aggr_type (DECL_CONTEXT (gen_tmpl), targ_ptr,
> > +   complain, gen_tmpl, true);
> > push_nested_class (ctx);
> >   }
> >   
> 
> 



Re: [PATCH] c++: optimize specialization of nested class templates

2022-06-09 Thread Patrick Palka via Gcc-patches
On Thu, 9 Jun 2022, Jason Merrill wrote:

> On 6/8/22 14:21, Patrick Palka wrote:
> > When substituting a class template specialization, tsubst_aggr_type
> > substitutes the TYPE_CONTEXT before passing it to lookup_template_class.
> > This appears to be unnecessary, however, because the the initial value
> > of lookup_template_class's context parameter is unused outside of the
> > IDENTIFIER_NODE case, and l_t_c performs its own substitution of the
> > context, anyway.  So this patch removes the redundant substitution in
> > tsubst_aggr_type.  Doing so causes us to ICE on template/nested5.C
> > because during lookup_template_class for A::C::D with T=E and S=S,
> > we substitute and complete the context A::C with T=E, which in turn
> > registers the desired dependent specialization of D for us and we end up
> > trying to register it again.  This patch fixes this by checking the
> > specializations table again after completion of the context.
> > 
> > This patch also implements a couple of other optimizations:
> > 
> >* In lookup_template_class, if the context of the partially
> >  instantiated template is already non-dependent, then we could
> >  reuse that instead of substituting the context of the most
> >  general template.
> >* When substituting the TYPE_DECL for an injected-class-name
> >  in tsubst_decl, we can avoid substituting its TREE_TYPE and
> >  DECL_TI_ARGS.
> > 
> > Together these optimizations improve memory usage for the range-v3
> > testcase test/view/split.cc by about 5%.  The improvement is probably
> > more significant when dealing with deeply nested class templates.
> > 
> > Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for
> > trunk?
> > 
> > gcc/cp/ChangeLog:
> > 
> > * pt.cc (lookup_template_class): Remove dead stores to
> > context parameter.  Don't substitute the context of the
> > most general template if that of the partially instantiated
> > template is non-dependent.  Check the specializations table
> > again after completing the context of a nested dependent
> > specialization.
> > (tsubst_aggr_type) : Don't substitute
> > TYPE_CONTEXT or pass it to lookup_template_class.
> > (tsubst_decl) : Avoid substituting the
> > TREE_TYPE and DECL_TI_ARGS when DECL_SELF_REFERENCE_P.
> > ---
> >   gcc/cp/pt.cc | 69 +++-
> >   1 file changed, 41 insertions(+), 28 deletions(-)
> > 
> > diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
> > index 59b94317e88..28023d60684 100644
> > --- a/gcc/cp/pt.cc
> > +++ b/gcc/cp/pt.cc
> > @@ -9840,8 +9840,6 @@ lookup_template_class (tree d1, tree arglist, tree
> > in_decl, tree context,
> >   if (context)
> > pop_decl_namespace ();
> > }
> > -  if (templ)
> > -   context = DECL_CONTEXT (templ);
> >   }
> > else if (TREE_CODE (d1) == TYPE_DECL && MAYBE_CLASS_TYPE_P (TREE_TYPE
> > (d1)))
> >   {
> > @@ -9868,7 +9866,6 @@ lookup_template_class (tree d1, tree arglist, tree
> > in_decl, tree context,
> >   {
> > templ = d1;
> > d1 = DECL_NAME (templ);
> > -  context = DECL_CONTEXT (templ);
> >   }
> > else if (DECL_TEMPLATE_TEMPLATE_PARM_P (d1))
> >   {
> > @@ -10059,8 +10056,25 @@ lookup_template_class (tree d1, tree arglist, tree
> > in_decl, tree context,
> > context = DECL_CONTEXT (gen_tmpl);
> > if (context && TYPE_P (context))
> > {
> > - context = tsubst_aggr_type (context, arglist, complain, in_decl,
> > true);
> > - context = complete_type (context);
> > + if (!uses_template_parms (DECL_CONTEXT (templ)))
> > +   /* If the context of the partially instantiated template is
> > +  already non-dependent, then we might as well use it.  */
> > +   context = DECL_CONTEXT (templ);
> > + else
> > +   {
> > + context = tsubst_aggr_type (context, arglist, complain, in_decl,
> > true);
> > + context = complete_type (context);
> > + if (is_dependent_type && arg_depth > 1)
> > +   {
> > + /* If this is a dependent nested specialization such as
> > +A::B, then completion of A might have
> > +registered this specialization of B for us, so check
> > +the table again (33959).  */
> > + entry = type_specializations->find_with_hash (&elt, hash);
> > + if (entry)
> > +   return entry->spec;
> > +   }
> > +   }
> > }
> > else
> > context = tsubst (context, arglist, complain, in_decl);
> > @@ -13711,25 +13725,12 @@ tsubst_aggr_type (tree t,
> > if (TYPE_TEMPLATE_INFO (t) && uses_template_parms (t))
> > {
> >   tree argvec;
> > - tree context;
> >   tree r;
> >   /* In "sizeof(X)" we need to evaluate "I".  */
> >   cp_evaluated ev;
> >   -   /* First, determine the context for the type we are looking
> > -up.  */
> > - 

[r13-1021 Regression] FAIL: gcc.target/i386/pr84101.c scan-tree-dump-not slp2 "optimized: basic block" on Linux/x86_64

2022-06-09 Thread skpandey--- via Gcc-patches
On Linux/x86_64,

269edf4e5e6ab489730038f7e3495550623179fe is the first bad commit
commit 269edf4e5e6ab489730038f7e3495550623179fe
Author: Cui,Lili 
Date:   Wed Jun 8 11:25:57 2022 +0800

Update {skylake,icelake,alderlake}_cost to add a bit preference to vector 
store.

caused

FAIL: gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c scan-tree-dump-not 
slp2 "basic block part vectorized"
FAIL: gcc.target/i386/pr84101.c scan-tree-dump-not slp2 "optimized: basic block"

with GCC configured with

../../gcc/configure 
--prefix=/local/skpandey/gccwork/toolwork/gcc-bisect-master/master/r13-1021/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="x86_64-costmodel-vect.exp=gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
 --target_board='unix{-m64\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr84101.c --target_board='unix{-m32\ 
-march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at skpgkp2 at gmail dot com)


Re: [committed] openmp: Add support for HBW or large capacity or interleaved memory through the libmemkind.so library

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Thu, Jun 09, 2022 at 06:07:20PM +0100, Richard Sandiford wrote:
> Dunno if this has already been reported, but I'm getting:
> 
> .../libgomp/config/linux/allocator.c:36:10: fatal error: 
> ../../../allocator.c: No such file or directory
>36 | #include "../../../allocator.c"
>   |  ^~
> 
> Should there be one less "../"?

Ouch, you're right.
I'm configuring with ../configure, dunno if that is the reason why it
happened to work for me.

Fixed up now, sorry.

2022-06-09  Jakub Jelinek  

* config/linux/allocator.c: Fix up #include directive.

--- libgomp/config/linux/allocator.c.jj
+++ libgomp/config/linux/allocator.c
@@ -33,4 +33,4 @@
 #define LIBGOMP_USE_MEMKIND
 #endif
 
-#include "../../../allocator.c"
+#include "../../allocator.c"


Jakub



Re: [committed] openmp: Add support for HBW or large capacity or interleaved memory through the libmemkind.so library

2022-06-09 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek via Gcc-patches  writes:
> Hi!
>
> This patch adds support for dlopening libmemkind.so on Linux and uses it
> for some kinds of allocations (but not yet e.g. pinned memory).
>
> Bootstrapped/regtested on x86_64-linux and i686-linux (with libmemkind
> around) and compile tested with LIBGOMP_USE_MEMKIND undefined, committed
> to trunk.
>
> 2022-06-09  Jakub Jelinek  
>
>   * allocator.c: Include dlfcn.h if LIBGOMP_USE_MEMKIND is defined.
>   (enum gomp_memkind_kind): New type.
>   (struct omp_allocator_data): Add memkind field if LIBGOMP_USE_MEMKIND
>   is defined.
>   (struct gomp_memkind_data): New type.
>   (memkind_data, memkind_data_once): New variables.
>   (gomp_init_memkind, gomp_get_memkind): New functions.
>   (omp_init_allocator): Initialize data.memkind, don't fail for
>   omp_high_bw_mem_space if libmemkind supports it.
>   (omp_aligned_alloc, omp_free, omp_aligned_calloc, omp_realloc): Add
>   memkind support of LIBGOMP_USE_MEMKIND is defined.
>   * config/linux/allocator.c: New file.

Dunno if this has already been reported, but I'm getting:

.../libgomp/config/linux/allocator.c:36:10: fatal error: ../../../allocator.c: 
No such file or directory
   36 | #include "../../../allocator.c"
  |  ^~

Should there be one less "../"?

Richard

> --- libgomp/allocator.c.jj2022-06-08 08:21:03.099446883 +0200
> +++ libgomp/allocator.c   2022-06-08 13:41:45.647133610 +0200
> @@ -31,9 +31,28 @@
>  #include "libgomp.h"
>  #include 
>  #include 
> +#ifdef LIBGOMP_USE_MEMKIND
> +#include 
> +#endif
>  
>  #define omp_max_predefined_alloc omp_thread_mem_alloc
>  
> +enum gomp_memkind_kind
> +{
> +  GOMP_MEMKIND_NONE = 0,
> +#define GOMP_MEMKIND_KINDS \
> +  GOMP_MEMKIND_KIND (HBW_INTERLEAVE),\
> +  GOMP_MEMKIND_KIND (HBW_PREFERRED), \
> +  GOMP_MEMKIND_KIND (DAX_KMEM_ALL),  \
> +  GOMP_MEMKIND_KIND (DAX_KMEM),  \
> +  GOMP_MEMKIND_KIND (INTERLEAVE),\
> +  GOMP_MEMKIND_KIND (DEFAULT)
> +#define GOMP_MEMKIND_KIND(kind) GOMP_MEMKIND_##kind
> +  GOMP_MEMKIND_KINDS,
> +#undef GOMP_MEMKIND_KIND
> +  GOMP_MEMKIND_COUNT
> +};
> +
>  struct omp_allocator_data
>  {
>omp_memspace_handle_t memspace;
> @@ -46,6 +65,9 @@ struct omp_allocator_data
>unsigned int fallback : 8;
>unsigned int pinned : 1;
>unsigned int partition : 7;
> +#ifdef LIBGOMP_USE_MEMKIND
> +  unsigned int memkind : 8;
> +#endif
>  #ifndef HAVE_SYNC_BUILTINS
>gomp_mutex_t lock;
>  #endif
> @@ -59,13 +81,95 @@ struct omp_mem_header
>void *pad;
>  };
>  
> +struct gomp_memkind_data
> +{
> +  void *memkind_handle;
> +  void *(*memkind_malloc) (void *, size_t);
> +  void *(*memkind_calloc) (void *, size_t, size_t);
> +  void *(*memkind_realloc) (void *, void *, size_t);
> +  void (*memkind_free) (void *, void *);
> +  int (*memkind_check_available) (void *);
> +  void **kinds[GOMP_MEMKIND_COUNT];
> +};
> +
> +#ifdef LIBGOMP_USE_MEMKIND
> +static struct gomp_memkind_data *memkind_data;
> +static pthread_once_t memkind_data_once = PTHREAD_ONCE_INIT;
> +
> +static void
> +gomp_init_memkind (void)
> +{
> +  void *handle = dlopen ("libmemkind.so", RTLD_LAZY);
> +  struct gomp_memkind_data *data;
> +  int i;
> +  static const char *kinds[] = {
> +NULL,
> +#define GOMP_MEMKIND_KIND(kind) "MEMKIND_" #kind
> +GOMP_MEMKIND_KINDS
> +#undef GOMP_MEMKIND_KIND
> +  };
> +
> +  data = calloc (1, sizeof (struct gomp_memkind_data));
> +  if (data == NULL)
> +{
> +  if (handle)
> + dlclose (handle);
> +  return;
> +}
> +  if (!handle)
> +{
> +  __atomic_store_n (&memkind_data, data, MEMMODEL_RELEASE);
> +  return;
> +}
> +  data->memkind_handle = handle;
> +  data->memkind_malloc
> += (__typeof (data->memkind_malloc)) dlsym (handle, "memkind_malloc");
> +  data->memkind_calloc
> += (__typeof (data->memkind_calloc)) dlsym (handle, "memkind_calloc");
> +  data->memkind_realloc
> += (__typeof (data->memkind_realloc)) dlsym (handle, "memkind_realloc");
> +  data->memkind_free
> += (__typeof (data->memkind_free)) dlsym (handle, "memkind_free");
> +  data->memkind_check_available
> += (__typeof (data->memkind_check_available))
> +  dlsym (handle, "memkind_check_available");
> +  if (data->memkind_malloc
> +  && data->memkind_calloc
> +  && data->memkind_realloc
> +  && data->memkind_free
> +  && data->memkind_check_available)
> +for (i = 1; i < GOMP_MEMKIND_COUNT; ++i)
> +  {
> + data->kinds[i] = (void **) dlsym (handle, kinds[i]);
> + if (data->kinds[i] && data->memkind_check_available (*data->kinds[i]))
> +   data->kinds[i] = NULL;
> +  }
> +  __atomic_store_n (&memkind_data, data, MEMMODEL_RELEASE);
> +}
> +
> +static struct gomp_memkind_data *
> +gomp_get_memkind (void)
> +{
> +  struct gomp_memkind_data *data
> += __atomic_load_n (&memkind_data, MEMMODEL_ACQUIRE);
> +  if (data)
> 

Re: [PATCH] c++: optimize specialization of nested class templates

2022-06-09 Thread Jason Merrill via Gcc-patches

On 6/8/22 14:21, Patrick Palka wrote:

When substituting a class template specialization, tsubst_aggr_type
substitutes the TYPE_CONTEXT before passing it to lookup_template_class.
This appears to be unnecessary, however, because the the initial value
of lookup_template_class's context parameter is unused outside of the
IDENTIFIER_NODE case, and l_t_c performs its own substitution of the
context, anyway.  So this patch removes the redundant substitution in
tsubst_aggr_type.  Doing so causes us to ICE on template/nested5.C
because during lookup_template_class for A::C::D with T=E and S=S,
we substitute and complete the context A::C with T=E, which in turn
registers the desired dependent specialization of D for us and we end up
trying to register it again.  This patch fixes this by checking the
specializations table again after completion of the context.

This patch also implements a couple of other optimizations:

   * In lookup_template_class, if the context of the partially
 instantiated template is already non-dependent, then we could
 reuse that instead of substituting the context of the most
 general template.
   * When substituting the TYPE_DECL for an injected-class-name
 in tsubst_decl, we can avoid substituting its TREE_TYPE and
 DECL_TI_ARGS.

Together these optimizations improve memory usage for the range-v3
testcase test/view/split.cc by about 5%.  The improvement is probably
more significant when dealing with deeply nested class templates.

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for
trunk?

gcc/cp/ChangeLog:

* pt.cc (lookup_template_class): Remove dead stores to
context parameter.  Don't substitute the context of the
most general template if that of the partially instantiated
template is non-dependent.  Check the specializations table
again after completing the context of a nested dependent
specialization.
(tsubst_aggr_type) : Don't substitute
TYPE_CONTEXT or pass it to lookup_template_class.
(tsubst_decl) : Avoid substituting the
TREE_TYPE and DECL_TI_ARGS when DECL_SELF_REFERENCE_P.
---
  gcc/cp/pt.cc | 69 +++-
  1 file changed, 41 insertions(+), 28 deletions(-)

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index 59b94317e88..28023d60684 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -9840,8 +9840,6 @@ lookup_template_class (tree d1, tree arglist, tree 
in_decl, tree context,
  if (context)
pop_decl_namespace ();
}
-  if (templ)
-   context = DECL_CONTEXT (templ);
  }
else if (TREE_CODE (d1) == TYPE_DECL && MAYBE_CLASS_TYPE_P (TREE_TYPE (d1)))
  {
@@ -9868,7 +9866,6 @@ lookup_template_class (tree d1, tree arglist, tree 
in_decl, tree context,
  {
templ = d1;
d1 = DECL_NAME (templ);
-  context = DECL_CONTEXT (templ);
  }
else if (DECL_TEMPLATE_TEMPLATE_PARM_P (d1))
  {
@@ -10059,8 +10056,25 @@ lookup_template_class (tree d1, tree arglist, tree 
in_decl, tree context,
context = DECL_CONTEXT (gen_tmpl);
if (context && TYPE_P (context))
{
- context = tsubst_aggr_type (context, arglist, complain, in_decl, 
true);
- context = complete_type (context);
+ if (!uses_template_parms (DECL_CONTEXT (templ)))
+   /* If the context of the partially instantiated template is
+  already non-dependent, then we might as well use it.  */
+   context = DECL_CONTEXT (templ);
+ else
+   {
+ context = tsubst_aggr_type (context, arglist, complain, in_decl, 
true);
+ context = complete_type (context);
+ if (is_dependent_type && arg_depth > 1)
+   {
+ /* If this is a dependent nested specialization such as
+A::B, then completion of A might have
+registered this specialization of B for us, so check
+the table again (33959).  */
+ entry = type_specializations->find_with_hash (&elt, hash);
+ if (entry)
+   return entry->spec;
+   }
+   }
}
else
context = tsubst (context, arglist, complain, in_decl);
@@ -13711,25 +13725,12 @@ tsubst_aggr_type (tree t,
if (TYPE_TEMPLATE_INFO (t) && uses_template_parms (t))
{
  tree argvec;
- tree context;
  tree r;
  
  	  /* In "sizeof(X)" we need to evaluate "I".  */

  cp_evaluated ev;
  
-	  /* First, determine the context for the type we are looking

-up.  */
- context = TYPE_CONTEXT (t);
- if (context && TYPE_P (context))
-   {
- context = tsubst_aggr_type (context, args, complain,
- in_decl, /*entering_scope=*/1);
- /* If context is a nested class inside a class template,
-  

Re: [PATCH 2/1] c++: optimize specialization of templated member functions

2022-06-09 Thread Jason Merrill via Gcc-patches

On 6/9/22 09:00, Patrick Palka wrote:

This performs one of the optimizations added by the previous
patch to lookup_template_class, to instantiate_template as well.
(For the libstdc++ ranges tests this optimization appears to be
effective around 30% of the time, i.e. 30% of the time context of 'tmpl'
is non-dependent while the context of 'gen_tmpl' is dependent.)


If this is a significant optimization, how about doing it in 
tsubst_aggr_type rather than its callers?



gcc/cp/ChangeLog:

* pt.cc (instantiate_template): Don't substitute the context
of the most general template if that of the partially
instantiated template is non-dependent.
---
  gcc/cp/pt.cc | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index e021c254872..208daad298a 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -21661,8 +21661,14 @@ instantiate_template (tree tmpl, tree orig_args, 
tsubst_flags_t complain)
  ++processing_template_decl;
if (DECL_CLASS_SCOPE_P (gen_tmpl))
  {
-  tree ctx = tsubst_aggr_type (DECL_CONTEXT (gen_tmpl), targ_ptr,
-  complain, gen_tmpl, true);
+  tree ctx;
+  if (!uses_template_parms (DECL_CONTEXT (tmpl)))
+   /* If the context of the partially instantiated template is already
+  non-dependent, then we might as well use it.  */
+   ctx = DECL_CONTEXT (tmpl);
+  else
+   ctx = tsubst_aggr_type (DECL_CONTEXT (gen_tmpl), targ_ptr,
+   complain, gen_tmpl, true);
push_nested_class (ctx);
  }
  




Re: [PATCH v4, rs6000] Implemented f[min/max]_optab by xs[min/max]dp [PR103605]

2022-06-09 Thread Segher Boessenkool
On Thu, Jun 09, 2022 at 09:24:00AM +0800, HAO CHEN GUI wrote:
> On 8/6/2022 下午 9:24, Segher Boessenkool wrote:
> > But it regresses the code quality generated with -ffast-math (because
> > the new unspecs arent't optimised like standard rtl is).  This can be
> > follow-up work of course -- and the best direction is to make fmin/fmax
> > generic, even!  :-)
> 
> fmin/max will be folded to MIN/MAX_EXPR when fast-math is set. So the
> behavior doesn't change when fast-math is set.

Ah, good.  Should we then have an assert that there is no fast-math if
we ever get the rtl fmin/fmax stuff?


Segher


Re: [committed] RISC-V: Use a tab rather than space with FSFLAGS

2022-06-09 Thread Kito Cheng via Gcc-patches
Thanks :)


On Thu, Jun 9, 2022 at 9:35 PM Maciej W. Rozycki  wrote:
>
> Consistently use a tab rather than a space as the separator between the
> assembly instruction mnemonic and its operand with FSFLAGS instructions
> produced with the unordered FP comparison RTL insns.
>
> gcc/
> * config/riscv/riscv.md
> (*f_quiet4_default)
> (*f_quiet4_snan): Emit a tab
> rather than space with FSFLAGS.
> ---
> Hi,
>
>  Committed as obvious.
>
>   Maciej
> ---
>  gcc/config/riscv/riscv.md |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> gcc-riscv-fcmp-tab.diff
> Index: gcc/gcc/config/riscv/riscv.md
> ===
> --- gcc.orig/gcc/config/riscv/riscv.md
> +++ gcc/gcc/config/riscv/riscv.md
> @@ -2344,7 +2344,7 @@
>   QUIET_COMPARISON))
>  (clobber (match_scratch:X 3 "=&r"))]
>"TARGET_HARD_FLOAT && ! HONOR_SNANS (mode)"
> -  "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags %3"
> +  "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags\t%3"
>[(set_attr "type" "fcmp")
> (set_attr "mode" "")
> (set (attr "length") (const_int 12))])
> @@ -2357,7 +2357,7 @@
>   QUIET_COMPARISON))
>  (clobber (match_scratch:X 3 "=&r"))]
>"TARGET_HARD_FLOAT && HONOR_SNANS (mode)"
> -  "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags 
> %3\n\tfeq.\tzero,%1,%2"
> +  
> "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags\t%3\n\tfeq.\tzero,%1,%2"
>[(set_attr "type" "fcmp")
> (set_attr "mode" "")
> (set (attr "length") (const_int 16))])


Re: [PATCH] RISC-V: Reset the length to the default of 4 for FP comparisons

2022-06-09 Thread Kito Cheng via Gcc-patches
LGTM, *f_quiet4_default and
*f_quiet4_snan has set their own
length and the only user of this setting is
*cstore4, but apparently the length if 4 for that
not 8.

Thanks!

On Thu, Jun 9, 2022 at 9:36 PM Maciej W. Rozycki  wrote:
>
> The default length for floating-point compare operations is overridden
> to 8, however the FEQ.fmt, FLT.fmt, FLE.fmt machine instructions and
> FGE.fmt, FGT.fmt assembly idioms the relevant RTL insns produce are all
> 4 bytes long each.  And all the floating-point compare RTL insns that
> produce multiple machine instructions explicitly set their lengths.
>
> Remove the override then, letting the default of 4 apply for the single
> instruction case.
>
> gcc/
> * config/riscv/riscv.md (length): Remove the explicit setting
> for "fcmp".
> ---
> Hi,
>
>  So for:
>
> int
> feq (float x, float y)
> {
>   return x == y;
> }
>
> we get:
>
> .globl  feq
> .type   feq, @function
> feq:
> feq.s   a0,fa0,fa1  # 15[c=4 l=8]  *cstoresfdi4
> ret # 24[c=0 l=4]  simple_return
> .size   feq, .-feq
>
> which is obviously wrong given:
>
> Disassembly of section .text:
>
>  :
>0:   a0b52553feq.s   a0,fa0,fa1
>4:   8082ret
>
> (hmm tabs are odd here too, but that's a binutils issue).  I note that the
> override has always been there since the RISC-V port landed, so I take it
> it's a missed leftover from an earlier situation.
>
>  With the change in place we instead get:
>
> .globl  feq
> .type   feq, @function
> feq:
> feq.s   a0,fa0,fa1  # 15[c=4 l=4]  *cstoresfdi4
> ret # 24[c=0 l=4]  simple_return
> .size   feq, .-feq
>
> which I find so relieving.
>
>  No regressions in the testsuite (and I haven't checked how it affects
> instruction scheduling, especially with `-Os', but I think it's obviously
> correct really).  OK to apply?
>
>   Maciej
> ---
>  gcc/config/riscv/riscv.md |2 --
>  1 file changed, 2 deletions(-)
>
> gcc-riscv-fcmp-length.diff
> Index: gcc/gcc/config/riscv/riscv.md
> ===
> --- gcc.orig/gcc/config/riscv/riscv.md
> +++ gcc/gcc/config/riscv/riscv.md
> @@ -231,8 +231,6 @@
>
>   (eq_attr "got" "load") (const_int 8)
>
> - (eq_attr "type" "fcmp") (const_int 8)
> -
>   ;; SHIFT_SHIFTs are decomposed into two separate instructions.
>   (eq_attr "move_type" "shift_shift")
> (const_int 8)


Re: [PATCH v2 01/11] OpenMP 5.0: Clause ordering for OpenMP 5.0 (topological sorting by base pointer)

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Wed, Jun 08, 2022 at 04:00:39PM +0100, Julian Brown wrote:
> > I think big question is if we do want to do this map clause reordering
> > before processing the  omp target etc. clauses, or after (during
> > gimplify_adjust_omp_clauses, when clauses from the implicit mappings
> > are added too and especially with the declare mapper expansions),
> > or both before and after.
> 
> The existing code constrains us a bit here, unless we want to
> completely rewrite it!
> 
> We can only do sorting on clauses before gimplification, otherwise the
> "structural" matching of the parsed syntax of base pointers inside other
> clauses on the directive, etc. will certainly fail.
> 
> (Semi-relatedly, I asked this on the omp-lang mailing list:
> 
>   "When we have mappings that represent base pointers, and other
>   mappings that use those base pointers, the former must be ordered to
>   take place before the latter -- but should we determine that relation
>   purely syntactically? How about if we write e.g. "p->" on one vs.
>   "(*p)." on the other?"
> 
> but no reply...)
> 
> So, this is fine for sorting explicit mapping clauses. When planning
> the approach I've used for "declare mapper" support, I wrote this (in
> an internal email):
> 
> "At the moment, gimplifying OMP workshare regions proceeds in three
> phases:
> 
>  1. Clauses are processed (gimplify_scan_omp_clauses), creating
> records of mapped variables in a splay tree, with associated flags.
> 
>  2. The body of the workshare region is processed (gimplified),
> augmenting the same splay tree with information about variables
> which are used implicitly (and maybe also modifying the "explicit"
> mappings from the first step).
> 
>  3. The clauses are modified based on the results of the second stage
> (gimplify_adjust_omp_clauses). E.g. clauses are removed that refer
> to variables that aren't actually used in the region, or new
> clauses created for implicitly-referenced variables without mapping
> clauses on the construct.
> 
> The problem with this with regards to mappers is that the "expanded"
> mappers should undergo some of the processing we currently perform
> during phase 1 (struct sibling list handling, and so on), but we don't
> know which variables are implicitly referenced until phase 2.
> 
> [description of a plan that didn't work removed]
> 
> So the new plan is to do:
> 
> phase 1  (scan original clauses)
> phase 2  (scan workshare body)
> phase 1  (use variables from "2" to instantiate mappers, and process
>   new clauses only. Prepend new list to original clauses)
> phase 3  (as before)
> 
> I was concerned that this would upset the sorting code -- but I think
> actually, as long as implicitly-created clauses are inserted at the
> front of the clause list, there can't be a case where a pointer base is
> mapped after a use of that base. If that assumption turns out to be
> wrong, then things might get a little more complicated."
> 
> ...and so far, the plan seems to be working out. The assumption, to
> state it in other words, is that an implicitly-added map clause *cannot*
> have a dependency on an explicit map clause, in terms of relying on a
> base pointer in that explicit clause, by construction.

I don't think there is any need to add extra phases, but we can move
some code from gimplify_scan_omp_clauses to gimplify_adjust_omp_clauses.
What must be done in gimplify_scan_omp_clauses is stuff that will or
could affect the gimplification of the region's body, in that phase 2
we want to know say that some variable was privatized explicitly or
explicitly mapped or none of that, so we can based on that decide if we
should note implicit data sharing or implicit mapping etc.
But e.g. the sorting of the OMP_CLAUSE_MAP clauses is something that can
IMHO be deferred until we have all those clauses, probably it is done
in gimplify_scan_omp_clauses right now was just that the sorting at least
initially was only needed for struct mapping (map (tofrom: a.b, a.c, a.d.e, 
a.d.f))
and that could appear only explicitly, not implicitly, implicit mapping
would only map the whole var.
But declare mapper changes this substantially, declare mapper can add
similar mappings even from the implicit maps.
So, I think we should keep in phase 1 for OMP_CLAUSE_MAP only the stuff that
perhaps gimplifies some expressions used in those and puts records about
them into splay trees and sorting and ideally some kind of merging of
adjacent mappings can be done only when we have even the implicit
mappings all collected (so that would be after
  splay_tree_foreach (ctx->variables, gimplify_adjust_omp_clauses_1, &data);
finishes).

Jakub



Re: [Patch] OpenMP: Move omp requires checks to libgomp

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Thu, Jun 09, 2022 at 02:46:34PM +0200, Tobias Burnus wrote:
> On 09.06.22 13:40, Jakub Jelinek via Gcc-patches wrote:
> > On Wed, Jun 08, 2022 at 05:56:02AM +0200, Tobias Burnus wrote:
> > > + && lookup_attribute ("omp declare target",
> > > +  DECL_ATTRIBUTES (current_function_decl)))
> > > +   omp_requires_mask
> > > + = (enum omp_requires) (omp_requires_mask | 
> > > OMP_REQUIRES_TARGET_USED);
> > I must admit it is unclear what the
> > "must appear lexically before any device constructs or device routines."
> > restriction actually means for device routines.
> > Is that lexically before definition of such device routines, or even their
> > declarations?
> I have similar issues – also for Fortran (and C++) module use. Hence, I
> had filled https://github.com/OpenMP/spec/issues/3240 (not publicly
> accessible); I added your issues to the list.
> > The above patch snippet is I believe for function definitions that were
> > arked as declare target before the definition somehow (another decl for
> > it merged with the new one or in between the begin/end).  And is true
> > even for device_type (host), to rule that out you'd need to check for
> > "omp declare target host" attribute not being present.
> > I'm not against the above snippet perhaps adjusted for device_type(host),
> > but IMHO we want clarifications from omp-lang
> How to proceed for now? And does 'omp_is_initial_device()' on the host a
> device function or not? It can be hard-coded to 'true' ...

If it is from me, bet it was because of that (mis)understanding that
device routines are device related runtime API calls.
I'd suggest to only mark in the patch what is clear (which is device
constructs) and defer the rest until it is clarified.

> > For Fortran, is the above mostly not needed because requires need to be in
> > the specification part and device constructs are executable and appear in
> > the part after it?  Do we allow requires in BLOCK's specification part?
> We don't allow it in BLOCK – but there are issues related to USE-ing
> modules, cf. OpenMP issue.

Ack.

> In terms of parsing, it makes no difference – contrary to
> 'unified_shared_memory', where the parser could decide not to add
> implicit mapping, the compiler part is not affected by API calls.

Yeah.  So perhaps on the standard side we should just keep the
lexically before device constructs (and metadirective/declare variant
device related resolution) in the restriction, but say that TUs
that have device constructs and device runtime APIs (or whatever is agreed)
imply that requires mask must be the same in all of them.

> > Shouldn't the vars in that section be const, so that it is a read-only
> > section?
> > 
> > Is unsigned_type_node what we want (say wouldn't be just unsigned_char_node
> > be enough, currently we just need 3 bits).
> 
> Probably -that would be 8 bits, leaving 5 spare. I have not checked what
> Andrew et al. do with the pinned-memory support by -f, but
> that will likely use only 1 to 3 bits, if any.

If it is SHF_MERGE, even 16-bit or 32-bit wouldn't be the end of the world,
or if it is in LTO streamed out stuff, we can use a bitpack for it...

> > Also, wonder if for HAVE_GAS_SHF_MERGE && flag_merge_constants
> > we shouldn't try to make that section mergeable.  If it goes away during
> > linking and is replaced by something, then it doesn't matter, but otherwise,
> > as we don't record which TU had what flags, all we care about is that
> > there were some TUs which used device construct/routines (and device APIs?)
> > and used bitmask 7, other TUs that used bitmask 3 and others that used
> > bitmask 4.
> (maybe – I am not sure about this, either.)
> > @@ -442,6 +463,14 @@ omp_finish_file (void)
> >   }
> > else
> >   {
> > +#ifndef ACCEL_COMPILER
> > +  if (flag_openmp
> > +   && (omp_requires_mask & OMP_REQUIRES_TARGET_USED)
> > +   && (omp_requires_mask & (OMP_REQUIRES_UNIFIED_ADDRESS
> > +| OMP_REQUIRES_UNIFIED_SHARED_MEMORY
> > +| OMP_REQUIRES_REVERSE_OFFLOAD)))
> > + sorry ("OpenMP device offloading is not supported for this target");
> > +#endif
> > I don't understand this snippet.  Without named sections on the host,
> > I bet we simply don't support offloading at all,
> > the record_offload_symbol target hook is only non-trivially defined
> > for nvptx and nvptx isn't typical host for OpenMP offloading,
> > because we don't remember it anywhere.
> 
> I thought that would address your: "This probably needs to sorry if the
> target doesn't support named sections. We probably don't support LTO in
> that case either though."

But sorry means we will fail to compile it.  Perhaps
inform would be better, but then we don't complain (warn/inform)
if no offloading targets are configured.  And, presence of requires
unified*/reverse_offload  as the reason for the diagnostics rather than
say presence of declare targe

[ping][vect-patterns] Refactor widen_plus/widen_minus as internal_fns

2022-06-09 Thread Joel Hutton via Gcc-patches
> Before I make any changes, I'd like to check we're all on the same page.
> 
> richi, are you ok with the gimple_build function, perhaps with a different
> name if you are concerned with overloading? we could use gimple_ch_build
> or gimple_code_helper_build?
> 
> Similarly are you ok with the use of gimple_extract_op? I would lean towards
> using it as it is cleaner, but I don't have strong feelings.
> 
> Joel

Ping. Just looking for some confirmation before I rework this patch. It would 
be good to get some agreement on this as Tamar is blocked on this patch.

Joel



> -Original Message-
> From: Joel Hutton
> Sent: 07 June 2022 10:02
> To: Richard Sandiford 
> Cc: Richard Biener ; gcc-patches@gcc.gnu.org
> Subject: RE: [ping][vect-patterns] Refactor widen_plus/widen_minus as
> internal_fns
> 
> Thanks Richard,
> 
> > I thought the potential problem with the above is that gimple_build is
> > a folding interface, so in principle it's allowed to return an
> > existing SSA_NAME set by an existing statement (or even a constant).
> > I think in this context we do need to force a new statement to be created.
> 
> Before I make any changes, I'd like to check we're all on the same page.
> 
> richi, are you ok with the gimple_build function, perhaps with a different
> name if you are concerned with overloading? we could use gimple_ch_build
> or gimple_code_helper_build?
> 
> Similarly are you ok with the use of gimple_extract_op? I would lean towards
> using it as it is cleaner, but I don't have strong feelings.
> 
> Joel
> 
> > -Original Message-
> > From: Richard Sandiford 
> > Sent: 07 June 2022 09:18
> > To: Joel Hutton 
> > Cc: Richard Biener ; gcc-patches@gcc.gnu.org
> > Subject: Re: [ping][vect-patterns] Refactor widen_plus/widen_minus as
> > internal_fns
> >
> > Joel Hutton  writes:
> > >> > Patches attached. They already incorporated the .cc rename, now
> > >> > rebased to be after the change to tree.h
> > >>
> > >> @@ -1412,8 +1412,7 @@ vect_recog_widen_op_pattern (vec_info
> *vinfo,
> > >>2, oprnd, half_type, unprom, vectype);
> > >>
> > >>tree var = vect_recog_temp_ssa_var (itype, NULL);
> > >> -  gimple *pattern_stmt = gimple_build_assign (var, wide_code,
> > >> - oprnd[0], oprnd[1]);
> > >> +  gimple *pattern_stmt = gimple_build (var, wide_code, oprnd[0],
> > >> oprnd[1]);
> > >>
> > >>
> > >> you should be able to do without the new gimple_build overload by
> > >> using
> > >>
> > >>gimple_seq stmts = NULL;
> > >>gimple_build (&stmts, wide_code, itype, oprnd[0], oprnd[1]);
> > >>gimple *pattern_stmt = gimple_seq_last_stmt (stmts);
> > >>
> > >> because 'gimple_build' is an existing API.
> > >
> > > Done.
> > >
> > > The gimple_build overload was at the request of Richard Sandiford, I
> > assume removing it is ok with you Richard S?
> > > From Richard Sandiford:
> > >> For example, I think we should hide this inside a new:
> > >>
> > >>   gimple_build (var, wide_code, oprnd[0], oprnd[1]);
> > >>
> > >> that works directly on code_helper, similarly to the new
> > >> code_helper gimple_build interfaces.
> >
> > I thought the potential problem with the above is that gimple_build is
> > a folding interface, so in principle it's allowed to return an
> > existing SSA_NAME set by an existing statement (or even a constant).
> > I think in this context we do need to force a new statement to be created.
> >
> > Of course, the hope is that there wouldn't still be such folding
> > opportunities at this stage, but I don't think it's guaranteed
> > (especially with options fuzzing).
> >
> > Sind I was mentioned :-) ...
> >
> > Could you run the patch through contrib/check_GNU_style.py?
> > There seem to be a few long lines.
> >
> > > +  if (res_op.code.is_tree_code ())
> >
> > Do you need this is_tree_code ()?  These comparisons…
> >
> > > +  {
> > > +  widen_arith = (code == WIDEN_PLUS_EXPR
> > > +  || code == WIDEN_MINUS_EXPR
> > > +  || code == WIDEN_MULT_EXPR
> > > +  || code == WIDEN_LSHIFT_EXPR);
> >
> > …ought to be safe unconditionally.
> >
> > > + }
> > > +  else
> > > +  widen_arith = false;
> > > +
> > > +  if (!widen_arith
> > > +  && !CONVERT_EXPR_CODE_P (code)
> > > +  && code != FIX_TRUNC_EXPR
> > > +  && code != FLOAT_EXPR)
> > > +return false;
> > >
> > >/* Check types of lhs and rhs.  */
> > > -  scalar_dest = gimple_assign_lhs (stmt);
> > > +  scalar_dest = gimple_get_lhs (stmt);
> > >lhs_type = TREE_TYPE (scalar_dest);
> > >vectype_out = STMT_VINFO_VECTYPE (stmt_info);
> > >
> > > @@ -4938,10 +4951,14 @@ vectorizable_conversion (vec_info *vinfo,
> > >
> > >if (op_type == binary_op)
> > >  {
> > > -  gcc_assert (code == WIDEN_MULT_EXPR || code ==
> > WIDEN_LSHIFT_EXPR
> > > -   || code == WIDEN_PLUS_EXPR || code ==
> > WIDEN_MINUS_EXPR);
> > > +  gcc_assert (code == WIDEN_MULT_EXPR
> > > +   |

[PATCH] RISC-V: Split unordered FP comparisons into individual RTL insns

2022-06-09 Thread Maciej W. Rozycki
We have unordered FP comparisons implemented as RTL insns that produce 
multiple machine instructions.  Such RTL insns are hard to match with a 
processor pipeline description and additionally there is a redundant 
SNEZ instruction produced on the result of these comparisons even though 
the FLT.fmt and FLE.fmt machine instructions already produce either 0 or 
1, e.g.:

long
flt (double x, double y)
{
  return __builtin_isless (x, y);
}

with `-O2 -fno-finite-math-only -fno-signaling-nans' gets compiled to:

.globl  flt
.type   flt, @function
flt:
frflags a5
flt.d   a0,fa0,fa1
fsflags a5
sneza0,a0
ret
.size   flt, .-flt

because the middle end can't see through the UNSPEC operation unordered 
FP comparisons have been defined in terms of.

These instructions are only produced via an expander already, so change 
the expander to emit individual RTL insns for each machine instruction 
in the ultimate ultimate sequence produced rather than deferring to a 
single RTL insn producing the whole sequence at once.

gcc/
* config/riscv/riscv.md (UNSPECV_FSNVSNAN): New constant.
(QUIET_PATTERN): New int attribute.
(f_quiet4): Emit the intended 
RTL insns entirely within the preparation statements.
(*f_quiet4_default)
(*f_quiet4_snan): Remove 
insns.
(*riscv_fsnvsnan2): New insn.

gcc/testsuite/
* gcc.target/riscv/fle-ieee.c: New test.
* gcc.target/riscv/fle-snan.c: New test.
* gcc.target/riscv/fle.c: New test.
* gcc.target/riscv/flef-ieee.c: New test.
* gcc.target/riscv/flef-snan.c: New test.
* gcc.target/riscv/flef.c: New test.
* gcc.target/riscv/flt-ieee.c: New test.
* gcc.target/riscv/flt-snan.c: New test.
* gcc.target/riscv/flt.c: New test.
* gcc.target/riscv/fltf-ieee.c: New test.
* gcc.target/riscv/fltf-snan.c: New test.
* gcc.target/riscv/fltf.c: New test.
---
Hi,

 I think it is a step in the right direction, however ultimately I think 
we ought to actually tell GCC about the IEEE exception flags, so that the 
compiler can track data dependencies and we do not have to resort to 
UNSPECs which the compiler cannot see through.  E.g. for a piece of code 
like:

long
fltlt (double x, double y, double z)
{
  return __builtin_isless (x, y) + __builtin_isless (x, z);
}

(using an addition here for clarity because for a logical operation even 
more horror is produced) we get:

.globl  fltlt
.type   fltlt, @function
fltlt:
frflags a5  # 8 [c=4 l=4]  riscv_frflags
flt.d   a0,fa0,fa1  # 9 [c=4 l=4]  *cstoredfdi4
fsflags a5  # 10[c=0 l=4]  riscv_fsflags
frflags a4  # 16[c=4 l=4]  riscv_frflags
flt.d   a5,fa0,fa2  # 17[c=4 l=4]  *cstoredfdi4
fsflags a4  # 18[c=0 l=4]  riscv_fsflags
addwa0,a0,a5# 30[c=8 l=4]  *addsi3_extended/0
ret # 40[c=0 l=4]  simple_return
.size   fltlt, .-fltlt

where the middle FSFLAGS/FRFLAGS pair makes no sense of course and is a 
waste of both space and cycles.

 I'm yet running some benchmarking to see if the use of UNSPEC_VOLATILEs 
makes any noticeable performance difference, but I suspect it does not as 
the compiler could not do much about the original multiple-instruction 
single RTL insns anyway.

 No regressions with the GCC (with and w/o `-fsignaling-nans') and glibc 
testsuites (as per commit 1fcbfb00fc67 ("RISC-V: Fix -fsignaling-nans for 
glibc testsuite.")).  OK to apply?

  Maciej
---
 gcc/config/riscv/riscv.md  |   67 +++--
 gcc/testsuite/gcc.target/riscv/fle-ieee.c  |   12 +
 gcc/testsuite/gcc.target/riscv/fle-snan.c  |   12 +
 gcc/testsuite/gcc.target/riscv/fle.c   |   12 +
 gcc/testsuite/gcc.target/riscv/flef-ieee.c |   12 +
 gcc/testsuite/gcc.target/riscv/flef-snan.c |   12 +
 gcc/testsuite/gcc.target/riscv/flef.c  |   12 +
 gcc/testsuite/gcc.target/riscv/flt-ieee.c  |   12 +
 gcc/testsuite/gcc.target/riscv/flt-snan.c  |   12 +
 gcc/testsuite/gcc.target/riscv/flt.c   |   12 +
 gcc/testsuite/gcc.target/riscv/fltf-ieee.c |   12 +
 gcc/testsuite/gcc.target/riscv/fltf-snan.c |   12 +
 gcc/testsuite/gcc.target/riscv/fltf.c  |   12 +
 13 files changed, 179 insertions(+), 32 deletions(-)

gcc-riscv-fcmp-split.diff
Index: gcc/gcc/config/riscv/riscv.md
===
--- gcc.orig/gcc/config/riscv/riscv.md
+++ gcc/gcc/config/riscv/riscv.md
@@ -57,6 +57,7 @@
   ;; Floating-point unspecs.
   UNSPECV_FRFLAGS
   UNSPECV_FSFLAGS
+  UNSPECV_FSNVSNAN
 
   ;; Interrupt handler instructions.
   UNSPECV_MRET
@@ -360,6 +361,7 @@
 ;; Iterator and attributes for quiet comparisons.
 (define_int_iterator QUIET_COMPARISON [UNSPEC_FLT_QUIET UNSPEC_F

[PATCH] RISC-V: Reset the length to the default of 4 for FP comparisons

2022-06-09 Thread Maciej W. Rozycki
The default length for floating-point compare operations is overridden 
to 8, however the FEQ.fmt, FLT.fmt, FLE.fmt machine instructions and 
FGE.fmt, FGT.fmt assembly idioms the relevant RTL insns produce are all 
4 bytes long each.  And all the floating-point compare RTL insns that 
produce multiple machine instructions explicitly set their lengths.

Remove the override then, letting the default of 4 apply for the single 
instruction case.

gcc/
* config/riscv/riscv.md (length): Remove the explicit setting 
for "fcmp".
---
Hi,

 So for:

int
feq (float x, float y)
{
  return x == y;
}

we get:

.globl  feq
.type   feq, @function
feq:
feq.s   a0,fa0,fa1  # 15[c=4 l=8]  *cstoresfdi4
ret # 24[c=0 l=4]  simple_return
.size   feq, .-feq

which is obviously wrong given:

Disassembly of section .text:

 :
   0:   a0b52553feq.s   a0,fa0,fa1
   4:   8082ret

(hmm tabs are odd here too, but that's a binutils issue).  I note that the 
override has always been there since the RISC-V port landed, so I take it 
it's a missed leftover from an earlier situation.

 With the change in place we instead get:

.globl  feq
.type   feq, @function
feq:
feq.s   a0,fa0,fa1  # 15[c=4 l=4]  *cstoresfdi4
ret # 24[c=0 l=4]  simple_return
.size   feq, .-feq

which I find so relieving.

 No regressions in the testsuite (and I haven't checked how it affects 
instruction scheduling, especially with `-Os', but I think it's obviously 
correct really).  OK to apply?

  Maciej
---
 gcc/config/riscv/riscv.md |2 --
 1 file changed, 2 deletions(-)

gcc-riscv-fcmp-length.diff
Index: gcc/gcc/config/riscv/riscv.md
===
--- gcc.orig/gcc/config/riscv/riscv.md
+++ gcc/gcc/config/riscv/riscv.md
@@ -231,8 +231,6 @@
 
  (eq_attr "got" "load") (const_int 8)
 
- (eq_attr "type" "fcmp") (const_int 8)
-
  ;; SHIFT_SHIFTs are decomposed into two separate instructions.
  (eq_attr "move_type" "shift_shift")
(const_int 8)


[committed] RISC-V: Use a tab rather than space with FSFLAGS

2022-06-09 Thread Maciej W. Rozycki
Consistently use a tab rather than a space as the separator between the 
assembly instruction mnemonic and its operand with FSFLAGS instructions 
produced with the unordered FP comparison RTL insns.

gcc/
* config/riscv/riscv.md 
(*f_quiet4_default)
(*f_quiet4_snan): Emit a tab 
rather than space with FSFLAGS.
---
Hi,

 Committed as obvious.

  Maciej
---
 gcc/config/riscv/riscv.md |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

gcc-riscv-fcmp-tab.diff
Index: gcc/gcc/config/riscv/riscv.md
===
--- gcc.orig/gcc/config/riscv/riscv.md
+++ gcc/gcc/config/riscv/riscv.md
@@ -2344,7 +2344,7 @@
  QUIET_COMPARISON))
 (clobber (match_scratch:X 3 "=&r"))]
   "TARGET_HARD_FLOAT && ! HONOR_SNANS (mode)"
-  "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags %3"
+  "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags\t%3"
   [(set_attr "type" "fcmp")
(set_attr "mode" "")
(set (attr "length") (const_int 12))])
@@ -2357,7 +2357,7 @@
  QUIET_COMPARISON))
 (clobber (match_scratch:X 3 "=&r"))]
   "TARGET_HARD_FLOAT && HONOR_SNANS (mode)"
-  "frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags 
%3\n\tfeq.\tzero,%1,%2"
+  
"frflags\t%3\n\tf.\t%0,%1,%2\n\tfsflags\t%3\n\tfeq.\tzero,%1,%2"
   [(set_attr "type" "fcmp")
(set_attr "mode" "")
(set (attr "length") (const_int 16))])


c++: Better module initializer code

2022-06-09 Thread Nathan Sidwell

Every module interface needs to emit a global initializer, but it
might have nothing to init.  In those cases, there's no need for any
idempotency boolean to be emitted.

nathan

--
Nathan SidwellFrom 227ffed7dbbdffeeb5bc013852d61a97aa468c62 Mon Sep 17 00:00:00 2001
From: Nathan Sidwell 
Date: Wed, 8 Jun 2022 11:25:14 -0700
Subject: [PATCH] c++: Better module initializer code

Every module interface needs to emit a global initializer, but it
might have nothing to init.  In those cases, there's no need for any
idempotency boolean to be emitted.

	gcc/cp
	* cp-tree.h (module_initializer_kind): Replace with ...
	(module_global_init_needed, module_has_import_inits): ...
	these.
	* decl2.cc (start_objects): Add has_body parm.  Reorganize
	module initializer creation.
	(generate_ctor_or_dtor_function): Adjust.
	(c_parse_final_cleanups): Adjust.
	(vtv_start_verification_constructor_init_function): Adjust.
	* module.cc (module_initializer_kind): Replace with ...
	(module_global_init_needed, module_has_import_inits): ...
	these.

	gcc/testsuite/
	* g++.dg/modules/init-2_a.C: Check no idempotency.
	* g++.dg/modules/init-2_b.C: Check idempotency.
---
 gcc/cp/cp-tree.h|  3 ++-
 gcc/cp/decl2.cc | 32 +
 gcc/cp/module.cc| 23 +-
 gcc/testsuite/g++.dg/modules/init-2_a.C |  2 ++
 gcc/testsuite/g++.dg/modules/init-2_b.C |  2 ++
 5 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
index 3d8a08b8dd7..a5d93282167 100644
--- a/gcc/cp/cp-tree.h
+++ b/gcc/cp/cp-tree.h
@@ -7179,7 +7179,8 @@ extern module_state *get_module (tree name, module_state *parent = NULL,
  bool partition = false);
 extern bool module_may_redeclare (tree decl);
 
-extern int module_initializer_kind ();
+extern bool module_global_init_needed ();
+extern bool module_has_import_inits ();
 extern void module_add_import_initializers ();
 
 /* Where the namespace-scope decl was originally declared.  */
diff --git a/gcc/cp/decl2.cc b/gcc/cp/decl2.cc
index bfb6a32e3b6..9de9a7a4f8a 100644
--- a/gcc/cp/decl2.cc
+++ b/gcc/cp/decl2.cc
@@ -55,7 +55,7 @@ int raw_dump_id;
  
 extern cpp_reader *parse_in;
 
-static tree start_objects (bool, unsigned);
+static tree start_objects (bool, unsigned, bool);
 static tree finish_objects (bool, unsigned, tree);
 static tree start_partial_init_fini_fn (bool, unsigned, unsigned);
 static void finish_partial_init_fini_fn (tree);
@@ -3848,15 +3848,13 @@ generate_tls_wrapper (tree fn)
 /* Start a global constructor or destructor function.  */
 
 static tree
-start_objects (bool initp, unsigned priority)
+start_objects (bool initp, unsigned priority, bool has_body)
 {
-  int module_init = 0;
-
-  if (priority == DEFAULT_INIT_PRIORITY && initp)
-module_init = module_initializer_kind ();
-
+  bool default_init = initp && priority == DEFAULT_INIT_PRIORITY;
+  bool is_module_init = default_init && module_global_init_needed ();
   tree name = NULL_TREE;
-  if (module_init > 0)
+
+  if (is_module_init)
 name = mangle_module_global_init (0);
   else
 {
@@ -3880,7 +3878,7 @@ start_objects (bool initp, unsigned priority)
   tree fntype =	build_function_type (void_type_node, void_list_node);
   tree fndecl = build_lang_decl (FUNCTION_DECL, name, fntype);
   DECL_CONTEXT (fndecl) = FROB_CONTEXT (global_namespace);
-  if (module_init > 0)
+  if (is_module_init)
 {
   SET_DECL_ASSEMBLER_NAME (fndecl, name);
   TREE_PUBLIC (fndecl) = true;
@@ -3905,8 +3903,10 @@ start_objects (bool initp, unsigned priority)
 
   tree body = begin_compound_stmt (BCS_FN_BODY);
 
-  if (module_init > 0)
+  bool has_import_inits = default_init && module_has_import_inits ();
+  if (is_module_init && (has_import_inits || has_body))
 {
+  // If the function is going to be empty, don't emit idempotency.
   // 'static bool __in_chrg = false;
   // if (__inchrg) return;
   // __inchrg = true
@@ -3930,7 +3930,7 @@ start_objects (bool initp, unsigned priority)
   finish_expr_stmt (assign);
 }
 
-  if (module_init)
+  if (has_import_inits)
 module_add_import_initializers ();
 
   return body;
@@ -4321,7 +4321,7 @@ generate_ctor_or_dtor_function (bool initp, unsigned priority,
 {
   input_location = locus;
 
-  tree body = start_objects (initp, priority);
+  tree body = start_objects (initp, priority, bool (fns));
 
   /* To make sure dynamic construction doesn't access globals from other
  compilation units where they might not be yet constructed, for
@@ -4359,7 +4359,9 @@ generate_ctor_or_dtor_function (bool initp, unsigned priority,
   if (initp && (flag_sanitize & SANITIZE_ADDRESS))
 finish_expr_stmt (asan_dynamic_init_call (/*after_p=*/true));
 
-  /* Close out the function.  */
+  /* Close out the function, and arrange for it to be called at init
+ or fini time.  (Even module initializer functions need this, as
+ we cannot guarantee the module

[PATCH 2/1] c++: optimize specialization of templated member functions

2022-06-09 Thread Patrick Palka via Gcc-patches
This performs one of the optimizations added by the previous
patch to lookup_template_class, to instantiate_template as well.
(For the libstdc++ ranges tests this optimization appears to be
effective around 30% of the time, i.e. 30% of the time context of 'tmpl'
is non-dependent while the context of 'gen_tmpl' is dependent.)

gcc/cp/ChangeLog:

* pt.cc (instantiate_template): Don't substitute the context
of the most general template if that of the partially
instantiated template is non-dependent.
---
 gcc/cp/pt.cc | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index e021c254872..208daad298a 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -21661,8 +21661,14 @@ instantiate_template (tree tmpl, tree orig_args, 
tsubst_flags_t complain)
 ++processing_template_decl;
   if (DECL_CLASS_SCOPE_P (gen_tmpl))
 {
-  tree ctx = tsubst_aggr_type (DECL_CONTEXT (gen_tmpl), targ_ptr,
-  complain, gen_tmpl, true);
+  tree ctx;
+  if (!uses_template_parms (DECL_CONTEXT (tmpl)))
+   /* If the context of the partially instantiated template is already
+  non-dependent, then we might as well use it.  */
+   ctx = DECL_CONTEXT (tmpl);
+  else
+   ctx = tsubst_aggr_type (DECL_CONTEXT (gen_tmpl), targ_ptr,
+   complain, gen_tmpl, true);
   push_nested_class (ctx);
 }
 
-- 
2.36.1.363.g9c897eef06



Re: [PATCH] c++: Fix up ICE on __builtin_shufflevector constexpr evaluation [PR105871]

2022-06-09 Thread Jason Merrill via Gcc-patches

On 6/8/22 02:08, Jakub Jelinek wrote:

Hi!

As the following testcase shows, BIT_FIELD_REF result doesn't have to have
just integral type, it can also have vector type.  And in that case
cxx_eval_bit_field_ref just ICEs on it because it is unprepared for that
case, creates the initial value with build_int_cst (sure, that one could be
easily replaced with build_zero_cst) and then expects it can through shifts,
ands and ors come up with the final value, but that doesn't work for
vectors.

We already call fold_ternary if whole is a VECTOR_CST, this patch does the
same if the result doesn't have integral type.  And, there is no guarantee
fold_ternary will succeed and the callers certainly don't expect NULL
being returned, so it also diagnoses those as non-constant and returns
original t in that case.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?


OK.


2022-06-08  Jakub Jelinek  

PR c++/105871
* constexpr.cc (cxx_eval_bit_field_ref): For BIT_FIELD_REF with
non-integral result type use fold_ternary too like for BIT_FIELD_REFs
from VECTOR_CST.  If fold_ternary returns NULL, diagnose non-constant
expression, set *non_constant_p and return t, instead of returning
NULL.

* g++.dg/pr105871.C: New test.

--- gcc/cp/constexpr.cc.jj  2022-06-03 11:20:13.0 +0200
+++ gcc/cp/constexpr.cc 2022-06-07 13:43:13.157127740 +0200
@@ -4198,9 +4198,16 @@ cxx_eval_bit_field_ref (const constexpr_
if (*non_constant_p)
  return t;
  
-  if (TREE_CODE (whole) == VECTOR_CST)

-return fold_ternary (BIT_FIELD_REF, TREE_TYPE (t), whole,
-TREE_OPERAND (t, 1), TREE_OPERAND (t, 2));
+  if (TREE_CODE (whole) == VECTOR_CST || !INTEGRAL_TYPE_P (TREE_TYPE (t)))
+{
+  if (tree r = fold_ternary (BIT_FIELD_REF, TREE_TYPE (t), whole,
+TREE_OPERAND (t, 1), TREE_OPERAND (t, 2)))
+   return r;
+  if (!ctx->quiet)
+   error ("%qE is not a constant expression", orig_whole);
+  *non_constant_p = true;
+  return t;
+}
  
start = TREE_OPERAND (t, 2);

istart = tree_to_shwi (start);
--- gcc/testsuite/g++.dg/pr105871.C.jj  2022-06-07 13:56:02.743241969 +0200
+++ gcc/testsuite/g++.dg/pr105871.C 2022-06-07 13:56:29.042975525 +0200
@@ -0,0 +1,12 @@
+// PR c++/105871
+// { dg-do compile }
+// { dg-options "-Wno-psabi" }
+
+typedef __attribute__((__vector_size__ ( 1))) unsigned char U;
+typedef __attribute__((__vector_size__ (16))) unsigned char V;
+
+U
+foo (void)
+{
+  return __builtin_shufflevector ((U){}, (V){}, 0);
+}

Jakub





Re: [Patch] OpenMP: Move omp requires checks to libgomp

2022-06-09 Thread Tobias Burnus

On 09.06.22 13:40, Jakub Jelinek via Gcc-patches wrote:

On Wed, Jun 08, 2022 at 05:56:02AM +0200, Tobias Burnus wrote:

+ && lookup_attribute ("omp declare target",
+  DECL_ATTRIBUTES (current_function_decl)))
+   omp_requires_mask
+ = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);

I must admit it is unclear what the
"must appear lexically before any device constructs or device routines."
restriction actually means for device routines.
Is that lexically before definition of such device routines, or even their
declarations?

I have similar issues – also for Fortran (and C++) module use. Hence, I
had filled https://github.com/OpenMP/spec/issues/3240 (not publicly
accessible); I added your issues to the list.

The above patch snippet is I believe for function definitions that were
arked as declare target before the definition somehow (another decl for
it merged with the new one or in between the begin/end).  And is true
even for device_type (host), to rule that out you'd need to check for
"omp declare target host" attribute not being present.
I'm not against the above snippet perhaps adjusted for device_type(host),
but IMHO we want clarifications from omp-lang

How to proceed for now? And does 'omp_is_initial_device()' on the host a
device function or not? It can be hard-coded to 'true' ...

[...]
target update is also a device construct and the above snippet hasn't been
added for it, ditto for interop which we don't implement yet.
But, my preference would be instead of adding these snippets to
c_parser_omp_target_{data,enter_data,exit_data,update} etc. move it from
c_parser_omp_target to c_parser_omp_all_clauses:
   if (flag_openmp
   && (mask & (OMP_CLAUSE_MASK_1 << PRAGMA_OMP_CLAUSE_DEVICE)) != 0)
 omp_requires_mask
   = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);
(somewhere at the start of the function), because the definition of device
constructs is exactly like that:
"device construct An OpenMP construct that accepts the device clause."


Makes sense.

[C++ cases]


Ditto.
For Fortran, is the above mostly not needed because requires need to be in
the specification part and device constructs are executable and appear in
the part after it?  Do we allow requires in BLOCK's specification part?

We don't allow it in BLOCK – but there are issues related to USE-ing
modules, cf. OpenMP issue.

--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -3644,6 +3644,9 @@ gimplify_call_expr (tree *expr_p, gimple_seq *pre_p, bool 
want_value)
+  if (fndecl && flag_openmp && omp_runtime_api_call (fndecl, true))
+omp_requires_mask
+  = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);

I'm sure device APIs were discussed, but I must be blind and I can't find it
in either 5.0, 5.1 or 5.2.  All I see is device constructs or device routines
in those places where I'd also look for device related OpenMP runtime
library APIs.  Though, if some routine calls omp_get_num_devices (),
certainly the library at that point needs to know
reverse_offload/unified_shared_memory/etc. requires because that determines
how many devices it has.  So, what have I missed (aka on which place in the
standard the above snippet is based on)?


It is based on your review comments from last year ("Something I miss in
the patch is that for the device API calls") plus what requires some
device initialization. But otherwise, I also did not see it.

In terms of parsing, it makes no difference – contrary to
'unified_shared_memory', where the parser could decide not to add
implicit mapping, the compiler part is not affected by API calls.

I cannot really make up my mind whether it should be required in this
case or not. Maybe, it is not needed.


+ const char *requires_section = ".gnu.gomp_requires";

+  tree maskvar = build_decl (UNKNOWN_LOCATION, VAR_DECL,
+ get_identifier (".gomp_requires_mask"),
+ unsigned_type_node);
+  SET_DECL_ALIGN (maskvar, TYPE_ALIGN (unsigned_type_node));

Don't we want also DECL_USER_ALIGN (maskvar) = 1; so that
we never try to increase its alignment?

Probably yes.

Is it an allocated section, or should it be better non-allocated and then
dealt with by mkoffload?

Shouldn't the vars in that section be const, so that it is a read-only
section?

Is unsigned_type_node what we want (say wouldn't be just unsigned_char_node
be enough, currently we just need 3 bits).


Probably -that would be 8 bits, leaving 5 spare. I have not checked what
Andrew et al. do with the pinned-memory support by -f, but
that will likely use only 1 to 3 bits, if any.


Also, wonder if for HAVE_GAS_SHF_MERGE && flag_merge_constants
we shouldn't try to make that section mergeable.  If it goes away during
linking and is replaced by something, then it doesn't matter, but otherwise,
as we don't record which TU had what flags, all we care about is that
there were s

Re: [PATCH, OpenMP, v4] Implement uses_allocators clause for target regions

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Thu, Jun 09, 2022 at 02:21:13PM +0800, Chung-Lin Tang wrote:
> @@ -15651,6 +15653,213 @@ c_parser_omp_clause_allocate (c_parser *parser, 
> tree list)
>return nl;
>  }
>  
> +/* OpenMP 5.0:
> +   uses_allocators ( allocator-list )
> +
> +   allocator-list:
> +   allocator
> +   allocator , allocator-list
> +   allocator ( traits-array )
> +   allocator ( traits-array ) , allocator-list
> +
> +   OpenMP 5.2:
> +
> +   uses_allocators ( modifier : allocator-list )

Please drop the -list above.

> +   uses_allocators ( modifier , modifier : allocator-list )

and here too.

> +  struct item_tok
> +  {
> +location_t loc;
> +tree id;
> +item_tok (void) : loc (UNKNOWN_LOCATION), id (NULL_TREE) {}
> +  };
> +  struct item { item_tok name, arg; };
> +  auto_vec *modifiers = NULL, *allocators = NULL;
> +  auto_vec *cur_list = new auto_vec (4);

I was hoping you'd drop all this.
See https://gcc.gnu.org/r13-1002
for implementation (both C and C++ FE) of something very similar,
the only difference there is that in the case of linear clause, it is
looking for
val
ref
uval
step ( whatever )
followed by , or )
(anod ref and uval not in C FE),
while you are looking for
memspace ( whatever )
traits ( whatever )
followed by : or by , (in case of , repeat).
But in both cases you can actually use the same parser APIs
for raw token pre-parsing to just compute if it is the modifier
syntax or not, set bool has_modifiers based on that (when you
come over probably valid syntax followed by CPP_COLON).

Jakub



Re: [Patch] OpenMP: Handle ancestor:1 with discover_declare_target

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Thu, Jun 09, 2022 at 12:47:17PM +0200, Tobias Burnus wrote:
> Another minor step to getting reverse offloading to work ...
> 
> OK for mainline?
> 
> Tobias
> 
> PS: As attached, this patch is a stand-alone patch, which fails
> due to the requires sorry (see dg-prune-output).
> With the requires patch, it should fail with the next sorry:
>   'ancestor' not yet supported"
> (For Fortran, it currently fails already in the FE as the
> sorry prevents generating tree code. Thus, there cannot be
> a Fortran check until the requires patch is in.)
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955

> OpenMP: Handle ancestor:1 with discover_declare_target
> 
> gcc/
>   * omp-offload.cc (omp_discover_declare_target_tgt_fn_r,
>   omp_discover_declare_target_fn_r): Don't walk reverse-offload
>   target regions.
> 
> gcc/testsuite/
>   * c-c++-common/gomp/reverse-offload-1.c: New.

LGTM, thanks.

Jakub



Re: [committed] openmp: Add support for HBW or large capacity or interleaved memory through the libmemkind.so library

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Thu, Jun 09, 2022 at 12:11:28PM +0200, Thomas Schwinge wrote:
> On 2022-06-09T10:19:03+0200, Jakub Jelinek via Gcc-patches 
>  wrote:
> > This patch adds support for dlopening libmemkind.so
> 
> Instead of 'dlopen'ing literally 'libmemkind.so':
> 
> > --- libgomp/allocator.c.jj2022-06-08 08:21:03.099446883 +0200
> > +++ libgomp/allocator.c   2022-06-08 13:41:45.647133610 +0200
> 
> > +  void *handle = dlopen ("libmemkind.so", RTLD_LAZY);
> 
> ..., shouldn't this instead 'dlopen' 'libmemkind.so.0'?  At least for
> Debian/Ubuntu, the latter ('libmemkind.so.0') is shipped in the "library"
> package:

I agree and I've actually noticed it too right before committing, but I thought
I'll investigate and tweak incrementally because "libmemkind.so"
is what I've actually tested (it is what llvm libomp uses).

> 
> $ apt-file list libmemkind0 | grep -F libmemkind.so
> libmemkind0: /usr/lib/x86_64-linux-gnu/libmemkind.so.0
> libmemkind0: /usr/lib/x86_64-linux-gnu/libmemkind.so.0.0.1
> 
> ..., but the former ('libmemkind.so') only in the "development" package:
> 
> $ apt-file list libmemkind-dev | grep -F libmemkind.so
> libmemkind-dev: /usr/lib/x86_64-linux-gnu/libmemkind.so
> 
> ..., which users of GCC/libgomp shouldn't have to care about.

Similarly in Fedora memkind package provides just
/usr/lib64/libautohbw.so.0
/usr/lib64/libautohbw.so.0.0.0
/usr/lib64/libmemkind.so.0
/usr/lib64/libmemkind.so.0.0.1
/usr/lib64/libmemtier.so.0
/usr/lib64/libmemtier.so.0.0.0
and
/usr/lib64/libautohbw.so
/usr/lib64/libmemkind.so
/usr/lib64/libmemtier.so
comes from memkind-devel.

> Any plans about test cases for this?  (Not trivial, I suppose?)

That is the hard part.
All the testing I've done so far were for atv_interleaved:
#include 

int
main ()
{
  omp_alloctrait_t traits[3]
= { { omp_atk_alignment, 64 },
{ omp_atk_fallback, omp_atv_null_fb },
{ omp_atk_partition, omp_atv_interleaved } };
  omp_allocator_handle_t a;

  a = omp_init_allocator (omp_default_mem_space, 3, traits);
  if (a == omp_null_allocator)
return 1;
  void *p = omp_alloc (128, a);
  if (!p)
return 2;
  void *q = omp_realloc (p, 256, a, a);
  if (!q)
return 3;
  void *r = omp_calloc (1, 512, a);
  if (!r)
return 4;
  omp_free (q, a);
  omp_free (r, a);
  return 0;
}
because that is something that works even on my devel WS, though
in the testcase one doesn't figure out if memkind was actually available and
whether the memory was indeed interleaved or not, just that it works
(I could certainly also store some data and read them back after realloc,
and also test one without omp_atk_alignment which effectively prevents
memkind_realloc from being called and uses allocation + deallocation), but
that is it.  I've actually stepped through in the debugger to verify
memkind_* is called...

Now for HBW memory, some googling around and brief look at the memkind
source shows that it probably supports just Intel Xeon Phi HBW memory,
I'm waiting for access to such a box right now but it might take a few days.

For the DAX stuff, I admit I don't know what it exactly is (what kind of hw
it needs).

> > --- libgomp/config/linux/allocator.c.jj   2022-06-08 08:58:23.197078191 
> > +0200
> > +++ libgomp/config/linux/allocator.c  2022-06-08 09:39:15.108410730 +0200
> > @@ -0,0 +1,36 @@
> 
> > +#define _GNU_SOURCE
> > +#include "libgomp.h"
> > +#if defined(PLUGIN_SUPPORT) && defined(LIBGOMP_USE_PTHREADS)
> > +#define LIBGOMP_USE_MEMKIND
> > +#endif
> > +
> > +#include "../../../allocator.c"
> 
> Given this use of 'PLUGIN_SUPPORT' (and thus 'dlopen' etc.) for something
> different than libgomp plugins (offloading), might move 'DL_LIBS',
> 'PLUGIN_SUPPORT' from 'libgomp/plugin/configfrag.ac' into
> 'libgomp/configure.ac', and 'libgomp_la_LIBADD += $(DL_LIBS)' from
> 'libgomp/plugin/Makefrag.am' into 'libgomp/Makefile.am'.

Maybe, but libgomp/plugin/configfrag.ac is included unconditionally
and the memkind support is some kind of plugin too, just not offloading
plugin, but allocator plugin...
Didn't want to spend too much time on it and PLUGIN_SUPPORT
is right now solely about dlsym exists and -ldl works and has been added.

Jakub



Re: [Patch] OpenMP: Move omp requires checks to libgomp

2022-06-09 Thread Jakub Jelinek via Gcc-patches
On Wed, Jun 08, 2022 at 05:56:02AM +0200, Tobias Burnus wrote:
> gcc/c/ChangeLog:
> 
>   * c-parser.cc (c_parser_declaration_or_fndef): Set
>   OMP_REQUIRES_TARGET_USED in omp_requires_mask if function has
>   "omp declare target" attribute.
>   (c_parser_omp_target_data): Set OMP_REQUIRES_TARGET_USED in
>   omp_requires_mask.
>   (c_parser_omp_target_enter_data): Likewise.
>   (c_parser_omp_target_exit_data): Likewise.
>   (c_parser_omp_requires): Remove sorry.
> 
> gcc/cp/ChangeLog:
> 
>   * parser.cc (cp_parser_simple_declaration): Set
>   OMP_REQUIRES_TARGET_USED in omp_requires_mask if function has
>   "omp declare target" attribute.
>   (cp_parser_omp_target_data): Set OMP_REQUIRES_TARGET_USED in
>   omp_requires_mask.
>   (cp_parser_omp_target_enter_data): Likewise.
>   (cp_parser_omp_target_exit_data): Likewise.
>   (cp_parser_omp_requires): Remove sorry.
> 
> gcc/fortran/ChangeLog:
> 
>   * openmp.cc (gfc_match_omp_requires): Remove "not implemented yet".
>   * parse.cc: Include "tree.h" and "omp-general.h".
>   (gfc_parse_file): Add code to merge omp_requires to omp_requires_mask.
> 
> gcc/ChangeLog:
> 
>   * omp-general.h (omp_runtime_api_call): New prototype.
>   * omp-general.cc (omp_runtime_api_call): Added device_api_only arg
>   and moved from ...
>   * omp-low.cc (omp_runtime_api_call): ... here.
>   (scan_omp_1_stmt): Update call.
>   * gimplify.cc (gimplify_call_expr): Call omp_runtime_api_call.
>   * omp-offload.cc (omp_finish_file): Add code to create OpenMP requires
>   mask variable in .gnu.gomp_requires section, if needed.
> 
> include/ChangeLog:
> 
>   * gomp-constants.h (GOMP_REQUIRES_UNIFIED_ADDRESS,
>   GOMP_REQUIRES_UNIFIED_SHARED_MEMORY,
>   GOMP_REQUIRES_REVERSE_OFFLOAD): New.
> 
> libgcc/ChangeLog:
> 
>   * offloadstuff.c (__requires_mask_table, __requires_mask_table_end):
>   New symbols to mark start and end of the .gnu.gomp_requires section.
> 
> 
> libgomp/ChangeLog:
> 
>   * libgomp-plugin.h (GOMP_OFFLOAD_get_num_devices): Add
>   omp_requires_mask arg.
>   * plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices): Likewise;
>   return -1 when device available but omp_requires_mask != 0.
>   * plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices): Likewise.
>   * oacc-host.c (host_get_num_devices, host_openacc_get_property):
>   Update call.
>   * oacc-init.c (resolve_device, acc_init_1, acc_shutdown_1,
>   goacc_attach_host_thread_to_device, acc_get_num_devices,
>   acc_set_device_num, get_property_any): Likewise.
>   * target.c: (__requires_mask_table, __requires_mask_table_end):
>   Declare weak extern symbols.
>   (gomp_requires_to_name): New.
>   (gomp_target_init): Add code to check .gnu._gomp_requires section
>   mask values for inconsistencies; warn when requirements makes an
>   existing device unsupported.
>   * testsuite/libgomp.c-c++-common/requires-1-aux.c: New test.
>   * testsuite/libgomp.c-c++-common/requires-1.c: New test.
>   * testsuite/libgomp.c-c++-common/requires-2-aux.c: New test.
>   * testsuite/libgomp.c-c++-common/requires-2.c: New test.
> 
> liboffloadmic/ChangeLog:
> 
>   * plugin/libgomp-plugin-intelmic.cpp (GOMP_OFFLOAD_get_num_devices):
>   Return -1 when device available but omp_requires_mask != 0.
> 
> gcc/testsuite/ChangeLog:
> 
>   * c-c++-common/gomp/requires-4.c: Update dg-*.
>   * c-c++-common/gomp/target-device-ancestor-2.c: Likewise.
>   * c-c++-common/gomp/target-device-ancestor-3.c: Likewise.
>   * c-c++-common/gomp/target-device-ancestor-4.c: Likewise.
>   * c-c++-common/gomp/target-device-ancestor-5.c: Likewise.
>   * gfortran.dg/gomp/target-device-ancestor-3.f90: Likewise.
>   * gfortran.dg/gomp/target-device-ancestor-4.f90: Likewise.
>   * gfortran.dg/gomp/target-device-ancestor-2.f90: Likewise. Move post-FE
>   checks to ...
>   * gfortran.dg/gomp/target-device-ancestor-2a.f90: ... this new file.

> +  if (flag_openmp
> + && lookup_attribute ("omp declare target",
> +  DECL_ATTRIBUTES (current_function_decl)))
> +   omp_requires_mask
> + = (enum omp_requires) (omp_requires_mask | 
> OMP_REQUIRES_TARGET_USED);

I must admit it is unclear what the
"must appear lexically before any device constructs or device routines."
restriction actually means for device routines.
Is that lexically before definition of such device routines, or even their
declarations?

It wouldn't surprise me if some library packages started eventually adding
declare target directives in some headers around external declarations,
should that be the point after which we don't allow requires directives?

On the other side, for the definitions, we don't need to know when parsing
the definition whether it is a device routine.

void
foo (void)
{
}
#pragma omp declar

Re: [PATCH] or1k: Add support for a little-endian target variant

2022-06-09 Thread Stafford Horne via Gcc-patches
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index c5064dd37666..0c3a09dfe810 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -2866,6 +2866,11 @@ or1k*-*-*)
>   done
>   TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>  
> + case ${target} in
> + or1k*le*-*)

Should this be just or1kle*-*?

> + tm_defines="${tm_defines} TARGET_LITTLE_ENDIAN_DEFAULT=1"
> + ;;
> + esac
>   case ${target} in
>   or1k*-*-linux*)
>   tm_file="${tm_file} gnu-user.h linux.h glibc-stdint.h"


> diff --git a/gcc/config/or1k/or1k.opt b/gcc/config/or1k/or1k.opt
> index 8a66832a99b1..497f259faae9 100644
> --- a/gcc/config/or1k/or1k.opt
> +++ b/gcc/config/or1k/or1k.opt
> @@ -24,6 +24,14 @@
>  HeaderInclude
>  config/or1k/or1k-opts.h
>  
> +mbig-endian
> +Target Report RejectNegative Mask(BIG_ENDIAN)
> +Use big-endian byte order.
> +
> +mlittle-endian
> +Target Report RejectNegative InverseMask(BIG_ENDIAN, LITTLE_ENDIAN)
> +Use little-endian byte order.
> +

We should explain what is the default int he doc's.

Can you also document in: gcc/doc/invoke.texi

This looks good, thank you.

-Stafford


[Patch] OpenMP: Handle ancestor:1 with discover_declare_target

2022-06-09 Thread Tobias Burnus

Another minor step to getting reverse offloading to work ...

OK for mainline?

Tobias

PS: As attached, this patch is a stand-alone patch, which fails
due to the requires sorry (see dg-prune-output).
With the requires patch, it should fail with the next sorry:
  'ancestor' not yet supported"
(For Fortran, it currently fails already in the FE as the
sorry prevents generating tree code. Thus, there cannot be
a Fortran check until the requires patch is in.)
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
OpenMP: Handle ancestor:1 with discover_declare_target

gcc/
	* omp-offload.cc (omp_discover_declare_target_tgt_fn_r,
	omp_discover_declare_target_fn_r): Don't walk reverse-offload
	target regions.

gcc/testsuite/
	* c-c++-common/gomp/reverse-offload-1.c: New.

diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index ad4e772015e..fcbe6cf83d8 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -268,12 +268,12 @@ omp_discover_declare_target_tgt_fn_r (tree *tp, int *walk_subtrees, void *data)
 }
   else if (TYPE_P (*tp))
 *walk_subtrees = 0;
-  /* else if (TREE_CODE (*tp) == OMP_TARGET)
-   {
-	 if (tree dev = omp_find_clause (OMP_TARGET_CLAUSES (*tp)))
-	   if (OMP_DEVICE_ANCESTOR (dev))
-	 *walk_subtrees = 0;
-   } */
+  else if (TREE_CODE (*tp) == OMP_TARGET)
+{
+  tree c = omp_find_clause (OMP_CLAUSES (*tp), OMP_CLAUSE_DEVICE);
+  if (c && OMP_CLAUSE_DEVICE_ANCESTOR (c))
+	*walk_subtrees = 0;
+}
   return NULL_TREE;
 }
 
@@ -284,10 +284,11 @@ omp_discover_declare_target_fn_r (tree *tp, int *walk_subtrees, void *data)
 {
   if (TREE_CODE (*tp) == OMP_TARGET)
 {
-  /* And not OMP_DEVICE_ANCESTOR.  */
-  walk_tree_without_duplicates (&OMP_TARGET_BODY (*tp),
-omp_discover_declare_target_tgt_fn_r,
-data);
+  tree c = omp_find_clause (OMP_CLAUSES (*tp), OMP_CLAUSE_DEVICE);
+  if (!c || !OMP_CLAUSE_DEVICE_ANCESTOR (c))
+	walk_tree_without_duplicates (&OMP_TARGET_BODY (*tp),
+  omp_discover_declare_target_tgt_fn_r,
+  data);
   *walk_subtrees = 0;
 }
   else if (TYPE_P (*tp))
diff --git a/gcc/testsuite/c-c++-common/gomp/reverse-offload-1.c b/gcc/testsuite/c-c++-common/gomp/reverse-offload-1.c
new file mode 100644
index 000..9a3fa5230f8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/gomp/reverse-offload-1.c
@@ -0,0 +1,93 @@
+/* { dg-additional-options "-fdump-tree-omplower" } */
+
+/* { dg-final { scan-tree-dump-times "omp declare target\[^ \]" 3 "omplower" } }  */
+
+/* { dg-final { scan-tree-dump-times "__attribute__\\(\\(omp declare target\\)\\)\[\n\r\]*int called_in_target1" 1 "omplower" } }  */
+/* { dg-final { scan-tree-dump-times "__attribute__\\(\\(omp declare target\\)\\)\[\n\r\]*int called_in_target2" 1 "omplower" } }  */
+/* { dg-final { scan-tree-dump-times "__attribute__\\(\\(omp declare target, omp declare target block\\)\\)\[\n\r\]*void tg_fn" 1 "omplower" } }  */
+
+/* { dg-prune-output "'reverse_offload' clause on 'requires' directive not supported yet" } */
+
+#pragma omp requires reverse_offload
+
+extern int add_3 (int);
+
+static int global_var = 5;
+
+void
+check_offload (int *x, int *y)
+{
+  *x = add_3 (*x);
+  *y = add_3 (*y);
+}
+
+int
+called_in_target1 ()
+{
+  return 42;
+}
+
+int
+called_in_target2 ()
+{
+  return -6;
+}
+
+#pragma omp declare target
+void
+tg_fn (int *x, int *y)
+{
+  int x2 = *x, y2 = *y;
+  if (x2 != 2 || y2 != 3)
+__builtin_abort ();
+  x2 = x2 + 2 + called_in_target1 ();
+  y2 = y2 + 7;
+
+  #pragma omp target device(ancestor : 1) map(tofrom: x2)
+check_offload(&x2, &y2);
+
+  if (x2 != 2+2+3+42 || y2 != 3 + 7)
+__builtin_abort ();
+  *x = x2, *y = y2;
+}
+#pragma omp end declare target
+
+void
+my_func (int *x, int *y)
+{
+  if (global_var != 5)
+__builtin_abort ();
+  global_var = 242;
+  *x = 2*add_3(*x);
+  *y = 3*add_3(*y);
+}
+
+int
+main ()
+{
+  #pragma omp target
+  {
+ int x = 2, y = 3;
+ tg_fn (&x, &y);
+  }
+
+  #pragma omp target
+  {
+ int x = -2, y = -1;
+ x += called_in_target2 ();
+ #pragma omp target device ( ancestor:1 ) firstprivate(y) map(tofrom:x)
+ {
+   if (x != -2-6 || y != -1)
+ __builtin_abort ();
+   my_func (&x, &y);
+   if (x != 2*(3-2) || y != 3*(3-1))
+ __builtin_abort ();
+ }
+ if (x != 2*(3-2) || y != -1)
+   __builtin_abort ();
+  }
+
+  if (global_var != 242)
+__builtin_abort ();
+  return 0;
+}


Re: [committed][nvptx] Add march-map

2022-06-09 Thread Thomas Schwinge
Hi Tom!

On 2022-03-29T14:03:22+0200, Tom de Vries via Gcc-patches 
 wrote:
> Say we have an sm_50 board, and we want to run a benchmark using the highest
> possible march setting.
>
> Currently there's march=sm_30, march=sm_35, march=sm_53, but no march=sm_50.
>
> So, we'd need to pick march=sm_35.
>
> Likewise, for a test script that handles multiple boards, we'd need a mapping
> from native board sm_xx to march, which might have to be updated with newer
> gcc releases.

ACK.

> Add an option march-map, such that we can just specify march-map=sm_50, and
> let the compiler map this to the appropriate march.

So, I understand that the idea is, that users should use
'-march-map=[...]' instead of '-misa=[...]' or alias '-march=[...]',
because the former ('-march-map=[...]') will always Do The Right Thing:
pick the best available SM level for GCC/nvptx code generation (like
you've said: may change with GCC releases, and users then don't have to
change their receipes), and it'll never error out, in contrast to
'-misa=[...]' or alias '-march=[...]' do when the requested architecture
isn't directly supported:

xgcc: error: unrecognized argument in option ‘-misa=sm_50’
xgcc: note: valid arguments to ‘-misa=’ are: sm_30 sm_35 sm_53 sm_70 sm_75 
sm_80; did you mean ‘sm_30’?

My question, though, is: why did you add a new option name '-march-map'
instead of directly using '-march' for that (instead of "[nvptx] Add
march alias for misa", added on the same day)?  Would you accept a patch
that: (a) un-aliases '-march' -> '-misa', and (b) renames '-march-map' to
'-march', and (c) sets up a '-march-map' -> '-march' alias for backwards
compatibility (if so desired)?  Regarding (a), (b), in my opinion,
there's no backwards compatibility issue there: the "new '-march'" will
simply accept more options than the "old '-march'" did.  Regarding (c),
I'd even drop the user-visible '-march-map' option completely; I suppose
nobody's really used that by now?


Grüße
 Thomas


> The option is implemented as a list of aliases, such that we have a somewhat
> lengthy (17 lines in total):
> ...
> $ gcc --help=target
>   ...
>   -march-map=sm_30Same as -misa=sm_30.
>   -march-map=sm_32Same as -misa=sm_30.
>   ...
>   -march-map=sm_87Same as -misa=sm_80.
>   -march-map=sm_90Same as -misa=sm_80.
> ...
>
> This implementation was chosen in the hope that it'll be easier if
> we end up with some misa multilib.
>
> It would be nice to have the mapping list generated from an updated
> nvptx-sm.def, but for now it's spelled out in nvptx.opt.
>
> Tested on nvptx.
>
> Committed to trunk.
>
> Thanks,
> - Tom
>
> [nvptx] Add march-map
>
> gcc/ChangeLog:
>
> 2022-03-29  Tom de Vries  
>
>   PR target/104714
>   * config/nvptx/nvptx.opt (march-map=*): Add aliases.
>
> gcc/testsuite/ChangeLog:
>
> 2022-03-29  Tom de Vries  
>
>   PR target/104714
>   * gcc.target/nvptx/march-map.c: New test.
>
> ---
>  gcc/config/nvptx/nvptx.opt | 51 
> ++
>  gcc/testsuite/gcc.target/nvptx/march-map.c |  5 +++
>  2 files changed, 56 insertions(+)
>
> diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
> index b5d0170e9e9..58eddeeabf4 100644
> --- a/gcc/config/nvptx/nvptx.opt
> +++ b/gcc/config/nvptx/nvptx.opt
> @@ -60,6 +60,57 @@ march=
>  Target RejectNegative Joined Alias(misa=)
>  Alias:
>
> +march-map=sm_30
> +Target RejectNegative Alias(misa=,sm_30)
> +
> +march-map=sm_32
> +Target RejectNegative Alias(misa=,sm_30)
> +
> +march-map=sm_35
> +Target RejectNegative Alias(misa=,sm_35)
> +
> +march-map=sm_37
> +Target RejectNegative Alias(misa=,sm_35)
> +
> +march-map=sm_50
> +Target RejectNegative Alias(misa=,sm_35)
> +
> +march-map=sm_52
> +Target RejectNegative Alias(misa=,sm_35)
> +
> +march-map=sm_53
> +Target RejectNegative Alias(misa=,sm_53)
> +
> +march-map=sm_60
> +Target RejectNegative Alias(misa=,sm_53)
> +
> +march-map=sm_61
> +Target RejectNegative Alias(misa=,sm_53)
> +
> +march-map=sm_62
> +Target RejectNegative Alias(misa=,sm_53)
> +
> +march-map=sm_70
> +Target RejectNegative Alias(misa=,sm_70)
> +
> +march-map=sm_72
> +Target RejectNegative Alias(misa=,sm_70)
> +
> +march-map=sm_75
> +Target RejectNegative Alias(misa=,sm_75)
> +
> +march-map=sm_80
> +Target RejectNegative Alias(misa=,sm_80)
> +
> +march-map=sm_86
> +Target RejectNegative Alias(misa=,sm_80)
> +
> +march-map=sm_87
> +Target RejectNegative Alias(misa=,sm_80)
> +
> +march-map=sm_90
> +Target RejectNegative Alias(misa=,sm_80)
> +
>  Enum
>  Name(ptx_version) Type(int)
>  Known PTX ISA versions (for use with the -mptx= option):
> diff --git a/gcc/testsuite/gcc.target/nvptx/march-map.c 
> b/gcc/testsuite/gcc.target/nvptx/march-map.c
> new file mode 100644
> index 000..00838e55fc0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/nvptx/march-map.c
> @@ -0,0 +1,5 @@
> +/* { dg-options "-march-map=sm_50" } */
> +
> +#include "main.c"
> +
> +/* { dg-final { s

RE: [PATCH] libgomp, openmp: pinned memory

2022-06-09 Thread Stubbs, Andrew
> For example, it's documented that 'cuMemHostAlloc',
>  api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b
> 9>,
> "Allocates page-locked host memory".  The crucial thing, though, what
> makes this different from 'malloc' plus 'mlock' is, that "The driver
> tracks the virtual memory ranges allocated with this function and
> automatically accelerates calls to functions such as cuMemcpyHtoD().
> Since the memory can be accessed directly by the device, it can be read
> or written with much higher bandwidth than pageable memory obtained with
> functions such as malloc()".

OK, interesting. I had not seen this, but I think it confirms that the 
performance difference is within Cuda and regular locked memory is not so great.

> Also, by means of the Nvidia Driver allocating the memory, I suppose
> using this interface likely circumvents any "annoying" 'ulimit'
> limitations?

Yes, this is the case.

> If not directly *allocating and registering* such memory via
> 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
>  api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b5422
> 3>:
> "Page-locks the memory range specified [...] and maps it for the
> device(s) [...].  This memory range also is added to the same tracking
> mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> manual 'mlock'ing involved in that case, too; presumably again using this
> interface likely circumvents any "annoying" 'ulimit' limitations?)
> 
> Such a *register* abstraction can then be implemented by all the libgomp
> offloading plugins: they just call the respective
> CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> memory.
> 
> ..., but maybe I'm missing some crucial "detail" here?

I'm investigating this stuff for the AMD USM implementation as well right now. 
It might be a good way to handle static and stack data too. Or not.

Andrew


RE: [PATCH] libgomp, openmp: pinned memory

2022-06-09 Thread Stubbs, Andrew
> The question is only what to do with 'requires unified_shared_memory' –
> and a non-multi-device allocator.

The compiler emits an error at compile time if you attempt to use both 
-foffload-memory=pinned and USM, because they’re not compatible. You're fine to 
use both explicit allocators in the same program, but the "pinnedness" of USM 
allocations is a matter for Cuda to care about (cuMallocManaged) and has 
nothing to do with this discussion.

The OpenMP pinned memory feature is intended to accelerate normal mappings, as 
far as I can tell.

Andrew


Re: [committed] openmp: Add support for HBW or large capacity or interleaved memory through the libmemkind.so library

2022-06-09 Thread Thomas Schwinge
Hi Jakub!

On 2022-06-09T10:19:03+0200, Jakub Jelinek via Gcc-patches 
 wrote:
> This patch adds support for dlopening libmemkind.so

Instead of 'dlopen'ing literally 'libmemkind.so':

> --- libgomp/allocator.c.jj2022-06-08 08:21:03.099446883 +0200
> +++ libgomp/allocator.c   2022-06-08 13:41:45.647133610 +0200

> +  void *handle = dlopen ("libmemkind.so", RTLD_LAZY);

..., shouldn't this instead 'dlopen' 'libmemkind.so.0'?  At least for
Debian/Ubuntu, the latter ('libmemkind.so.0') is shipped in the "library"
package:

$ apt-file list libmemkind0 | grep -F libmemkind.so
libmemkind0: /usr/lib/x86_64-linux-gnu/libmemkind.so.0
libmemkind0: /usr/lib/x86_64-linux-gnu/libmemkind.so.0.0.1

..., but the former ('libmemkind.so') only in the "development" package:

$ apt-file list libmemkind-dev | grep -F libmemkind.so
libmemkind-dev: /usr/lib/x86_64-linux-gnu/libmemkind.so

..., which users of GCC/libgomp shouldn't have to care about.


Any plans about test cases for this?  (Not trivial, I suppose?)

Or, at least some 'gomp_debug' logging, what's happening behind the
scenes?


> --- libgomp/config/linux/allocator.c.jj   2022-06-08 08:58:23.197078191 
> +0200
> +++ libgomp/config/linux/allocator.c  2022-06-08 09:39:15.108410730 +0200
> @@ -0,0 +1,36 @@

> +#define _GNU_SOURCE
> +#include "libgomp.h"
> +#if defined(PLUGIN_SUPPORT) && defined(LIBGOMP_USE_PTHREADS)
> +#define LIBGOMP_USE_MEMKIND
> +#endif
> +
> +#include "../../../allocator.c"

Given this use of 'PLUGIN_SUPPORT' (and thus 'dlopen' etc.) for something
different than libgomp plugins (offloading), might move 'DL_LIBS',
'PLUGIN_SUPPORT' from 'libgomp/plugin/configfrag.ac' into
'libgomp/configure.ac', and 'libgomp_la_LIBADD += $(DL_LIBS)' from
'libgomp/plugin/Makefrag.am' into 'libgomp/Makefile.am'.


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH] libgomp, openmp: pinned memory

2022-06-09 Thread Tobias Burnus

On 09.06.22 11:38, Thomas Schwinge wrote:

On 2022-06-07T13:28:33+0100, Andrew Stubbs  wrote:

On 07/06/2022 13:10, Jakub Jelinek wrote:

On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:

The memory pinned via the mlock call does not give the expected performance
boost. I had not expected that it would do much in my test setup, given that
the machine has a lot of RAM and my benchmarks are small, but others have
tried more and on varying machines and architectures.

I don't understand why there should be any expected performance boost (at
least not unless the machine starts swapping out pages),
{ omp_atk_pinned, true } is solely about the requirement that the memory
can't be swapped out.

It seems like it takes a faster path through the NVidia drivers. [...]


I think this conflates two parts:

* User-defined allocators in general – there CUDA does not make much
sense and without unified-shared memory, it will always be inaccessible
on the device (w/o explicit/implicit mapping).

* Memory which is supposed to be accessible both on the host and on the
device. That's most obvious by  explicitly allocating to be accessible
on both – it is less clear cut when just creating an allocator with
unified-shared memory as it is not clear when it is only using on the
host (e.g. with host-based thread parallelization) – and when it is also
relevant for the device.

Currently, the user has no means to express the intent that it should be
accessible on both the host and one/several devices, except for 'omp
requires unified_shared_memory'.

The next OpenMP version will likely permit a means to create an
allocator which permits this →
https://github.com/OpenMP/spec/issues/1843 (not publicly available;
slides (last comment) are slightly outdated).

 * * *

The question is only what to do with 'requires unified_shared_memory' –
and a non-multi-device allocator.

Probably: unified_shared_memory or no nvptx device: just use mlock.
Otherwise (i.e. both nvptx device and (unified_shared_memory or a
multi-device-allocator)), use the CUDA one.

For the latter, I think Thomas' remarks are helpful.

Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: aarch64: Fix bitfield alignment in param passing [PR105549]

2022-06-09 Thread Christophe Lyon via Gcc-patches




On 6/8/22 15:19, Richard Sandiford wrote:

Christophe Lyon  writes:

On 6/7/22 19:44, Richard Sandiford wrote:

Christophe Lyon via Gcc-patches  writes:

While working on enabling DFP for AArch64, I noticed new failures in
gcc.dg/compat/struct-layout-1.exp (t028) which were not actually
caused by DFP types handling. These tests are generated during 'make
check' and enabling DFP made generation different (not sure if new
non-DFP tests are generated, or if existing ones are generated
differently, the tests in question are huge and difficult to compare).

Anyway, I reduced the problem to what I attach at the end of the new
gcc.target/aarch64/aapcs64/va_arg-17.c test and rewrote it in the same
scheme as other va_arg* AArch64 tests.  Richard Sandiford further
reduced this to a non-vararg function, added as a second testcase.

This is a tough case mixing bitfields and alignment, where
aarch64_function_arg_alignment did not follow what its descriptive
comment says: we want to use the natural alignment of the bitfield
type only if the user didn't override the alignment for the bitfield
itself.

The fix is thus very small, and this patch adds two new tests
(va_arg-17.c and pr105549.c). va_arg-17.c contains the reduced
offending testcase from struct-layout-1.exp for reference.

We also take the opportunity to fix the comment above
aarch64_function_arg_alignment since the value of the abi_break
parameter was changed in a previous commit, no longer match the
description.

2022-06-02  Christophe Lyon  

gcc/
PR target/105549
* config/aarch64/aarch64.cc (aarch64_function_arg_alignment):
Check DECL_USER_ALIGN for bitfield.

gcc/testsuite/
PR target/105549
* gcc.target/aarch64/aapcs64/va_arg-17.c: New.
* gcc.target/aarch64/pr105549.c: New.


### Attachment also inlined for ease of reply###


diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
40fc5e633992036a2c06867857a681792178ef00..2c6ccce7cb5dc32097d24514ee525729efb6b7ff
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -7262,9 +7262,9 @@ aarch64_vfp_is_call_candidate (cumulative_args_t pcum_v, 
machine_mode mode,
   /* Given MODE and TYPE of a function argument, return the alignment in
  bits.  The idea is to suppress any stronger alignment requested by
  the user and opt for the natural alignment (specified in AAPCS64 \S
-   4.1).  ABI_BREAK is set to true if the alignment was incorrectly
-   calculated in versions of GCC prior to GCC-9.  This is a helper
-   function for local use only.  */
+   4.1).  ABI_BREAK is set to the old alignment if the alignment was
+   incorrectly calculated in versions of GCC prior to GCC-9.  This is
+   a helper function for local use only.  */
   
   static unsigned int

   aarch64_function_arg_alignment (machine_mode mode, const_tree type,
@@ -7304,7 +7304,10 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
   "s" contains only one Fundamental Data Type (the int field)
   but gains 8-byte alignment and size thanks to "e".  */
alignment = std::max (alignment, DECL_ALIGN (field));
-   if (DECL_BIT_FIELD_TYPE (field))
+
+   /* Take bit-field type's alignment into account only if the
+  user didn't override this field's alignment.  */
+   if (DECL_BIT_FIELD_TYPE (field) && !DECL_USER_ALIGN (field))


I think we need to check DECL_PACKED instead.  On its own, an alignment
attribute on the field can only increase alignment, not decrease it.
In constrast, the packed attribute effectively forces the alignment to
1 byte, so has an effect even without an alignment attribute.  Adding an
explicit alignment on top can then increase the alignment from 1 to any
value (bigger or smaller than the original underlying type).


Right, but the comment before aarch64_function_arg_alignment says:

"The idea is to suppress any stronger alignment requested by the user
and opt for the natural alignment (specified in AAPCS64 \S 4.1)"

When using DECL_PACKED, wouldn't we check the opposite of this (ie. that
the user requested a smaller alignment)?   I mean we'd not "suppress
stronger alignment" since such cases do not have DECL_PACKED?


I think "stronger alignment" here means "greater alignment" rather
than "less alignment".  But in these examples we're dealing with
alignments of the fields.  I think that part is OK, and that the
intention is to ignore any greater alignment specified at the structure
level, independently of the fields.

In other words, if field list X occupies 16 bytes, then S1 and S2
below should be handled in the same way as far as register assignment
is concerned:

   struct S1 { X };
   struct S2 { X } __attribute__((aligned(16)));

The idea is that structures are just a sequence of fields/members
and don't have any "magic" properties beyond that.


However I'm not sure which part of the ABI is mentioned in

Re: [PATCH] libgomp, openmp: pinned memory

2022-06-09 Thread Thomas Schwinge
Hi!

I'm not all too familiar with the "newish" CUDA Driver API, but maybe the
following is useful still:

On 2022-06-07T13:28:33+0100, Andrew Stubbs  wrote:
> On 07/06/2022 13:10, Jakub Jelinek wrote:
>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>> Following some feedback from users of the OG11 branch I think I need to
>>> withdraw this patch, for now.
>>>
>>> The memory pinned via the mlock call does not give the expected performance
>>> boost. I had not expected that it would do much in my test setup, given that
>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>> tried more and on varying machines and architectures.
>>
>> I don't understand why there should be any expected performance boost (at
>> least not unless the machine starts swapping out pages),
>> { omp_atk_pinned, true } is solely about the requirement that the memory
>> can't be swapped out.
>
> It seems like it takes a faster path through the NVidia drivers. This is
> a black box, for me, but that seems like a plausible explanation. The
> results are different on x86_64 and powerpc hosts (such as the Summit
> supercomputer).

For example, it's documented that 'cuMemHostAlloc',
,
"Allocates page-locked host memory".  The crucial thing, though, what
makes this different from 'malloc' plus 'mlock' is, that "The driver
tracks the virtual memory ranges allocated with this function and
automatically accelerates calls to functions such as cuMemcpyHtoD().
Since the memory can be accessed directly by the device, it can be read
or written with much higher bandwidth than pageable memory obtained with
functions such as malloc()".

Similar, for example, for 'cuMemAllocHost',
.

This, to me, would explain why "the mlock call does not give the expected
performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
with 'mlock' you're missing the "tracks the virtual memory ranges"
aspect.

Also, by means of the Nvidia Driver allocating the memory, I suppose
using this interface likely circumvents any "annoying" 'ulimit'
limitations?  I get this impression, because documentation continues
stating that "Allocating excessive amounts of memory with
cuMemAllocHost() may degrade system performance, since it reduces the
amount of memory available to the system for paging.  As a result, this
function is best used sparingly to allocate staging areas for data
exchange between host and device".

>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>> pinned using the Cuda API to get the performance boost.
>>
>> For performance boost of what kind of code?
>> I don't understand how Cuda API could be useful (or can be used at all) if
>> offloading to NVPTX isn't involved.  The fact that somebody asks for host
>> memory allocation with omp_atk_pinned set to true doesn't mean it will be
>> in any way related to NVPTX offloading (unless it is in NVPTX target region
>> obviously, but then mlock isn't available, so sure, if there is something
>> CUDA can provide for that case, nice).
>
> This is specifically for NVPTX offload, of course, but then that's what
> our customer is paying for.
>
> The expectation, from users, is that memory pinning will give the
> benefits specific to the active device. We can certainly make that
> happen when there is only one (flavour of) offload device present. I had
> hoped it could be one way for all, but it looks like not.

Aren't there CUDA Driver interfaces for that?  That is:

>>> I had not done this
>>> this because it was difficult to resolve the code abstraction
>>> difficulties and anyway the implementation was supposed to be device
>>> independent, but it seems we need a specific pinning mechanism for each
>>> device.

If not directly *allocating and registering* such memory via
'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
*register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
:
"Page-locks the memory range specified [...] and maps it for the
device(s) [...].  This memory range also is added to the same tracking
mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
manual 'mlock'ing involved in that case, too; presumably again using this
interface likely circumvents any "annoying" 'ulimit' limitations?)

Such a *register* abstraction can then be implemented by all the libgomp
offloading plugins: they just call the respective
CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
memory.

..., but maybe I'm missing some crucial "detail" here?


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift

[PING][Patch] OpenMP, libgomp, gimple: omp_get_max_teams, omp_set_num_teams, and omp_{gs}et_teams_thread_limit on offload devices

2022-06-09 Thread Marcel Vollweiler

Hi,

I’d like to ping the patch for the OpenMP runtime routines omp_get_max_teams,
omp_set_num_teams, and omp_{gs}et_teams_thread_limit on offload devices:

https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593260.html

This patch builds on the following patch which is currently in revision/review:
- [PATCH] OpenMP, libgomp: Environment variable syntax extension.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588728.html

As several technical details will be changed anyway due to revision of the
environment variable extension patch, a complete review does not make sense yet
from my point of view. However, I wondered if a "rough" review about the main
approach/idea is feasible, so that necessary changes could be included in the
revision that is needed anyway.

Thanks
Marcel
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


RE: [PATCH]AArch64 relax predicate on load structure load instructions

2022-06-09 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Sandiford 
> Sent: Thursday, June 9, 2022 9:22 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; Kyrylo Tkachov
> ; rguent...@suse.de;
> ro...@nextmovesoftware.com
> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> instructions
> 
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Wednesday, June 8, 2022 3:36 PM
> >> To: Tamar Christina 
> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> ; Marcus Shawcroft
> >> ; Kyrylo Tkachov
> ;
> >> rguent...@suse.de; ro...@eyesopen.com
> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> >> instructions
> >>
> >> Tamar Christina  writes:
> >> >> -Original Message-
> >> >> From: Richard Sandiford 
> >> >> Sent: Wednesday, June 8, 2022 11:31 AM
> >> >> To: Tamar Christina 
> >> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> ; Marcus Shawcroft
> >> >> ; Kyrylo Tkachov
> >> 
> >> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> >> >> instructions
> >> >>
> >> >> Tamar Christina  writes:
> >> >> > Hi All,
> >> >> >
> >> >> > At some point in time we started lowering the ld1r instructions
> >> >> > in
> >> gimple.
> >> >> >
> >> >> > That is:
> >> >> >
> >> >> > uint8x8_t f1(const uint8_t *in) {
> >> >> > return vld1_dup_u8(&in[1]);
> >> >> > }
> >> >> >
> >> >> > generates at gimple:
> >> >> >
> >> >> >   _3 = MEM[(const uint8_t *)in_1(D) + 1B];
> >> >> >   _4 = {_3, _3, _3, _3, _3, _3, _3, _3};
> >> >> >
> >> >> > Which is good, but we then generate:
> >> >> >
> >> >> > f1:
> >> >> >   ldr b0, [x0, 1]
> >> >> >   dup v0.8b, v0.b[0]
> >> >> >   ret
> >> >> >
> >> >> > instead of ld1r.
> >> >> >
> >> >> > The reason for this is because the load instructions have a too
> >> >> > restrictive predicate on them which causes combine not to be
> >> >> > able to combine the instructions due to the predicate only
> >> >> > accepting simple
> >> >> addressing modes.
> >> >> >
> >> >> > This patch relaxes the predicate to accept any memory operand
> >> >> > and relies on LRA to legitimize the address when it needs to as
> >> >> > the constraint still only allows the simple addressing mode.
> >> >> > Reload is always able to legitimize to these.
> >> >> >
> >> >> > Secondly since we are now actually generating more ld1r it
> >> >> > became clear that the lane instructions suffer from a similar issue.
> >> >> >
> >> >> > i.e.
> >> >> >
> >> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
> >> >> > float32x4_t dup = vld1q_dup_f32(&in[1]);
> >> >> > return vfmaq_laneq_f32 (a, a, dup, 1); }
> >> >> >
> >> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
> >> >> >
> >> >> > The reason for this is similar to the ld1r issue.  The predicate
> >> >> > is too restrictive in only acception register operands but not memory.
> >> >> >
> >> >> > This relaxes it to accept register and/or memory while leaving
> >> >> > the constraint to only accept registers.  This will have LRA
> >> >> > generate a reload if needed forcing the memory to registers
> >> >> > using the standard
> >> >> patterns.
> >> >> >
> >> >> > These two changes allow combine and reload to generate the right
> >> >> sequences.
> >> >> >
> >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >> >>
> >> >> This is going against the general direction of travel, which is to
> >> >> make the instruction's predicates and conditions enforce the
> >> >> constraints as much as possible (making optimistic assumptions
> >> >> about
> >> pseudo registers).
> >> >>
> >> >> The RA *can* deal with things like:
> >> >>
> >> >>   (match_operand:M N "general_operand" "r")
> >> >>
> >> >> but it's best avoided, for a few reasons:
> >> >>
> >> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary
> >> >> registers.  This can make the allocation of those temporaries
> >> >> suboptimal but (more importantly) it might require other
> >> >> previously-allocated registers to be spilled late due to the
> >> >> unexpected increase in register pressure.
> >> >>
> >> >> (2) It ends up hiding instructions from the pre-RA optimisers.
> >> >>
> >> >> (3) It can also prevent combine opportunities (as well as create them),
> >> >> unless the loose predicates in an insn I are propagated to all
> >> >> patterns that might result from combining I with something else.
> >> >>
> >> >> It sounds like the first problem (not generating ld1r) could be
> >> >> fixed by (a) combining aarch64_simd_dup and
> >> *aarch64_simd_ld1r,
> >> >> so that the register and memory alternatives are in the same
> >> >> pattern and (b) using the merged instruction(s) to implement the
> >> >> vec_duplicate
> >> optab.
> >> >> Target-independent code should then make the address satisfy the
> >> >> predicate, simplifying the addres

[PATCH] c++: Add support for __real__/__imag__ modifications in constant expressions [PR88174]

2022-06-09 Thread Jakub Jelinek via Gcc-patches
Hi!

We claim we support P0415R1 (constexpr complex), but e.g.
#include 

constexpr bool
foo ()
{
  std::complex a (1.0, 2.0);
  a += 3.0;
  a.real (6.0);
  return a.real () == 6.0 && a.imag () == 2.0;
}

static_assert (foo ());

fails with
test.C:12:20: error: non-constant condition for static assertion
   12 | static_assert (foo ());
  |^~
test.C:12:20:   in ‘constexpr’ expansion of ‘foo()’
test.C:8:10:   in ‘constexpr’ expansion of 
‘a.std::complex::real(6.0e+0)’
test.C:12:20: error: modification of ‘__real__ 
a.std::complex::_M_value’ is not a constant expression

The problem is we don't handle REALPART_EXPR and IMAGPART_EXPR
in cxx_eval_store_expression.
The following patch attempts to support it (with a requirement
that those are the outermost expressions, ARRAY_REF/COMPONENT_REF
etc. are just not possible on the result of these, BIT_FIELD_REF
would be theoretically possible if trying to extract some bits
from one part of a complex int, but I don't see how it could appear
in the FE trees.

For these references, the code handles value being COMPLEX_CST,
COMPLEX_EXPR or CONSTRUCTOR_NO_CLEARING empty CONSTRUCTOR (what we use
to represent uninitialized values for C++20 and later) and the
code starts by rewriting it to COMPLEX_EXPR, so that we can freely
adjust the individual parts and later on possibly optimize it back
to COMPLEX_CST if both halves are constant.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2022-06-09  Jakub Jelinek  

PR c++/88174
* constexpr.cc (cxx_eval_store_expression): Handle REALPART_EXPR
and IMAGPART_EXPR.

* g++.dg/cpp1y/constexpr-complex1.C: New test.

--- gcc/cp/constexpr.cc.jj  2022-06-08 08:21:02.973448193 +0200
+++ gcc/cp/constexpr.cc 2022-06-08 17:13:04.986040449 +0200
@@ -5707,6 +5707,20 @@ cxx_eval_store_expression (const constex
  }
  break;
 
+   case REALPART_EXPR:
+ gcc_assert (probe == target);
+ vec_safe_push (refs, integer_zero_node);
+ vec_safe_push (refs, TREE_TYPE (probe));
+ probe = TREE_OPERAND (probe, 0);
+ break;
+
+   case IMAGPART_EXPR:
+ gcc_assert (probe == target);
+ vec_safe_push (refs, integer_one_node);
+ vec_safe_push (refs, TREE_TYPE (probe));
+ probe = TREE_OPERAND (probe, 0);
+ break;
+
default:
  if (evaluated)
object = probe;
@@ -5749,6 +5763,8 @@ cxx_eval_store_expression (const constex
   auto_vec index_pos_hints;
   bool activated_union_member_p = false;
   bool empty_base = false;
+  int complex_part = -1;
+  tree *complex_expr = NULL;
   while (!refs->is_empty ())
 {
   if (*valp == NULL_TREE)
@@ -5785,14 +5801,36 @@ cxx_eval_store_expression (const constex
  *valp = ary_ctor;
}
 
-  /* If the value of object is already zero-initialized, any new ctors for
-subobjects will also be zero-initialized.  */
-  no_zero_init = CONSTRUCTOR_NO_CLEARING (*valp);
-
   enum tree_code code = TREE_CODE (type);
   tree reftype = refs->pop();
   tree index = refs->pop();
 
+  if (code == COMPLEX_TYPE)
+   {
+ if (TREE_CODE (*valp) == COMPLEX_CST)
+   *valp = build2 (COMPLEX_EXPR, type, TREE_REALPART (*valp),
+   TREE_IMAGPART (*valp));
+ else if (TREE_CODE (*valp) == CONSTRUCTOR
+  && CONSTRUCTOR_NELTS (*valp) == 0
+  && CONSTRUCTOR_NO_CLEARING (*valp))
+   {
+ tree r = build_constructor (reftype, NULL);
+ CONSTRUCTOR_NO_CLEARING (r) = 1;
+ *valp = build2 (COMPLEX_EXPR, type, r, r);
+   }
+ gcc_assert (TREE_CODE (*valp) == COMPLEX_EXPR);
+ complex_expr = valp;
+ valp = &TREE_OPERAND (*valp, index != integer_zero_node);
+ gcc_checking_assert (refs->is_empty ());
+ type = reftype;
+ complex_part = index != integer_zero_node;
+ break;
+   }
+
+  /* If the value of object is already zero-initialized, any new ctors for
+subobjects will also be zero-initialized.  */
+  no_zero_init = CONSTRUCTOR_NO_CLEARING (*valp);
+
   if (code == RECORD_TYPE && is_empty_field (index))
/* Don't build a sub-CONSTRUCTOR for an empty base or field, as they
   have no data and might have an offset lower than previously declared
@@ -5946,6 +5984,24 @@ cxx_eval_store_expression (const constex
= get_or_insert_ctor_field (*valp, indexes[i], index_pos_hints[i]);
  valp = &cep->value;
}
+  if (complex_part != -1)
+   {
+ if (TREE_CODE (*valp) == COMPLEX_CST)
+   *valp = build2 (COMPLEX_EXPR, TREE_TYPE (*valp),
+   TREE_REALPART (*valp),
+   TREE_IMAGPART (*valp));
+ else if (TREE_CODE (*valp) == CONSTRUCTOR
+  && CONSTRUCTOR_NELTS (*valp) == 0
+  && CON

Re: [PATCH]AArch64 relax predicate on load structure load instructions

2022-06-09 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Wednesday, June 8, 2022 3:36 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; Kyrylo Tkachov
>> ; rguent...@suse.de; ro...@eyesopen.com
>> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
>> instructions
>> 
>> Tamar Christina  writes:
>> >> -Original Message-
>> >> From: Richard Sandiford 
>> >> Sent: Wednesday, June 8, 2022 11:31 AM
>> >> To: Tamar Christina 
>> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> ; Marcus Shawcroft
>> >> ; Kyrylo Tkachov
>> 
>> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
>> >> instructions
>> >>
>> >> Tamar Christina  writes:
>> >> > Hi All,
>> >> >
>> >> > At some point in time we started lowering the ld1r instructions in
>> gimple.
>> >> >
>> >> > That is:
>> >> >
>> >> > uint8x8_t f1(const uint8_t *in) {
>> >> > return vld1_dup_u8(&in[1]);
>> >> > }
>> >> >
>> >> > generates at gimple:
>> >> >
>> >> >   _3 = MEM[(const uint8_t *)in_1(D) + 1B];
>> >> >   _4 = {_3, _3, _3, _3, _3, _3, _3, _3};
>> >> >
>> >> > Which is good, but we then generate:
>> >> >
>> >> > f1:
>> >> > ldr b0, [x0, 1]
>> >> > dup v0.8b, v0.b[0]
>> >> > ret
>> >> >
>> >> > instead of ld1r.
>> >> >
>> >> > The reason for this is because the load instructions have a too
>> >> > restrictive predicate on them which causes combine not to be able
>> >> > to combine the instructions due to the predicate only accepting
>> >> > simple
>> >> addressing modes.
>> >> >
>> >> > This patch relaxes the predicate to accept any memory operand and
>> >> > relies on LRA to legitimize the address when it needs to as the
>> >> > constraint still only allows the simple addressing mode.  Reload is
>> >> > always able to legitimize to these.
>> >> >
>> >> > Secondly since we are now actually generating more ld1r it became
>> >> > clear that the lane instructions suffer from a similar issue.
>> >> >
>> >> > i.e.
>> >> >
>> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
>> >> > float32x4_t dup = vld1q_dup_f32(&in[1]);
>> >> > return vfmaq_laneq_f32 (a, a, dup, 1); }
>> >> >
>> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
>> >> >
>> >> > The reason for this is similar to the ld1r issue.  The predicate is
>> >> > too restrictive in only acception register operands but not memory.
>> >> >
>> >> > This relaxes it to accept register and/or memory while leaving the
>> >> > constraint to only accept registers.  This will have LRA generate a
>> >> > reload if needed forcing the memory to registers using the standard
>> >> patterns.
>> >> >
>> >> > These two changes allow combine and reload to generate the right
>> >> sequences.
>> >> >
>> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >>
>> >> This is going against the general direction of travel, which is to
>> >> make the instruction's predicates and conditions enforce the
>> >> constraints as much as possible (making optimistic assumptions about
>> pseudo registers).
>> >>
>> >> The RA *can* deal with things like:
>> >>
>> >>   (match_operand:M N "general_operand" "r")
>> >>
>> >> but it's best avoided, for a few reasons:
>> >>
>> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary
>> >> registers.  This can make the allocation of those temporaries
>> >> suboptimal but (more importantly) it might require other
>> >> previously-allocated registers to be spilled late due to the
>> >> unexpected increase in register pressure.
>> >>
>> >> (2) It ends up hiding instructions from the pre-RA optimisers.
>> >>
>> >> (3) It can also prevent combine opportunities (as well as create them),
>> >> unless the loose predicates in an insn I are propagated to all
>> >> patterns that might result from combining I with something else.
>> >>
>> >> It sounds like the first problem (not generating ld1r) could be fixed
>> >> by (a) combining aarch64_simd_dup and
>> *aarch64_simd_ld1r,
>> >> so that the register and memory alternatives are in the same pattern
>> >> and (b) using the merged instruction(s) to implement the vec_duplicate
>> optab.
>> >> Target-independent code should then make the address satisfy the
>> >> predicate, simplifying the address where necessary.
>> >>
>> >
>> > I think I am likely missing something here. I would assume that you
>> > wanted to use the optab to split the addressing off from the mem
>> > expression so the combined insn matches.
>> >
>> > But in that case, why do you need to combine the two instructions?
>> > I've tried and it doesn't work since the vec_duplicate optab doesn't
>> > see the mem as op1, because in gimple the mem is not part of the
>> duplicate.
>> >
>> > So you still just see:
>> >
>>  dbgrtx (ops[1].value)
>> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0)
>> >
>> > As the operand as the argument to the 

[committed] doc: Fix up -Waddress documentation

2022-06-09 Thread Jakub Jelinek via Gcc-patches
Hi!

When looking up the -Waddress documentation due to some PR that mentioned it,
I've noticed some typos and thus I'm fixing them.

Bootstrapped/regtested on x86_64-linux and i686-linux, committed to trunk
as obvious.

2022-06-09  Jakub Jelinek  

* doc/invoke.texi (-Waddress): Fix a typo in small example.
Fix typos inptr_t -> intptr_t and uinptr_t -> uintptr_t.

--- gcc/doc/invoke.texi.jj  2022-06-03 11:20:13.155071628 +0200
+++ gcc/doc/invoke.texi 2022-06-08 18:05:17.205340980 +0200
@@ -8901,7 +8901,7 @@ such as in
 void f (void);
 void g (void)
 @{
-  if (!func)   // warning: expression evaluates to false
+  if (!f)   // warning: expression evaluates to false
 abort ();
 @}
 @end smallexample
@@ -8927,7 +8927,7 @@ weak symbols), so their use in a conditi
 parentheses in a function call or a missing dereference in an array
 expression.  The subset of the warning for object pointers can be
 suppressed by casting the pointer operand to an integer type such
-as @code{inptr_t} or @code{uinptr_t}.
+as @code{intptr_t} or @code{uintptr_t}.
 Comparisons against string literals result in unspecified behavior
 and are not portable, and suggest the intent was to call @code{strcmp}.
 The warning is suppressed if the suspicious expression is the result

Jakub



[committed] openmp: Add support for HBW or large capacity or interleaved memory through the libmemkind.so library

2022-06-09 Thread Jakub Jelinek via Gcc-patches
Hi!

This patch adds support for dlopening libmemkind.so on Linux and uses it
for some kinds of allocations (but not yet e.g. pinned memory).

Bootstrapped/regtested on x86_64-linux and i686-linux (with libmemkind
around) and compile tested with LIBGOMP_USE_MEMKIND undefined, committed
to trunk.

2022-06-09  Jakub Jelinek  

* allocator.c: Include dlfcn.h if LIBGOMP_USE_MEMKIND is defined.
(enum gomp_memkind_kind): New type.
(struct omp_allocator_data): Add memkind field if LIBGOMP_USE_MEMKIND
is defined.
(struct gomp_memkind_data): New type.
(memkind_data, memkind_data_once): New variables.
(gomp_init_memkind, gomp_get_memkind): New functions.
(omp_init_allocator): Initialize data.memkind, don't fail for
omp_high_bw_mem_space if libmemkind supports it.
(omp_aligned_alloc, omp_free, omp_aligned_calloc, omp_realloc): Add
memkind support of LIBGOMP_USE_MEMKIND is defined.
* config/linux/allocator.c: New file.

--- libgomp/allocator.c.jj  2022-06-08 08:21:03.099446883 +0200
+++ libgomp/allocator.c 2022-06-08 13:41:45.647133610 +0200
@@ -31,9 +31,28 @@
 #include "libgomp.h"
 #include 
 #include 
+#ifdef LIBGOMP_USE_MEMKIND
+#include 
+#endif
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+enum gomp_memkind_kind
+{
+  GOMP_MEMKIND_NONE = 0,
+#define GOMP_MEMKIND_KINDS \
+  GOMP_MEMKIND_KIND (HBW_INTERLEAVE),  \
+  GOMP_MEMKIND_KIND (HBW_PREFERRED),   \
+  GOMP_MEMKIND_KIND (DAX_KMEM_ALL),\
+  GOMP_MEMKIND_KIND (DAX_KMEM),\
+  GOMP_MEMKIND_KIND (INTERLEAVE),  \
+  GOMP_MEMKIND_KIND (DEFAULT)
+#define GOMP_MEMKIND_KIND(kind) GOMP_MEMKIND_##kind
+  GOMP_MEMKIND_KINDS,
+#undef GOMP_MEMKIND_KIND
+  GOMP_MEMKIND_COUNT
+};
+
 struct omp_allocator_data
 {
   omp_memspace_handle_t memspace;
@@ -46,6 +65,9 @@ struct omp_allocator_data
   unsigned int fallback : 8;
   unsigned int pinned : 1;
   unsigned int partition : 7;
+#ifdef LIBGOMP_USE_MEMKIND
+  unsigned int memkind : 8;
+#endif
 #ifndef HAVE_SYNC_BUILTINS
   gomp_mutex_t lock;
 #endif
@@ -59,13 +81,95 @@ struct omp_mem_header
   void *pad;
 };
 
+struct gomp_memkind_data
+{
+  void *memkind_handle;
+  void *(*memkind_malloc) (void *, size_t);
+  void *(*memkind_calloc) (void *, size_t, size_t);
+  void *(*memkind_realloc) (void *, void *, size_t);
+  void (*memkind_free) (void *, void *);
+  int (*memkind_check_available) (void *);
+  void **kinds[GOMP_MEMKIND_COUNT];
+};
+
+#ifdef LIBGOMP_USE_MEMKIND
+static struct gomp_memkind_data *memkind_data;
+static pthread_once_t memkind_data_once = PTHREAD_ONCE_INIT;
+
+static void
+gomp_init_memkind (void)
+{
+  void *handle = dlopen ("libmemkind.so", RTLD_LAZY);
+  struct gomp_memkind_data *data;
+  int i;
+  static const char *kinds[] = {
+NULL,
+#define GOMP_MEMKIND_KIND(kind) "MEMKIND_" #kind
+GOMP_MEMKIND_KINDS
+#undef GOMP_MEMKIND_KIND
+  };
+
+  data = calloc (1, sizeof (struct gomp_memkind_data));
+  if (data == NULL)
+{
+  if (handle)
+   dlclose (handle);
+  return;
+}
+  if (!handle)
+{
+  __atomic_store_n (&memkind_data, data, MEMMODEL_RELEASE);
+  return;
+}
+  data->memkind_handle = handle;
+  data->memkind_malloc
+= (__typeof (data->memkind_malloc)) dlsym (handle, "memkind_malloc");
+  data->memkind_calloc
+= (__typeof (data->memkind_calloc)) dlsym (handle, "memkind_calloc");
+  data->memkind_realloc
+= (__typeof (data->memkind_realloc)) dlsym (handle, "memkind_realloc");
+  data->memkind_free
+= (__typeof (data->memkind_free)) dlsym (handle, "memkind_free");
+  data->memkind_check_available
+= (__typeof (data->memkind_check_available))
+  dlsym (handle, "memkind_check_available");
+  if (data->memkind_malloc
+  && data->memkind_calloc
+  && data->memkind_realloc
+  && data->memkind_free
+  && data->memkind_check_available)
+for (i = 1; i < GOMP_MEMKIND_COUNT; ++i)
+  {
+   data->kinds[i] = (void **) dlsym (handle, kinds[i]);
+   if (data->kinds[i] && data->memkind_check_available (*data->kinds[i]))
+ data->kinds[i] = NULL;
+  }
+  __atomic_store_n (&memkind_data, data, MEMMODEL_RELEASE);
+}
+
+static struct gomp_memkind_data *
+gomp_get_memkind (void)
+{
+  struct gomp_memkind_data *data
+= __atomic_load_n (&memkind_data, MEMMODEL_ACQUIRE);
+  if (data)
+return data;
+  pthread_once (&memkind_data_once, gomp_init_memkind);
+  return __atomic_load_n (&memkind_data, MEMMODEL_ACQUIRE);
+}
+#endif
+
 omp_allocator_handle_t
 omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
const omp_alloctrait_t traits[])
 {
   struct omp_allocator_data data
 = { memspace, 1, ~(uintptr_t) 0, 0, 0, omp_atv_contended, omp_atv_all,
-   omp_atv_default_mem_fb, omp_atv_false, omp_atv_environment };
+   omp_atv_default_mem_fb, omp_atv_false, omp_atv_environment,
+#ifdef LIBGOMP_USE_MEMKIND
+   

[PATCH]middle-end Use subregs to expand COMPLEX_EXPR to set the lowpart.

2022-06-09 Thread Tamar Christina via Gcc-patches
Hi All,

When lowering COMPLEX_EXPR we currently emit two VEC_EXTRACTs.  One for the
lowpart and one for the highpart.

The problem with this is that in RTL the lvalue of the RTX is the only thing
tying the two instructions together.

This means that e.g. combine is unable to try to combine the two instructions
for setting the lowpart and highpart.

For ISAs that have bit extract instructions we can eliminate one of the extracts
if, and only if we're setting the entire complex number.

This change changes the expand code when we're setting the entire complex number
to generate a subreg for the lowpart instead of a vec_extract.

This allows us to optimize sequences such as:

_Complex int f(int a, int b) {
_Complex int t = a + b * 1i;
return t;
}

from:

f:
bfi x2, x0, 0, 32
bfi x2, x1, 32, 32
mov x0, x2
ret

into:

f:
bfi x0, x1, 32, 32
ret

I have also confirmed the codegen for x86_64 did not change.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* emit-rtl.cc (validate_subreg): Accept subregs of complex modes.
* expr.cc (emit_move_complex_parts): Emit subreg of lowpart if possible.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/complex-init.C: New test.

--- inline copy of patch -- 
diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
index 
f4404d7abe33b565358b7f609a91114c75ecf4e7..15ffca2ffe986bca56c1fae9381bd33f5d6b012d
 100644
--- a/gcc/emit-rtl.cc
+++ b/gcc/emit-rtl.cc
@@ -947,9 +947,11 @@ validate_subreg (machine_mode omode, machine_mode imode,
   && GET_MODE_INNER (omode) == GET_MODE_INNER (imode))
 ;
   /* Subregs involving floating point modes are not allowed to
- change size.  Therefore (subreg:DI (reg:DF) 0) is fine, but
+ change size unless it's an insert into a complex mode.
+ Therefore (subreg:DI (reg:DF) 0) and (subreg:CS (reg:SF) 0) are fine, but
  (subreg:SI (reg:DF) 0) isn't.  */
-  else if (FLOAT_MODE_P (imode) || FLOAT_MODE_P (omode))
+  else if ((FLOAT_MODE_P (imode) || FLOAT_MODE_P (omode))
+  && !COMPLEX_MODE_P (omode))
 {
   if (! (known_eq (isize, osize)
 /* LRA can use subreg to store a floating point value in
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 
5f7142b975ada2cd8b00663d35ba1e0004b8e28d..fce672c236fdbc4d40adb6e2614c234c02a61933
 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -3740,7 +3740,17 @@ emit_move_complex_parts (rtx x, rtx y)
   && REG_P (x) && !reg_overlap_mentioned_p (x, y))
 emit_clobber (x);
 
-  write_complex_part (x, read_complex_part (y, false), false);
+  /* If we're writing the entire value using a concat into a register
+ then emit the lower part as a simple mov followed by an insert
+ into the top part.  */
+  if (GET_CODE (y) == CONCAT && !reload_completed && REG_P (x))
+{
+  rtx val = XEXP (y, false);
+  rtx dest = lowpart_subreg (GET_MODE (val), x, GET_MODE (x));
+  emit_move_insn (dest, val);
+}
+  else
+write_complex_part (x, read_complex_part (y, false), false);
   write_complex_part (x, read_complex_part (y, true), true);
 
   return get_last_insn ();
diff --git a/gcc/testsuite/g++.target/aarch64/complex-init.C 
b/gcc/testsuite/g++.target/aarch64/complex-init.C
new file mode 100644
index 
..497cc4bca3e2c59da95c871ceb5cc96216fc302d
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/complex-init.C
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+/*
+** _Z1fii:
+** ...
+** bfi x0, x1, 32, 32
+** ret
+** ...
+*/
+_Complex int f(int a, int b) {
+_Complex int t = a + b * 1i;
+return t;
+}
+
+/*
+** _Z2f2ii:
+** ...
+** bfi x0, x1, 32, 32
+** ret
+** ...
+*/
+_Complex int f2(int a, int b) {
+_Complex int t = {a, b};
+return t;
+}
+
+/*
+** _Z12f_convolutedii:
+** ...
+** bfi x0, x1, 32, 32
+** ret
+** ...
+*/
+_Complex int f_convoluted(int a, int b) {
+_Complex int t = (_Complex int)a;
+__imag__ t = b;
+return t;
+}




-- 
diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
index 
f4404d7abe33b565358b7f609a91114c75ecf4e7..15ffca2ffe986bca56c1fae9381bd33f5d6b012d
 100644
--- a/gcc/emit-rtl.cc
+++ b/gcc/emit-rtl.cc
@@ -947,9 +947,11 @@ validate_subreg (machine_mode omode, machine_mode imode,
   && GET_MODE_INNER (omode) == GET_MODE_INNER (imode))
 ;
   /* Subregs involving floating point modes are not allowed to
- change size.  Therefore (subreg:DI (reg:DF) 0) is fine, but
+ change size unless it's an insert into a complex mode.
+ Therefore (subreg:DI (reg:DF) 0) and (subreg:CS (reg:SF) 0) are fine, but
  (subreg:SI (reg:DF) 0) isn't.  */
-  else if (FLOAT_MODE_P (imode) || FLOAT_MODE_P (omode))
+  else if ((FLOAT_MODE_P (imode) || FLOAT_MODE_P (omode))
+

RE: [PATCH]AArch64 relax predicate on load structure load instructions

2022-06-09 Thread Tamar Christina via Gcc-patches

> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, June 8, 2022 3:36 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; Kyrylo Tkachov
> ; rguent...@suse.de; ro...@eyesopen.com
> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> instructions
> 
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Wednesday, June 8, 2022 11:31 AM
> >> To: Tamar Christina 
> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> ; Marcus Shawcroft
> >> ; Kyrylo Tkachov
> 
> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> >> instructions
> >>
> >> Tamar Christina  writes:
> >> > Hi All,
> >> >
> >> > At some point in time we started lowering the ld1r instructions in
> gimple.
> >> >
> >> > That is:
> >> >
> >> > uint8x8_t f1(const uint8_t *in) {
> >> > return vld1_dup_u8(&in[1]);
> >> > }
> >> >
> >> > generates at gimple:
> >> >
> >> >   _3 = MEM[(const uint8_t *)in_1(D) + 1B];
> >> >   _4 = {_3, _3, _3, _3, _3, _3, _3, _3};
> >> >
> >> > Which is good, but we then generate:
> >> >
> >> > f1:
> >> >  ldr b0, [x0, 1]
> >> >  dup v0.8b, v0.b[0]
> >> >  ret
> >> >
> >> > instead of ld1r.
> >> >
> >> > The reason for this is because the load instructions have a too
> >> > restrictive predicate on them which causes combine not to be able
> >> > to combine the instructions due to the predicate only accepting
> >> > simple
> >> addressing modes.
> >> >
> >> > This patch relaxes the predicate to accept any memory operand and
> >> > relies on LRA to legitimize the address when it needs to as the
> >> > constraint still only allows the simple addressing mode.  Reload is
> >> > always able to legitimize to these.
> >> >
> >> > Secondly since we are now actually generating more ld1r it became
> >> > clear that the lane instructions suffer from a similar issue.
> >> >
> >> > i.e.
> >> >
> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
> >> > float32x4_t dup = vld1q_dup_f32(&in[1]);
> >> > return vfmaq_laneq_f32 (a, a, dup, 1); }
> >> >
> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
> >> >
> >> > The reason for this is similar to the ld1r issue.  The predicate is
> >> > too restrictive in only acception register operands but not memory.
> >> >
> >> > This relaxes it to accept register and/or memory while leaving the
> >> > constraint to only accept registers.  This will have LRA generate a
> >> > reload if needed forcing the memory to registers using the standard
> >> patterns.
> >> >
> >> > These two changes allow combine and reload to generate the right
> >> sequences.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >>
> >> This is going against the general direction of travel, which is to
> >> make the instruction's predicates and conditions enforce the
> >> constraints as much as possible (making optimistic assumptions about
> pseudo registers).
> >>
> >> The RA *can* deal with things like:
> >>
> >>   (match_operand:M N "general_operand" "r")
> >>
> >> but it's best avoided, for a few reasons:
> >>
> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary
> >> registers.  This can make the allocation of those temporaries
> >> suboptimal but (more importantly) it might require other
> >> previously-allocated registers to be spilled late due to the
> >> unexpected increase in register pressure.
> >>
> >> (2) It ends up hiding instructions from the pre-RA optimisers.
> >>
> >> (3) It can also prevent combine opportunities (as well as create them),
> >> unless the loose predicates in an insn I are propagated to all
> >> patterns that might result from combining I with something else.
> >>
> >> It sounds like the first problem (not generating ld1r) could be fixed
> >> by (a) combining aarch64_simd_dup and
> *aarch64_simd_ld1r,
> >> so that the register and memory alternatives are in the same pattern
> >> and (b) using the merged instruction(s) to implement the vec_duplicate
> optab.
> >> Target-independent code should then make the address satisfy the
> >> predicate, simplifying the address where necessary.
> >>
> >
> > I think I am likely missing something here. I would assume that you
> > wanted to use the optab to split the addressing off from the mem
> > expression so the combined insn matches.
> >
> > But in that case, why do you need to combine the two instructions?
> > I've tried and it doesn't work since the vec_duplicate optab doesn't
> > see the mem as op1, because in gimple the mem is not part of the
> duplicate.
> >
> > So you still just see:
> >
>  dbgrtx (ops[1].value)
> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0)
> >
> > As the operand as the argument to the dup is just an SSA_NAME.
> 
> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would come from a
> constructor rather than a vec_duplicate_expr, so we don't get the usua

Re: [PATCH] Update {skylake,icelake,alderlake}_cost to add a bit preference to vector store.

2022-06-09 Thread Hongtao Liu via Gcc-patches
On Wed, Jun 8, 2022 at 11:44 AM Cui, Lili  wrote:
>
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: Monday, June 6, 2022 1:25 PM
> > To: H.J. Lu 
> > Cc: Cui, Lili ; Liu, Hongtao ; 
> > GCC
> > Patches 
> > Subject: Re: [PATCH] Update {skylake,icelake,alderlake}_cost to add a bit
> > preference to vector store.
> > >
> > > Should we add some tests to verify improvements?
> > We can take pr99881.c as a unit test.
> >
> > Ok for the trunk.
> > >
> > > --
> > > H.J.
> >
> Hi hongtao,
>
> 1. I added test case pr105493.c for 525.x264_r. For 538.imagic_r we have 
> pr99881.c.
> 2. I changed the dg-final check in pr105638.c due to code generation changes.
>
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk?
Thanks, committed to trunk.
>
> Thanks,
> Lili.
>
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao