Re: [PATCH PR90240][RFC]Avoid scaling cost overflow by introducing scaling bound

2019-05-08 Thread Richard Biener
On Sun, May 5, 2019 at 8:05 AM bin.cheng  wrote:
>
> Hmm, mis-attached the old version patch.  Here is the updated one.

OK (if still needed)

Richard.

> Thanks,
> bin
>
> --
> Sender:bin.cheng 
> Sent At:2019 May 5 (Sun.) 13:54
> Recipient:Richard Biener 
> Cc:GCC Patches 
> Subject:Re: [PATCH PR90240][RFC]Avoid scaling cost overflow by introducing 
> scaling bound
>
>
> > --
> > Sender:Richard Biener 
> > Sent At:2019 Apr. 29 (Mon.) 20:01
> > Recipient:bin.cheng 
> > Cc:GCC Patches ; mliska 
> > Subject:Re: [PATCH PR90240][RFC]Avoid scaling cost overflow by introducing 
> > scaling bound
> >
> >
> > On Sat, Apr 27, 2019 at 6:13 AM bin.cheng  
> > wrote:
> > >
> > > > Hi,
> > >
> > > This is the draft patch avoiding scaling cost overflow by introducing a 
> > > scaling bound
> > > in IVOPTs.  For now the bound is 20, and scaling factor will be further 
> > > scaled wrto
> > > this bound.  For example, scaling factor like 1, 1000, 2000(max) would be 
> > > scaled to
> > > 1, 10, 20 correspondingly.
> > >
> > > HI Martin, I remember you introduced comp_cost/cost_scaling to improve 
> > > one case
> > > in spec2017.  Unfortunately I don't have access to the benchmark now, 
> > > could you help
> > > verify that if this patch has performance issue on it please?  Thanks
> > >
> > > Bootstrap and test on x86_64, and this is for further comment/discussion.
> >
> > + float factor = 1.0f * bfreq / lfreq;
> > + if (adjust_factor_p)
> > +   {
> > + factor *= COST_SCALING_FACTOR_BOUND;
> > + factor = factor * lfreq / max_freq;
> > +   }
> > + body[i]->aux = (void *)(intptr_t)(int) factor;
> >
> > float computations on the host shouldn't be done.
> Fixed.
> >
> > I wonder if changing comp_cost::cost to int64_t would help instead?  See 
> > also my
> > suggestion to use gmp.
> Next patch for PR90078 changes the cost to int64_t.
> >
> > Otherwise the approach looks sane to me - can we then even assert that
> > no overflows
> > happen and thus INFTY only appears if we manually set such cost?
> Next patch for PR90078 changes to assert on cost overflow.
>
> Attachment is the updated patch.  Bootstrap/test on x86_64 along with PR90078 
> patch.
> Is it OK?
>
> Thanks,
> bin
> 2019-05-05  Bin Cheng  
>
> PR tree-optimization/90240
> * tree-ssa-loop-ivopts.c (get_scaled_computation_cost_at): Scale cost
> with respect to scaling factor pre-computed for each basic block.
> (try_improve_iv_set): Return bool if best_cost equals to iv_ca cost.
> (find_optimal_iv_set_1): Free iv_ca set if it has infinite_cost.
> (COST_SCALING_FACTOR_BOUND, determine_scaling_factor): New.
> (tree_ssa_iv_optimize_loop): Call determine_scaling_factor.  Extend
> live range for array of loop's basic blocks.  Cleanup aux field of
> loop's basic blocks.
>
> 2018-05-05  Bin Cheng  
>
> PR tree-optimization/90240
> * gfortran.dg/graphite/pr90240.f: New test.


Re: Fixing ifcvt issue as exposed by BZ89430

2019-05-08 Thread Richard Biener
On Thu, Feb 28, 2019 at 1:26 PM JiangNing OS
 wrote:
>
> To solve BZ89430 the followings are needed,
>
> (1) The code below in noce_try_cmove_arith needs to be fixed.
>
>   /* ??? We could handle this if we knew that a load from A or B could
>  not trap or fault.  This is also true if we've already loaded
>  from the address along the path from ENTRY.  */
>   else if (may_trap_or_fault_p (a) || may_trap_or_fault_p (b))
> return FALSE;
>
> Finding dominating memory access could help to decide whether the memory 
> access following the conditional move is valid or not.
> * If there is a dominating memory write to the same memory address in test_bb 
> as the one from x=a, it is always safe.
> * When the dominating memory access to the same memory address in test_bb is 
> a read, for the load out of x=a, it is always safe. For the store out of x=a, 
> if the memory is on stack, it is still safe.
>
> Besides, there is a bug around rearranging the then_bb and else_bb layout, 
> which needs to be fixed.
>
> (2) The fix by https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02998.html  is 
> an overkill. If the write target following conditional move is a memory 
> access, it exits shortly.
>
>   if (!set_b && MEM_P (orig_x))
> /* We want to avoid store speculation to avoid cases like
>  if (pthread_mutex_trylock(mutex))
>++global_variable;
>Rather than go to much effort here, we rely on the SSA optimizers,
>which do a good enough job these days.  */
> return FALSE;
>
> It looks a bit hard for compiler to know the program semantic is related to 
> mutex and avoid store speculation. Without removing this overkill, the fix 
> mentioned by (1) would not work. Any idea? An alternative solution is to 
> detect the following pattern conservatively,

But it's important to not break this.  Note we do have
--param allow-store-data-races which the user can use to override this.
Note that accesses to the stack can obviously not cause store
speculation if its address didn't escape.  But that's probably what is
refered to by "rely on the SSA optimizers".

>  if (a_function_call(...))
>++local_variable;
>
> If the local_variable doesn't have address taken at all in current function, 
> it mustn't be a pthread mutex lock semantic, and then the following patch 
> would work to solve (1) and pass the last case as described in 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430.
>
> Index: gcc/cse.c
> ===
> --- gcc/cse.c   (revision 268256)
> +++ gcc/cse.c   (working copy)
> @@ -540,7 +540,6 @@
> already as part of an already processed extended basic block.  */
>  static sbitmap cse_visited_basic_blocks;
>
> -static bool fixed_base_plus_p (rtx x);
>  static int notreg_cost (rtx, machine_mode, enum rtx_code, int);
>  static int preferable (int, int, int, int);
>  static void new_basic_block (void);
> @@ -606,30 +605,7 @@
>
>  static const struct rtl_hooks cse_rtl_hooks = RTL_HOOKS_INITIALIZER;
>
>
> -/* Nonzero if X has the form (PLUS frame-pointer integer).  */
>
> -static bool
> -fixed_base_plus_p (rtx x)
> -{
> -  switch (GET_CODE (x))
> -{
> -case REG:
> -  if (x == frame_pointer_rtx || x == hard_frame_pointer_rtx)
> -   return true;
> -  if (x == arg_pointer_rtx && fixed_regs[ARG_POINTER_REGNUM])
> -   return true;
> -  return false;
> -
> -case PLUS:
> -  if (!CONST_INT_P (XEXP (x, 1)))
> -   return false;
> -  return fixed_base_plus_p (XEXP (x, 0));
> -
> -default:
> -  return false;
> -}
> -}
> -
>  /* Dump the expressions in the equivalence class indicated by CLASSP.
> This function is used only for debugging.  */
>  DEBUG_FUNCTION void
> Index: gcc/ifcvt.c
> ===
> --- gcc/ifcvt.c (revision 268256)
> +++ gcc/ifcvt.c (working copy)
> @@ -76,6 +76,9 @@
>  /* Whether conditional execution changes were made.  */
>  static int cond_exec_changed_p;
>
> +/* bitmap for stack frame pointer definition insns. */
> +static bitmap bba_sets_sfp;
> +
>  /* Forward references.  */
>  static int count_bb_insns (const_basic_block);
>  static bool cheap_bb_rtx_cost_p (const_basic_block, profile_probability, 
> int);
> @@ -99,6 +102,7 @@
>edge, int);
>  static void noce_emit_move_insn (rtx, rtx);
>  static rtx_insn *block_has_only_trap (basic_block);
> +static void collect_all_fp_insns (void);
>
>
>  /* Count the number of non-jump active insns in BB.  */
>
> @@ -2029,6 +2033,110 @@
>return true;
>  }
>
> +/* Return true if MEM x is on stack. a_insn contains x, if it exists. */
> +
> +static bool
> +noce_mem_is_on_stack (rtx_insn *a_insn, const_rtx x)
> +{
> +  df_ref use;
> +
> +  gcc_assert (x);
> +  gcc_assert (MEM_P (x));
> +
> +  /* Early exits if find base register is a stack register. */
> +  rtx a = XEXP (x, 0);
> +  if (fixed_base_plus_p(a)

Re: Enable BF16 support (Please ignore my former email)

2019-05-08 Thread Uros Bizjak
On Wed, May 8, 2019 at 5:06 AM Hongtao Liu  wrote:
>
> On Wed, May 8, 2019 at 2:33 AM Uros Bizjak  wrote:
> >
> > On Tue, May 7, 2019 at 8:49 AM Hongtao Liu  wrote:
> >
> > > > > > > > > > > This patch is about to enable support for bfloat16 
> > > > > > > > > > > which will be in Future Cooper Lake, Please refer to 
> > > > > > > > > > > https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference
> > > > > > > > > > > for more details about BF16.
> > > > > > > > > > >
> > > > > > > > > > > There are 3 instructions for AVX512BF16: VCVTNE2PS2BF16, 
> > > > > > > > > > > VCVTNEPS2BF16 and DPBF16PS instructions, which are Vector 
> > > > > > > > > > > Neural Network Instructions supporting:
> > > > > > > > > > >
> > > > > > > > > > > -   VCVTNE2PS2BF16: Convert Two Packed Single Data to 
> > > > > > > > > > > One Packed BF16 Data.
> > > > > > > > > > > -   VCVTNEPS2BF16: Convert Packed Single Data to 
> > > > > > > > > > > Packed BF16 Data.
> > > > > > > > > > > -   VDPBF16PS: Dot Product of BF16 Pairs Accumulated 
> > > > > > > > > > > into Packed Single Precision.
> > > > > > > > > > >
> > > > > > > > > > > Since only BF16 intrinsics are supported, we treat it as 
> > > > > > > > > > > HI for simplicity.
> > > > > > > > > >
> > > > > > > > > > I think it was a mistake declaring cvtps2ph and cvtph2ps 
> > > > > > > > > > using HImode
> > > > > > > > > > instead of HFmode. Is there a compelling reason not to 
> > > > > > > > > > introduce
> > > > > > > > > > corresponding bf16_format supporting infrastructure and 
> > > > > > > > > > declare these
> > > > > > > > > > intrinsics using half-binary (HBmode ?) mode instead?
> > > > > > > > > >
> > > > > > > > > > Uros.
> > > > > > > > >
> > > > > > > > > Bfloat16 isn't IEEE standard which we want to reserve HFmode 
> > > > > > > > > for.
> > > > > > > >
> > > > > > > > True.
> > > > > > > >
> > > > > > > > > The IEEE 754 standard specifies a binary16 as having the 
> > > > > > > > > following format:
> > > > > > > > > Sign bit: 1 bit
> > > > > > > > > Exponent width: 5 bits
> > > > > > > > > Significand precision: 11 bits (10 explicitly stored)
> > > > > > > > >
> > > > > > > > > Bfloat16 has the following format:
> > > > > > > > > Sign bit: 1 bit
> > > > > > > > > Exponent width: 8 bits
> > > > > > > > > Significand precision: 8 bits (7 explicitly stored), as 
> > > > > > > > > opposed to 24
> > > > > > > > > bits in a classical single-precision floating-point format
> > > > > > > >
> > > > > > > > This is why I proposed to introduce HBmode (and corresponding
> > > > > > > > bfloat16_format) to distingush between ieee HFmode and BFmode.
> > > > > > > >
> > > > > > >
> > > > > > > Unless there is BF16 language level support,  HBmode has no 
> > > > > > > advantage
> > > > > > > over HImode.   We can add HBmode when we gain BF16 language 
> > > > > > > support.
> > > > > > >
> > > > > > > --
> > > > > > > H.J.
> > > > > >
> > > > > > Any other comments, I'll merge this to trunk?
> > > > >
> > > > > It is not a regression, so please no.
> > > >
> > > > Ehm, "regression fix" ...
> > > >
> > > > Uros.
> > >
> > > Update patch.
> >
> > Index: gcc/config/i386/i386-builtins.c
> > ===
> > --- gcc/config/i386/i386-builtins.c(revision 270934)
> > +++ gcc/config/i386/i386-builtins.c(working copy)
> > @@ -1920,6 +1920,7 @@
> >F_VPCLMULQDQ,
> >F_AVX512VNNI,
> >F_AVX512BITALG,
> > +  F_AVX512BF16,
> >F_MAX
> >  };
> >
> > @@ -2064,7 +2065,8 @@
> >{"gfni",F_GFNI,P_ZERO},
> >{"vpclmulqdq", F_VPCLMULQDQ, P_ZERO},
> >{"avx512vnni", F_AVX512VNNI, P_ZERO},
> > -  {"avx512bitalg", F_AVX512BITALG, P_ZERO}
> > +  {"avx512bitalg", F_AVX512BITALG, P_ZERO},
> > +  {"avx512bf16", F_AVX512BF16, P_ZERO}
> >  };
> >
> >  /* This parses the attribute arguments to target in DECL and determines
> >
> > You also need to update cpuinfo.h and cpuinfo.c in libgcc/config/i386
> > with avx512bf16, plus relevant test files.
> >
> > Index: gcc/testsuite/gcc.target/i386/avx-1.c
> > Index: gcc/testsuite/gcc.target/i386/avx-2.c
> >
> > No need to update above two files, sse-*.c changes are enough to cover
> > new functionality.
> >
> > Otherwise LGTM, but please repost updated patch with the ChangeLog
> > entry (please see [1]).
> >
> > [1] https://www.gnu.org/software/gcc/contribute.html#patches
> >
> > Uros.
>
> Update patch:
> 1. Add Changelog.

In fututre, please add ChangeLog entries as plaintext to the message,
as instructed in [1]. Also, please read the links to GCC coding
conventions and GNU Coding Standards for further information.

[1] https://www.gnu.org/software/gcc/contribute.html#patches

> 2. Update libgcc part.

Please add one line of space after

+2019-05-07  Wei Xiao  
+
+*

in all ChangeLog entries (please follow the form of existing entries
as an example).

+* gcc.target/i386/avx512bf16vl-vdpbf16ps-1.c: New 

[PATCH] Improve tail call analysis and inliner EH clobber through variable life analysis (PR tree-optimization/89060, take 2)

2019-05-08 Thread Jakub Jelinek
On Mon, May 06, 2019 at 04:17:01PM +0200, Richard Biener wrote:
> > > +struct compute_live_vars_data {
> > > +  /* Vector of bitmaps for live vars at the end of basic blocks,
> > > + indexed by bb->index.  ACTIVE[ENTRY_BLOCK] must be empty bitmap,
> > > + ACTIVE[EXIT_BLOCK] is used for STOP_AFTER.  */
> > > +  vec active;
> > > +  /* Work bitmap of currently live variables.  */
> > > +  bitmap work;
> > > +  /* Bitmap of interesting variables.  Variables with uids not in this
> > > + bitmap are not tracked.  */
> > > +  bitmap vars;
> > > +};
> 
> How dense are the bitmaps?  Storage requirement will be quadratic
> so eventually using a sparseset for 'vars' and using the sparseset
> index for the bitmaps will improve things?  Or re-using live
> code mapping?

Here is an updated patch that instead of using a bitmap of vars uses a
hash_map from DECL_UID to some dense indexes used in the rest of the
bitmaps.

> > > +   tree lhs = gimple_assign_lhs (stmt);
> > > +   if (VAR_P (lhs) && bitmap_bit_p (data->vars, DECL_UID (lhs)))
> > > + bitmap_clear_bit (data->work, DECL_UID (lhs));
> 
> I think this clearing causes PR90348.

This stuff keeps as is, can be changed depending on what we decide for
PR90348.

> > > +  /* If any such variables are found, try harder using variable life
> > > + tracking to see if those variables aren't already out of scope.  */
> > > +  if (local_decls)
> > > +{
> > > +  vec live = compute_live_vars (cfun, local_decls, 
> > > call);
> > > +  bool any_live_p = !bitmap_empty_p (&live[EXIT_BLOCK]);
> > > +  destroy_live_vars (live);
> > > +  BITMAP_FREE (local_decls);
> > > +  if (any_live_p)
> > >   return;
> > >  }
> 
> I wonder if it is possible to split analysis and transform here
> some more - the above is called for all preds of EXIT, if we
> first analyze all of them and then compute live once if needed,
> pruning invalid ones and then doing transform this would at least
> make LIVE only computed once.
> 
> It should be also possible to restrict LIVE to the sub-CFG leading to
> the EXIT preds with tail-call candidates as well, no?

And here it lazily computes the analysis the first time we need it in a
function for all addressable vars in the function and then uses it for that
and all following tail call locations, where it just walks the last bb until
after the call stmt.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2019-05-08  Jakub Jelinek  

PR tree-optimization/89060
* tree-ssa-live.h (live_vars_map): New typedef.
(compute_live_vars, live_vars_at_stmt, destroy_live_vars): Declare.
* tree-ssa-live.c: Include gimple-walk.h and cfganal.h.
(struct compute_live_vars_data): New type.
(compute_live_vars_visit, compute_live_vars_1, compute_live_vars,
live_vars_at_stmt, destroy_live_vars): New functions.
* tree-tailcall.c: Include tree-ssa-live.h.
(live_vars, live_vars_vec): New global variables.
(find_tail_calls): Perform variable life analysis before punting.
(tree_optimize_tail_calls_1): Clean up live_vars and live_vars_vec.
* tree-inline.h (struct copy_body_data): Add eh_landing_pad_dest
member.
* tree-inline.c (add_clobbers_to_eh_landing_pad): Remove BB argument.
Perform variable life analysis to select variables that really need
clobbers added.
(copy_edges_for_bb): Don't call add_clobbers_to_eh_landing_pad here,
instead set id->eh_landing_pad_dest and assert it is the same.
(copy_cfg_body): Call it here if id->eh_landing_pad_dest is non-NULL.

* gcc.dg/tree-ssa/pr89060.c: New test.

--- gcc/tree-ssa-live.h.jj  2019-02-08 08:10:16.954053427 +0100
+++ gcc/tree-ssa-live.h 2019-05-07 17:43:00.369174630 +0200
@@ -266,6 +266,11 @@ extern void debug (tree_live_info_d &ref
 extern void debug (tree_live_info_d *ptr);
 extern void dump_live_info (FILE *, tree_live_info_p, int);
 
+typedef hash_map, unsigned int> live_vars_map;
+extern vec compute_live_vars (struct function *, live_vars_map *);
+extern bitmap live_vars_at_stmt (vec &, live_vars_map *,
+gimple *);
+extern void destroy_live_vars (vec &);
 
 /*  Return TRUE if P is marked as a global in LIVE.  */
 
--- gcc/tree-ssa-live.c.jj  2019-02-08 08:10:16.902054295 +0100
+++ gcc/tree-ssa-live.c 2019-05-07 17:45:25.775844167 +0200
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.
 #include "stringpool.h"
 #include "attribs.h"
 #include "optinfo.h"
+#include "gimple-walk.h"
+#include "cfganal.h"
 
 static void verify_live_on_entry (tree_live_info_p);
 
@@ -1194,8 +1196,149 @@ calculate_live_ranges (var_map map, bool
 
   return live;
 }
+
+/* Data structure for compute_live_vars* functions.  */
 
+struct compute_live_vars_data {
+  /* Vector of bitmaps for live vars indices at the end of basic blocks,
+ indexed by bb->index.  ACTIVE[ENTRY_BLOCK] must be empty

[ada, build] Avoid cp -p failures during Ada make install

2019-05-08 Thread Rainer Orth
Prompted by a known make install failure on Linux/x86_64, I decided to
finally rework my ancient patch

http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01669.html

along the lines Mike suggested back then, i.e. use cp && touch -r
instead of cp -p.  This avoids the failures like

for file in rts/*.ad[sb]*; do \
cp -p $file 
/vol/gcc/obj/gcc/gcc-10/vol/gcc/lib/gcc/x86_64-pc-linux-gnu/10.0.0/adainclude; \
done
cp: preserving permissions for 
‘/vol/gcc/obj/gcc/gcc-10/vol/gcc/lib/gcc/x86_64-pc-linux-gnu/10.0.0/adainclude/a-assert.adb’:
 Operation not supported

when installing e.g. to an NFSv3 destination which doesn't support POSIX
ACLs from a local filesystem that does.

The fix turned out to be a bit more involved than I initially thought:

* I've changed INSTALL_DATA_DATE to use GNU make's $(call) function
  since the args are now needed twice.

* The destination argument now needs to include the filename: while cp
  can use a directory as destination, touch -r needs the full filename
  to preserve the timestamps.

* In order for $(notdir) to work which operates textually on its args,
  the loops had to be changed from shell for loops to make $(foreach)
  loops.

* And finally, to avoid exceeding the command line length limit, the
  individual commands have been separate by newlines, a trick learned
  from 
https://stackoverflow.com/questions/7039811/how-do-i-process-extremely-long-lists-of-files-in-a-make-recipe

When first testing this, the make install failed trying to install
standard.ads.h: while this gave just a warning with the old code

cp: cannot stat 'rts/standard.ads.h': No such file or directory

it now fails.  It turns out this is a dangling symlink

$ ls -l gcc/ada/rts/standard.ads.h
lrwxrwxrwx 1 ro gcc 50 May  7 18:13 gcc/ada/rts/standard.ads.h -> 
/vol/gcc/src/hg/trunk/local/gcc/ada/standard.ads.h
$ ls -lL gcc/ada/rts/standard.ads.h
ls: cannot access 'gcc/ada/rts/standard.ads.h': No such file or directory

introduced in r260921.  This is only present on mainline and the gcc-9
branch and the file seems never to have existed in tree, so I'm just
removing the reference.

Tested on x86_64-pc-linux-gnu installing both to a local filesystem and
an NFSv3 filesystem.

Ok for mainline (and the gcc-9 and gcc-8 branches eventually)?

Thanks.
Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2019-05-04  Rainer Orth  

* Makefile.rtl (LIBGNAT_SRCS): Remove standard.ads.h.
* gcc-interface/Makefile.in (INSTALL_DATA_DATE): Replace cp -p
with cp && touch -r.
(NL): Define.
(install-gcc-specs, install-gnatlib): Adapt INSTALL_DATA_DATE
uses.  Switch to $(foreach).  Separate commands by $(NL).

# HG changeset patch
# Parent  53354151a0c54f05832420ddfa64adf8e22f6251
Avoid cp -p failures during Ada make install

diff --git a/gcc/ada/Makefile.rtl b/gcc/ada/Makefile.rtl
--- a/gcc/ada/Makefile.rtl
+++ b/gcc/ada/Makefile.rtl
@@ -2610,7 +2610,7 @@ LIBGNAT_OBJS = adadecode.o adaint.o argv
 #  from ADA_INCLUDE_SRCS.
 
 LIBGNAT_SRCS = $(patsubst %.o,%.c,$(LIBGNAT_OBJS))			\
-  adadecode.h adaint.h env.h gsocket.h raise.h standard.ads.h		\
+  adadecode.h adaint.h env.h gsocket.h raise.h\
   tb-gcc.c libgnarl/thread.c $(EXTRA_LIBGNAT_SRCS)
 
 # memtrack.o is special as not put into libgnat.
diff --git a/gcc/ada/gcc-interface/Makefile.in b/gcc/ada/gcc-interface/Makefile.in
--- a/gcc/ada/gcc-interface/Makefile.in
+++ b/gcc/ada/gcc-interface/Makefile.in
@@ -98,7 +98,7 @@ COMPILER_FLAGS = $(CFLAGS)
 SHELL = @SHELL@
 PWD_COMMAND = $${PWDCMD-pwd}
 # How to copy preserving the date
-INSTALL_DATA_DATE = cp -p
+INSTALL_DATA_DATE = cp $(1) $(2) && touch -r $(1) $(2)
 MAKEINFO = makeinfo
 TEXI2DVI = texi2dvi
 TEXI2PDF = texi2pdf
@@ -502,26 +502,31 @@ gnatlink-re: ../stamp-tools gnatmake-re
 	  true; \
 	fi
 
+# Used as line separator to avoid overflowing command lines.
+define NL
+
+
+endef
+
 install-gcc-specs:
 #	Install all the requested GCC spec files.
 
-	$(foreach f,$(GCC_SPEC_FILES), \
-	$(INSTALL_DATA_DATE) $(srcdir)/ada/$(f) $(DESTDIR)$(libsubdir)/;)
+	$(foreach f, $(GCC_SPEC_FILES), \
+	$(call INSTALL_DATA_DATE, $(srcdir)/ada/$(f), $(libsubdir))/$(notdir $(f));)
 
 install-gnatlib: ../stamp-gnatlib-$(RTSDIR) install-gcc-specs
 	$(RMDIR) $(DESTDIR)$(ADA_RTL_OBJ_DIR)
 	$(RMDIR) $(DESTDIR)$(ADA_INCLUDE_DIR)
 	-$(MKDIR) $(DESTDIR)$(ADA_RTL_OBJ_DIR)
 	-$(MKDIR) $(DESTDIR)$(ADA_INCLUDE_DIR)
-	for file in $(RTSDIR)/*.ali; do \
-	$(INSTALL_DATA_DATE) $$file $(DESTDIR)$(ADA_RTL_OBJ_DIR); \
-	done
+	$(foreach file, $(wildcard $(RTSDIR)/*.ali), \
+	$(call INSTALL_DATA_DATE, $(file), $(DESTDIR)$(ADA_RTL_OBJ_DIR)/$(notdir $(file)))$(NL))
 	-cd $(RTSDIR); for file in *$(arext);do \
 	$(INSTALL_DATA) $$file $(DESTDIR)$(ADA_RTL_OBJ_DIR); \
 	$(RANLIB_FOR_TARGET) $(DESTDIR)$(ADA_RTL_OBJ_DIR)/$$file; \
 	done
 	-$(foreach file, $(EXTRA_ADALIB_OBJS), \

[C++ Patch] One more location fix

2019-05-08 Thread Paolo Carlini

Hi again,

one more straightforward fixlet which remained in my tree for a while. 
Tested x86_64-linux.


Thanks, Paolo.

/

/cp
2019-04-29  Paolo Carlini  

* decl.c (grokvardecl): Use an accurate location in error message
about main as a global variable.

/testsuite
2019-04-29  Paolo Carlini  

* g++.dg/diagnostic/main1.C: New.
Index: cp/decl.c
===
--- cp/decl.c   (revision 270643)
+++ cp/decl.c   (working copy)
@@ -9475,7 +9475,8 @@ grokvardecl (tree type,
   if (DECL_NAME (decl)
   && MAIN_NAME_P (DECL_NAME (decl))
   && scope == global_namespace)
-error ("cannot declare %<::main%> to be a global variable");
+error_at (DECL_SOURCE_LOCATION (decl),
+ "cannot declare %<::main%> to be a global variable");
 
   /* Check that the variable can be safely declared as a concept.
  Note that this also forbids explicit specializations.  */
Index: testsuite/g++.dg/diagnostic/main1.C
===
--- testsuite/g++.dg/diagnostic/main1.C (nonexistent)
+++ testsuite/g++.dg/diagnostic/main1.C (working copy)
@@ -0,0 +1 @@
+int main __attribute__((unused));  // { dg-error "5:cannot declare" }


Re: [ada, build] Avoid cp -p failures during Ada make install

2019-05-08 Thread Arnaud Charlet
> Tested on x86_64-pc-linux-gnu installing both to a local filesystem and
> an NFSv3 filesystem.
> 
> Ok for mainline (and the gcc-9 and gcc-8 branches eventually)?

No, this is not OK.

I'd rather keep the simple current logic and either stick to cp -p, or
use a proper $(INSTALL_whatever) as done elsewhere rather than adding more
kludges.

Also, standard.ads.h is a valid file, so the reference shouldn't be removed.

I'll add it to the repository, this was an oversight, thanks for noticing.

Arno


Re: Enable BF16 support (Please ignore my former email)

2019-05-08 Thread Hongtao Liu
Sorry for the indentation issue, and thanks for your reminder.

On Wed, May 8, 2019 at 3:39 PM Uros Bizjak  wrote:
>
> On Wed, May 8, 2019 at 5:06 AM Hongtao Liu  wrote:
> >
> > On Wed, May 8, 2019 at 2:33 AM Uros Bizjak  wrote:
> > >
> > > On Tue, May 7, 2019 at 8:49 AM Hongtao Liu  wrote:
> > >
> > > > > > > > > > > > This patch is about to enable support for bfloat16 
> > > > > > > > > > > > which will be in Future Cooper Lake, Please refer to 
> > > > > > > > > > > > https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference
> > > > > > > > > > > > for more details about BF16.
> > > > > > > > > > > >
> > > > > > > > > > > > There are 3 instructions for AVX512BF16: 
> > > > > > > > > > > > VCVTNE2PS2BF16, VCVTNEPS2BF16 and DPBF16PS 
> > > > > > > > > > > > instructions, which are Vector Neural Network 
> > > > > > > > > > > > Instructions supporting:
> > > > > > > > > > > >
> > > > > > > > > > > > -   VCVTNE2PS2BF16: Convert Two Packed Single Data 
> > > > > > > > > > > > to One Packed BF16 Data.
> > > > > > > > > > > > -   VCVTNEPS2BF16: Convert Packed Single Data to 
> > > > > > > > > > > > Packed BF16 Data.
> > > > > > > > > > > > -   VDPBF16PS: Dot Product of BF16 Pairs 
> > > > > > > > > > > > Accumulated into Packed Single Precision.
> > > > > > > > > > > >
> > > > > > > > > > > > Since only BF16 intrinsics are supported, we treat it 
> > > > > > > > > > > > as HI for simplicity.
> > > > > > > > > > >
> > > > > > > > > > > I think it was a mistake declaring cvtps2ph and cvtph2ps 
> > > > > > > > > > > using HImode
> > > > > > > > > > > instead of HFmode. Is there a compelling reason not to 
> > > > > > > > > > > introduce
> > > > > > > > > > > corresponding bf16_format supporting infrastructure and 
> > > > > > > > > > > declare these
> > > > > > > > > > > intrinsics using half-binary (HBmode ?) mode instead?
> > > > > > > > > > >
> > > > > > > > > > > Uros.
> > > > > > > > > >
> > > > > > > > > > Bfloat16 isn't IEEE standard which we want to reserve 
> > > > > > > > > > HFmode for.
> > > > > > > > >
> > > > > > > > > True.
> > > > > > > > >
> > > > > > > > > > The IEEE 754 standard specifies a binary16 as having the 
> > > > > > > > > > following format:
> > > > > > > > > > Sign bit: 1 bit
> > > > > > > > > > Exponent width: 5 bits
> > > > > > > > > > Significand precision: 11 bits (10 explicitly stored)
> > > > > > > > > >
> > > > > > > > > > Bfloat16 has the following format:
> > > > > > > > > > Sign bit: 1 bit
> > > > > > > > > > Exponent width: 8 bits
> > > > > > > > > > Significand precision: 8 bits (7 explicitly stored), as 
> > > > > > > > > > opposed to 24
> > > > > > > > > > bits in a classical single-precision floating-point format
> > > > > > > > >
> > > > > > > > > This is why I proposed to introduce HBmode (and corresponding
> > > > > > > > > bfloat16_format) to distingush between ieee HFmode and BFmode.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Unless there is BF16 language level support,  HBmode has no 
> > > > > > > > advantage
> > > > > > > > over HImode.   We can add HBmode when we gain BF16 language 
> > > > > > > > support.
> > > > > > > >
> > > > > > > > --
> > > > > > > > H.J.
> > > > > > >
> > > > > > > Any other comments, I'll merge this to trunk?
> > > > > >
> > > > > > It is not a regression, so please no.
> > > > >
> > > > > Ehm, "regression fix" ...
> > > > >
> > > > > Uros.
> > > >
> > > > Update patch.
> > >
> > > Index: gcc/config/i386/i386-builtins.c
> > > ===
> > > --- gcc/config/i386/i386-builtins.c(revision 270934)
> > > +++ gcc/config/i386/i386-builtins.c(working copy)
> > > @@ -1920,6 +1920,7 @@
> > >F_VPCLMULQDQ,
> > >F_AVX512VNNI,
> > >F_AVX512BITALG,
> > > +  F_AVX512BF16,
> > >F_MAX
> > >  };
> > >
> > > @@ -2064,7 +2065,8 @@
> > >{"gfni",F_GFNI,P_ZERO},
> > >{"vpclmulqdq", F_VPCLMULQDQ, P_ZERO},
> > >{"avx512vnni", F_AVX512VNNI, P_ZERO},
> > > -  {"avx512bitalg", F_AVX512BITALG, P_ZERO}
> > > +  {"avx512bitalg", F_AVX512BITALG, P_ZERO},
> > > +  {"avx512bf16", F_AVX512BF16, P_ZERO}
> > >  };
> > >
> > >  /* This parses the attribute arguments to target in DECL and determines
> > >
> > > You also need to update cpuinfo.h and cpuinfo.c in libgcc/config/i386
> > > with avx512bf16, plus relevant test files.
> > >
> > > Index: gcc/testsuite/gcc.target/i386/avx-1.c
> > > Index: gcc/testsuite/gcc.target/i386/avx-2.c
> > >
> > > No need to update above two files, sse-*.c changes are enough to cover
> > > new functionality.
> > >
> > > Otherwise LGTM, but please repost updated patch with the ChangeLog
> > > entry (please see [1]).
> > >
> > > [1] https://www.gnu.org/software/gcc/contribute.html#patches
> > >
> > > Uros.
> >
> > Update patch:
> > 1. Add Changelog.
>
> In fututre, please add ChangeLog entries as plaintext to the message,
> as instructed in [1]. Also, please read the lin

Re: [ada, build] Avoid cp -p failures during Ada make install

2019-05-08 Thread Arnaud Charlet
> Also, standard.ads.h is a valid file, so the reference shouldn't be removed.
> 
> I'll add it to the repository, this was an oversight, thanks for noticing.

I've added it now.

2019-05-08  Arnaud Charlet  

* standard.ads.h: New file.




[committed] Address compiler diagnostics in libgomp.oacc-c-c++-common/pr87835.c (was: [committed][nvptx, libgomp] Fix map_push)

2019-05-08 Thread Thomas Schwinge
Hi!

On Wed, 23 Jan 2019 09:19:33 +0100, Tom de Vries  wrote:
> The map field of a struct ptx_stream is [...]

> The current implemention gets at least the first and most basic scenario 
> wrong:
> [...]

> This problem causes the test-case asyncwait-1.c to fail intermittently on some
> systems.  The pr87835.c test-case added here is a a minimized and modified
> version of asyncwait-1.c (avoiding the kernel construct) that is more likely 
> to
> fail.

Indeed, with one OpenACC directive fixed (see below), I've been able to
reliably reproduce the failure, too, for all optimization levels I tried.

> Fix this by rewriting map_pop more robustly, by: [...]

Thanks, belatedly.

Regarding the test case:

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c
> @@ -0,0 +1,62 @@
> +/* { dg-do run { target openacc_nvidia_accel_selected } } */
> +/* { dg-additional-options "-lcuda" } */
> +
> +#include 
> +#include 
> +#include "cuda.h"
> +
> +#include 
> +
> +#define n 128
> +
> +int
> +main (void)
> +{
> +  CUresult r;
> +  CUstream stream1;
> +  int N = n;
> +  int a[n];
> +  int b[n];

source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c:19:7: 
warning: unused variable 'b' [-Wunused-variable]
   19 |   int b[n];
  |   ^

> +  int c[n];
> +
> +  acc_init (acc_device_nvidia);
> +
> +  r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
> +  if (r != CUDA_SUCCESS)
> +{
> +  fprintf (stderr, "cuStreamCreate failed: %d\n", r);
> +  abort ();
> +}
> +
> +  acc_set_cuda_stream (1, stream1);
> +
> +  for (int i = 0; i < n; i++)
> +{
> +  a[i] = 3;
> +  c[i] = 0;
> +}
> +
> +#pragma acc data copy (a, b, c) copyin (N)
> +  {
> +#pragma acc parallel async (1)
> +;
> +
> +#pragma acc parallel async (1) num_gangs (320)
> +#pragma loop gang

source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c:45: 
warning: ignoring #pragma loop gang [-Wunknown-pragmas]
   45 | #pragma loop gang
  | 

> +for (int ii = 0; ii < N; ii++)
> +  c[ii] = (a[ii] + a[N - ii - 1]);
> +
> +#pragma acc parallel async (1)
> +#pragma acc loop seq
> +for (int ii = 0; ii < n; ii++)
> +  a[ii] = 6;
> +
> +#pragma acc wait (1)
> +  }
> +
> +  for (int i = 0; i < n; i++)
> +if (c[i] != 6)
> +  abort ();
> +
> +  return 0;
> +}

Addressed on trunk in r271004, and on gcc-9-branch in r271005, see
attached.


Grüße
 Thomas


From 253ef38b3c248b69e8ab493b19b1585f291c9843 Mon Sep 17 00:00:00 2001
From: tschwinge 
Date: Wed, 8 May 2019 10:01:30 +
Subject: [PATCH] Address compiler diagnostics in
 libgomp.oacc-c-c++-common/pr87835.c

source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c: In function 'main':
source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c:45: warning: ignoring #pragma loop gang [-Wunknown-pragmas]
   45 | #pragma loop gang
  |
source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c:19:7: warning: unused variable 'b' [-Wunused-variable]
   19 |   int b[n];
  |   ^

	libgomp/
	PR target/87835
	* testsuite/libgomp.oacc-c-c++-common/pr87835.c: Update.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@271004 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog | 5 +
 libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c | 5 ++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/libgomp/ChangeLog b/libgomp/ChangeLog
index 64e0a8ad8df..a8ce3c241fc 100644
--- a/libgomp/ChangeLog
+++ b/libgomp/ChangeLog
@@ -1,3 +1,8 @@
+2019-05-07  Thomas Schwinge  
+
+	PR target/87835
+	* testsuite/libgomp.oacc-c-c++-common/pr87835.c: Update.
+
 2019-05-06  Thomas Schwinge  
 
 	* oacc-parallel.c: Add comments to legacy entry points (GCC 5).
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c
index 310a485e74f..88c2c7763cc 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c
@@ -16,7 +16,6 @@ main (void)
   CUstream stream1;
   int N = n;
   int a[n];
-  int b[n];
   int c[n];
 
   acc_init (acc_device_nvidia);
@@ -36,13 +35,13 @@ main (void)
   c[i] = 0;
 }
 
-#pragma acc data copy (a, b, c) copyin (N)
+#pragma acc data copy (a, c) copyin (N)
   {
 #pragma acc parallel async (1)
 ;
 
 #pragma acc parallel async (1) num_gangs (320)
-#pragma loop gang
+#pragma acc loop gang
 for (int ii = 0; ii < N; ii++)
   c[ii] = (a[ii] + a[N - ii - 1]);
 
-- 
2.17.1

From 9f852e24d6d75f00ccca80acb5a6804912a33282 Mon Sep 17 00:00:00 2001
From: tschwinge 
Date: Wed, 8 May 2019 10:03:04 +
Subject: [PATCH] Address compiler diagnostics in
 libgomp.oacc-c-c++-common/pr87835.c

source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr87835.c: In function 'main':
source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/pr8

[PATCH][PR90106] Builtin call transformation changes in cdce pass

2019-05-08 Thread JunMa

Hi

As PR90106 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106),
when gcc meets builtin function call like:

  y = sqrt (x);

The cdce pass tries to transform the call into an internal function
call and conditionally executes call with a simple range check on the
arguments which can detect most cases and the errno does not need
to be set. It looks like:

  y = IFN_SQRT (x);
  if (__builtin_isless (x, 0))
    sqrt (x);

However, If the call is in tail position, for example:

  y =  sqrt (x);
  return y;

will become:

  y = IFN_SQRT (x);
  if (__builtin_isless (x, 0))
    sqrt (x);
  return y;

This transformation breaks tailcall pattern, and prevents
later tailcall optimizations.

So This patch transform builtin call with return value into
if-then-else part, which looks like:

  y =  sqrt (x);
   ==>
  if (__builtin_isless (x, 0))
    y = sqrt (x);
  else
    y = IFN_SQRT (x);

BTW, y = sqrt (x) can also transform like:

  y = IFN_SQRT (x);
  if (__builtin_isless (x, 0))
    y = sqrt (x);

We don‘t choose this pattern because it emits worse assemble
code(more move instruction and use more registers) in x86_64.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

Regards
JunMa


gcc/ChangeLog

2019-05-07  Jun Ma 

    PR tree-optimization/90106
    * tree-call-cdce.c (shrink_wrap_one_built_in_call_with_conds): Add
    new parameter as new internal function call, also move it to new
    basic block.
    (use_internal_fn): Pass internal function call to
    shrink_wrap_one_built_in_call_with_conds.

gcc/testsuite/ChangeLog

2019-05-07  Jun Ma 

    PR tree-optimization/90106
    * gcc.dg/cdce1.c: Check tailcall code generation after cdce pass.
    * gcc.dg/cdce2.c: Likewise.

---
 gcc/testsuite/gcc.dg/cdce1.c |  3 +-
 gcc/testsuite/gcc.dg/cdce2.c |  3 +-
 gcc/tree-call-cdce.c | 90 +---
 3 files changed, 71 insertions(+), 25 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/cdce1.c b/gcc/testsuite/gcc.dg/cdce1.c
index b23ad63..424d80f 100644
--- a/gcc/testsuite/gcc.dg/cdce1.c
+++ b/gcc/testsuite/gcc.dg/cdce1.c
@@ -1,7 +1,8 @@
 /* { dg-do  run  } */
 /* { dg-options "-O2 -fmath-errno -fdump-tree-cdce-details  -lm" } */
 /* { dg-require-effective-target int32plus } */
-/* { dg-final { scan-tree-dump  "cdce1.c:16: .* function call is 
shrink-wrapped into error conditions\."  "cdce" } } */
+/* { dg-final { scan-tree-dump  "cdce1.c:17: .* function call is 
shrink-wrapped into error conditions\."  "cdce" } } */
+/* { dg-final { scan-assembler "jmp pow" } } */
 /* { dg-require-effective-target large_double } */
 
 #include 
diff --git a/gcc/testsuite/gcc.dg/cdce2.c b/gcc/testsuite/gcc.dg/cdce2.c
index 30e7cb1..2af2893 100644
--- a/gcc/testsuite/gcc.dg/cdce2.c
+++ b/gcc/testsuite/gcc.dg/cdce2.c
@@ -1,7 +1,8 @@
 /* { dg-do  run  } */
 /* { dg-skip-if "doubles are floats" { "avr-*-*" } } */
 /* { dg-options "-O2 -fmath-errno -fdump-tree-cdce-details  -lm" } */
-/* { dg-final { scan-tree-dump  "cdce2.c:15: .* function call is 
shrink-wrapped into error conditions\." "cdce" } } */
+/* { dg-final { scan-tree-dump  "cdce2.c:16: .* function call is 
shrink-wrapped into error conditions\." "cdce" } } */
+/* { dg-final { scan-assembler "jmp log" } } */
  
 #include 
 #include 
diff --git a/gcc/tree-call-cdce.c b/gcc/tree-call-cdce.c
index 2e482b3..9e3372f 100644
--- a/gcc/tree-call-cdce.c
+++ b/gcc/tree-call-cdce.c
@@ -93,10 +93,10 @@ along with GCC; see the file COPYING3.  If not see
 
y = sqrt (x);
  ==>
-   y = IFN_SQRT (x);
if (__builtin_isless (x, 0))
-   sqrt (x);
-
+ y =  sqrt (x);
+   else
+ y = IFN_SQRT (x);
  In the vast majority of cases we should then never need to call sqrt.
 
Note that library functions are not supposed to clear errno to zero without
@@ -793,14 +793,16 @@ gen_shrink_wrap_conditions (gcall *bi_call, vec 
conds,
 }
 
 /* Shrink-wrap BI_CALL so that it is only called when one of the NCONDS
-   conditions in CONDS is false.  */
+   conditions in CONDS is false.  Also move BI_NEWCALL to a new basic block
+   when it is non-null, it is called while all of the CONDS are true.  */
 
 static void
 shrink_wrap_one_built_in_call_with_conds (gcall *bi_call, vec  conds,
- unsigned int nconds)
+ unsigned int nconds,
+ gcall *bi_newcall = NULL)
 {
   gimple_stmt_iterator bi_call_bsi;
-  basic_block bi_call_bb, join_tgt_bb, guard_bb;
+  basic_block bi_call_bb, bi_newcall_bb, join_tgt_bb, guard_bb;
   edge join_tgt_in_edge_from_call, join_tgt_in_edge_fall_thru;
   edge bi_call_in_edge0, guard_bb_in_edge;
   unsigned tn_cond_stmts;
@@ -809,27 +811,26 @@ shrink_wrap_one_built_in_call_with_conds (gcall *bi_call, 
vec  conds,
   gimple *cond_expr_start;
 
   /* The cfg we want to create looks like this:
-
-  [guard n-1] <- guard_bb (old block)
-|\
-   

[PATCH, d]: Fix PR90261, FAIL: libphobos.phobos/std/file.d on CentOS 5.11, Linux 2.6.18

2019-05-08 Thread Uros Bizjak
Hello!

CentOS 5.11 (glibc 2.5) does not have utimensat function, so there is
no nanosecond precision of file times available. Currently, the test
fails with:

/tmp/cc36u3o7.o: In function
`_D3std4file17__T8setTimesTAyaZ8setTimesFAyaS3std8datetime7systime7SysTimeS3std8datetime7systime7SysTimeZ16trustedUtimensatFNbNiNeiPxaKxG2S4core3sys5posix6signal8timespeciZi':
/home/uros/git/gcc/libphobos/testsuite/../src/std/file.d:1272:
undefined reference to `utimensat'
collect2: error: ld returned 1 exit status
compiler exited with status 1

Attached patch detects utimensat function during configure time and
falls back to utimes in case utimensat is not available.

2019-05-08  Uroš Bizjak  

PR d/90261
* m4/druntime/libraries.m4 (DRUNTIME_LIBRARIES_CLIB):
Check for utimensat function.
* configure: Regenerate
* Makefile.in: Regenerate
* libdruntime/gcc/config.d.in: Add Have_Utimensat.
* libdruntime/Makefile.in: Regenerate.
* libdruntime/core/sys/posix/sys/stat.d [version (CRuntime_Glibc)]:
Declare utimensat and futimens only with Have_Utimensat.
* src/Makefile.in: Regenerate.
* src/std/file.d: Call testTimes with non-zero argument only
when utimensat is defined.
* testsuite/Makefile.in: Regenerate.

BTW: The same fix as applied to CRuntime_Glibc can also be applied to
FreeBSD version, which currently reads:

// Since FreeBSD 11:
version (none)
{
int utimensat(int dirfd, const char *pathname,
ref const(timespec)[2] times, int flags);
int futimens(int fd, ref const(timespec)[2] times);
}

BTW2: The testcase now fails in another place in src/std/file.d on
CentOS 5.11 (and probably other non-modern systems):

// Tests sub-second precision of querying file times.
// Should pass on most modern systems running on modern filesystems.
// Exceptions:
// - FreeBSD, where one would need to first set the
//   vfs.timestamp_precision sysctl to a value greater than zero.
// - OS X, where the native filesystem (HFS+) stores filesystem
//   timestamps with 1-second precision.

This test should check the availability of utimensat on linux,
otherwise the resolution is only in seconds range.

Patch was bootstrapped and regression tested on x86_64-linux-gnu
{,-m32} with CentOS 5.11 and Fedora 30.

Uros.
diff --git a/libphobos/Makefile.in b/libphobos/Makefile.in
index 58368c92b492..de5c7d04fee3 100644
--- a/libphobos/Makefile.in
+++ b/libphobos/Makefile.in
@@ -215,6 +215,7 @@ DCFG_HAVE_64BIT_ATOMICS = @DCFG_HAVE_64BIT_ATOMICS@
 DCFG_HAVE_ATOMIC_BUILTINS = @DCFG_HAVE_ATOMIC_BUILTINS@
 DCFG_HAVE_LIBATOMIC = @DCFG_HAVE_LIBATOMIC@
 DCFG_HAVE_QSORT_R = @DCFG_HAVE_QSORT_R@
+DCFG_HAVE_UTIMENSAT = @DCFG_HAVE_UTIMENSAT@
 DCFG_MINFO_BRACKETING = @DCFG_MINFO_BRACKETING@
 DCFG_THREAD_MODEL = @DCFG_THREAD_MODEL@
 DEFS = @DEFS@
diff --git a/libphobos/configure b/libphobos/configure
index 95a2b4232187..a33debdd97b0 100755
--- a/libphobos/configure
+++ b/libphobos/configure
@@ -651,6 +651,7 @@ LIBATOMIC
 DCFG_HAVE_LIBATOMIC
 DCFG_HAVE_64BIT_ATOMICS
 DCFG_HAVE_ATOMIC_BUILTINS
+DCFG_HAVE_UTIMENSAT
 DCFG_HAVE_QSORT_R
 OS_LINK_SPEC
 DCFG_DLPI_TLS_MODID
@@ -11635,7 +11636,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11638 "configure"
+#line 11639 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11741,7 +11742,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11744 "configure"
+#line 11745 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -14393,6 +14394,13 @@ if test "x$ac_cv_func_qsort_r" = xyes; then :
 fi
 
 
+  DCFG_HAVE_UTIMENSAT=false
+  ac_fn_c_check_func "$LINENO" "utimensat" "ac_cv_func_utimensat"
+if test "x$ac_cv_func_utimensat" = xyes; then :
+  DCFG_HAVE_UTIMENSAT=true
+fi
+
+
   ac_ext=c
 ac_cpp='$CPP $CPPFLAGS'
 ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
diff --git a/libphobos/libdruntime/Makefile.in 
b/libphobos/libdruntime/Makefile.in
index 19ee94fc370d..bdcc1979046f 100644
--- a/libphobos/libdruntime/Makefile.in
+++ b/libphobos/libdruntime/Makefile.in
@@ -547,6 +547,7 @@ DCFG_HAVE_64BIT_ATOMICS = @DCFG_HAVE_64BIT_ATOMICS@
 DCFG_HAVE_ATOMIC_BUILTINS = @DCFG_HAVE_ATOMIC_BUILTINS@
 DCFG_HAVE_LIBATOMIC = @DCFG_HAVE_LIBATOMIC@
 DCFG_HAVE_QSORT_R = @DCFG_HAVE_QSORT_R@
+DCFG_HAVE_UTIMENSAT = @DCFG_HAVE_UTIMENSAT@
 DCFG_MINFO_BRACKETING = @DCFG_MINFO_BRACKETING@
 DCFG_THREAD_MODEL = @DCFG_THREAD_MODEL@
 DEFS = @DEFS@
diff --git a/libphobos/libdruntime/core/sys/posix/sys/stat.d 
b/libphobos/libdruntime/core/sys/posix/sys/stat.d
index 76e4460550df..9161912b8cb9 100644
--- a/libphobos/libdruntime/core/sys/posix/sys/stat.d
+++ b/libphobos/libdruntime/core/sys/posix/sys/stat.d
@@ -975,9 +975,14 @@ version (CRuntime_Glibc)
 enum UTIME_NOW = 0x3fff;
 enum UTIME_OMIT = 0x3ffe;
 
-int utimensat(int dirfd, const char *pathname,
-re

Re: V6 [PATCH] Optimize vector constructor

2019-05-08 Thread Richard Biener
On Fri, May 3, 2019 at 6:54 PM H.J. Lu  wrote:
>
> On Thu, May 2, 2019 at 10:53 AM H.J. Lu  wrote:
> >
> > On Thu, May 2, 2019 at 7:55 AM Richard Biener
> >  wrote:
> > >
> > > On Thu, May 2, 2019 at 4:54 PM Richard Biener
> > >  wrote:
> > > >
> > > > On Mon, Mar 11, 2019 at 8:03 AM H.J. Lu  wrote:
> > > > >
> > > > > On Fri, Mar 8, 2019 at 7:03 PM Richard Biener
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Mar 8, 2019 at 9:49 AM H.J. Lu  wrote:
> > > > > > >
> > > > > > > On Thu, Mar 7, 2019 at 9:51 AM H.J. Lu  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Mar 6, 2019 at 8:33 PM Richard Biener
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu  
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew 
> > > > > > > > > > > > > Pinski wrote:
> > > > > > > > > > > > > > )
> > > > > > > > > > > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu 
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For vector init constructor:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > typedef float __v4sf __attribute__ 
> > > > > > > > > > > > > > > ((__vector_size__ (16)));
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > __v4sf
> > > > > > > > > > > > > > > foo (__v4sf x, float f)
> > > > > > > > > > > > > > > {
> > > > > > > > > > > > > > >   __v4sf y = { f, x[1], x[2], x[3] };
> > > > > > > > > > > > > > >   return y;
> > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > we can optimize vector init constructor with 
> > > > > > > > > > > > > > > vector copy or permute
> > > > > > > > > > > > > > > followed by a single scalar insert:
> > > > > > > > > > >
> > > > > > > > > > > > and you want to advance to the _1 = BIT_INSERT_EXPR 
> > > > > > > > > > > > here.  The easiest way
> > > > > > > > > > > > is to emit a new stmt for _2 = copy ...; and do the 
> > > > > > > > > > > > set_rhs with the
> > > > > > > > > > > > BIT_INSERT_EXPR.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for BIT_INSERT_EXPR suggestion.  I am testing this 
> > > > > > > > > > > patch.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > H.J.
> > > > > > > > > > > ---
> > > > > > > > > > > We can optimize vector constructor with vector copy or 
> > > > > > > > > > > permute followed
> > > > > > > > > > > by a single scalar insert:
> > > > > > > > > > >
> > > > > > > > > > >   __v4sf y;
> > > > > > > > > > >   __v4sf D.1930;
> > > > > > > > > > >   float _1;
> > > > > > > > > > >   float _2;
> > > > > > > > > > >   float _3;
> > > > > > > > > > >
> > > > > > > > > > >:
> > > > > > > > > > >   _1 = BIT_FIELD_REF ;
> > > > > > > > > > >   _2 = BIT_FIELD_REF ;
> > > > > > > > > > >   _3 = BIT_FIELD_REF ;
> > > > > > > > > > >   y_6 = {f_5(D), _3, _2, _1};
> > > > > > > > > > >   return y_6;
> > > > > > > > > > >
> > > > > > > > > > > with
> > > > > > > > > > >
> > > > > > > > > > >  __v4sf y;
> > > > > > > > > > >   __v4sf D.1930;
> > > > > > > > > > >   float _1;
> > > > > > > > > > >   float _2;
> > > > > > > > > > >   float _3;
> > > > > > > > > > >   vector(4) float _8;
> > > > > > > > > > >
> > > > > > > > > > >:
> > > > > > > > > > >   _1 = BIT_FIELD_REF ;
> > > > > > > > > > >   _2 = BIT_FIELD_REF ;
> > > > > > > > > > >   _3 = BIT_FIELD_REF ;
> > > > > > > > > > >   _8 = x_9(D);
> > > > > > > > > > >   y_6 = BIT_INSERT_EXPR ;
> > > > > > > > > > >   return y_6;
> > > > > > > > > > >
> > > > > > > > > > > gcc/
> > > > > > > > > > >
> > > > > > > > > > > PR tree-optimization/88828
> > > > > > > > > > > * tree-ssa-forwprop.c 
> > > > > > > > > > > (simplify_vector_constructor): Optimize
> > > > > > > > > > > vector init constructor with vector copy or 
> > > > > > > > > > > permute followed
> > > > > > > > > > > by a single scalar insert.
> > > > > > > > > > >
> > > > > > > > > > > gcc/testsuite/
> > > > > > > > > > >
> > > > > > > > > > > PR tree-optimization/88828
> > > > > > > > > > > * gcc.target/i386/pr88828-1a.c: New test.
> > > > > > > > > > > * gcc.target/i386/pr88828-2b.c: Likewise.
> > > > > > > > > > > * gcc.target/i386/pr88828-2.c: Likewise.
> > > > > > > > > > > * gcc.target/i386/pr88828-3a.c: Likewise.
> > > > > > > > > > > * gcc.target/i386/pr88828-3b.c: Likewise.
> > > > > > > > > > > * gcc.target/i386/pr88828-3c.c: Likewise.
> > > > > > > > > > > * gcc.target/i386/pr88828-3d.c:

Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Richard Biener
On Tue, 7 May 2019, Andrew Pinski wrote:

> On Mon, May 6, 2019 at 7:24 AM Jiufu Guo  wrote:
> >
> > Hi,
> >
> > This patch implements the optimization in PR77820.  The optimization
> > eliminates phi and phi's basic block, if the phi is used only by
> > condition branch, and the phi's incoming value in the result of a
> > CMP result.
> >
> > This optimization eliminates:
> > 1. saving CMP result: p0 = a CMP b.
> > 2. additional CMP on branch: if (phi != 0).
> > 3. converting CMP result if there is phi = (INT_CONV) p0 if there is.
> >
> >   
> >   p0 = a CMP b
> >   goto ;
> >
> >   
> >   p1 = c CMP d
> >   goto ;
> >
> >   
> >   # phi = PHI 
> >   if (phi != 0) goto ; else goto ;
> >
> > Transform to:
> >
> >   
> >   p0 = a CMP b
> >   if (p0 != 0) goto ; else goto ;
> >
> >   
> >   p1 = c CMP d
> >   if (p1 != 0) goto ; else goto ;
> >
> > Bootstrapped and tested on powerpc64le with no regressions, and testcases 
> > were
> > saved. Is this ok for trunk?
> 
> forwprop was created orginally to something similar but this case is a
> specific case of backwards prop (almost).
> I wonder if it could be combined with that or as Richard mentioned,
> jump threading.

The awkward part is that it changes the CFG so it doesn't fit forwprop
well.  I thought of ifcombine which kind-of does a reverse transform
but in the end it is really duplicating  to get rid of the PHI and
then applying forwprop.

Richard.

> Thanks,
> Andrew Pinski
> 
> >
> > Thanks!
> >
> > [gcc]
> > 2019-05-06  Jiufu Guo  
> > Lijia He  
> >
> > PR tree-optimization/77820
> > * tree-ssa-mergephicmp.c: New file.
> > * Makefile.in (OBJS): Add tree-ssa-mergephicmp.o.
> > * common.opt (ftree-mergephicmp): New flag.
> > * passes.def (pass_mergephicmp): New pass.
> > * timevar.def (TV_TREE_MERGEPHICMP): New timevar.
> > * tree-pass.h: New file.
> >
> > [gcc/testsuite]
> > 2019-05-06  Jiufu Guo  
> > Lijia He  
> >
> > PR tree-optimization/77820
> > * gcc.dg/tree-ssa/phi_on_compare-1.c: New testcase.
> > * gcc.dg/tree-ssa/phi_on_compare-2.c: New testcase.
> >
> >
> > ---
> >  gcc/Makefile.in  |   1 +
> >  gcc/common.opt   |   4 +
> >  gcc/passes.def   |   1 +
> >  gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c |  31 +++
> >  gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-2.c |  31 +++
> >  gcc/timevar.def  |   1 +
> >  gcc/tree-pass.h  |   1 +
> >  gcc/tree-ssa-mergephicmp.c   | 260 
> > +++
> >  8 files changed, 330 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-2.c
> >  create mode 100644 gcc/tree-ssa-mergephicmp.c
> >
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index d186d71..9729501 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1567,6 +1567,7 @@ OBJS = \
> > tree-ssa-reassoc.o \
> > tree-ssa-sccvn.o \
> > tree-ssa-scopedtables.o \
> > +   tree-ssa-mergephicmp.o \
> > tree-ssa-sink.o \
> > tree-ssa-strlen.o \
> > tree-ssa-structalias.o \
> > diff --git a/gcc/common.opt b/gcc/common.opt
> > index d342c4f..5ea5ed2 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -2702,6 +2702,10 @@ ftree-salias
> >  Common Ignore
> >  Does nothing.  Preserved for backward compatibility.
> >
> > +ftree-mergephicmp
> > +Common Report Var(flag_mergephicmp) Init(1) Optimization
> > +Enable optimization on branch phi compare on trees.
> > +
> >  ftree-sink
> >  Common Report Var(flag_tree_sink) Optimization
> >  Enable SSA code sinking on trees.
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index 446a7c4..e3a3913 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -249,6 +249,7 @@ along with GCC; see the file COPYING3.  If not see
> >NEXT_PASS (pass_lim);
> >NEXT_PASS (pass_walloca, false);
> >NEXT_PASS (pass_pre);
> > +  NEXT_PASS (pass_mergephicmp);
> >NEXT_PASS (pass_sink_code);
> >NEXT_PASS (pass_sancov);
> >NEXT_PASS (pass_asan);
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c
> > new file mode 100644
> > index 000..2e3f4f6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi_on_compare-1.c
> > @@ -0,0 +1,31 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-mergephicmp" } */
> > +
> > +void g (void);
> > +void g1 (void);
> > +
> > +void
> > +f (long a, long b, long c, long d, int x)
> > +{
> > +  _Bool t;
> > +  if (x)
> > +{
> > +  t = a < b;
> > +}
> > +  else if (d == a + b)
> > +{
> > +  t = c < d;
> > +}
> > +  else
> > +{
> > +  t = a == c;
> > +

Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Richard Biener
On Wed, 8 May 2019, Jiufu Guo wrote:

> Hi,
> 
> Thanks Richard, Segher, Andrew and all.
> 
> Segher Boessenkool  writes:
> 
> > Let me try to answer some of this...
> >
> > On Tue, May 07, 2019 at 03:31:27PM +0200, Richard Biener wrote:
> >> On Mon, 6 May 2019, Jiufu Guo wrote:
> >> > This patch implements the optimization in PR77820.  The optimization
> >> > eliminates phi and phi's basic block, if the phi is used only by
> >> > condition branch, and the phi's incoming value in the result of a
> >> > CMP result.
> >
> >> I'm not sure I like a new pass here ;)  The transform is basically
> >> tail-duplicating the PHI block because the exit conditional can
> >> be "simplfied" - that's something jump threading generally does
> >> though it relies on "simplified" being a little bit more simplified
> >> than above.
> >
> Adding a new pass is just because it may be relatively easy to extend
> and maintain. 
> 
> Adding this micor-optimization into other passes is also a good
> sugguestion. 
> 
> - Similar with jump threading, this new transform is replacing jump
> old destination with new destination. While for this case, the new
> destination can not be determined at compile time. 
> 
> - And as Andrew said: forwprop/backprop are similar, but this case is in
> opposite: it is spread common code(phi+gcond) into different preds.
> 
> - And there is phiopt pass which focus on making 'phi' stmt better. 
> it also do some similar thing:
>  bb0:
>if (a != b) goto bb2; else goto bb1;
>  bb1:
>  bb2:
>x = PHI ;
> 
>tranform to:
> 
>  bb0:
>  bb2:
>x = PHI ;
> 
> If I avoid to add new pass, any suggustions about which exiting pass
> (jumpthreading/forwprop/phiopt/...) would be more suitable to extend to
> support this new transform? 

The main thing the transform does is tail-duplicate the PHI block,
if we'd just do that followup transforms would do the rest.

So my suggestion would be to look at passes doing exactly that
which would be tracer and path splitting (but that's just run for
loops).  If you want to fit it into jump threading then it would
be the backwards threader since the heuristics would look at
opportunities to simplify the conditional, see it fed by a PHI
and there from (two) conditional(s).

You do not put any constraints on the block relation of the
conditional generators, for example path splitting looks
for diamonds, duplicating the merge block.

It might be that extending the backwards threader with the
pattern matching is easiest (this one also runs quite a few times).

> > Right, but where in the pipeline does this fit in?
> >
> >> I suspect this transform was implemented because of some benchmark?
> >
> > Something in SPEC...  2006 iirc...  Will need to dig it up, I forgot
> > the details.
> >
> >> I suspect the performance benefit is because of better branch
> >> prediction by not mangling both conditional branches into one?
> >
> > No, it is that previously a condition was moved to a GPR, and then compared
> > again.  See PR77820.  This is expensive, and serial, too.
> >
> This transform focusing on eliminating additional GPR saving and
> additional compare, then it would helps a lot if the comparasion in in
> hot path.
> >> The transform is also somewhat similar to tail-duplication done
> >> in path splitting or tracer.
> >
> > Yes.
> >
> >> The pass itself does quite strict pattern-matching but I wonder
> >> if more cases should be handled this way.
> >
> > Maybe.  Probably.  But which?
> Currently this pass handles the case if there is only one PHI and the
> phi is used by gcond. If there are other suitable cases, we can just
> extend this tranform.

I was just thinking of a small amount of unrelated stmts in those
blocks or the case where just one PHI argument is fed by a conditional
thus only one path would simplify (but maybe the hot one if you look
at edge probabilities).

> >
> >> Any specific reason for the pass placement between PRE and sinking?
> >> tracer and path splitting run much later, jump threading runs
> >> all over the place.
> >
> > Dunno.  Jiufu, does the pass placement matter much?
> This pass would just need be inserted after ssa, and before the last dce
> and forwprop which could help to eliminate temp conversion from bool to
> int:
>  _1 = a_8(D) < b_9(D);
>  t_14 = (int) _1;
>  if (_1 != 0)
> 
> Putting it after PRE is just because some case is better be handled by PREx,
> like below case:  if (x) t = a < b; else t = a < b; if (t) xx.
> While actually, this pass is ok to insert at other placement.

OK, so all of the suggestions above would work since we have a late
forwprop and DCE pass to clean up.

Btw, I wonder if on RTL basic-block reordering (which also does
some tail duplication) could be a place to do such transform?
Or is it too late to do the desired cleanups after that?
Possibly since we're after RA.

Richard.


Re: [PATCH][PR90106] Builtin call transformation changes in cdce pass

2019-05-08 Thread Richard Biener
On Wed, May 8, 2019 at 12:09 PM JunMa  wrote:
>
> Hi
>
> As PR90106 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106),
> when gcc meets builtin function call like:
>
>y = sqrt (x);
>
> The cdce pass tries to transform the call into an internal function
> call and conditionally executes call with a simple range check on the
> arguments which can detect most cases and the errno does not need
> to be set. It looks like:
>
>y = IFN_SQRT (x);
>if (__builtin_isless (x, 0))
>  sqrt (x);
>
> However, If the call is in tail position, for example:
>
>y =  sqrt (x);
>return y;
>
> will become:
>
>y = IFN_SQRT (x);
>if (__builtin_isless (x, 0))
>  sqrt (x);
>return y;
>
> This transformation breaks tailcall pattern, and prevents
> later tailcall optimizations.
>
> So This patch transform builtin call with return value into
> if-then-else part, which looks like:
>
>y =  sqrt (x);
> ==>
>if (__builtin_isless (x, 0))
>  y = sqrt (x);
>else
>  y = IFN_SQRT (x);
>
> BTW, y = sqrt (x) can also transform like:
>
>y = IFN_SQRT (x);
>if (__builtin_isless (x, 0))
>  y = sqrt (x);
>
> We don‘t choose this pattern because it emits worse assemble
> code(more move instruction and use more registers) in x86_64.
>
> Bootstrapped/regtested on x86_64-linux, ok for trunk?

OK.

Thanks,
Richard.


> Regards
> JunMa
>
>
> gcc/ChangeLog
>
> 2019-05-07  Jun Ma 
>
>  PR tree-optimization/90106
>  * tree-call-cdce.c (shrink_wrap_one_built_in_call_with_conds): Add
>  new parameter as new internal function call, also move it to new
>  basic block.
>  (use_internal_fn): Pass internal function call to
>  shrink_wrap_one_built_in_call_with_conds.
>
> gcc/testsuite/ChangeLog
>
> 2019-05-07  Jun Ma 
>
>  PR tree-optimization/90106
>  * gcc.dg/cdce1.c: Check tailcall code generation after cdce pass.
>  * gcc.dg/cdce2.c: Likewise.
>


Re: [Vectorizer] Add SLP support for masked loads

2019-05-08 Thread Richard Biener
On Fri, Apr 26, 2019 at 3:14 PM Richard Sandiford
 wrote:
>
> Alejandro Martinez Vicente  writes:
> > Hi,
> >
> > Current vectorizer doesn't support masked loads for SLP. We should add 
> > that, to
> > allow things like:
> >
> > void
> > f (int *restrict x, int *restrict y, int *restrict z, int n)
> > {
> >   for (int i = 0; i < n; i += 2)
> > {
> >   x[i] = y[i] ? z[i] : 1;
> >   x[i + 1] = y[i + 1] ? z[i + 1] : 2;
> > }
> > }
> >
> > to be vectorized using contiguous loads rather than LD2 and ST2.
> >
> > This patch was motivated by SVE, but it is completely generic and should 
> > apply
> > to any architecture with masked loads.
> >
> > After the patch is applied, the above code generates this output
> > (-march=armv8.2-a+sve -O2 -ftree-vectorize):
> >
> >  :
> >0: 717fcmp w3, #0x0
> >4: 540002cdb.le5c 
> >8: 51000464sub w4, w3, #0x1
> >c: d283mov x3, #0x0// #0
> >   10: 9005adrpx5, 0 
> >   14: 25d8e3e0ptrue   p0.d
> >   18: 53017c84lsr w4, w4, #1
> >   1c: 91a5add x5, x5, #0x0
> >   20: 11000484add w4, w4, #0x1
> >   24: 85c0e0a1ld1rd   {z1.d}, p0/z, [x5]
> >   28: 2598e3e3ptrue   p3.s
> >   2c: d37ff884lsl x4, x4, #1
> >   30: 25a41fe2whilelo p2.s, xzr, x4
> >   34: d503201fnop
> >   38: a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2]
> >   3c: 25808c11cmpne   p1.s, p3/z, z0.s, #0
> >   40: 25808810cmpne   p0.s, p2/z, z0.s, #0
> >   44: a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2]
> >   48: 05a1c400sel z0.s, p1, z0.s, z1.s
> >   4c: e5434800st1w{z0.s}, p2, [x0, x3, lsl #2]
> >   50: 04b0e3e3incwx3
> >   54: 25a41c62whilelo p2.s, x3, x4
> >   58: 5401b.ne38   // b.any
> >   5c: d65f03c0ret
> >
> >
> > I tested this patch in an aarch64 machine bootstrapping the compiler and
> > running the checks.
> >
> > Alejandro
> >
> > gcc/Changelog:
> >
> > 2019-01-16  Alejandro Martinez  
> >
> >   * config/aarch64/aarch64-sve.md (copysign3): New define_expand.
> >   (xorsign3): Likewise.
> >   internal-fn.c: Marked mask_load_direct and mask_store_direct as
> >   vectorizable.
> >   tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
> >   tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
> >   combined even if masks different.
> >   (slp_vect_only_p): New function to detect masked loads that are only
> >   vectorizable using SLP.
> >   (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups.
> >   tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
> >   dissolve SLP-only vectorizable groups when SLP has been discarded.
> >   (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when needed.
> >   tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
> >   masks.
> >   (vect_build_slp_tree_1): Fixed comment typo.
> >   (vect_build_slp_tree_2): Include masks from masked loads in SLP tree.
> >   tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to get
> >   vec_defs for operand with optional SLP and vectype.
> >   (vectorizable_load): Allow vectorizaion of masked loads for SLP only.
> >   tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only
> >   vectorizable.
> >   tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise.
> >
> > gcc/testsuite/Changelog:
> >
> > 2019-01-16  Alejandro Martinez  
> >
> >   * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP
> >   vectorized masked loads.
> >
> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > index 4f2ef45..67eee59 100644
> > --- a/gcc/internal-fn.c
> > +++ b/gcc/internal-fn.c
> > @@ -100,11 +100,11 @@ init_internal_fns ()
> >  /* Create static initializers for the information returned by
> > direct_internal_fn.  */
> >  #define not_direct { -2, -2, false }
> > -#define mask_load_direct { -1, 2, false }
> > +#define mask_load_direct { -1, 2, true }
> >  #define load_lanes_direct { -1, -1, false }
> >  #define mask_load_lanes_direct { -1, -1, false }
> >  #define gather_load_direct { -1, -1, false }
> > -#define mask_store_direct { 3, 2, false }
> > +#define mask_store_direct { 3, 2, true }
> >  #define store_lanes_direct { 0, 0, false }
> >  #define mask_store_lanes_direct { 0, 0, false }
> >  #define scatter_store_direct { 3, 3, false }
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c 
> > b/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c
> > new file mode 100644
> > index 000..b106cae
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c
> > @@ -0,0 +1,74 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -ftree-vectorize" } */
> > +
> > +#include 
> > +
>

Re: [Vectorizer] Add SLP support for masked loads

2019-05-08 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, Apr 26, 2019 at 3:14 PM Richard Sandiford
>  wrote:
>>
>> Alejandro Martinez Vicente  writes:
>> > Hi,
>> >
>> > Current vectorizer doesn't support masked loads for SLP. We should add 
>> > that, to
>> > allow things like:
>> >
>> > void
>> > f (int *restrict x, int *restrict y, int *restrict z, int n)
>> > {
>> >   for (int i = 0; i < n; i += 2)
>> > {
>> >   x[i] = y[i] ? z[i] : 1;
>> >   x[i + 1] = y[i + 1] ? z[i + 1] : 2;
>> > }
>> > }
>> >
>> > to be vectorized using contiguous loads rather than LD2 and ST2.
>> >
>> > This patch was motivated by SVE, but it is completely generic and should 
>> > apply
>> > to any architecture with masked loads.
>> >
>> > After the patch is applied, the above code generates this output
>> > (-march=armv8.2-a+sve -O2 -ftree-vectorize):
>> >
>> >  :
>> >0: 717fcmp w3, #0x0
>> >4: 540002cdb.le5c 
>> >8: 51000464sub w4, w3, #0x1
>> >c: d283mov x3, #0x0// #0
>> >   10: 9005adrpx5, 0 
>> >   14: 25d8e3e0ptrue   p0.d
>> >   18: 53017c84lsr w4, w4, #1
>> >   1c: 91a5add x5, x5, #0x0
>> >   20: 11000484add w4, w4, #0x1
>> >   24: 85c0e0a1ld1rd   {z1.d}, p0/z, [x5]
>> >   28: 2598e3e3ptrue   p3.s
>> >   2c: d37ff884lsl x4, x4, #1
>> >   30: 25a41fe2whilelo p2.s, xzr, x4
>> >   34: d503201fnop
>> >   38: a5434820ld1w{z0.s}, p2/z, [x1, x3, lsl #2]
>> >   3c: 25808c11cmpne   p1.s, p3/z, z0.s, #0
>> >   40: 25808810cmpne   p0.s, p2/z, z0.s, #0
>> >   44: a5434040ld1w{z0.s}, p0/z, [x2, x3, lsl #2]
>> >   48: 05a1c400sel z0.s, p1, z0.s, z1.s
>> >   4c: e5434800st1w{z0.s}, p2, [x0, x3, lsl #2]
>> >   50: 04b0e3e3incwx3
>> >   54: 25a41c62whilelo p2.s, x3, x4
>> >   58: 5401b.ne38   // b.any
>> >   5c: d65f03c0ret
>> >
>> >
>> > I tested this patch in an aarch64 machine bootstrapping the compiler and
>> > running the checks.
>> >
>> > Alejandro
>> >
>> > gcc/Changelog:
>> >
>> > 2019-01-16  Alejandro Martinez  
>> >
>> >   * config/aarch64/aarch64-sve.md (copysign3): New define_expand.
>> >   (xorsign3): Likewise.
>> >   internal-fn.c: Marked mask_load_direct and mask_store_direct as
>> >   vectorizable.
>> >   tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
>> >   tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
>> >   combined even if masks different.
>> >   (slp_vect_only_p): New function to detect masked loads that are only
>> >   vectorizable using SLP.
>> >   (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups.
>> >   tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
>> >   dissolve SLP-only vectorizable groups when SLP has been discarded.
>> >   (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when 
>> > needed.
>> >   tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
>> >   masks.
>> >   (vect_build_slp_tree_1): Fixed comment typo.
>> >   (vect_build_slp_tree_2): Include masks from masked loads in SLP tree.
>> >   tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to 
>> > get
>> >   vec_defs for operand with optional SLP and vectype.
>> >   (vectorizable_load): Allow vectorizaion of masked loads for SLP only.
>> >   tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only
>> >   vectorizable.
>> >   tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise.
>> >
>> > gcc/testsuite/Changelog:
>> >
>> > 2019-01-16  Alejandro Martinez  
>> >
>> >   * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP
>> >   vectorized masked loads.
>> >
>> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
>> > index 4f2ef45..67eee59 100644
>> > --- a/gcc/internal-fn.c
>> > +++ b/gcc/internal-fn.c
>> > @@ -100,11 +100,11 @@ init_internal_fns ()
>> >  /* Create static initializers for the information returned by
>> > direct_internal_fn.  */
>> >  #define not_direct { -2, -2, false }
>> > -#define mask_load_direct { -1, 2, false }
>> > +#define mask_load_direct { -1, 2, true }
>> >  #define load_lanes_direct { -1, -1, false }
>> >  #define mask_load_lanes_direct { -1, -1, false }
>> >  #define gather_load_direct { -1, -1, false }
>> > -#define mask_store_direct { 3, 2, false }
>> > +#define mask_store_direct { 3, 2, true }
>> >  #define store_lanes_direct { 0, 0, false }
>> >  #define mask_store_lanes_direct { 0, 0, false }
>> >  #define scatter_store_direct { 3, 3, false }
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c 
>> > b/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c
>> > new file mode 100644
>> > index 000..b106cae
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target

[PATCH][AArch64] Make use of FADDP in simple reductions

2019-05-08 Thread Elen Kalda
Hi,

This patch adds a pattern to support the FADDP (scalar) instruction.

Before the patch, the C code

typedef double v2df __attribute__((vector_size (16)));

double
foo (v2df x)
{
  return x[1] + x[0];
}

generated:
foo:
dup d1, v0.d[0]
dup d0, v0.d[1]
faddd0, d1, d0
ret

After patch:
foo:
faddp   d0, v0.2d
ret


Bootstrapped and done regression tests on aarch64-none-linux-gnu - 
no issues found.

Best wishes,
Elen


gcc/ChangeLog:

2019-04-24  Elen Kalda  

* config/aarch64/aarch64-simd.md (*aarch64_faddp): New.

gcc/testsuite/ChangeLog:

2019-04-24  Elen Kalda  

* gcc.target/aarch64/simd/scalar_faddp.c: New test.diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index e3852c5d182b70978d7603225fce55c0b8ee2894..89fedc6cb3f0c6eb74c6f8d0b21cedb5ae20a095 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2372,6 +2372,21 @@
   [(set_attr "type" "neon_fp_reduc_add_")]
 )
 
+(define_insn "*aarch64_faddp"
+  [(set (match_operand: 0 "register_operand" "=w")
+(plus:
+  (vec_select: (match_operand:VHSDF 1 "register_operand" "w")
+		(parallel[(match_operand 2 "const_int_operand" "n")]))
+  (vec_select: (match_dup:VHSDF 1)
+		(parallel[(match_operand 3 "const_int_operand" "n")]]
+  "TARGET_SIMD
+  && ((INTVAL (operands[2]) == 0 && INTVAL (operands[3]) == 1)
+|| (INTVAL (operands[2]) == 1 && INTVAL (operands[3]) == 0))"
+  "faddp\t%0, %1.2"
+  [(set_attr "type" "neon_fp_reduc_add_")]
+)
+
+
 (define_insn "aarch64_reduc_plus_internal"
  [(set (match_operand:VDQV 0 "register_operand" "=w")
(unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")]
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
new file mode 100644
index ..2396286d483c16c8b70b16fa08bffeb15f034a93
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/scalar_faddp.c
@@ -0,0 +1,31 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_fp16_scalar_ok } */
+/* { dg-add-options arm_v8_2a_fp16_scalar } */
+/* { dg-additional-options "-save-temps -O1" } */
+/* { dg-final { scan-assembler-not "dup" } } */
+
+
+typedef double v2df __attribute__((vector_size (16)));
+typedef float v4sf __attribute__((vector_size (16)));
+typedef __fp16 v8hf __attribute__((vector_size (16)));
+
+double
+foo (v2df x)
+{
+  return x[1] + x[0];
+}
+/* { dg-final { scan-assembler-times {faddp\td[0-9]+, v[0-9]+.2d} 1 } } */
+
+float
+foo1 (v4sf x)
+{
+  return x[0] + x[1];
+}
+/* { dg-final { scan-assembler-times {faddp\ts[0-9]+, v[0-9]+.2s} 1 } } */
+
+__fp16
+foo2 (v8hf x)
+{
+  return x[0] + x[1];
+}
+/* { dg-final { scan-assembler-times {faddp\th[0-9]+, v[0-9]+.2h} 1 } } */


Re: [PATCH] Improve tail call analysis and inliner EH clobber through variable life analysis (PR tree-optimization/89060, take 2)

2019-05-08 Thread Richard Biener
On Wed, 8 May 2019, Jakub Jelinek wrote:

> On Mon, May 06, 2019 at 04:17:01PM +0200, Richard Biener wrote:
> > > > +struct compute_live_vars_data {
> > > > +  /* Vector of bitmaps for live vars at the end of basic blocks,
> > > > + indexed by bb->index.  ACTIVE[ENTRY_BLOCK] must be empty bitmap,
> > > > + ACTIVE[EXIT_BLOCK] is used for STOP_AFTER.  */
> > > > +  vec active;
> > > > +  /* Work bitmap of currently live variables.  */
> > > > +  bitmap work;
> > > > +  /* Bitmap of interesting variables.  Variables with uids not in this
> > > > + bitmap are not tracked.  */
> > > > +  bitmap vars;
> > > > +};
> > 
> > How dense are the bitmaps?  Storage requirement will be quadratic
> > so eventually using a sparseset for 'vars' and using the sparseset
> > index for the bitmaps will improve things?  Or re-using live
> > code mapping?
> 
> Here is an updated patch that instead of using a bitmap of vars uses a
> hash_map from DECL_UID to some dense indexes used in the rest of the
> bitmaps.
> 
> > > > + tree lhs = gimple_assign_lhs (stmt);
> > > > + if (VAR_P (lhs) && bitmap_bit_p (data->vars, DECL_UID (lhs)))
> > > > +   bitmap_clear_bit (data->work, DECL_UID (lhs));
> > 
> > I think this clearing causes PR90348.
> 
> This stuff keeps as is, can be changed depending on what we decide for
> PR90348.
> 
> > > > +  /* If any such variables are found, try harder using variable life
> > > > + tracking to see if those variables aren't already out of scope.  
> > > > */
> > > > +  if (local_decls)
> > > > +{
> > > > +  vec live = compute_live_vars (cfun, local_decls, 
> > > > call);
> > > > +  bool any_live_p = !bitmap_empty_p (&live[EXIT_BLOCK]);
> > > > +  destroy_live_vars (live);
> > > > +  BITMAP_FREE (local_decls);
> > > > +  if (any_live_p)
> > > > return;
> > > >  }
> > 
> > I wonder if it is possible to split analysis and transform here
> > some more - the above is called for all preds of EXIT, if we
> > first analyze all of them and then compute live once if needed,
> > pruning invalid ones and then doing transform this would at least
> > make LIVE only computed once.
> > 
> > It should be also possible to restrict LIVE to the sub-CFG leading to
> > the EXIT preds with tail-call candidates as well, no?
> 
> And here it lazily computes the analysis the first time we need it in a
> function for all addressable vars in the function and then uses it for that
> and all following tail call locations, where it just walks the last bb until
> after the call stmt.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

I'm still worried about complexity - for the inliner case we compute
liveness for the whole destination function for each function with
an EH landing pad we expand inline.  It seems to me the liveness
should be computed on the callee function body, not the caller
after inlining?  Is it really worth optimizing the number of clobbers
here with this potentially expensive work?

For the tailcall case consider a function like

foo ()
{
  T a, b, ;
  if (..) baz(&a);
  if (..) baz(&b);
...
  bar(&a,&b,&c...);
}

or similar, that is, a lot of decls we need to track, a lot of BBs
and a lot of tailcall opportunities with the case of all decls being
live throughout the whole function.

-foptimize-sibling-calls is only enabled at -O2 but the inlining
part at -O1 where we hit those huge auto-generated testcases.

Richard.

> 2019-05-08  Jakub Jelinek  
> 
>   PR tree-optimization/89060
>   * tree-ssa-live.h (live_vars_map): New typedef.
>   (compute_live_vars, live_vars_at_stmt, destroy_live_vars): Declare.
>   * tree-ssa-live.c: Include gimple-walk.h and cfganal.h.
>   (struct compute_live_vars_data): New type.
>   (compute_live_vars_visit, compute_live_vars_1, compute_live_vars,
>   live_vars_at_stmt, destroy_live_vars): New functions.
>   * tree-tailcall.c: Include tree-ssa-live.h.
>   (live_vars, live_vars_vec): New global variables.
>   (find_tail_calls): Perform variable life analysis before punting.
>   (tree_optimize_tail_calls_1): Clean up live_vars and live_vars_vec.
>   * tree-inline.h (struct copy_body_data): Add eh_landing_pad_dest
>   member.
>   * tree-inline.c (add_clobbers_to_eh_landing_pad): Remove BB argument.
>   Perform variable life analysis to select variables that really need
>   clobbers added.
>   (copy_edges_for_bb): Don't call add_clobbers_to_eh_landing_pad here,
>   instead set id->eh_landing_pad_dest and assert it is the same.
>   (copy_cfg_body): Call it here if id->eh_landing_pad_dest is non-NULL.
> 
>   * gcc.dg/tree-ssa/pr89060.c: New test.
> 
> --- gcc/tree-ssa-live.h.jj2019-02-08 08:10:16.954053427 +0100
> +++ gcc/tree-ssa-live.h   2019-05-07 17:43:00.369174630 +0200
> @@ -266,6 +266,11 @@ extern void debug (tree_live_info_d &ref
>  extern void debug (tree_live_info_d *ptr);
>  extern void dump

Re: [PATCH] Improve tail call analysis and inliner EH clobber through variable life analysis (PR tree-optimization/89060, take 2)

2019-05-08 Thread Jakub Jelinek
On Wed, May 08, 2019 at 03:51:42PM +0200, Richard Biener wrote:
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> I'm still worried about complexity - for the inliner case we compute
> liveness for the whole destination function for each function with
> an EH landing pad we expand inline.  It seems to me the liveness

No, in the inliner we compute it for the whole callee body only:
  vec live = compute_live_vars (id->src_cfun, vars);
caller body would be id->dest_cfun instead.

> should be computed on the callee function body, not the caller
> after inlining?  Is it really worth optimizing the number of clobbers
> here with this potentially expensive work?

I've seen in several testcases that we were adding big amounts of clobbers
and that is something we really won't get rid until RTL expansion, if it is
in functions we inline further it will be duplicated etc.

> For the tailcall case consider a function like
> 
> foo ()
> {
>   T a, b, ;
>   if (..) baz(&a);
>   if (..) baz(&b);
> ...
>   bar(&a,&b,&c...);
> }
> 
> or similar, that is, a lot of decls we need to track, a lot of BBs
> and a lot of tailcall opportunities with the case of all decls being
> live throughout the whole function.

True, sure, there can be corner cases with many variables.  On the other
side, the tail call optimization is in some cases very important for some
programs (actually required in some cases where e.g. the recursion is very
deep), so I think the optimization is worth it.
If we run into some corner cases, we could punt if we e.g. notice that the
number of variables to check is too high with too large cfg.

> -foptimize-sibling-calls is only enabled at -O2 but the inlining
> part at -O1 where we hit those huge auto-generated testcases.

If we want, we could either just for -O1 or all cases again punt if the cfg
is too large and/or number of variables is too large.

Jakub


Re: [PATCH] Improve tail call analysis and inliner EH clobber through variable life analysis (PR tree-optimization/89060, take 2)

2019-05-08 Thread Richard Biener
On Wed, 8 May 2019, Jakub Jelinek wrote:

> On Wed, May 08, 2019 at 03:51:42PM +0200, Richard Biener wrote:
> > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> > 
> > I'm still worried about complexity - for the inliner case we compute
> > liveness for the whole destination function for each function with
> > an EH landing pad we expand inline.  It seems to me the liveness
> 
> No, in the inliner we compute it for the whole callee body only:
>   vec live = compute_live_vars (id->src_cfun, vars);
> caller body would be id->dest_cfun instead.

Ah, OK.  So what we could still do is cache this for functions
inlined multiple times (though it probably only matters for large
functions which are probably not inlined anyways, not multiple times
at least).

> > should be computed on the callee function body, not the caller
> > after inlining?  Is it really worth optimizing the number of clobbers
> > here with this potentially expensive work?
> 
> I've seen in several testcases that we were adding big amounts of clobbers
> and that is something we really won't get rid until RTL expansion, if it is
> in functions we inline further it will be duplicated etc.

I see.

> > For the tailcall case consider a function like
> > 
> > foo ()
> > {
> >   T a, b, ;
> >   if (..) baz(&a);
> >   if (..) baz(&b);
> > ...
> >   bar(&a,&b,&c...);
> > }
> > 
> > or similar, that is, a lot of decls we need to track, a lot of BBs
> > and a lot of tailcall opportunities with the case of all decls being
> > live throughout the whole function.
> 
> True, sure, there can be corner cases with many variables.  On the other
> side, the tail call optimization is in some cases very important for some
> programs (actually required in some cases where e.g. the recursion is very
> deep), so I think the optimization is worth it.
> If we run into some corner cases, we could punt if we e.g. notice that the
> number of variables to check is too high with too large cfg.

OK.

> > -foptimize-sibling-calls is only enabled at -O2 but the inlining
> > part at -O1 where we hit those huge auto-generated testcases.
> 
> If we want, we could either just for -O1 or all cases again punt if the cfg
> is too large and/or number of variables is too large.

I guess I'm fine with the patch then given the inliner worry is
solved.

Thanks,
Richard.


Re: [PATCH][AArch64] Emit TARGET_DOTPROD-specific sequence for sadv16qi

2019-05-08 Thread Kyrill Tkachov

Hi Richard,

On 5/4/19 5:13 PM, Richard Sandiford wrote:

Kyrill Tkachov  writes:

@@ -764,6 +780,13 @@ (define_insn "aarch64_adalp_3"
  ;; UABAL  tmp.8h, op1.16b, op2.16b
  ;; UADALP op3.4s, tmp.8h
  ;; MOVop0, op3 // should be eliminated in later passes.
+;;
+;; For TARGET_DOTPROD we do:
+;; MOV tmp1.16b, #1 // Can be CSE'd and hoisted out of loops.
+;; UABDtmp2.16b, op1.16b, op2.16b
+;; UDOTop3.4s, tmp2.16b, tmp1.16b
+;; MOV op0, op3 // RA will tie the operands of UDOT appropriately.
+;;
  ;; The signed version just uses the signed variants of the above instructions.

It looks like the code does what the comment says, and uses SDOT for the
signed optab.  Doesn't it need to be UDOT for both?  The signedness of the
optab applies to the inputs (and so to SABD vs. UABD), but the absolute
difference is always unsigned.


I think you're right, updated.

  
  (define_expand "sadv16qi"

@@ -773,6 +796,18 @@ (define_expand "sadv16qi"
 (use (match_operand:V4SI 3 "register_operand"))]
"TARGET_SIMD"
{
+if (TARGET_DOTPROD)
+  {
+   rtx ones = gen_reg_rtx (V16QImode);
+   emit_move_insn (ones,
+   aarch64_simd_gen_const_vector_dup (V16QImode,
+   HOST_WIDE_INT_1));

Easier as:

   rtx ones = force_reg (V16QImode, CONST1_RTX (V16QImode));


Indeed.


+   rtx abd = gen_reg_rtx (V16QImode);
+   emit_insn (gen_abdv16qi_3 (abd, operands[1], operands[2]));
+   emit_insn (gen_aarch64_dotv16qi (operands[0], operands[3],
+  abd, ones));

Nit: indented too far.


Thanks, fixed (and a couple of other minor edits after seeing 
Alejandro's SVE patch).


2019-08-05  Kyrylo Tkachov  

    * config/aarch64/iterators.md (MAX_OPP): New code attr.
    * config/aarch64/aarch64-simd.md (abd_3): New define_expand.
    (*aarch64_abd_3): Rename to...
    (aarch64_abd_3): ... This.
    (sadv16qi): Add TARGET_DOTPROD expansion.

2019-08-05  Kyrylo Tkachov 

    * gcc.target/aarch64/ssadv16qi.c: Add +nodotprod to pragma.
    * gcc.target/aarch64/usadv16qi.c: Likewise.
    * gcc.target/aarch64/ssadv16qi-dotprod.c: New test.
    * gcc.target/aarch64/usadv16qi-dotprod.c: Likewise.



Thanks,
Richard


+   DONE;
+  }
  rtx reduc = gen_reg_rtx (V8HImode);
  emit_insn (gen_aarch64_abdl2v16qi_3 (reduc, operands[1],
   operands[2]));
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 
16e4dbda73ab928054590c47a4398408162c0332..5afb692493c6e9fa31355693e7843e4f0b1b281c
 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -1059,6 +1059,9 @@ (define_code_attr f16mac [(plus "a") (minus "s")])
  ;; Map smax to smin and umax to umin.
  (define_code_attr max_opp [(smax "smin") (umax "umin")])
  
+;; Same as above, but louder.

+(define_code_attr MAX_OPP [(smax "SMIN") (umax "UMIN")])
+
  ;; The number of subvectors in an SVE_STRUCT.
  (define_mode_attr vector_count [(VNx32QI "2") (VNx16HI "2")
(VNx8SI  "2") (VNx4DI  "2")
diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi-dotprod.c 
b/gcc/testsuite/gcc.target/aarch64/ssadv16qi-dotprod.c
new file mode 100644
index 
..e08c33785303e86815554e67a300189a67dfc1da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi-dotprod.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_ok } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+/* { dg-additional-options "-O3" } */
+
+#pragma GCC target "+nosve"
+
+#define N 1024
+
+signed char pix1[N], pix2[N];
+
+int foo (void)
+{
+  int i_sum = 0;
+  int i;
+
+  for (i = 0; i < N; i++)
+i_sum += __builtin_abs (pix1[i] - pix2[i]);
+
+  return i_sum;
+}
+
+/* { dg-final { scan-assembler-not {\tsshll\t} } } */
+/* { dg-final { scan-assembler-not {\tsshll2\t} } } */
+/* { dg-final { scan-assembler-not {\tssubl\t} } } */
+/* { dg-final { scan-assembler-not {\tssubl2\t} } } */
+/* { dg-final { scan-assembler-not {\tabs\t} } } */
+
+/* { dg-final { scan-assembler {\tsabd\t} } } */
+/* { dg-final { scan-assembler {\tsdot\t} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c 
b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
index 
40b28843616e84df137210b45ec16abed2a37c75..85a867a113013f560bfd0a3142805b9c95ad8c5a
 100644
--- a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
+++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
@@ -1,7 +1,7 @@
  /* { dg-do compile } */
  /* { dg-options "-O3" } */
  
-#pragma GCC target "+nosve"

+#pragma GCC target "+nosve+nodotprod"
  
  #define N 1024
  
diff --git a/gcc/testsuite/gcc.target/aarch64/usadv16qi-dotprod.c b/gcc/testsuite/gcc.target/aarch64/usadv16qi-dotprod.c

new file mode 100644
index 
..ea8de4d69758bd6bc9af9e33e1498f838b706949
-

Re: [PATCH][Tree-optimization/PR89772]fold memchr builtins for character not in constant nul-padded string

2019-05-08 Thread Richard Biener
On Tue, May 7, 2019 at 4:34 AM JunMa  wrote:
>
> 在 2019/5/6 下午7:58, JunMa 写道:
> > 在 2019/5/6 下午6:02, Richard Biener 写道:
> >> On Thu, Mar 21, 2019 at 5:57 AM JunMa  wrote:
> >>> Hi
> >>> For now, gcc can not fold code like:
> >>>
> >>> const char a[5] = "123"
> >>> __builtin_memchr (a, '7', sizeof a)
> >>>
> >>> It tries to avoid folding out of string length although length of a
> >>> is 5.
> >>> This is a bit conservative, it's safe to folding memchr/bcmp/memcmp
> >>> builtins when constant string stores in array with some trailing nuls.
> >>>
> >>> This patch folds these cases by exposing additional length of
> >>> trailing nuls in c_getstr().
> >>> Bootstrapped/regtested on x86_64-linux, ok for trunk?
> >> It's probably better if gimple_fold_builtin_memchr uses string_constant
> >> directly instead?
> > We can use string_constant in gimple_fold_builtin_memchr, however it is a
> > bit complex to use it in memchr/memcmp constant folding.
> >> You are changing memcmp constant folding but only have a testcase
> >> for memchr.
> > I'll add later.
> >> If we decide to amend c_getstr I would rather make it return the
> >> total length instead of the number of trailing zeros.
> > I think return the total length is safe in other place as well.
> > I used the argument in patch since I do not want any impact on
> > other part at all.
> >
>
> Although it is safe to use total length, I found that function
> inline_expand_builtin_string_cmp() which used c_getstr() may emit
> redundant rtls for trailing null chars when total length is returned.
>
> Also it is trivial to handle constant string with trailing null chars.
>
> So this updated patch follow richard's suggestion: using string_constant
> directly.
>
> Bootstrapped/regtested on x86_64-linux, ok for trunk?

Doesn't this fold to NULL for the case where seaching for '0' and it
doesn't occur in the string constant but only the zero-padding?
So you'd need to conditionalize on c being not zero (or handle
that case specially by actually finding the first zero in the padding)?
I think there was work to have all string constants zero terminated
but I'm not sure if we can rely on that here.  Bernd?

Richard.

> Regards
> JunMa
>
> gcc/ChangeLog
>
> 2019-05-07  Jun Ma 
>
>  PR Tree-optimization/89772
>  * gimple-fold.c (gimple_fold_builtin_memchr): consider trailing nuls in
>  out-of-bound accesses checking.
>
> gcc/testsuite/ChangeLog
>
> 2019-05-07  Jun Ma 
>
>  PR Tree-optimization/89772
>  * gcc.dg/builtin-memchr-4.c: New test.
> > Thanks
> > JunMa
> >> Richard.
> >>
> >>> Regards
> >>> JunMa
> >>>
> >>>
> >>> gcc/ChangeLog
> >>>
> >>> 2019-03-21  Jun Ma 
> >>>
> >>>   PR Tree-optimization/89772
> >>>   * fold-const.c (c_getstr): Add new parameter to get length of
> >>> additional
> >>>   trailing nuls after constant string.
> >>>   * gimple-fold.c (gimple_fold_builtin_memchr): consider
> >>> trailing nuls in
> >>>   out-of-bound accesses checking.
> >>>   * fold-const-call.c (fold_const_call): Likewise.
> >>>
> >>>
> >>> gcc/testsuite/ChangeLog
> >>>
> >>> 2019-03-21  Jun Ma 
> >>>
> >>>   PR Tree-optimization/89772
> >>>   * gcc.dg/builtin-memchr-4.c: New test.
> >
>


Re: [PATCH, arm][PR88167] Fix __builtin_return_address returns invalid address

2019-05-08 Thread Richard Earnshaw (lists)
On 18/12/2018 12:53, Mihail Ionescu wrote:
> 
> 
> On 12/18/2018 09:32 AM, Mihail Ionescu wrote:
>> Hi All,
>>
>> In Thumb mode when the function prologue gets expanded, in case of a
>> multiple register push, additional mov instructions are generated to
>> save the high registers which result in lr getting overwritten before
>> it's value can be used to retrieve the return address.
>>
>> The fix consists of detecting if lr is alive after the prologue, in
>> which case, the lr register won't be used as a scratch.
>>
>> Regression tested on arm-none-eabi.
>>
>> gcc/ChangeLog:
>> 2018-11-23  Mihail Ionescu  
>>
>>  PR target/88167
>>  * config/arm/arm.c: Add lr liveness check.
>>
>> gcc/testsuite/ChangeLog
>> 2018-11-23  Mihail Ionescu  
>>
>>  PR target/88167
>>  * gcc.target/arm/pr88167.c: New test.
>>
>> If everything is ok for trunk, could someone commit it on my behalf?
>>
>> Best regards,
>>     Mihail

Hi Mihail,

Thanks for the patch.  There are a couple of minor issues with your
patch.  Firstly, the name of the variable doesn't really describe its
use clearly.  It wasn't obvious from your changes what was the case
being considered and it had to be derived from looking at the testcase
and bug report.  A comment on the code would have helped clarify things
somewhat.

Secondly the testcase cannot simply add options like "-mcpu=cortex-m0"
to the command-line options, even after testing for thumb1 being OK.
The arm_thumb1_ok test says that it is OK to add -mthumb to the options,
but it doesn't allow you to change much else, especially anything
related to the CPU or architecture.

Whilst reviewing this patch I noticed that we could probably do better
when generating prologue and epilogue code in situations similar to
this.  Specifically, that we could often make more use of the low
registers rather than forcing a work register and then pushing (and
popping) every high register one at a time.  So what I'm committing
below adds this to your original changes (with some slight tweaking).

2019-05-08  Mihail Ionescu  
Richard Earnshaw  

gcc:

PR target/88167
* config/arm/arm.c (thumb1_prologue_unused_call_clobbered_lo_regs): New
function.
(thumb1_epilogue_unused_call_clobbered_lo_regs): New function.
(thumb1_compute_save_core_reg_mask): Don't force a spare work
register if both the epilogue and prologue can use call-clobbered
regs.
(thumb1_unexpanded_epilogue): Use
thumb1_epilogue_unused_call_clobbered_lo_regs.
Reverse the logic for picking temporaries for restoring high regs
to match that of the prologue where possible.
(thumb1_expand_prologue): Add any usable call-clobbered low
registers to the list of work registers.  Detect if the return
address is still live at the end of the prologue and avoid using
it for a work register if so.  If the return address is not
live, add LR to the list of pushable regs after the first pass.

testsuite:

PR target/88167
* gcc.target/arm/pr88167-1.c: New test.
* gcc.target/arm/pr88167-2.c: New test.

R.
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 45abcd89963..91bb65130b8 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -19670,6 +19670,35 @@ arm_compute_save_core_reg_mask (void)
   return save_reg_mask;
 }
 
+/* Return a mask for the call-clobbered low registers that are unused
+   at the end of the prologue.  */
+static unsigned long
+thumb1_prologue_unused_call_clobbered_lo_regs (void)
+{
+  unsigned long mask = 0;
+
+  for (int reg = 0; reg <= LAST_LO_REGNUM; reg++)
+if (!callee_saved_reg_p (reg)
+	&& !REGNO_REG_SET_P (df_get_live_out (ENTRY_BLOCK_PTR_FOR_FN (cfun)),
+			 reg))
+  mask |= 1 << reg;
+  return mask;
+}
+
+/* Similarly for the start of the epilogue.  */
+static unsigned long
+thumb1_epilogue_unused_call_clobbered_lo_regs (void)
+{
+  unsigned long mask = 0;
+
+  for (int reg = 0; reg <= LAST_LO_REGNUM; reg++)
+if (!callee_saved_reg_p (reg)
+	&& !REGNO_REG_SET_P (df_get_live_in (EXIT_BLOCK_PTR_FOR_FN (cfun)),
+			 reg))
+  mask |= 1 << reg;
+  return mask;
+}
+
 /* Compute a bit mask of which core registers need to be
saved on the stack for the current function.  */
 static unsigned long
@@ -19701,10 +19730,19 @@ thumb1_compute_save_core_reg_mask (void)
   if (mask & 0xff || thumb_force_lr_save ())
 mask |= (1 << LR_REGNUM);
 
-  /* Make sure we have a low work register if we need one.
- We will need one if we are going to push a high register,
- but we are not currently intending to push a low register.  */
+  bool call_clobbered_scratch
+= (thumb1_prologue_unused_call_clobbered_lo_regs ()
+   && thumb1_epilogue_unused_call_clobbered_lo_regs ());
+
+  /* Make sure we have a low work register if we need one.  We will
+ need one if we are going to push a high register, but we are not
+ currently intendin

Re: abstract out EH propagation cleanups

2019-05-08 Thread Aldy Hernandez

On 5/8/19 2:30 AM, Richard Biener wrote:

On Tue, May 7, 2019 at 11:55 PM Jeff Law  wrote:


On 5/7/19 3:45 AM, Richard Biener wrote:

On Tue, May 7, 2019 at 11:13 AM Aldy Hernandez  wrote:


Hi.

We seem to have numerous copies of the same EH propagation cleanups
scattered throughout the compiler.  The attached patch moves all the
logic into one class that allows for easy marking of statements and
automatic cleanup once it goes out of scope.

Tested on x86-64 Linux.

OK for trunk? (*)


Ugh :/

First of all I don't like the fact that the actual cleanup is done
upon constructor execution.  Please make it explicit
and in the constructor assert that nothing is to be done.

I'm of a mixed mind here.  I have railed against implicit code being run
behind my back for decades.

However as I've had to debug locking issues and the like in other C++
codebases I've become more and more of a fan of RAII and its basic
concepts.  This has made me more open to code running behind my back
like this implicitly when the object gets destroyed.

There's something to be said for embedding this little class into other
objects like Aldy has done and just letting things clean up
automatically as the object goes out of scope.  No more missing calls to
run the cleanup bits, it "just works".

But I won't object if you want it to be more explicit.  I've been there
and understand why one might want the cleanup step to be explicit.





Then I'm not sure this is a 1:1 transform since for example

@@ -1061,8 +1173,6 @@
substitute_and_fold_dom_walker::before_dom_children (basic_block bb)
 }

gimple *old_stmt = stmt;
-  bool was_noreturn = (is_gimple_call (stmt)
-  && gimple_call_noreturn_p (stmt));

/* Replace real uses in the statement.  */
did_replace |= substitute_and_fold_engine->replace_uses_in (stmt);
@@ -1110,25 +1220,7 @@
substitute_and_fold_dom_walker::before_dom_children (basic_block bb)
/* Now cleanup.  */
if (did_replace)
 {
...
+ fixups.record_change (old_stmt, stmt);

here we no longer can reliably determine whether old_stmt was noreturn since
we substitute into stmt itself.  It's no longer a correctness issue if
we do _not_
fixup_noreturn since we now have GF_CALL_CTRL_ALTERING, it's merely
an optimization issue.  So there may be no testcase for this (previously such
cases ICEd).

But AFAICT we don't care in the case Aldy is changing.  If we really
want to know if the old statement was a noreturn we can test prior to
queing it.


The code isn't doing what it did before after the change.  That's a bug.


If this is indeed a problem in the cases that I changed, it's easily 
fixed by marking the statement ahead of time and keeping track of the 
noreturn bit (as I have done in the attached patch).




To be more explicit here - adding this kind of new and fancy C++ stuff
just for the sake of it, thereby adding yet _another_ way of handling these
things instead of forcing it a new way (remove the other APIs this
encapsulates) is very bad(TM) IMHO, both for maintainance and for
new people coming around trying to understand GCC.


I'm not adding them for the sake of it.  I'm adding them because we have 
5 copies of the virtually the same code, and if I add any (additional) 
propagation smarts to the compiler, I'm going to have to add a 6th copy. 
 My patch abstracts away 4 out of the 5 versions, and I am volunteering 
to fix the last one (forwprop) as an incremental patch.


I don't agree that this is worse.  On the contrary, I am transforming 
multiple copies of this:


-  bool was_noreturn = (is_gimple_call (stmt)
-  && gimple_call_noreturn_p (stmt));
...
...
- /* If we cleaned up EH information from the statement,
-remove EH edges.  */
- if (maybe_clean_or_replace_eh_stmt (old_stmt, stmt))
-   bitmap_set_bit (need_eh_cleanup, bb->index);
-
- /* If we turned a not noreturn call into a noreturn one
-schedule it for fixup.  */
- if (!was_noreturn
- && is_gimple_call (stmt)
- && gimple_call_noreturn_p (stmt))
-   stmts_to_fixup.safe_push (stmt);
-
- if (gimple_assign_single_p (stmt))
-   {
- tree rhs = gimple_assign_rhs1 (stmt);
- if (TREE_CODE (rhs) == ADDR_EXPR)
-   recompute_tree_invariant_for_addr_expr (rhs);
-   }
-   }

by this:

+  fixups.mark_stmt (old_stmt);
...
...
+   fixups.record_change (stmt);

And we no longer have to clean things up manually at scope end:

-  if (!bitmap_empty_p (need_eh_cleanup))
-gimple_purge_all_dead_eh_edges (need_eh_cleanup);
-
-  /* Fixup stmts that became noreturn calls.  This may require splitting
- blocks and thus isn't possible during the dominator walk.  Do this
- in reverse order so we don't inadvertedly remove a stmt we want to
- fixup by visiting a dominating now noreturn call first.  

Re: [RS6000] PR89271, gcc.target/powerpc/vsx-simode2.c

2019-05-08 Thread Segher Boessenkool
Hi Alan,

On Wed, May 08, 2019 at 03:02:48PM +0930, Alan Modra wrote:
> This is https://gcc.gnu.org/ml/gcc-patches/2019-03/msg01299.html with
> the fixes Segher requested, plus a few more:
> - delete PREFERRED_RELOAD_CLASS changes
> - adjust for recent register renumbering
> - use defines rather than hard coding register numbers

Did you need to adjust for the renumbering _at all_ after that?
(edit: Ah, the reg sets).

> - flip altivec/float test when dealing with moves within vsx regs,
>   so that the altivec hard reg count is preferred over the fp hard reg
>   count when both reg types are possible.
> - use 2 for power9 direct move cost, and remove more '?'s from insns.
> - use reg_class_subset_p in the test for slow LR/CTR moves

Super minor:

> +   if (TARGET_DIRECT_MOVE)
> + {
> +   if (rs6000_tune != PROCESSOR_POWER9)
> + ret = 4 * hard_regno_nregs (FIRST_GPR_REGNO, mode);
> +   else
> + ret = 2 * hard_regno_nregs (FIRST_GPR_REGNO, mode);

Please write that the other way around, positive conditions are easier
to read?  And newer stuff first is nice, too.


Thanks for the patch!  Please apply to trunk.  And watch for fallout...
It is kind of inevitable I'm afraid.  But it's stage 1 now.


Segher


Re: abstract out EH propagation cleanups

2019-05-08 Thread Aldy Hernandez




On 5/8/19 2:30 AM, Richard Biener wrote:

On Tue, May 7, 2019 at 11:55 PM Jeff Law  wrote:


On 5/7/19 3:45 AM, Richard Biener wrote:

On Tue, May 7, 2019 at 11:13 AM Aldy Hernandez  wrote:


Hi.

We seem to have numerous copies of the same EH propagation cleanups
scattered throughout the compiler.  The attached patch moves all the
logic into one class that allows for easy marking of statements and
automatic cleanup once it goes out of scope.

Tested on x86-64 Linux.

OK for trunk? (*)


Ugh :/

First of all I don't like the fact that the actual cleanup is done
upon constructor execution.  Please make it explicit
and in the constructor assert that nothing is to be done.

I'm of a mixed mind here.  I have railed against implicit code being run
behind my back for decades.

However as I've had to debug locking issues and the like in other C++
codebases I've become more and more of a fan of RAII and its basic
concepts.  This has made me more open to code running behind my back
like this implicitly when the object gets destroyed.

There's something to be said for embedding this little class into other
objects like Aldy has done and just letting things clean up
automatically as the object goes out of scope.  No more missing calls to
run the cleanup bits, it "just works".

But I won't object if you want it to be more explicit.  I've been there
and understand why one might want the cleanup step to be explicit.





Then I'm not sure this is a 1:1 transform since for example

@@ -1061,8 +1173,6 @@
substitute_and_fold_dom_walker::before_dom_children (basic_block bb)
 }

gimple *old_stmt = stmt;
-  bool was_noreturn = (is_gimple_call (stmt)
-  && gimple_call_noreturn_p (stmt));

/* Replace real uses in the statement.  */
did_replace |= substitute_and_fold_engine->replace_uses_in (stmt);
@@ -1110,25 +1220,7 @@
substitute_and_fold_dom_walker::before_dom_children (basic_block bb)
/* Now cleanup.  */
if (did_replace)
 {
...
+ fixups.record_change (old_stmt, stmt);

here we no longer can reliably determine whether old_stmt was noreturn since
we substitute into stmt itself.  It's no longer a correctness issue if
we do _not_
fixup_noreturn since we now have GF_CALL_CTRL_ALTERING, it's merely
an optimization issue.  So there may be no testcase for this (previously such
cases ICEd).

But AFAICT we don't care in the case Aldy is changing.  If we really
want to know if the old statement was a noreturn we can test prior to
queing it.


The code isn't doing what it did before after the change.  That's a bug.

To be more explicit here - adding this kind of new and fancy C++ stuff
just for the sake of it, thereby adding yet _another_ way of handling these
things instead of forcing it a new way (remove the other APIs this
encapsulates) is very bad(TM) IMHO, both for maintainance and for
new people coming around trying to understand GCC.

Not to say that all the places that are refactored with this patch
look sligtly different and thus the pattern doesnt' exactly match
(leading to issues like the one I pointed out above).

So - please no.

Richard.





I'm also not sure I like to put all these (unrelated) things into a
single class,
it really also hides the details of what is performed immediately and what
delayed and what kind of changes - this makes understanding of pass
transforms hard.

On the other hand this class defines a contract for what it can and will
do for us and allows us to bring consistency in that handling.  We
declare manual management of this stuff verboten.  Ideally we'd declare
the class final and avoid derivation, but I doubt we can do that
immediately.

Jeff


BTW, in case it wasn't clear.  I do understand that not all copies are 
identical, that is why the interface has the recompute_invariants and 
m_todo_flags fields.  It is *supposed* to work as before, and it if 
isn't, it's a bug and I'll gladly fix it.


Aldy


Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Segher Boessenkool
On Wed, May 08, 2019 at 02:20:19PM +0200, Richard Biener wrote:
> Btw, I wonder if on RTL basic-block reordering (which also does
> some tail duplication) could be a place to do such transform?
> Or is it too late to do the desired cleanups after that?
> Possibly since we're after RA.

It is *much* too late; it is too late to do it at combine time, already
(and combine of course cannot do this).  You need the early RTL passes to
clean up all that mess the conditionals generate -- the point of this patch
is to never have that conditional-to-integer-reg stuff to begin with.

RTL isn't nice for doing cross-BB transforms, either.


Segher


Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Richard Biener
On Wed, 8 May 2019, Segher Boessenkool wrote:

> On Wed, May 08, 2019 at 02:20:19PM +0200, Richard Biener wrote:
> > Btw, I wonder if on RTL basic-block reordering (which also does
> > some tail duplication) could be a place to do such transform?
> > Or is it too late to do the desired cleanups after that?
> > Possibly since we're after RA.
> 
> It is *much* too late; it is too late to do it at combine time, already
> (and combine of course cannot do this).  You need the early RTL passes to
> clean up all that mess the conditionals generate -- the point of this patch
> is to never have that conditional-to-integer-reg stuff to begin with.
> 
> RTL isn't nice for doing cross-BB transforms, either.

Ah, well - I thought it works reasonably well for extended BBs ;)

We can of course also match this at RTL expansion time but exposing
this earlier definitely looks valuable.

Richard.


Make deque iterator operators hidden friends

2019-05-08 Thread François Dumont
Here is a patch to reduce number of operators exposed at std namespace 
scope.


    * include/bits/stl_deque.h
    (_Deque_iterator<>::operator+(difference_type)): Make hidden friend.
    (_Deque_iterator<>::operator-(difference_type)): Likewise.
    (operator==(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator!=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator<(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator<=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator>(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator>=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.

Tested under Linux x86_64 normal/debug modes.

Ok to commit ?

François

diff --git a/libstdc++-v3/include/bits/stl_deque.h b/libstdc++-v3/include/bits/stl_deque.h
index 7a7a42aa903..72420c27646 100644
--- a/libstdc++-v3/include/bits/stl_deque.h
+++ b/libstdc++-v3/include/bits/stl_deque.h
@@ -238,24 +238,10 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 	return *this;
   }
 
-  _Self
-  operator+(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-	_Self __tmp = *this;
-	return __tmp += __n;
-  }
-
   _Self&
   operator-=(difference_type __n) _GLIBCXX_NOEXCEPT
   { return *this += -__n; }
 
-  _Self
-  operator-(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-	_Self __tmp = *this;
-	return __tmp -= __n;
-  }
-
   reference
   operator[](difference_type __n) const _GLIBCXX_NOEXCEPT
   { return *(*this + __n); }
@@ -272,123 +258,116 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 	_M_first = *__new_node;
 	_M_last = _M_first + difference_type(_S_buffer_size());
   }
-};
-
-  // Note: we also provide overloads whose operands are of the same type in
-  // order to avoid ambiguous overload resolution when std::rel_ops operators
-  // are in scope (for additional details, see libstdc++/3628)
-  template
-inline bool
-operator==(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	   const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return __x._M_cur == __y._M_cur; }
-
-  template
-inline bool
-operator==(const _Deque_iterator<_Tp, _RefL, _PtrL>& __x,
-	   const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
-{ return __x._M_cur == __y._M_cur; }
 
-  template
-inline bool
-operator!=(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	   const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return !(__x == __y); }
-
-  template
-inline bool
-operator!=(const _Deque_iterator<_Tp, _RefL, _PtrL>& __x,
-	   const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
-{ return !(__x == __y); }
-
-  template
-inline bool
-operator<(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	  const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return (__x._M_node == __y._M_node) ? (__x._M_cur < __y._M_cur)
-	  : (__x._M_node < __y._M_node); }
-
-  template
-inline bool
-operator<(const _Deque_iterator<_Tp, _RefL, _PtrL>& __x,
-	  const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
-{ return (__x._M_node == __y._M_node) ? (__x._M_cur < __y._M_cur)
-	  : (__x._M_node < __y._M_node); }
-
-  template
-inline bool
-operator>(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	  const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return __y < __x; }
-
-  template
-inline bool
-operator>(const _Deque_iterator<_Tp, _RefL, _PtrL>& __x,
-	  const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
-{ return __y < __x; }
-
-  template
-inline bool
-operator<=(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	   const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return !(__y < __x); }
+  friend bool
+  operator==(const _Self& __x, const _Self& __y) _GLIBCXX_NOEXCEPT
+  { return __x._M_cur == __y._M_cur; }
+
+  // Note: we also provide overloads whose operands are of the same type in
+  // order to avoid ambiguous overload resolution when std::rel_ops operators
+  // are in scope (for additional details, see libstdc++/3628)
+  template
+	friend bool
+	operator==(const _Self& __x,
+		   const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
+	{ return __x._M_cur == __y._M_cur; }
+
+  friend bool
+  operator!=(const _Self& __x, const _Self& __y) _GLIBCXX_NOEXCEPT
+  { return !(__x == __y); }
+
+  template
+	friend bool
+	operator!=(const _Self& __x,
+		   const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
+	{ return !(__x == __y); }
+
+  friend bool
+  operator<(const _Self& __x, const _Self& __y) _GLIBCXX_NOEXCEPT
+  {
+	return (__x._M_node == __y._M_node)
+	  ? (__x._M_cur < __y._M_cur) : (__x._M_node < __y._M_node);
+  }
 
-  template
-

Go patch committed: Remove trailing spaces

2019-05-08 Thread Ian Lance Taylor
This patch to the Go frontend by Ben Shi removes trailing spaces from
the source files.  Bootstrapped and ran Go testsuite on
x86_64-pc-linux-gnu.  Committed to mainline.

Ian
Index: gcc/go/gofrontend/MERGE
===
--- gcc/go/gofrontend/MERGE (revision 271000)
+++ gcc/go/gofrontend/MERGE (working copy)
@@ -1,4 +1,4 @@
-fcbf847c3bf76fb475c9020e1c57057134407263
+0b4cf8ded107ccbfbd4af7f4e056f23f941d0f86
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
Index: gcc/go/gofrontend/backend.h
===
--- gcc/go/gofrontend/backend.h (revision 270877)
+++ gcc/go/gofrontend/backend.h (working copy)
@@ -564,7 +564,7 @@ class Backend
   // ASM_NAME is encoded assembler-friendly version of the name, or the
   // empty string if no encoding is needed.
   //
-  // TYPE is the type of the implicit variable. 
+  // TYPE is the type of the implicit variable.
   //
   // IS_HIDDEN will be true if the descriptor should only be visible
   // within the current object.
@@ -634,7 +634,7 @@ class Backend
   //
   // TYPE will be a struct type; the type of the returned expression
   // must be a pointer to this struct type.
-  // 
+  //
   // We must create the named structure before we know its
   // initializer, because the initializer may refer to its own
   // address.  After calling this the frontend will call
@@ -668,7 +668,7 @@ class Backend
  Btype* type, Location) = 0;
 
   // Labels.
-  
+
   // Create a new label.  NAME will be empty if this is a label
   // created by the frontend for a loop construct.  The location is
   // where the label is defined.
Index: gcc/go/gofrontend/expressions.cc
===
--- gcc/go/gofrontend/expressions.cc(revision 270877)
+++ gcc/go/gofrontend/expressions.cc(working copy)
@@ -1704,7 +1704,7 @@ class Boolean_expression : public Expres
   void
   do_dump_expression(Ast_dump_context* ast_dump_context) const
   { ast_dump_context->ostream() << (this->val_ ? "true" : "false"); }
-  
+
  private:
   // The constant.
   bool val_;
@@ -2015,8 +2015,8 @@ String_info_expression::do_dump_expressi
   ast_dump_context->ostream() << "stringinfo(";
   this->string_->dump_expression(ast_dump_context);
   ast_dump_context->ostream() << ",";
-  ast_dump_context->ostream() << 
-  (this->string_info_ == STRING_INFO_DATA ? "data" 
+  ast_dump_context->ostream() <<
+  (this->string_info_ == STRING_INFO_DATA ? "data"
 : this->string_info_ == STRING_INFO_LENGTH ? "length"
 : "unknown");
   ast_dump_context->ostream() << ")";
@@ -2694,7 +2694,7 @@ class Complex_expression : public Expres
   // Write REAL/IMAG to dump context.
   static void
   dump_complex(Ast_dump_context* ast_dump_context, const mpc_t val);
-  
+
  protected:
   bool
   do_is_constant() const
Index: gcc/go/gofrontend/expressions.h
===
--- gcc/go/gofrontend/expressions.h (revision 270877)
+++ gcc/go/gofrontend/expressions.h (working copy)
@@ -2806,7 +2806,7 @@ class Unknown_expression : public Parser
 
   void
   do_dump_expression(Ast_dump_context*) const;
-  
+
  private:
   // The unknown name.
   Named_object* named_object_;
@@ -2832,7 +2832,7 @@ class Index_expression : public Parser_e
   // Dump an index expression, i.e. an expression of the form
   // expr[expr], expr[expr:expr], or expr[expr:expr:expr] to a dump context.
   static void
-  dump_index_expression(Ast_dump_context*, const Expression* expr, 
+  dump_index_expression(Ast_dump_context*, const Expression* expr,
 const Expression* start, const Expression* end,
 const Expression* cap);
 
@@ -2988,7 +2988,7 @@ class Array_index_expression : public Ex
 
   void
   do_dump_expression(Ast_dump_context*) const;
-  
+
  private:
   // The array we are getting a value from.
   Expression* array_;
@@ -3828,7 +3828,7 @@ class Map_construction_expression : publ
 
   void
   do_dump_expression(Ast_dump_context*) const;
-  
+
  private:
   // The type of the map to construct.
   Type* type_;
Index: gcc/go/gofrontend/parse.cc
===
--- gcc/go/gofrontend/parse.cc  (revision 270877)
+++ gcc/go/gofrontend/parse.cc  (working copy)
@@ -3483,7 +3483,7 @@ Parse::expression(Precedence precedence,
 
   if (is_parenthesized != NULL)
*is_parenthesized = false;
-  
+
   Operator op = token->op();
   Location binop_location = token->location();
 
@@ -3576,7 +3576,7 @@ Parse::unary_expr(bool may_be_sink, bool
 
   // There is a complex parse for <- chan.  The choices are
   // Convert x to type <- chan int:
-  //   (<- chan int)(x) 
+  //   (<- chan int)(x)
   // Receive from (x converted to type c

[C++ PATCH] Kill DECL_SAVED_FUNCTION_DATA

2019-05-08 Thread Nathan Sidwell
I discovered that DECL_SAVED_FUNCTION_DATA became obsolete during or 
after the gimple conversion.  But we continued to faithfully save and 
restore it in the C++ FE.  We then extended it to hold the auto return 
pattern a function might have been declared with.


This patch removes the saved function data, and stores the auto pattern 
in the decl's slot that used to point to the saved data.


nathan
--
Nathan Sidwell
2019-05-08  Nathan Sidwell  

	Kill DECL_SAVED_FUNCTION_DATA .
	* cp-tree.h (language_function): Remove x_auto_return_pattern.
	(current_function_auto_return_pattern): Delete.
	(lang_decl_fn): Replace saved_language_function with
	saved_auto_return type.
	(DECL_SAVED_FUNCTION_DATA): Delete.
	(DECL_SAVED_AUTO_RETURN_TYPE): New.
	(FNDECL_USED_AUTO): Correct documentation.
	* decl.c (duplicate_decls): Adjust AUTO return handling.
	(start_preparsed_function): Replace
	current_function_auto_return_pattern with
	DECL_SAVED_AUTO_RETURN_TYPE.  Remove DECL_SAVED_FUNCTION_DATA
	zapping.
	(finish_function): Likewise.
	(save_function_data): Delete.
	(fndecl_declared_return_type): Reimplement.
	* mangle.c (write_unqualified_name): Use DECL_SAVED_AUTO_RETURN_TYPE.
	* method.c (make_thunk, make_alias_for): Likewise.
	* parser.c (cp_parser_jump_statement): Likewise.
	* pt.c (do_auto_deduction): Likewise.
	* typeck.c (check_return_expr): Likewise.

Index: gcc/cp/cp-tree.h
===
--- gcc/cp/cp-tree.h	(revision 271012)
+++ gcc/cp/cp-tree.h	(working copy)
@@ -1813,7 +1813,6 @@ struct GTY(()) language_function {
   tree x_in_charge_parm;
   tree x_vtt_parm;
   tree x_return_value;
-  tree x_auto_return_pattern;
 
   BOOL_BITFIELD returns_value : 1;
   BOOL_BITFIELD returns_null : 1;
@@ -1909,11 +1908,6 @@ struct GTY(()) language_function {
 #define current_function_return_value \
   (cp_function_chain->x_return_value)
 
-/* A type involving 'auto' to be used for return type deduction.  */
-
-#define current_function_auto_return_pattern \
-  (cp_function_chain->x_auto_return_pattern)
-
 /* In parser.c.  */
 extern tree cp_literal_operator_id (const char *);
 
@@ -2654,8 +2648,7 @@ struct GTY(()) lang_decl_fn {
   union lang_decl_u3
   {
 struct cp_token_cache * GTY ((tag ("1"))) pending_inline_info;
-struct language_function * GTY ((tag ("0")))
-  saved_language_function;
+tree GTY ((tag ("0"))) saved_auto_return_type;
   } GTY ((desc ("%1.pending_inline_p"))) u;
 
 };
@@ -3700,10 +3693,10 @@ struct GTY(()) lang_decl {
 #define FOLD_EXPR_INIT(NODE) \
   TREE_OPERAND (BINARY_FOLD_EXPR_CHECK (NODE), 2)
 
-/* In a FUNCTION_DECL, the saved language-specific per-function data.  */
-#define DECL_SAVED_FUNCTION_DATA(NODE)			\
+/* In a FUNCTION_DECL, the saved auto-return pattern.  */
+#define DECL_SAVED_AUTO_RETURN_TYPE(NODE)		\
   (LANG_DECL_FN_CHECK (FUNCTION_DECL_CHECK (NODE))	\
-   ->u.saved_language_function)
+   ->u.saved_auto_return_type)
 
 /* True if NODE is an implicit INDIRECT_REF from convert_from_reference.  */
 #define REFERENCE_REF_P(NODE)\
@@ -3934,7 +3927,7 @@ more_aggr_init_expr_args_p (const aggr_i
 /* True if NODE was declared with auto in its return type, but it has
started compilation and so the return type might have been changed by
return type deduction; its declared return type should be found in
-   DECL_STRUCT_FUNCTION(NODE)->language->x_auto_return_pattern.  */
+   DECL_SAVED_AUTO_RETURN_TYPE (NODE).   */
 #define FNDECL_USED_AUTO(NODE) \
   TREE_LANG_FLAG_2 (FUNCTION_DECL_CHECK (NODE))
 
Index: gcc/cp/decl.c
===
--- gcc/cp/decl.c	(revision 271012)
+++ gcc/cp/decl.c	(working copy)
@@ -80,7 +80,6 @@ static void maybe_deduce_size_from_array
 static void layout_var_decl (tree);
 static tree check_initializer (tree, tree, int, vec **);
 static void make_rtl_for_nonlocal_decl (tree, tree, const char *);
-static void save_function_data (tree);
 static void copy_type_enum (tree , tree);
 static void check_function_type (tree, tree);
 static void finish_constructor_body (void);
@@ -2480,9 +2479,9 @@ duplicate_decls (tree newdecl, tree oldd
 	}
 	  else if (DECL_PENDING_INLINE_P (newdecl))
 	;
-	  else if (DECL_SAVED_FUNCTION_DATA (newdecl) == NULL)
-	DECL_SAVED_FUNCTION_DATA (newdecl)
-	  = DECL_SAVED_FUNCTION_DATA (olddecl);
+	  else if (DECL_SAVED_AUTO_RETURN_TYPE (newdecl) == NULL)
+	DECL_SAVED_AUTO_RETURN_TYPE (newdecl)
+	  = DECL_SAVED_AUTO_RETURN_TYPE (olddecl);
 
 	  DECL_DECLARED_INLINE_P (newdecl) |= DECL_DECLARED_INLINE_P (olddecl);
 
@@ -15449,20 +15448,21 @@ start_preparsed_function (tree decl1, tr
   current_stmt_tree ()->stmts_are_full_exprs_p = 1;
   current_binding_level = bl;
 
+  /* If we are (erroneously) defining a function that we have already
+ defined before, wipe out what we knew before.  */
+  gcc_checking_assert (!DECL_PENDING_INLINE_P (decl1));
+  FNDECL_USED_AUTO (decl1) = false;
+  DECL_SAVED_AU

libgo patch committed: Use builtin memmove directly

2019-05-08 Thread Ian Lance Taylor
This libgo patch by Cherry Zhang changes the runtime package to use
the builtin memmove function directly, rather than calling through C
code.  Bootstrapped and ran Go testsuite on x86_64-pc-linux-gnu.
Committed to mainline.

Ian

2019-05-08  Cherry Zhang  

* go-gcc.cc (Gcc_backend::Gcc_backend): Define memmove builtin.
Index: gcc/go/go-gcc.cc
===
--- gcc/go/go-gcc.cc(revision 270877)
+++ gcc/go/go-gcc.cc(working copy)
@@ -604,6 +604,15 @@ Gcc_backend::Gcc_backend()
NULL_TREE),
   false, false);
 
+  // We use __builtin_memmove for copying data.
+  this->define_builtin(BUILT_IN_MEMMOVE, "__builtin_memmove", "memmove",
+  build_function_type_list(void_type_node,
+   ptr_type_node,
+   const_ptr_type_node,
+   size_type_node,
+   NULL_TREE),
+  false, false);
+
   // Used by runtime/internal/sys.
   this->define_builtin(BUILT_IN_CTZ, "__builtin_ctz", "ctz",
   build_function_type_list(integer_type_node,
Index: gcc/go/gofrontend/MERGE
===
--- gcc/go/gofrontend/MERGE (revision 271014)
+++ gcc/go/gofrontend/MERGE (working copy)
@@ -1,4 +1,4 @@
-0b4cf8ded107ccbfbd4af7f4e056f23f941d0f86
+3a9bccfbf4af1c756978c40967838d9f6a4e7a62
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
Index: libgo/Makefile.am
===
--- libgo/Makefile.am   (revision 270877)
+++ libgo/Makefile.am   (working copy)
@@ -461,7 +461,6 @@ runtime_files = \
runtime/go-memclr.c \
runtime/go-memcmp.c \
runtime/go-memequal.c \
-   runtime/go-memmove.c \
runtime/go-nanotime.c \
runtime/go-now.c \
runtime/go-nosys.c \
Index: libgo/go/runtime/stubs.go
===
--- libgo/go/runtime/stubs.go   (revision 270877)
+++ libgo/go/runtime/stubs.go   (working copy)
@@ -100,6 +100,7 @@ func reflect_memclrNoHeapPointers(ptr un
 
 // memmove copies n bytes from "from" to "to".
 //go:noescape
+//extern __builtin_memmove
 func memmove(to, from unsafe.Pointer, n uintptr)
 
 //go:linkname reflect_memmove reflect.memmove
Index: libgo/runtime/go-memmove.c
===
--- libgo/runtime/go-memmove.c  (revision 270877)
+++ libgo/runtime/go-memmove.c  (nonexistent)
@@ -1,16 +0,0 @@
-/* go-memmove.c -- move one memory buffer to another
-
-   Copyright 2016 The Go Authors. All rights reserved.
-   Use of this source code is governed by a BSD-style
-   license that can be found in the LICENSE file.  */
-
-#include "runtime.h"
-
-void move(void *, void *, uintptr)
-  __asm__ (GOSYM_PREFIX "runtime.memmove");
-
-void
-move (void *p1, void *p2, uintptr len)
-{
-  __builtin_memmove (p1, p2, len);
-}


Re: Make deque iterator operators hidden friends

2019-05-08 Thread Jonathan Wakely

On 08/05/19 18:50 +0200, François Dumont wrote:
Here is a patch to reduce number of operators exposed at std namespace 
scope.


    * include/bits/stl_deque.h
    (_Deque_iterator<>::operator+(difference_type)): Make hidden friend.
    (_Deque_iterator<>::operator-(difference_type)): Likewise.
    (operator==(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator!=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator<(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator<=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator>(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator>=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.

Tested under Linux x86_64 normal/debug modes.

Ok to commit ?

François




diff --git a/libstdc++-v3/include/bits/stl_deque.h 
b/libstdc++-v3/include/bits/stl_deque.h
index 7a7a42aa903..72420c27646 100644
--- a/libstdc++-v3/include/bits/stl_deque.h
+++ b/libstdc++-v3/include/bits/stl_deque.h
@@ -238,24 +238,10 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
return *this;
  }

-  _Self
-  operator+(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-   _Self __tmp = *this;
-   return __tmp += __n;
-  }
-
  _Self&
  operator-=(difference_type __n) _GLIBCXX_NOEXCEPT
  { return *this += -__n; }

-  _Self
-  operator-(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-   _Self __tmp = *this;
-   return __tmp -= __n;
-  }
-


These two are already not visible at namespace-scope, but it doesn't
hurt to make them hidden friends for consistency.


+  friend _Self
+  operator+(const _Self& __x, difference_type __n) _GLIBCXX_NOEXCEPT
+  {
+   _Self __tmp = __x;
+   return __tmp += __n;


I know you've just reused the original implementation, but it has a
problem that we might as well fix while touching t his code.

This function prevents NRVO copy elision, because __tmp += __n returns
a reference, which gets copied into the return value.

If you write it like this then the copy is elided, and __tmp is
constructed directly in the return value slot:

 _Self __tmp = __x;
 __tmp += __n;
 return __tmp;



+  friend _Self
+  operator-(const _Self& __x, difference_type __n) _GLIBCXX_NOEXCEPT
+  {
+   _Self __tmp = __x;
+   return __tmp -= __n;


Same problem here. It would be better to write:

 _Self __tmp = __x;
 __tmp -= __n;
 return __tmp;

OK for trunk with those two changes, thanks.




Re: [Patch, fortran] ISO_Fortran_binding PRs 90093, 90352 & 90355

2019-05-08 Thread Paul Richard Thomas
Unless there are any objections to this patch, I plan to commit to
trunk and 9-branch tomorrow night, with the change to the testcase
pointed out by Dominique.

I sincerely hope that will be the end of CFI PRs for a little while,
at least. I have a load of pending patches and want to get on with
fixing PDTs.

Cheers

Paul

On Mon, 6 May 2019 at 19:59, Paul Richard Thomas
 wrote:
>
> It helps to attach the patch!
>
> On Mon, 6 May 2019 at 19:57, Paul Richard Thomas
>  wrote:
> >
> > Unfortunately, this patch was still in the making at the release of
> > 9.1. It is more or less self explanatory with the ChangeLogs.
> >
> > It should be noted that gfc_conv_expr_present could not be used in the
> > fix for PR90093 because the passed descriptor is a CFI type. Instead,
> > the test is for a null pointer passed.
> >
> > The changes to trans-array.c(gfc_trans_create_temp_array) have an eye
> > on the future, as well as PR90355. I am progressing towards the point
> > where all descriptors have 'span' set correctly so that
> > trans.c(get_array_span) can be eliminated and much of the code in the
> > callers can be simplified.
> >
> > Bootstrapped and regtested on FC29/x86_64 - OK for trunk and 9-branch?
> >
> > Paul
> >
> > 2019-05-06  Paul Thomas  
> >
> > PR fortran/90093
> > * trans-decl.c (convert_CFI_desc): Test that the dummy is
> > present before doing any of the conversions.
> >
> > PR fortran/90352
> > * decl.c (gfc_verify_c_interop_param): Restore the error for
> > charlen > 1 actual arguments passed to bind(C) procs.
> > Clean up trailing white space.
> >
> > PR fortran/90355
> > * trans-array.c (gfc_trans_create_temp_array): Set the 'span'
> > field to the element length for all types.
> > (gfc_conv_expr_descriptor): The force_no_tmp flag is used to
> > prevent temporary creation, especially for substrings.
> > * trans-decl.c (gfc_trans_deferred_vars): Rather than assert
> > that the backend decl for the string length is non-null, use it
> > as a condition before calling gfc_trans_vla_type_sizes.
> > * trans-expr.c (gfc_conv_gfc_desc_to_cfi_desc): 'force_no_tmp'
> > is set before calling gfc_conv_expr_descriptor.
> > * trans.c (get_array_span): Move the code for extracting 'span'
> > from gfc_build_array_ref to this function. This is specific to
> > descriptors that are component and indirect references.
> > * trans.h : Add the force_no_tmp flag bitfield to gfc_se.
> >
> > 2019-05-06  Paul Thomas  
> >
> > PR fortran/90093
> > * gfortran.dg/ISO_Fortran_binding_12.f90: New test.
> > * gfortran.dg/ISO_Fortran_binding_12.c: Supplementary code.
> >
> > PR fortran/90352
> > * gfortran.dg/iso_c_binding_char_1.f90: New test.
> >
> > PR fortran/90355
> > * gfortran.dg/ISO_Fortran_binding_4.f90: Add 'substr' to test
> > the direct passing of substrings as descriptors to bind(C).
> > * gfortran.dg/assign_10.f90: Increase the tree_dump count of
> > 'atmp' to account for the setting of the 'span' field.
> > * gfortran.dg/transpose_optimization_2.f90: Ditto.



-- 
"If you can't explain it simply, you don't understand it well enough"
- Albert Einstein


New German PO file for 'gcc' (version 9.1.0)

2019-05-08 Thread Translation Project Robot
Hello, gentle maintainer.

This is a message from the Translation Project robot.

A revised PO file for textual domain 'gcc' has been submitted
by the German team of translators.  The file is available at:

https://translationproject.org/latest/gcc/de.po

(This file, 'gcc-9.1.0.de.po', has just now been sent to you in
a separate email.)

All other PO files for your package are available in:

https://translationproject.org/latest/gcc/

Please consider including all of these in your next release, whether
official or a pretest.

Whenever you have a new distribution with a new version number ready,
containing a newer POT file, please send the URL of that distribution
tarball to the address below.  The tarball may be just a pretest or a
snapshot, it does not even have to compile.  It is just used by the
translators when they need some extra translation context.

The following HTML page has been updated:

https://translationproject.org/domain/gcc.html

If any question arises, please contact the translation coordinator.

Thank you for all your work,

The Translation Project robot, in the
name of your translation coordinator.




Re: [PATCH][Tree-optimization/PR89772]fold memchr builtins for character not in constant nul-padded string

2019-05-08 Thread Bernd Edlinger
On 5/8/19 4:31 PM, Richard Biener wrote:
> On Tue, May 7, 2019 at 4:34 AM JunMa  wrote:
>>
>> 在 2019/5/6 下午7:58, JunMa 写道:
>>> 在 2019/5/6 下午6:02, Richard Biener 写道:
 On Thu, Mar 21, 2019 at 5:57 AM JunMa  wrote:
> Hi
> For now, gcc can not fold code like:
>
> const char a[5] = "123"
> __builtin_memchr (a, '7', sizeof a)
>
> It tries to avoid folding out of string length although length of a
> is 5.
> This is a bit conservative, it's safe to folding memchr/bcmp/memcmp
> builtins when constant string stores in array with some trailing nuls.
>
> This patch folds these cases by exposing additional length of
> trailing nuls in c_getstr().
> Bootstrapped/regtested on x86_64-linux, ok for trunk?
 It's probably better if gimple_fold_builtin_memchr uses string_constant
 directly instead?
>>> We can use string_constant in gimple_fold_builtin_memchr, however it is a
>>> bit complex to use it in memchr/memcmp constant folding.
 You are changing memcmp constant folding but only have a testcase
 for memchr.
>>> I'll add later.
 If we decide to amend c_getstr I would rather make it return the
 total length instead of the number of trailing zeros.
>>> I think return the total length is safe in other place as well.
>>> I used the argument in patch since I do not want any impact on
>>> other part at all.
>>>
>>
>> Although it is safe to use total length, I found that function
>> inline_expand_builtin_string_cmp() which used c_getstr() may emit
>> redundant rtls for trailing null chars when total length is returned.
>>
>> Also it is trivial to handle constant string with trailing null chars.
>>
>> So this updated patch follow richard's suggestion: using string_constant
>> directly.
>>
>> Bootstrapped/regtested on x86_64-linux, ok for trunk?
> 
> Doesn't this fold to NULL for the case where seaching for '0' and it
> doesn't occur in the string constant but only the zero-padding?
> So you'd need to conditionalize on c being not zero (or handle
> that case specially by actually finding the first zero in the padding)?
> I think there was work to have all string constants zero terminated
> but I'm not sure if we can rely on that here.  Bernd?
> 

It turned out that there is a number of languages that don't have 
zero-terminated
strings by default, which would have complicated the patch just too much for too
little benefit.

In the end, it was more important to guarantee that mem_size >= string_length 
holds.

In C it is just a convention that string constants have usually a 
zero-termination,
but even with C there are ways how strings constants can be not zero-terminated.

There can in theory be optimizations that introduce not zero-terminated strings,
like tree-ssa-forwprop.c, where a not zero-terminated string constant is folded
in simplify_builtin_call.

In such a case, c_getstr might in theory return a string without 
zero-termination,
but I think it will be rather difficult to find a C test case for that.

Well, if I had a test case for that I would probably fix it in c_getstr to 
consider
the implicit padding as equivalent to an explicit zero-termination.


Bernd.


> Richard.
> 
>> Regards
>> JunMa
>>
>> gcc/ChangeLog
>>
>> 2019-05-07  Jun Ma 
>>
>>  PR Tree-optimization/89772
>>  * gimple-fold.c (gimple_fold_builtin_memchr): consider trailing nuls in
>>  out-of-bound accesses checking.
>>
>> gcc/testsuite/ChangeLog
>>
>> 2019-05-07  Jun Ma 
>>
>>  PR Tree-optimization/89772
>>  * gcc.dg/builtin-memchr-4.c: New test.
>>> Thanks
>>> JunMa
 Richard.

> Regards
> JunMa
>
>
> gcc/ChangeLog
>
> 2019-03-21  Jun Ma 
>
>   PR Tree-optimization/89772
>   * fold-const.c (c_getstr): Add new parameter to get length of
> additional
>   trailing nuls after constant string.
>   * gimple-fold.c (gimple_fold_builtin_memchr): consider
> trailing nuls in
>   out-of-bound accesses checking.
>   * fold-const-call.c (fold_const_call): Likewise.
>
>
> gcc/testsuite/ChangeLog
>
> 2019-03-21  Jun Ma 
>
>   PR Tree-optimization/89772
>   * gcc.dg/builtin-memchr-4.c: New test.
>>>
>>


New French PO file for 'gcc' (version 9.1.0)

2019-05-08 Thread Translation Project Robot
Hello, gentle maintainer.

This is a message from the Translation Project robot.

A revised PO file for textual domain 'gcc' has been submitted
by the French team of translators.  The file is available at:

https://translationproject.org/latest/gcc/fr.po

(This file, 'gcc-9.1.0.fr.po', has just now been sent to you in
a separate email.)

All other PO files for your package are available in:

https://translationproject.org/latest/gcc/

Please consider including all of these in your next release, whether
official or a pretest.

Whenever you have a new distribution with a new version number ready,
containing a newer POT file, please send the URL of that distribution
tarball to the address below.  The tarball may be just a pretest or a
snapshot, it does not even have to compile.  It is just used by the
translators when they need some extra translation context.

The following HTML page has been updated:

https://translationproject.org/domain/gcc.html

If any question arises, please contact the translation coordinator.

Thank you for all your work,

The Translation Project robot, in the
name of your translation coordinator.




Go patch committed: Generate memmove for non-pointer slice copy

2019-05-08 Thread Ian Lance Taylor
This Go frontend patch by Cherry Zhang generates a call to
__builtin_memmove for a non-pointer slice copy.  The builtin copy
function is lowered to runtime functions slicecopy, stringslicecopy,
or typedslicecopy.  The first two are basically thin wrappers of
memmove . Instead of making a runtime call, we can just use
__builtin_memmove.  This gives the compiler backend opportunities for
further optimizations.

Move the lowering of builtin copy function to flatten phase for the
ease of rewriting.

Also do this optimization for the copy part of append(s1, s2...).

Bootstrapped and ran Go testsuite on x86_64-pc-linux-gnu.  Committed
to mainline.

Ian
Index: gcc/go/gofrontend/MERGE
===
--- gcc/go/gofrontend/MERGE (revision 271016)
+++ gcc/go/gofrontend/MERGE (working copy)
@@ -1,4 +1,4 @@
-3a9bccfbf4af1c756978c40967838d9f6a4e7a62
+859e8ed3d632d9fe43d03fb81f6abefecf5fe3a6
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
Index: gcc/go/gofrontend/expressions.cc
===
--- gcc/go/gofrontend/expressions.cc(revision 271014)
+++ gcc/go/gofrontend/expressions.cc(working copy)
@@ -7743,8 +7743,9 @@ Builtin_call_expression::do_lower(Gogo*,
   return this;
 }
 
-// Flatten a builtin call expression.  This turns the arguments of copy and
-// append into temporary expressions.
+// Flatten a builtin call expression.  This turns the arguments of some
+// builtin calls into temporary expressions.  Also expand copy and append
+// to runtime calls.
 
 Expression*
 Builtin_call_expression::do_flatten(Gogo* gogo, Named_object* function,
@@ -7781,6 +7782,85 @@ Builtin_call_expression::do_flatten(Gogo
*pa = Expression::make_temporary_reference(temp, loc);
  }
  }
+
+// Lower to runtime call.
+const Expression_list* args = this->args();
+go_assert(args != NULL && args->size() == 2);
+Expression* arg1 = args->front();
+Expression* arg2 = args->back();
+go_assert(arg1->is_variable());
+go_assert(arg2->is_variable());
+bool arg2_is_string = arg2->type()->is_string_type();
+
+Expression* ret;
+Type* et = at->array_type()->element_type();
+if (et->has_pointer())
+  {
+Expression* td = Expression::make_type_descriptor(et, loc);
+ret = Runtime::make_call(Runtime::TYPEDSLICECOPY, loc,
+ 3, td, arg1, arg2);
+  }
+else
+  {
+Type* int_type = Type::lookup_integer_type("int");
+Type* uintptr_type = Type::lookup_integer_type("uintptr");
+
+// l1 = len(arg1)
+Named_object* lenfn = gogo->lookup_global("len");
+Expression* lenref = Expression::make_func_reference(lenfn, NULL, 
loc);
+Expression_list* len_args = new Expression_list();
+len_args->push_back(arg1->copy());
+Expression* len1 = Expression::make_call(lenref, len_args, false, 
loc);
+gogo->lower_expression(function, inserter, &len1);
+gogo->flatten_expression(function, inserter, &len1);
+Temporary_statement* l1tmp = Statement::make_temporary(int_type, 
len1, loc);
+inserter->insert(l1tmp);
+
+// l2 = len(arg2)
+len_args = new Expression_list();
+len_args->push_back(arg2->copy());
+Expression* len2 = Expression::make_call(lenref, len_args, false, 
loc);
+gogo->lower_expression(function, inserter, &len2);
+gogo->flatten_expression(function, inserter, &len2);
+Temporary_statement* l2tmp = Statement::make_temporary(int_type, 
len2, loc);
+inserter->insert(l2tmp);
+
+// n = (l1 < l2 ? l1 : l2)
+Expression* l1ref = Expression::make_temporary_reference(l1tmp, 
loc);
+Expression* l2ref = Expression::make_temporary_reference(l2tmp, 
loc);
+Expression* cond = Expression::make_binary(OPERATOR_LT, l1ref, 
l2ref, loc);
+Expression* n = Expression::make_conditional(cond,
+ l1ref->copy(),
+ l2ref->copy(),
+ loc);
+Temporary_statement* ntmp = Statement::make_temporary(NULL, n, 
loc);
+inserter->insert(ntmp);
+
+// sz = n * sizeof(elem_type)
+Expression* nref = Expression::make_temporary_reference(ntmp, loc);
+nref = Expression::make_cast(uintptr_type, nref, loc);
+Expression* sz = Expression::make_type_info(et, TYPE_INFO_SIZE);
+sz = Expression::make_binary(OPERATOR_MULT, sz, nref, loc);
+
+// memmove(arg1.ptr, arg2.ptr, sz)
+  

Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Jeff Law
On 5/8/19 6:20 AM, Richard Biener wrote:
> On Wed, 8 May 2019, Jiufu Guo wrote:
> 
>> Hi,
>>
>> Thanks Richard, Segher, Andrew and all.
>>
>> Segher Boessenkool  writes:
>>
>>> Let me try to answer some of this...
>>>
>>> On Tue, May 07, 2019 at 03:31:27PM +0200, Richard Biener wrote:
 On Mon, 6 May 2019, Jiufu Guo wrote:
> This patch implements the optimization in PR77820.  The optimization
> eliminates phi and phi's basic block, if the phi is used only by
> condition branch, and the phi's incoming value in the result of a
> CMP result.
>>>
 I'm not sure I like a new pass here ;)  The transform is basically
 tail-duplicating the PHI block because the exit conditional can
 be "simplfied" - that's something jump threading generally does
 though it relies on "simplified" being a little bit more simplified
 than above.
>>>
>> Adding a new pass is just because it may be relatively easy to extend
>> and maintain. 
>>
>> Adding this micor-optimization into other passes is also a good
>> sugguestion. 
>>
>> - Similar with jump threading, this new transform is replacing jump
>> old destination with new destination. While for this case, the new
>> destination can not be determined at compile time. 
>>
>> - And as Andrew said: forwprop/backprop are similar, but this case is in
>> opposite: it is spread common code(phi+gcond) into different preds.
>>
>> - And there is phiopt pass which focus on making 'phi' stmt better. 
>> it also do some similar thing:
>>  bb0:
>>if (a != b) goto bb2; else goto bb1;
>>  bb1:
>>  bb2:
>>x = PHI ;
>>
>>tranform to:
>>
>>  bb0:
>>  bb2:
>>x = PHI ;
>>
>> If I avoid to add new pass, any suggustions about which exiting pass
>> (jumpthreading/forwprop/phiopt/...) would be more suitable to extend to
>> support this new transform? 
> 
> The main thing the transform does is tail-duplicate the PHI block,
> if we'd just do that followup transforms would do the rest.
One might even claim this is really a transformation for cfg cleanups.
If we ignore the PHI what we have is a unconditional jump to a
conditional jump.  The obvious way to optimize that is to replace the
unconditional jump with a copy of the conditional jump.



>> Currently this pass handles the case if there is only one PHI and the
>> phi is used by gcond. If there are other suitable cases, we can just
>> extend this tranform.
> 
> I was just thinking of a small amount of unrelated stmts in those
> blocks or the case where just one PHI argument is fed by a conditional
> thus only one path would simplify (but maybe the hot one if you look
> at edge probabilities).
Perhaps, but do these happen enough in practice?  With the code motions
we do I'd be a bit surprised.

> 
>>>
 Any specific reason for the pass placement between PRE and sinking?
 tracer and path splitting run much later, jump threading runs
 all over the place.
>>>
>>> Dunno.  Jiufu, does the pass placement matter much?
>> This pass would just need be inserted after ssa, and before the last dce
>> and forwprop which could help to eliminate temp conversion from bool to
>> int:
>>  _1 = a_8(D) < b_9(D);
>>  t_14 = (int) _1;
>>  if (_1 != 0)
>>
>> Putting it after PRE is just because some case is better be handled by PREx,
>> like below case:  if (x) t = a < b; else t = a < b; if (t) xx.
>> While actually, this pass is ok to insert at other placement.
> 
> OK, so all of the suggestions above would work since we have a late
> forwprop and DCE pass to clean up.
I'd expect this transformation to be useful for jump threading, VRP,
DOM, etc.

> 
> Btw, I wonder if on RTL basic-block reordering (which also does
> some tail duplication) could be a place to do such transform?
> Or is it too late to do the desired cleanups after that?
> Possibly since we're after RA.IIRC we've had code in jump.c to do this in the 
> past.  It may have
morphed through the years, but I'm pretty sure we've done this kind of
thing on a low level before.
jeff


Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Jeff Law
On 5/6/19 8:24 AM, Jiufu Guo wrote:
> Hi,
> 
> This patch implements the optimization in PR77820.  The optimization
> eliminates phi and phi's basic block, if the phi is used only by
> condition branch, and the phi's incoming value in the result of a
> CMP result.
> 
> This optimization eliminates:
> 1. saving CMP result: p0 = a CMP b.
> 2. additional CMP on branch: if (phi != 0).
> 3. converting CMP result if there is phi = (INT_CONV) p0 if there is.
> 
>   
>   p0 = a CMP b
>   goto ;
> 
>   
>   p1 = c CMP d
>   goto ;
> 
>   
>   # phi = PHI 
>   if (phi != 0) goto ; else goto ;
> 
> Transform to:
> 
>   
>   p0 = a CMP b
>   if (p0 != 0) goto ; else goto ;
> 
>   
>   p1 = c CMP d
>   if (p1 != 0) goto ; else goto ;
> 
> Bootstrapped and tested on powerpc64le with no regressions, and testcases were
> saved. Is this ok for trunk?
So there's been talk of somehow integrating with jump threading.   The
big problem with that idea is jump threading is only concerned with
rearranging the CFG when doing so allows it to determine the result of a
conditional branch.

This is actually a generalized form of path splitting that we currently
do for loop latches.  You may find you can twiddle that code to do what
you want.

One of the things that falls out of that realization is that you have to
be very careful that this doesn't muck up if-conversion.

Jeff



Re: [PATCH] PR libstdc++/71579 assert that type traits are not misused with an incomplete type

2019-05-08 Thread Jonathan Wakely

On 06/05/19 14:19 +0300, Antony Polukhin wrote:

@@ -924,14 +984,25 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template
struct is_default_constructible
: public __is_default_constructible_safe<_Tp>::type
-{ };
+{
+  static_assert(std::__is_complete_or_unbounded(__type_identity<_Tp>{}),
+   "template argument must be a complete class or an unbounded array");
+};

  /// is_constructible


This "///" comment is for Doxygen, and should remain with the
is_constructible trait, not __is_constructible_impl.


-  template
-struct is_constructible
+template
+struct __is_constructible_impl
  : public __bool_constant<__is_constructible(_Tp, _Args...)>
{ };

+  template
+struct is_constructible
+  : public __is_constructible_impl<_Tp, _Args...>
+{
+  static_assert(std::__is_complete_or_unbounded(__type_identity<_Tp>{}),
+   "template argument must be a complete class or an unbounded array");
+};
+
  template::value>
struct __is_copy_constructible_impl;



The rest looks reasonable, but I'll finish reviewing it ASAP, when I
have more time.



Re: [PATCH][PR90106] Builtin call transformation changes in cdce pass

2019-05-08 Thread Jeff Law
On 5/8/19 6:28 AM, Richard Biener wrote:
> On Wed, May 8, 2019 at 12:09 PM JunMa  wrote:
>>
>> Hi
>>
>> As PR90106 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106),
>> when gcc meets builtin function call like:
>>
>>y = sqrt (x);
>>
>> The cdce pass tries to transform the call into an internal function
>> call and conditionally executes call with a simple range check on the
>> arguments which can detect most cases and the errno does not need
>> to be set. It looks like:
>>
>>y = IFN_SQRT (x);
>>if (__builtin_isless (x, 0))
>>  sqrt (x);
>>
>> However, If the call is in tail position, for example:
>>
>>y =  sqrt (x);
>>return y;
>>
>> will become:
>>
>>y = IFN_SQRT (x);
>>if (__builtin_isless (x, 0))
>>  sqrt (x);
>>return y;
>>
>> This transformation breaks tailcall pattern, and prevents
>> later tailcall optimizations.
>>
>> So This patch transform builtin call with return value into
>> if-then-else part, which looks like:
>>
>>y =  sqrt (x);
>> ==>
>>if (__builtin_isless (x, 0))
>>  y = sqrt (x);
>>else
>>  y = IFN_SQRT (x);
>>
>> BTW, y = sqrt (x) can also transform like:
>>
>>y = IFN_SQRT (x);
>>if (__builtin_isless (x, 0))
>>  y = sqrt (x);
>>
>> We don‘t choose this pattern because it emits worse assemble
>> code(more move instruction and use more registers) in x86_64.
>>
>> Bootstrapped/regtested on x86_64-linux, ok for trunk?
> 
> OK.
JunMa -- do you have a copyright assignment on file and write access to
the repository?  If not we should take care of that before proceeding
further.

Jeff


[patch, fortran] C prototype writing improvements for gfortran

2019-05-08 Thread Thomas Koenig

Hello world,

the attached patch fixes PR 90351 (not all prototypes were written
to standard output with -fc-prototypes) and introduces new
functionality to also write C prototypes for external functions,
at the same time discouraging their use (because BIND(C) is really
the better, standard-conforming and portable way).  While looking
at the code, I also noticed that COMPLEX was not handled before,
so I added that, too.

Example:

$ cat c.f90
integer function r(i)
end

subroutine foo(a,b,c)
  character*(*) a
  real b
  complex c
end

character*(*) function x(r, c1,c2)
  real r
  character*(*) c1,c2
end
$ gfortran -fsyntax-only -fc-prototypes-external c.f90
/* Prototypes for external procedures generated from c.f90
   by GNU Fortran (GCC) 10.0.0 20190427 (experimental).

   Use of this interface is dicsouraged, consider using the
   BIND(C) feature of standard Fortran instead.  */

int r_ (int *i);
void foo_ (char *a, float *b, float complex *c, size_t a_len);
void x_ (char *result_x, size_t result_x_len, float *r, char *c1, char 
*c2, size_t c1_len, size_t c2_len);


I'd like to commit this to trunk and to gcc-9, to help users of
old-fashioned Lapack bindings, such as R, with their transition
to something that does not violate gfortran's ABI.

Tested with "make dvi" and "make info".  Otherwise, since these flags
are not tested in the testsuite (maybe they should be, I just don't
know how), regression test passed.

OK?

2019-05-08  Thomas Koenig  

PR fortran/90351
PR fortran/90329
* gfortran.dg/dump-parse-tree.c: Include version.h.
(gfc_dump_external_c_prototypes): New function.
(get_c_type_name): Select "char" as a name for a simple char.
Adjust to handling external functions. Also handle complex.
(write_decl): Add argument bind_c. Adjust for dumping of external
procedures.
(write_proc): Likewise.
(write_interop_decl): Add bind_c argument to call of write_proc.
* gfortran.h: Add prototype for gfc_dump_external_c_prototypes.
* lang.opt: Add -fc-prototypes-external flag.
* parse.c (gfc_parse_file): Move dumping of BIND(C) prototypes.
Call gfc_dump_external_c_prototypes if option is set.
* invoke.texi: Document -fc-prototypes-external.


Index: dump-parse-tree.c
===
--- dump-parse-tree.c	(Revision 270622)
+++ dump-parse-tree.c	(Arbeitskopie)
@@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "coretypes.h"
 #include "gfortran.h"
 #include "constructor.h"
+#include "version.h"
 
 /* Keep track of indentation for symbol tree dumps.  */
 static int show_level = 0;
@@ -3074,6 +3075,7 @@ gfc_dump_parse_tree (gfc_namespace *ns, FILE *file
 /* This part writes BIND(C) definition for use in external C programs.  */
 
 static void write_interop_decl (gfc_symbol *);
+static void write_proc (gfc_symbol *, bool);
 
 void
 gfc_dump_c_prototypes (gfc_namespace *ns, FILE *file)
@@ -3086,6 +3088,33 @@ gfc_dump_c_prototypes (gfc_namespace *ns, FILE *fi
   gfc_traverse_ns (ns, write_interop_decl);
 }
 
+/* Loop over all global symbols, writing out their declrations.  */
+
+void
+gfc_dump_external_c_prototypes (FILE * file)
+{
+  dumpfile = file;
+  fprintf (dumpfile,
+	   _("/* Prototypes for external procedures generated from %s\n"
+	 "   by GNU Fortran %s%s.\n\n"
+	 "   Use of this interface is dicsouraged, consider using the\n"
+	 "   BIND(C) feature of standard Fortran instead.  */\n\n"),
+	   gfc_source_file, pkgversion_string, version_string);
+
+  for (gfc_current_ns = gfc_global_ns_list; gfc_current_ns;
+   gfc_current_ns = gfc_current_ns->sibling)
+{
+  gfc_symbol *sym = gfc_current_ns->proc_name;
+
+  if (sym == NULL || sym->attr.flavor != FL_PROCEDURE
+	  || sym->attr.is_bind_c)
+	continue;
+
+  write_proc (sym, false);
+}
+  return;
+}
+
 enum type_return { T_OK=0, T_WARN, T_ERROR };
 
 /* Return the name of the type for later output.  Both function pointers and
@@ -3104,7 +3133,7 @@ get_c_type_name (gfc_typespec *ts, gfc_array_spec
   *asterisk = false;
   *post = "";
   *type_name = "";
-  if (ts->type == BT_REAL || ts->type == BT_INTEGER)
+  if (ts->type == BT_REAL || ts->type == BT_INTEGER || ts->type == BT_COMPLEX)
 {
   if (ts->is_c_interop && ts->interop_kind)
 	{
@@ -3113,6 +3142,12 @@ get_c_type_name (gfc_typespec *ts, gfc_array_spec
 	*type_name = "signed char";
 	  else if (strcmp (*type_name, "size_t") == 0)
 	*type_name = "ssize_t";
+	  else if (strcmp (*type_name, "float_complex") == 0)
+	*type_name = "float complex";
+	  else if (strcmp (*type_name, "double_complex") == 0)
+	*type_name = "double complex";
+	  else if (strcmp (*type_name, "long_double_complex") == 0)
+	*type_name = "long double complex";
 
 	  ret = T_OK;
 	}
@@ -3130,6 +3165,12 @@ get_c_type_name (gfc_typespec *ts, gfc_array_spec
 		*type_name = "signed cha

Re: [PATCH] Add simplification rule tanh (x) * cosh (x) -> sinh (x)

2019-05-08 Thread Jeff Law
On 5/4/19 6:21 PM, Giuliano Belinassi wrote:
> Hi
> 
> On 04/30, Jeff Law wrote:
>> On 4/30/19 8:00 AM, Jakub Jelinek wrote:
>>> On Tue, Apr 30, 2019 at 07:57:20AM -0600, Jeff Law wrote:
> Just curious, do we want to add math identities like above to match.pd ?
 I'd think so.


> In practice, I am not sure how often we'd see  "tanh * cosh" instead
> of sinh directly in source,
 We're often surprised when what ultimately shows up in sources :-)  And
 a transformation that allows us to turn two transcendentals and a
 multiplication into a single transcendental  is going to be a big win.
>>>
> 
> I wonder why these kind of optimization are not in the EasyHacks page.
I doubt anyone has really thought about them.  It's certainly
appropriate to add more stuff to that page ;-)

> It is somewhat easy to add it and can give huge performance
> improvements.
Agreed.

> 
> There is a blogpost that I wrote about the previous patches I submitted
> here. I tried to be didadic, discussing from the mathematical
> standpoint, to floating point numbers, and GCC implementation.
> Although there may be several english errors, I think it is still
> useful:
> 
> https://flusp.ime.usp.br/gcc/2019/03/26/making-gcc-optimize-some-trigonometric-functions/
Thanks.  It's funny you mention high school math, I had to do a fair
amount of refreshing when we were working on your patches ;-)

jeff


Re: [patch, fortran] C prototype writing improvements for gfortran

2019-05-08 Thread Steve Kargl
On Wed, May 08, 2019 at 11:30:57PM +0200, Thomas Koenig wrote:
> $ gfortran -fsyntax-only -fc-prototypes-external c.f90
> /* Prototypes for external procedures generated from c.f90
> by GNU Fortran (GCC) 10.0.0 20190427 (experimental).
> 
> Use of this interface is dicsouraged, consider using the

dicsouraged?

Otherwise, looks ok to me.

-- 
Steve


libgo patch committed: Add Debugging section to README

2019-05-08 Thread Ian Lance Taylor
This libgo patch adds a Debugging section to the README file.
Committed to mainline.

Ian
Index: gcc/go/gofrontend/MERGE
===
--- gcc/go/gofrontend/MERGE (revision 271017)
+++ gcc/go/gofrontend/MERGE (working copy)
@@ -1,4 +1,4 @@
-859e8ed3d632d9fe43d03fb81f6abefecf5fe3a6
+f813c670deb8e0454c3f64de74bedb5dcedd10b4
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
Index: libgo/README
===
--- libgo/README(revision 270877)
+++ libgo/README(working copy)
@@ -42,3 +42,28 @@ http://code.google.com/legal/corporate-c
 If the copyright holder for your code has already completed the
 agreement in connection with another Google open source project, it
 does not need to be completed again.
+
+Debugging
+=
+
+This describes how to test libgo when built as part of gccgo.
+
+To test a specific package, cd to the libgo build directory
+(TARGET/libgo) and run `make PKG/check`.  For example, `make
+bytes/check`.
+
+To see the exact commands that it runs, including how the compiler is
+invoked, run `make GOTESTFLAGS=--trace bytes/check`.  This will
+display the commands if the test fails.  If the test passes, the
+commands and other output will be visible in a file named
+check-testlog in a subdirectory with the name of the package being
+checked.  In the case of bytes/check, this will create
+bytes/check-testlog.
+
+To leave the test program behind, run `make GOTESTFLAGS=--keep
+bytes/check`.  That will leave a gotest/test directory in the
+libgo build directory.  In that directory you can run
+`LD_LIBRARY_PATH=../../.libs ./a.out -test.short` to run the tests.
+You can run specific failing tests using a -test.run option.  You can
+see the tests being run with the -test.v option.  You can run the
+program under a debugger such as gdb.


Re: [patch, fortran] C prototype writing improvements for gfortran

2019-05-08 Thread Thomas Koenig

Hi Steve,> dicsouraged?

Fixed.


Otherwise, looks ok to me.


Committed, thanks.

Let's see where this leads...

Regards

Thomas


Go patch committed: Avoid copy for string([]byte) in string comparison

2019-05-08 Thread Ian Lance Taylor
This Go frontend patch by Cherry Zhang avoids a copy for a
string([]byte) conversion used in a string comparison.  If a
string([]byte) conversion is used immediately in a string comparison,
we don't need to copy the backing store of the byte slice, as the
string comparison doesn't hold any reference to it.  Instead, just
create a string header from the byte slice and pass it for comparison.
A new type of expression, String_value_expression, is introduced, for
constructing string headers.  Bootstrapped and ran Go testsuite on
x86_64-pc-linux-gnu.  Committed to mainline.

Ian

2019-05-08  Cherry Zhang  

* go.dg/cmpstring.go: New test.
Index: gcc/go/gofrontend/MERGE
===
--- gcc/go/gofrontend/MERGE (revision 271019)
+++ gcc/go/gofrontend/MERGE (working copy)
@@ -1,4 +1,4 @@
-f813c670deb8e0454c3f64de74bedb5dcedd10b4
+9c8581187b1c1a30036263728370f31cb846a274
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
Index: gcc/go/gofrontend/expressions.cc
===
--- gcc/go/gofrontend/expressions.cc(revision 271017)
+++ gcc/go/gofrontend/expressions.cc(working copy)
@@ -2031,6 +2031,90 @@ Expression::make_string_info(Expression*
   return new String_info_expression(string, string_info, location);
 }
 
+// An expression that represents an string value: a struct with value pointer
+// and length fields.
+
+class String_value_expression : public Expression
+{
+ public:
+  String_value_expression(Expression* valptr, Expression* len, Location 
location)
+  : Expression(EXPRESSION_STRING_VALUE, location),
+valptr_(valptr), len_(len)
+  { }
+
+ protected:
+  int
+  do_traverse(Traverse*);
+
+  Type*
+  do_type()
+  { return Type::make_string_type(); }
+
+  void
+  do_determine_type(const Type_context*)
+  { go_unreachable(); }
+
+  Expression*
+  do_copy()
+  {
+return new String_value_expression(this->valptr_->copy(),
+   this->len_->copy(),
+   this->location());
+  }
+
+  Bexpression*
+  do_get_backend(Translate_context* context);
+
+  void
+  do_dump_expression(Ast_dump_context*) const;
+
+ private:
+  // The value pointer.
+  Expression* valptr_;
+  // The length.
+  Expression* len_;
+};
+
+int
+String_value_expression::do_traverse(Traverse* traverse)
+{
+  if (Expression::traverse(&this->valptr_, traverse) == TRAVERSE_EXIT
+  || Expression::traverse(&this->len_, traverse) == TRAVERSE_EXIT)
+return TRAVERSE_EXIT;
+  return TRAVERSE_CONTINUE;
+}
+
+Bexpression*
+String_value_expression::do_get_backend(Translate_context* context)
+{
+  std::vector vals(2);
+  vals[0] = this->valptr_->get_backend(context);
+  vals[1] = this->len_->get_backend(context);
+
+  Gogo* gogo = context->gogo();
+  Btype* btype = Type::make_string_type()->get_backend(gogo);
+  return gogo->backend()->constructor_expression(btype, vals, 
this->location());
+}
+
+void
+String_value_expression::do_dump_expression(
+Ast_dump_context* ast_dump_context) const
+{
+  ast_dump_context->ostream() << "stringvalue(";
+  ast_dump_context->ostream() << "value: ";
+  this->valptr_->dump_expression(ast_dump_context);
+  ast_dump_context->ostream() << ", length: ";
+  this->len_->dump_expression(ast_dump_context);
+  ast_dump_context->ostream() << ")";
+}
+
+Expression*
+Expression::make_string_value(Expression* valptr, Expression* len,
+  Location location)
+{
+  return new String_value_expression(valptr, len, location);
+}
+
 // Make an integer expression.
 
 class Integer_expression : public Expression
@@ -3702,9 +3786,11 @@ Type_conversion_expression::do_check_typ
 Expression*
 Type_conversion_expression::do_copy()
 {
-  return new Type_conversion_expression(this->type_->copy_expressions(),
-   this->expr_->copy(),
-   this->location());
+  Expression* ret = new 
Type_conversion_expression(this->type_->copy_expressions(),
+   this->expr_->copy(),
+   this->location());
+  ret->conversion_expression()->set_no_copy(this->no_copy_);
+  return ret;
 }
 
 // Get the backend representation for a type conversion.
@@ -3764,7 +3850,22 @@ Type_conversion_expression::do_get_backe
 
   Runtime::Function code;
   if (e->integer_type()->is_byte())
-code = Runtime::SLICEBYTETOSTRING;
+{
+  if (this->no_copy_)
+{
+  if (gogo->debug_optimization())
+go_inform(loc, "no copy string([]byte)");
+  Expression* ptr = Expression::make_slice_info(this->expr_,
+
SLICE_INFO_VALUE_POINTER,
+loc);

Re: [PATCH][PR90106] Builtin call transformation changes in cdce pass

2019-05-08 Thread Bin.Cheng
On Thu, May 9, 2019 at 5:31 AM Jeff Law  wrote:
>
> On 5/8/19 6:28 AM, Richard Biener wrote:
> > On Wed, May 8, 2019 at 12:09 PM JunMa  wrote:
> >>
> >> Hi
> >>
> >> As PR90106 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106),
> >> when gcc meets builtin function call like:
> >>
> >>y = sqrt (x);
> >>
> >> The cdce pass tries to transform the call into an internal function
> >> call and conditionally executes call with a simple range check on the
> >> arguments which can detect most cases and the errno does not need
> >> to be set. It looks like:
> >>
> >>y = IFN_SQRT (x);
> >>if (__builtin_isless (x, 0))
> >>  sqrt (x);
> >>
> >> However, If the call is in tail position, for example:
> >>
> >>y =  sqrt (x);
> >>return y;
> >>
> >> will become:
> >>
> >>y = IFN_SQRT (x);
> >>if (__builtin_isless (x, 0))
> >>  sqrt (x);
> >>return y;
> >>
> >> This transformation breaks tailcall pattern, and prevents
> >> later tailcall optimizations.
> >>
> >> So This patch transform builtin call with return value into
> >> if-then-else part, which looks like:
> >>
> >>y =  sqrt (x);
> >> ==>
> >>if (__builtin_isless (x, 0))
> >>  y = sqrt (x);
> >>else
> >>  y = IFN_SQRT (x);
> >>
> >> BTW, y = sqrt (x) can also transform like:
> >>
> >>y = IFN_SQRT (x);
> >>if (__builtin_isless (x, 0))
> >>  y = sqrt (x);
> >>
> >> We don‘t choose this pattern because it emits worse assemble
> >> code(more move instruction and use more registers) in x86_64.
> >>
> >> Bootstrapped/regtested on x86_64-linux, ok for trunk?
> >
> > OK.
> JunMa -- do you have a copyright assignment on file and write access to
> the repository?  If not we should take care of that before proceeding
> further.
Hi Jeff,
Thanks very much for helping.
Yes, he is under Alibaba's copyright assignment.  What else should we
do other than noticing in this mailing list message?
BTW, I think JunMa needs to apply write-access though.

Thanks,
bin
>
> Jeff


Re: [PATCH][Tree-optimization/PR89772]fold memchr builtins for character not in constant nul-padded string

2019-05-08 Thread JunMa

在 2019/5/8 下午10:31, Richard Biener 写道:

On Tue, May 7, 2019 at 4:34 AM JunMa  wrote:

在 2019/5/6 下午7:58, JunMa 写道:

在 2019/5/6 下午6:02, Richard Biener 写道:

On Thu, Mar 21, 2019 at 5:57 AM JunMa  wrote:

Hi
For now, gcc can not fold code like:

const char a[5] = "123"
__builtin_memchr (a, '7', sizeof a)

It tries to avoid folding out of string length although length of a
is 5.
This is a bit conservative, it's safe to folding memchr/bcmp/memcmp
builtins when constant string stores in array with some trailing nuls.

This patch folds these cases by exposing additional length of
trailing nuls in c_getstr().
Bootstrapped/regtested on x86_64-linux, ok for trunk?

It's probably better if gimple_fold_builtin_memchr uses string_constant
directly instead?

We can use string_constant in gimple_fold_builtin_memchr, however it is a
bit complex to use it in memchr/memcmp constant folding.

You are changing memcmp constant folding but only have a testcase
for memchr.

I'll add later.

If we decide to amend c_getstr I would rather make it return the
total length instead of the number of trailing zeros.

I think return the total length is safe in other place as well.
I used the argument in patch since I do not want any impact on
other part at all.


Although it is safe to use total length, I found that function
inline_expand_builtin_string_cmp() which used c_getstr() may emit
redundant rtls for trailing null chars when total length is returned.

Also it is trivial to handle constant string with trailing null chars.

So this updated patch follow richard's suggestion: using string_constant
directly.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

Doesn't this fold to NULL for the case where seaching for '0' and it
doesn't occur in the string constant but only the zero-padding?



For C case like:

  const char str[100] = “hello world”
  const char ch = '\0';
  ret = (char*)memchr(str, ch, strlen(str));
  ret2 = (char*)memchr(str, ch, sizeof str);

then ret is null,  ret2 is not



So you'd need to conditionalize on c being not zero (or handle
that case specially by actually finding the first zero in the padding)?



Yes, the string length return from c_getstr() including the termination
NULL most of the time(when string_length <= string_size).
If ch is NULL and the string length >= 3rd argument of memchr,
then we donot need to handle this.

Thanks
JunMa



I think there was work to have all string constants zero terminated
but I'm not sure if we can rely on that here.  Bernd?

Richard.


Regards
JunMa

gcc/ChangeLog

2019-05-07  Jun Ma 

  PR Tree-optimization/89772
  * gimple-fold.c (gimple_fold_builtin_memchr): consider trailing nuls in
  out-of-bound accesses checking.

gcc/testsuite/ChangeLog

2019-05-07  Jun Ma 

  PR Tree-optimization/89772
  * gcc.dg/builtin-memchr-4.c: New test.

Thanks
JunMa

Richard.


Regards
JunMa


gcc/ChangeLog

2019-03-21  Jun Ma 

   PR Tree-optimization/89772
   * fold-const.c (c_getstr): Add new parameter to get length of
additional
   trailing nuls after constant string.
   * gimple-fold.c (gimple_fold_builtin_memchr): consider
trailing nuls in
   out-of-bound accesses checking.
   * fold-const-call.c (fold_const_call): Likewise.


gcc/testsuite/ChangeLog

2019-03-21  Jun Ma 

   PR Tree-optimization/89772
   * gcc.dg/builtin-memchr-4.c: New test.





Re: [PATCH][Tree-optimization/PR89772]fold memchr builtins for character not in constant nul-padded string

2019-05-08 Thread JunMa

在 2019/5/9 上午3:02, Bernd Edlinger 写道:

On 5/8/19 4:31 PM, Richard Biener wrote:

On Tue, May 7, 2019 at 4:34 AM JunMa  wrote:

在 2019/5/6 下午7:58, JunMa 写道:

在 2019/5/6 下午6:02, Richard Biener 写道:

On Thu, Mar 21, 2019 at 5:57 AM JunMa  wrote:

Hi
For now, gcc can not fold code like:

const char a[5] = "123"
__builtin_memchr (a, '7', sizeof a)

It tries to avoid folding out of string length although length of a
is 5.
This is a bit conservative, it's safe to folding memchr/bcmp/memcmp
builtins when constant string stores in array with some trailing nuls.

This patch folds these cases by exposing additional length of
trailing nuls in c_getstr().
Bootstrapped/regtested on x86_64-linux, ok for trunk?

It's probably better if gimple_fold_builtin_memchr uses string_constant
directly instead?

We can use string_constant in gimple_fold_builtin_memchr, however it is a
bit complex to use it in memchr/memcmp constant folding.

You are changing memcmp constant folding but only have a testcase
for memchr.

I'll add later.

If we decide to amend c_getstr I would rather make it return the
total length instead of the number of trailing zeros.

I think return the total length is safe in other place as well.
I used the argument in patch since I do not want any impact on
other part at all.


Although it is safe to use total length, I found that function
inline_expand_builtin_string_cmp() which used c_getstr() may emit
redundant rtls for trailing null chars when total length is returned.

Also it is trivial to handle constant string with trailing null chars.

So this updated patch follow richard's suggestion: using string_constant
directly.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

Doesn't this fold to NULL for the case where seaching for '0' and it
doesn't occur in the string constant but only the zero-padding?
So you'd need to conditionalize on c being not zero (or handle
that case specially by actually finding the first zero in the padding)?
I think there was work to have all string constants zero terminated
but I'm not sure if we can rely on that here.  Bernd?


It turned out that there is a number of languages that don't have 
zero-terminated
strings by default, which would have complicated the patch just too much for too
little benefit.

In the end, it was more important to guarantee that mem_size >= string_length 
holds.


The string_length returned form  c_getstr() is equal to mem_size when 
string_length > string_size, so I'll add assert


in the patch.


In C it is just a convention that string constants have usually a 
zero-termination,
but even with C there are ways how strings constants can be not zero-terminated.

There can in theory be optimizations that introduce not zero-terminated strings,
like tree-ssa-forwprop.c, where a not zero-terminated string constant is folded
in simplify_builtin_call.

In such a case, c_getstr might in theory return a string without 
zero-termination,
but I think it will be rather difficult to find a C test case for that.

Well, if I had a test case for that I would probably fix it in c_getstr to 
consider
the implicit padding as equivalent to an explicit zero-termination.



c_getstr() makes sure  the returned string has zero-termination when 2nd 
argument is NULL, but not when
string_length is returned.  I think it is the caller's responsibility to 
take care of both of the returned string and length.



Thanks
JunMa




Bernd.



Richard.


Regards
JunMa

gcc/ChangeLog

2019-05-07  Jun Ma 

  PR Tree-optimization/89772
  * gimple-fold.c (gimple_fold_builtin_memchr): consider trailing nuls in
  out-of-bound accesses checking.

gcc/testsuite/ChangeLog

2019-05-07  Jun Ma 

  PR Tree-optimization/89772
  * gcc.dg/builtin-memchr-4.c: New test.

Thanks
JunMa

Richard.


Regards
JunMa


gcc/ChangeLog

2019-03-21  Jun Ma 

   PR Tree-optimization/89772
   * fold-const.c (c_getstr): Add new parameter to get length of
additional
   trailing nuls after constant string.
   * gimple-fold.c (gimple_fold_builtin_memchr): consider
trailing nuls in
   out-of-bound accesses checking.
   * fold-const-call.c (fold_const_call): Likewise.


gcc/testsuite/ChangeLog

2019-03-21  Jun Ma 

   PR Tree-optimization/89772
   * gcc.dg/builtin-memchr-4.c: New test.





Re: [PATCH][PR90106] Builtin call transformation changes in cdce pass

2019-05-08 Thread JunMa

在 2019/5/9 上午9:20, Bin.Cheng 写道:

On Thu, May 9, 2019 at 5:31 AM Jeff Law  wrote:

On 5/8/19 6:28 AM, Richard Biener wrote:

On Wed, May 8, 2019 at 12:09 PM JunMa  wrote:

Hi

As PR90106 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90106),
when gcc meets builtin function call like:

y = sqrt (x);

The cdce pass tries to transform the call into an internal function
call and conditionally executes call with a simple range check on the
arguments which can detect most cases and the errno does not need
to be set. It looks like:

y = IFN_SQRT (x);
if (__builtin_isless (x, 0))
  sqrt (x);

However, If the call is in tail position, for example:

y =  sqrt (x);
return y;

will become:

y = IFN_SQRT (x);
if (__builtin_isless (x, 0))
  sqrt (x);
return y;

This transformation breaks tailcall pattern, and prevents
later tailcall optimizations.

So This patch transform builtin call with return value into
if-then-else part, which looks like:

y =  sqrt (x);
 ==>
if (__builtin_isless (x, 0))
  y = sqrt (x);
else
  y = IFN_SQRT (x);

BTW, y = sqrt (x) can also transform like:

y = IFN_SQRT (x);
if (__builtin_isless (x, 0))
  y = sqrt (x);

We don‘t choose this pattern because it emits worse assemble
code(more move instruction and use more registers) in x86_64.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

OK.

JunMa -- do you have a copyright assignment on file and write access to
the repository?  If not we should take care of that before proceeding
further.

Hi Jeff,
Thanks very much for helping.
Yes, he is under Alibaba's copyright assignment.  What else should we
do other than noticing in this mailing list message?
BTW, I think JunMa needs to apply write-access though.


Yes, I do not have write access, FYI.

Thanks
JunMa



Thanks,
bin

Jeff





Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Jiufu Guo
Jeff Law  writes:

> On 5/6/19 8:24 AM, Jiufu Guo wrote:
>> Hi,
>> 
>> This patch implements the optimization in PR77820.  The optimization
>> eliminates phi and phi's basic block, if the phi is used only by
>> condition branch, and the phi's incoming value in the result of a
>> CMP result.
>> 
>> This optimization eliminates:
>> 1. saving CMP result: p0 = a CMP b.
>> 2. additional CMP on branch: if (phi != 0).
>> 3. converting CMP result if there is phi = (INT_CONV) p0 if there is.
>> 
>>   
>>   p0 = a CMP b
>>   goto ;
>> 
>>   
>>   p1 = c CMP d
>>   goto ;
>> 
>>   
>>   # phi = PHI 
>>   if (phi != 0) goto ; else goto ;
>> 
>> Transform to:
>> 
>>   
>>   p0 = a CMP b
>>   if (p0 != 0) goto ; else goto ;
>> 
>>   
>>   p1 = c CMP d
>>   if (p1 != 0) goto ; else goto ;
>> 
>> Bootstrapped and tested on powerpc64le with no regressions, and testcases 
>> were
>> saved. Is this ok for trunk?
> So there's been talk of somehow integrating with jump threading.   The
> big problem with that idea is jump threading is only concerned with
> rearranging the CFG when doing so allows it to determine the result of a
> conditional branch.
>
> This is actually a generalized form of path splitting that we currently
> do for loop latches.  You may find you can twiddle that code to do what
> you want.
>
> One of the things that falls out of that realization is that you have to
> be very careful that this doesn't muck up if-conversion.
>
> Jeff

Thanks for all your suggeustions: Richard, Segher, Andrew and Jeff!
Will send new patch after investigating and code refining.

Jiufu Guo.



Re: Make deque iterator operators hidden friends

2019-05-08 Thread François Dumont

Thanks for the tip, nice to know.

Attached patch applied.

François

On 5/8/19 8:28 PM, Jonathan Wakely wrote:

On 08/05/19 18:50 +0200, François Dumont wrote:
Here is a patch to reduce number of operators exposed at std 
namespace scope.


    * include/bits/stl_deque.h
    (_Deque_iterator<>::operator+(difference_type)): Make hidden friend.
    (_Deque_iterator<>::operator-(difference_type)): Likewise.
    (operator==(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator!=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator<(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator<=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator>(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.
    (operator>=(const _Deque_iterator<>&, const _Deque_iterator<>&)):
    Likewise.

Tested under Linux x86_64 normal/debug modes.

Ok to commit ?

François



diff --git a/libstdc++-v3/include/bits/stl_deque.h 
b/libstdc++-v3/include/bits/stl_deque.h

index 7a7a42aa903..72420c27646 100644
--- a/libstdc++-v3/include/bits/stl_deque.h
+++ b/libstdc++-v3/include/bits/stl_deque.h
@@ -238,24 +238,10 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
return *this;
  }

-  _Self
-  operator+(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-    _Self __tmp = *this;
-    return __tmp += __n;
-  }
-
  _Self&
  operator-=(difference_type __n) _GLIBCXX_NOEXCEPT
  { return *this += -__n; }

-  _Self
-  operator-(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-    _Self __tmp = *this;
-    return __tmp -= __n;
-  }
-


These two are already not visible at namespace-scope, but it doesn't
hurt to make them hidden friends for consistency.


+  friend _Self
+  operator+(const _Self& __x, difference_type __n) 
_GLIBCXX_NOEXCEPT

+  {
+    _Self __tmp = __x;
+    return __tmp += __n;


I know you've just reused the original implementation, but it has a
problem that we might as well fix while touching t his code.

This function prevents NRVO copy elision, because __tmp += __n returns
a reference, which gets copied into the return value.

If you write it like this then the copy is elided, and __tmp is
constructed directly in the return value slot:

 _Self __tmp = __x;
 __tmp += __n;
 return __tmp;



+  friend _Self
+  operator-(const _Self& __x, difference_type __n) 
_GLIBCXX_NOEXCEPT

+  {
+    _Self __tmp = __x;
+    return __tmp -= __n;


Same problem here. It would be better to write:

 _Self __tmp = __x;
 __tmp -= __n;
 return __tmp;

OK for trunk with those two changes, thanks.





diff --git a/libstdc++-v3/include/bits/stl_deque.h b/libstdc++-v3/include/bits/stl_deque.h
index 7a7a42aa903..c050d1bf023 100644
--- a/libstdc++-v3/include/bits/stl_deque.h
+++ b/libstdc++-v3/include/bits/stl_deque.h
@@ -238,24 +238,10 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 	return *this;
   }
 
-  _Self
-  operator+(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-	_Self __tmp = *this;
-	return __tmp += __n;
-  }
-
   _Self&
   operator-=(difference_type __n) _GLIBCXX_NOEXCEPT
   { return *this += -__n; }
 
-  _Self
-  operator-(difference_type __n) const _GLIBCXX_NOEXCEPT
-  {
-	_Self __tmp = *this;
-	return __tmp -= __n;
-  }
-
   reference
   operator[](difference_type __n) const _GLIBCXX_NOEXCEPT
   { return *(*this + __n); }
@@ -272,123 +258,118 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 	_M_first = *__new_node;
 	_M_last = _M_first + difference_type(_S_buffer_size());
   }
-};
-
-  // Note: we also provide overloads whose operands are of the same type in
-  // order to avoid ambiguous overload resolution when std::rel_ops operators
-  // are in scope (for additional details, see libstdc++/3628)
-  template
-inline bool
-operator==(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	   const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return __x._M_cur == __y._M_cur; }
-
-  template
-inline bool
-operator==(const _Deque_iterator<_Tp, _RefL, _PtrL>& __x,
-	   const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
-{ return __x._M_cur == __y._M_cur; }
-
-  template
-inline bool
-operator!=(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	   const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return !(__x == __y); }
-
-  template
-inline bool
-operator!=(const _Deque_iterator<_Tp, _RefL, _PtrL>& __x,
-	   const _Deque_iterator<_Tp, _RefR, _PtrR>& __y) _GLIBCXX_NOEXCEPT
-{ return !(__x == __y); }
-
-  template
-inline bool
-operator<(const _Deque_iterator<_Tp, _Ref, _Ptr>& __x,
-	  const _Deque_iterator<_Tp, _Ref, _Ptr>& __y) _GLIBCXX_NOEXCEPT
-{ return (__x._M_node == __y._M_node) ? (__x._M_cur < __y._M_cur)
-	  : (__x._M_node < __y._M_node); }
 
-  template
-inline bool
-o

Make vector iterator operators hidden friends

2019-05-08 Thread François Dumont

Hi

    Patch similar to the one I just apply for deque iterator including 
NRVO copy ellision fix.


    * include/bits/stl_bvector.h
    (operator==(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Make hidden friend.
    (operator<(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Likewise.
    (operator!=(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Likewise.
    (operator>(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Likewise.
    (operator<=(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Likewise.
    (operator>=(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Likewise.
    (operator-(const _Bit_iterator_base&, const _Bit_iterator_base&)):
    Likewise.
    (_Bit_iterator::operator+(difference_type)): Likewise and allow NRVO
    copy elision.
    (_Bit_iterator::operator-(difference_type)): Likewise.
    (operator+(ptrdiff_t, const _Bit_iterator&)): Make hidden friend.
    (_Bit_const_iterator::operator+(difference_type)): Likewise and allow
    NRVO copy elision.
    (_Bit_const_iterator::operator-(difference_type)): Likewise.
    (operator+(ptrdiff_t, const _Bit_const_iterator&)): Make hidden friend.

Tested under linux x86_64 normal mode.

Ok to commit ?

François

diff --git a/libstdc++-v3/include/bits/stl_bvector.h b/libstdc++-v3/include/bits/stl_bvector.h
index c60f4f0ef1c..6b9fb2b42b3 100644
--- a/libstdc++-v3/include/bits/stl_bvector.h
+++ b/libstdc++-v3/include/bits/stl_bvector.h
@@ -182,40 +182,40 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   _M_offset = static_cast(__n);
 }
 
-bool
-operator==(const _Bit_iterator_base& __i) const
-{ return _M_p == __i._M_p && _M_offset == __i._M_offset; }
+friend bool
+operator==(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
+{ return __x._M_p == __y._M_p && __x._M_offset == __y._M_offset; }
 
-bool
-operator<(const _Bit_iterator_base& __i) const
+friend bool
+operator<(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
 {
-  return _M_p < __i._M_p
-	|| (_M_p == __i._M_p && _M_offset < __i._M_offset);
+  return __x._M_p < __y._M_p
+	|| (__x._M_p == __y._M_p && __x._M_offset < __y._M_offset);
 }
 
-bool
-operator!=(const _Bit_iterator_base& __i) const
-{ return !(*this == __i); }
+friend bool
+operator!=(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
+{ return !(__x == __y); }
 
-bool
-operator>(const _Bit_iterator_base& __i) const
-{ return __i < *this; }
+friend bool
+operator>(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
+{ return __y < __x; }
 
-bool
-operator<=(const _Bit_iterator_base& __i) const
-{ return !(__i < *this); }
+friend bool
+operator<=(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
+{ return !(__y < __x); }
 
-bool
-operator>=(const _Bit_iterator_base& __i) const
-{ return !(*this < __i); }
-  };
+friend bool
+operator>=(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
+{ return !(__x < __y); }
 
-  inline ptrdiff_t
-  operator-(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
-  {
-return (int(_S_word_bit) * (__x._M_p - __y._M_p)
-	+ __x._M_offset - __y._M_offset);
-  }
+friend ptrdiff_t
+operator-(const _Bit_iterator_base& __x, const _Bit_iterator_base& __y)
+{
+  return (int(_S_word_bit) * (__x._M_p - __y._M_p)
+	  + __x._M_offset - __y._M_offset);
+}
+  };
 
   struct _Bit_iterator : public _Bit_iterator_base
   {
@@ -280,29 +280,31 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   return *this;
 }
 
-iterator
-operator+(difference_type __i) const
+reference
+operator[](difference_type __i) const
+{ return *(*this + __i); }
+
+friend iterator
+operator+(const iterator& __x, difference_type __n)
 {
-  iterator __tmp = *this;
-  return __tmp += __i;
+  iterator __tmp = __x;
+  __tmp += __n;
+  return __tmp;
 }
 
-iterator
-operator-(difference_type __i) const
+friend iterator
+operator+(difference_type __n, const iterator& __x)
+{ return __x + __n; }
+
+friend iterator
+operator-(const iterator& __x, difference_type __n)
 {
-  iterator __tmp = *this;
-  return __tmp -= __i;
+  iterator __tmp = __x;
+  __tmp -= __n;
+  return __tmp;
 }
-
-reference
-operator[](difference_type __i) const
-{ return *(*this + __i); }
   };
 
-  inline _Bit_iterator
-  operator+(ptrdiff_t __n, const _Bit_iterator& __x)
-  { return __x + __n; }
-
   struct _Bit_const_iterator : public _Bit_iterator_base
   {
 typedef bool reference;
@@ -370,29 +372,29 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   return *this;
 }
 
-const_iterator
-operator+(difference_type __i) const
+const_reference
+operator[](difference_type __i) const
+{ retur

RE: Fixing ifcvt issue as exposed by BZ89430

2019-05-08 Thread JiangNing OS


> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, May 8, 2019 3:35 PM
> To: JiangNing OS 
> Cc: gcc-patches@gcc.gnu.org; Richard Biener ;
> pins...@gcc.gnu.org
> Subject: Re: Fixing ifcvt issue as exposed by BZ89430
> 
> On Thu, Feb 28, 2019 at 1:26 PM JiangNing OS
>  wrote:
> >
> > To solve BZ89430 the followings are needed,
> >
> > (1) The code below in noce_try_cmove_arith needs to be fixed.
> >
> >   /* ??? We could handle this if we knew that a load from A or B could
> >  not trap or fault.  This is also true if we've already loaded
> >  from the address along the path from ENTRY.  */
> >   else if (may_trap_or_fault_p (a) || may_trap_or_fault_p (b))
> > return FALSE;
> >
> > Finding dominating memory access could help to decide whether the
> memory access following the conditional move is valid or not.
> > * If there is a dominating memory write to the same memory address in
> test_bb as the one from x=a, it is always safe.
> > * When the dominating memory access to the same memory address in
> test_bb is a read, for the load out of x=a, it is always safe. For the store 
> out
> of x=a, if the memory is on stack, it is still safe.
> >
> > Besides, there is a bug around rearranging the then_bb and else_bb layout,
> which needs to be fixed.
> >
> > (2) The fix by https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02998.html
> is an overkill. If the write target following conditional move is a memory
> access, it exits shortly.
> >
> >   if (!set_b && MEM_P (orig_x))
> > /* We want to avoid store speculation to avoid cases like
> >  if (pthread_mutex_trylock(mutex))
> >++global_variable;
> >Rather than go to much effort here, we rely on the SSA optimizers,
> >which do a good enough job these days.  */
> > return FALSE;
> >
> > It looks a bit hard for compiler to know the program semantic is
> > related to mutex and avoid store speculation. Without removing this
> > overkill, the fix mentioned by (1) would not work. Any idea? An
> > alternative solution is to detect the following pattern
> > conservatively,
> 
> But it's important to not break this.  Note we do have --param allow-store-
> data-races which the user can use to override this.
> Note that accesses to the stack can obviously not cause store speculation if
> its address didn't escape.  But that's probably what is refered to by "rely on
> the SSA optimizers".

Yes! So I have sent out a patch with title "[PATCH] improve ifcvt optimization 
(PR rtl-optimization/89430)" to detect the access to stack two months ago.
The SSA Optimization called "tree-if-conv" in middle-end can't really cover
this case. The key part of my change is like,

-/* We want to avoid store speculation to avoid cases like
-if (pthread_mutex_trylock(mutex))
-  ++global_variable;
-   Rather than go to much effort here, we rely on the SSA optimizers,
-   which do a good enough job these days.  */
-return FALSE;
+{
+  /* We want to avoid store speculation to avoid cases like
+   if (pthread_mutex_trylock(mutex))
+ ++global_variable;  
+ Tree if conversion cannot handle this case well, and it intends to
+ help vectorization for loops only. */
+  if (!noce_mem_is_on_stack(insn_a, orig_x))
+return FALSE;
 
+  /* For case like,
+   if (pthread_mutex_trylock(mutex))
+ ++local_variable;
+ If any stack variable address is taken, potentially this local
+ variable could be modified by other threads and introduce store
+ speculation. */
+  if (!cfun_no_stack_address_taken)
+return FALSE;
+}

Can you please take a look if you have time? Thank you!

> 
> >  if (a_function_call(...))
> >++local_variable;
> >
> > If the local_variable doesn't have address taken at all in current 
> > function, it
> mustn't be a pthread mutex lock semantic, and then the following patch
> would work to solve (1) and pass the last case as described in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430.
> >
> > Index: gcc/cse.c
> >
> 
> ===
> > --- gcc/cse.c   (revision 268256)
> > +++ gcc/cse.c   (working copy)
> > @@ -540,7 +540,6 @@
> > already as part of an already processed extended basic block.  */
> > static sbitmap cse_visited_basic_blocks;
> >
> > -static bool fixed_base_plus_p (rtx x);  static int notreg_cost (rtx,
> > machine_mode, enum rtx_code, int);  static int preferable (int, int,
> > int, int);  static void new_basic_block (void); @@ -606,30 +605,7 @@
> >
> >  static const struct rtl_hooks cse_rtl_hooks = RTL_HOOKS_INITIALIZER;
> >
> >
> > -/* Nonzero if X has the form (PLUS frame-pointer integer).  */
> >
> > -static bool
> > -fixed_base_plus_p (rtx x)
> > -{
> > -  switch (GET_CODE (x))
> > -{
> > -case REG:
> > -  if (x == frame_pointer_rtx || x == hard_frame_pointer_rtx)

Re: [PATCH] Eliminates phi on branch for CMP result

2019-05-08 Thread Richard Biener
On Wed, 8 May 2019, Jeff Law wrote:

> On 5/8/19 6:20 AM, Richard Biener wrote:
> > On Wed, 8 May 2019, Jiufu Guo wrote:
> > 
> >> Hi,
> >>
> >> Thanks Richard, Segher, Andrew and all.
> >>
> >> Segher Boessenkool  writes:
> >>
> >>> Let me try to answer some of this...
> >>>
> >>> On Tue, May 07, 2019 at 03:31:27PM +0200, Richard Biener wrote:
>  On Mon, 6 May 2019, Jiufu Guo wrote:
> > This patch implements the optimization in PR77820.  The optimization
> > eliminates phi and phi's basic block, if the phi is used only by
> > condition branch, and the phi's incoming value in the result of a
> > CMP result.
> >>>
>  I'm not sure I like a new pass here ;)  The transform is basically
>  tail-duplicating the PHI block because the exit conditional can
>  be "simplfied" - that's something jump threading generally does
>  though it relies on "simplified" being a little bit more simplified
>  than above.
> >>>
> >> Adding a new pass is just because it may be relatively easy to extend
> >> and maintain. 
> >>
> >> Adding this micor-optimization into other passes is also a good
> >> sugguestion. 
> >>
> >> - Similar with jump threading, this new transform is replacing jump
> >> old destination with new destination. While for this case, the new
> >> destination can not be determined at compile time. 
> >>
> >> - And as Andrew said: forwprop/backprop are similar, but this case is in
> >> opposite: it is spread common code(phi+gcond) into different preds.
> >>
> >> - And there is phiopt pass which focus on making 'phi' stmt better. 
> >> it also do some similar thing:
> >>  bb0:
> >>if (a != b) goto bb2; else goto bb1;
> >>  bb1:
> >>  bb2:
> >>x = PHI ;
> >>
> >>tranform to:
> >>
> >>  bb0:
> >>  bb2:
> >>x = PHI ;
> >>
> >> If I avoid to add new pass, any suggustions about which exiting pass
> >> (jumpthreading/forwprop/phiopt/...) would be more suitable to extend to
> >> support this new transform? 
> > 
> > The main thing the transform does is tail-duplicate the PHI block,
> > if we'd just do that followup transforms would do the rest.
> One might even claim this is really a transformation for cfg cleanups.
> If we ignore the PHI what we have is a unconditional jump to a
> conditional jump.  The obvious way to optimize that is to replace the
> unconditional jump with a copy of the conditional jump.

I though about this too, but then quickly disregarded as too gross ;)

> >> Currently this pass handles the case if there is only one PHI and the
> >> phi is used by gcond. If there are other suitable cases, we can just
> >> extend this tranform.
> > 
> > I was just thinking of a small amount of unrelated stmts in those
> > blocks or the case where just one PHI argument is fed by a conditional
> > thus only one path would simplify (but maybe the hot one if you look
> > at edge probabilities).
> Perhaps, but do these happen enough in practice?  With the code motions
> we do I'd be a bit surprised.

It probably depends on what kind of source this typically arises from.

Richard.