Re: [RFC] [Patch] [Debug] Add new NOTE to be used for debugging.

2019-01-21 Thread Alan Hayward
(Apologies this isn’t using the reply id)

FYI, I tried the patch below and I’m happy it fixes the GDB issue in PR 
debug/88432.

In the meantime, GDB HEAD has a workaround to disable stack protection in
the GDB testsuite, and a KFAILed test for the exact bug. See:
https://sourceware.org/ml/gdb-patches/2019-01/msg00425.html


Alan.


Re: [RFC] [Patch] [Debug] Add new FUNCTION_BEG NOTE to be used for debugging.

• From: Matthew Malcomson 
• To: "gcc-patches at gcc dot gnu dot org" 
• Cc: "jason at redhat dot com" , nd ,  "ccoutant at gmail dot com" , 
"wilson at tuliptree dot org"  , Eric Botcazou 

• Date: Fri, 18 Jan 2019 15:45:58 +
• Subject: Re: [RFC] [Patch] [Debug] Add new FUNCTION_BEG NOTE to be 
used for debugging.
• References: 


Ping.
(note -- when running the GDB testsuite ensuring that 
-fstack-protector-all is used for compiling each testcase, this patch 
fixes over 1500 FAIL's)

On 10/01/19 13:28, Matthew Malcomson wrote:
> At the moment NOTE_INSN_FUNCTION_BEG is used for three different purposes.
> The first is as a marker just before the first insn coming from a
> "source code statement" of the function.
> Bug 88432 is due to the fact that the note does not accurately point to
> this logical position in a function -- in that case the stack protect
> prologue is directly after NOTE_INSN_FUNCTION_BEG.
> 
> The second is (I believe) to make assumptions about what values are in the
> parameter passing registers (in alias.c and calls.c).
> (I'm not sure about this second use, if I am correctly reading this code then
> it seems like a bug -- e.g. asan_emit_stack_protect inserts insns in the 
> stream
> that break the assumption that seems to be made.)
> 
> The third is as a marker to determine where to put extra code later in
> sjlj_emit_function_enter from except.c, where to insert profiling code for a
> function in final.c, and where to insert variable expansion code in
> pass_expand::execute from cfgexpand.c.
> 
> These three uses seem to be at odds with each other -- insns that change the
> values in the parameter passing registers store can come from automatically
> inserted code like stack protection, and some requirements on where 
> instructions
> should get inserted have moved the position of this NOTE (e.g. see bugzilla 
> bug
> 81186).
> 
> This patch splits the current note into two different notes, one to retain 
> uses
> 2 and 3 above, and one for use in genrating debug information.
> 
> The first two uses are still attached to NOTE_INSN_FUNCTION_BEG, while the
> debugging use is now implemented with NOTE_INSN_DEBUG_FUNCTION_BEG.
> 
> These two notes are put into the functions' insn chain in different
> places during the expand pass, and can hence satisfy their respective
> uses.
> 
> Bootstrapped and regtested on aarch64.
> TODO -- Manual tests done on resulting debug information -- yet to be 
> automated.
> 
> gcc/ChangeLog:
> 
> 2019-01-10  Matthew Malcomson  
> 
>   PR debug/88432
>   * cfgexpand.c (pass_expand::execute): Insert
>   NOTE_INSN_DEBUG_FUNCTION_BEG.
>   * function.c (thread_prologue_and_epilogue_insns): Account
>   for NOTE_INSN_DEBUG_FUNCTION_BEG.
>   * cfgrtl.c (duplicate_insn_chain): Account for new NOTE.
>   * doc/rtl.texi: Document new NOTE.
>   * dwarf2out.c (dwarf2out_source_line): Change comment to
>   reference new NOTE.
>   * final.c (asm_show_source): As above.
>   (final_scan_insn_1): Split action on NOTE_INSN_FUNCTION_BEG into
>   two, and move debugging info action to trigger on
>   NOTE_INSN_DEBUG_FUNCTION_BEG.
>   * insn-notes.def (INSN_NOTE): Add new NOTE.
> 
> 
> 
> ### Attachment also inlined for ease of reply
> ###
> 
> 
> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
> index 
> 60c1cfb4556e1a659db19f6719adccc1dab0fe46..491f441d01de226ba5aff2af8c71680b78648a12
>  100644
> --- a/gcc/cfgexpand.c
> +++ b/gcc/cfgexpand.c
> @@ -6476,6 +6476,12 @@ pass_expand::execute (function *fun)
> if (crtl->stack_protect_guard && targetm.stack_protect_runtime_enabled_p 
> ())
>   stack_protect_prologue ();
>   
> +  /* Insert a NOTE that marks the end of "generated code" and the start of 
> code
> + that comes from the user.  This is the point which dwarf2out.c will 
> treat
> + as the beginning of the users code in this function.  e.g. GDB will stop
> + just after this note when breaking on entry to the function.  */
> +  emit_note (NOTE_INSN_DEBUG_FUNCTION_BEG);
> +
> expand_phi_nodes (&SA);
>   
> /* Release any stale SSA redirection data.  */
> diff --git a/gcc/cfgrtl.c b/gcc/cfgrtl.c
> index 
> 172bdf585d036e27bcf53dba89c1ffc1b6cb84c7..d0cbca84aa3f14002a568a65e70016c3e15d6b9c
>  100644
> --- a/gcc/cfgrtl.c
> +++ b/gcc/cfgrtl.c
> @@ -4215,6 +4215,7 @@ duplicate_insn_chain (rtx_insn *from, rtx_insn *to)
>   case NOTE_INSN_DELETED_DEBUG_LABEL:
> /* No problem to strip thes

PING [PATCH v2 0/7] Support partial clobbers around TLS calls on Aarch64 SVE

2018-08-02 Thread Alan Hayward



> On 26 Jul 2018, at 10:13, Alan Hayward  wrote:
> 
> This is rebasing of the patch posted in November.
> It's aim is to support aarch64 SVE register preservation around TLS calls
> by adding a CLOBBER_HIGH expression.
> 
> Across a TLS call, Aarch64 SVE does not explicitly preserve the
> SVE vector registers. However, the Neon vector registers are preserved.
> Due to overlapping of registers, this means the lower 128bits of all
> SVE vector registers will be preserved.
> 
> The existing GCC code assume all Neon and SVE registers are clobbered
> across TLS calls.
> 
> This patch introduces a CLOBBER_HIGH expression. This behaves a bit like
> a CLOBBER expression. CLOBBER_HIGH can only refer to a single register.
> The mode of the expression indicates the size of the lower bits which
> will be preserved. If the register contains a value bigger than this
> mode then the code will treat the register as clobbered, otherwise the
> register remains untouched.
> 
> The means in order to evaluate if a clobber high is relevant, we need to
> ensure the mode of the existing value in a register is tracked.
> 
> The first two patches introduce CLOBBER_HIGH and generation support.
> The next patch adds a helper function for easily determining if a register
> gets clobbered by a CLOBBER_HIGH.
> The next three patches add clobber high checks to all of the passes. I
> couldn't think of a better way of splitting this up (maybe needs dividing
> into smaller patches?).
> Finally the last patch adds the CLOBBER_HIGHS around a TLS call for
> aarch64 SVE and some test cases.
> 
> Alan Hayward (7):
>  Add CLOBBER_HIGH expression
>  Generation support for CLOBBER_HIGH
>  Add func to check if register is clobbered by clobber_high
>  lra support for clobber_high
>  cse support for clobber_high
>  Remaining support for clobber high
>  Enable clobber high for tls descs on Aarch64
> 
> gcc/alias.c|  11 ++
> gcc/cfgexpand.c|   1 +
> gcc/combine-stack-adj.c|   1 +
> gcc/combine.c  |  38 -
> gcc/config/aarch64/aarch64.md  |  69 ++--
> gcc/cse.c  | 187 ++---
> gcc/cselib.c   |  42 +++--
> gcc/cselib.h   |   2 +-
> gcc/dce.c  |  11 +-
> gcc/df-scan.c  |   6 +
> gcc/doc/rtl.texi   |  15 +-
> gcc/dwarf2out.c|   1 +
> gcc/emit-rtl.c |  16 ++
> gcc/genconfig.c|   1 +
> gcc/genemit.c  |  51 +++---
> gcc/genrecog.c |   3 +-
> gcc/haifa-sched.c  |   3 +
> gcc/ira-build.c|   5 +
> gcc/ira-costs.c|   7 +
> gcc/ira.c  |   6 +-
> gcc/jump.c |   1 +
> gcc/lra-eliminations.c |  11 ++
> gcc/lra-int.h  |   2 +
> gcc/lra-lives.c|  31 ++--
> gcc/lra.c  |  66 +---
> gcc/postreload-gcse.c  |  21 ++-
> gcc/postreload.c   |  25 ++-
> gcc/print-rtl.c|   1 +
> gcc/recog.c|   9 +-
> gcc/regcprop.c |  10 +-
> gcc/reginfo.c  |   4 +
> gcc/reload1.c  |  16 +-
> gcc/reorg.c|  27 ++-
> gcc/resource.c |  24 ++-
> gcc/rtl.c  |  15 ++
> gcc/rtl.def|  10 ++
> gcc/rtl.h  |  27 ++-
> gcc/rtlanal.c  |  47 +-
> gcc/sched-deps.c   |  15 +-
> .../gcc.target/aarch64/sve_tls_preserve_1.c|  19 +++
> .../gcc.target/aarch64/sve_tls_preserve_2.c|  24 +++
> .../gcc.target/aarch64/sve_tls_preserve_3.c|  24 +++
> 42 files changed, 725 insertions(+), 180 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_3.c
> 
> -- 
> 2.15.2 (Apple Git-101.1)
> 



[PATCH v2 6/7] Remaining support for clobber high

2018-07-26 Thread Alan Hayward
Add the remainder of clobber high checks.
Happy to split this into smaller patches if required (there didn't
seem anything obvious to split into).

2018-07-25  Alan Hayward  

* alias.c (record_set): Check for clobber high.
* cfgexpand.c (expand_gimple_stmt): Likewise.
* combine-stack-adj.c (single_set_for_csa): Likewise.
* combine.c (find_single_use_1): Likewise.
(set_nonzero_bits_and_sign_copies): Likewise.
(get_combine_src_dest): Likewise.
(is_parallel_of_n_reg_sets): Likewise.
(try_combine): Likewise.
(record_dead_and_set_regs_1): Likewise.
(reg_dead_at_p_1): Likewise.
(reg_dead_at_p): Likewise.
* dce.c (deletable_insn_p): Likewise.
(mark_nonreg_stores_1): Likewise.
(mark_nonreg_stores_2): Likewise.
* df-scan.c (df_find_hard_reg_defs): Likewise.
(df_uses_record): Likewise.
(df_get_call_refs): Likewise.
* dwarf2out.c (mem_loc_descriptor): Likewise.
* haifa-sched.c (haifa_classify_rtx): Likewise.
* ira-build.c (create_insn_allocnos): Likewise.
* ira-costs.c (scan_one_insn): Likewise.
* ira.c (equiv_init_movable_p): Likewise.
(rtx_moveable_p): Likewise.
(interesting_dest_for_shprep): Likewise.
* jump.c (mark_jump_label_1): Likewise.
* postreload-gcse.c (record_opr_changes): Likewise.
* postreload.c (reload_cse_simplify): Likewise.
(struct reg_use): Add source expr.
(reload_combine): Check for clobber high.
(reload_combine_note_use): Likewise.
(reload_cse_move2add): Likewise.
(move2add_note_store): Likewise.
* print-rtl.c (print_pattern): Likewise.
* recog.c (decode_asm_operands): Likewise.
(store_data_bypass_p): Likewise.
(if_test_bypass_p): Likewise.
* regcprop.c (kill_clobbered_value): Likewise.
(kill_set_value): Likewise.
* reginfo.c (reg_scan_mark_refs): Likewise.
* reload1.c (maybe_fix_stack_asms): Likewise.
(eliminate_regs_1): Likewise.
(elimination_effects): Likewise.
(mark_not_eliminable): Likewise.
(scan_paradoxical_subregs): Likewise.
(forget_old_reloads_1): Likewise.
* reorg.c (find_end_label): Likewise.
(try_merge_delay_insns): Likewise.
(redundant_insn): Likewise.
(own_thread_p): Likewise.
(fill_simple_delay_slots): Likewise.
(fill_slots_from_thread): Likewise.
(dbr_schedule): Likewise.
* resource.c (update_live_status): Likewise.
(mark_referenced_resources): Likewise.
(mark_set_resources): Likewise.
* rtl.c (copy_rtx): Likewise.
* rtlanal.c (reg_referenced_p): Likewise.
(single_set_2): Likewise.
(noop_move_p): Likewise.
(note_stores): Likewise.
* sched-deps.c (sched_analyze_reg): Likewise.
(sched_analyze_insn): Likewise.
---
 gcc/alias.c | 11 +++
 gcc/cfgexpand.c |  1 +
 gcc/combine-stack-adj.c |  1 +
 gcc/combine.c   | 38 +-
 gcc/dce.c   | 11 +--
 gcc/df-scan.c   |  6 ++
 gcc/dwarf2out.c |  1 +
 gcc/haifa-sched.c   |  3 +++
 gcc/ira-build.c |  5 +
 gcc/ira-costs.c |  7 +++
 gcc/ira.c   |  6 +-
 gcc/jump.c  |  1 +
 gcc/postreload-gcse.c   | 21 -
 gcc/postreload.c| 25 -
 gcc/print-rtl.c |  1 +
 gcc/recog.c |  9 ++---
 gcc/regcprop.c  | 10 --
 gcc/reginfo.c   |  4 
 gcc/reload1.c   | 16 +++-
 gcc/reorg.c | 27 ++-
 gcc/resource.c  | 24 +---
 gcc/rtl.c   |  4 
 gcc/rtlanal.c   | 18 +++---
 gcc/sched-deps.c| 15 ++-
 24 files changed, 225 insertions(+), 40 deletions(-)

diff --git a/gcc/alias.c b/gcc/alias.c
index 2091dfbf3d7..748da2b6951 100644
--- a/gcc/alias.c
+++ b/gcc/alias.c
@@ -1554,6 +1554,17 @@ record_set (rtx dest, const_rtx set, void *data 
ATTRIBUTE_UNUSED)
  new_reg_base_value[regno] = 0;
  return;
}
+  /* A CLOBBER_HIGH only wipes out the old value if the mode of the old
+value is greater than that of the clobber.  */
+  else if (GET_CODE (set) == CLOBBER_HIGH)
+   {
+ if (new_reg_base_value[regno] != 0
+ && reg_is_clobbered_by_clobber_high (
+  regno, GET_MODE (new_reg_base_value[regno]), XEXP (set, 0)))
+   new_reg_base_value[regno] = 0;
+ return;
+   }
+
   src = SET_SRC (set);
 }
   else
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index d6e3c382085..39db6ed435f 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -3750,6 +3750,7 @@ expand_gimple_stmt 

[PATCH v2 7/7] Enable clobber high for tls descs on Aarch64

2018-07-26 Thread Alan Hayward
Add the clobber high expressions to tls_desc for aarch64.
It also adds three tests.

In addition I also tested by taking the gcc torture test suite and making
all global variables __thread. Then emended the suite to compile with -fpic,
save the .s file and only for one given O level.
I ran this before and after the patch and compared the resulting .s files,
ensuring that there were no ASM changes.
I discarded the 10% of tests that failed to compile (due to the code in
the test now being invalid C).
I did this for O0,O2,O3 on both x86 and aarch64 and observed no difference
between ASM files before and after the patch.

Alan.

2018-07-25  Alan Hayward  

gcc/
* config/aarch64/aarch64.md: Add clobber highs to tls_desc.

gcc/testsuite/
* gcc.target/aarch64/sve_tls_preserve_1.c: New test.
* gcc.target/aarch64/sve_tls_preserve_2.c: New test.
* gcc.target/aarch64/sve_tls_preserve_3.c: New test.
---
 gcc/config/aarch64/aarch64.md  | 69 ++
 .../gcc.target/aarch64/sve_tls_preserve_1.c| 19 ++
 .../gcc.target/aarch64/sve_tls_preserve_2.c| 24 
 .../gcc.target/aarch64/sve_tls_preserve_3.c| 24 
 4 files changed, 124 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_3.c

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index e9c16f9697b..a41d6e15bc8 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -57,14 +57,36 @@
 (LR_REGNUM 30)
 (SP_REGNUM 31)
 (V0_REGNUM 32)
+(V1_REGNUM 33)
+(V2_REGNUM 34)
+(V3_REGNUM 35)
 (V4_REGNUM 36)
+(V5_REGNUM 37)
+(V6_REGNUM 38)
+(V7_REGNUM 39)
 (V8_REGNUM 40)
+(V9_REGNUM 41)
+(V10_REGNUM42)
+(V11_REGNUM43)
 (V12_REGNUM44)
+(V13_REGNUM45)
+(V14_REGNUM46)
 (V15_REGNUM47)
 (V16_REGNUM48)
+(V17_REGNUM49)
+(V18_REGNUM50)
+(V19_REGNUM51)
 (V20_REGNUM52)
+(V21_REGNUM53)
+(V22_REGNUM54)
+(V23_REGNUM55)
 (V24_REGNUM56)
+(V25_REGNUM57)
+(V26_REGNUM58)
+(V27_REGNUM59)
 (V28_REGNUM60)
+(V29_REGNUM61)
+(V30_REGNUM62)
 (V31_REGNUM63)
 (LAST_SAVED_REGNUM 63)
 (SFP_REGNUM64)
@@ -6302,24 +6324,47 @@
   [(set_attr "type" "call")
(set_attr "length" "16")])
 
-;; For SVE, model tlsdesc calls as clobbering all vector and predicate
-;; registers, on top of the usual R0 and LR.  In reality the calls
-;; preserve the low 128 bits of the vector registers, but we don't
-;; yet have a way of representing that in the instruction pattern.
+;; For SVE, model tlsdesc calls as clobbering the lower 128 bits of
+;; all vector registers, and clobber all predicate registers, on
+;; top of the usual R0 and LR.
 (define_insn "tlsdesc_small_sve_"
   [(set (reg:PTR R0_REGNUM)
 (unspec:PTR [(match_operand 0 "aarch64_valid_symref" "S")]
UNSPEC_TLSDESC))
(clobber (reg:DI LR_REGNUM))
(clobber (reg:CC CC_REGNUM))
-   (clobber (reg:XI V0_REGNUM))
-   (clobber (reg:XI V4_REGNUM))
-   (clobber (reg:XI V8_REGNUM))
-   (clobber (reg:XI V12_REGNUM))
-   (clobber (reg:XI V16_REGNUM))
-   (clobber (reg:XI V20_REGNUM))
-   (clobber (reg:XI V24_REGNUM))
-   (clobber (reg:XI V28_REGNUM))
+   (clobber_high (reg:TI V0_REGNUM))
+   (clobber_high (reg:TI V1_REGNUM))
+   (clobber_high (reg:TI V2_REGNUM))
+   (clobber_high (reg:TI V3_REGNUM))
+   (clobber_high (reg:TI V4_REGNUM))
+   (clobber_high (reg:TI V5_REGNUM))
+   (clobber_high (reg:TI V6_REGNUM))
+   (clobber_high (reg:TI V7_REGNUM))
+   (clobber_high (reg:TI V8_REGNUM))
+   (clobber_high (reg:TI V9_REGNUM))
+   (clobber_high (reg:TI V10_REGNUM))
+   (clobber_high (reg:TI V11_REGNUM))
+   (clobber_high (reg:TI V12_REGNUM))
+   (clobber_high (reg:TI V13_REGNUM))
+   (clobber_high (reg:TI V14_REGNUM))
+   (clobber_high (reg:TI V15_REGNUM))
+   (clobber_high (reg:TI V16_REGNUM))
+   (clobber_high (reg:TI V17_REGNUM))
+   (clobber_high (reg:TI V18_REGNUM))
+   (clobber_high (reg:TI V19_REGNUM))
+   (clobber_high (reg:TI V20_REGNUM))
+   (clobber_high (reg:TI V21_REGNUM))
+   (clobber_high (reg:TI V22_REGNUM))
+   (clobber_high (reg:TI V23_REGNUM))
+   (clobber_high (reg:TI V24_REGNUM))
+   (clobber_high (reg:TI V25_REGNUM))

[PATCH v2 5/7] cse support for clobber_high

2018-07-26 Thread Alan Hayward
The cse specific changes for clobber_high.

2018-07-25  Alan Hayward  

* cse.c (invalidate_reg): New function extracted from...
(invalidate): ...here.
(canonicalize_insn): Check for clobber high.
(invalidate_from_clobbers): invalidate clobber highs.
(invalidate_from_sets_and_clobbers): Likewise.
(count_reg_usage): Check for clobber high.
(insn_live_p): Likewise.
* cselib.c (cselib_expand_value_rtx_1):Likewise.
(cselib_invalidate_regno): Check for clobber in setter.
(cselib_invalidate_rtx): Pass through setter.
(cselib_invalidate_rtx_note_stores):
(cselib_process_insn): Check for clobber high.
* cselib.h (cselib_invalidate_rtx): Add operand.
---
 gcc/cse.c| 187 +++
 gcc/cselib.c |  42 ++
 gcc/cselib.h |   2 +-
 3 files changed, 156 insertions(+), 75 deletions(-)

diff --git a/gcc/cse.c b/gcc/cse.c
index 4e94152b380..3d7888b7093 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -559,6 +559,7 @@ static struct table_elt *insert_with_costs (rtx, struct 
table_elt *, unsigned,
 static struct table_elt *insert (rtx, struct table_elt *, unsigned,
 machine_mode);
 static void merge_equiv_classes (struct table_elt *, struct table_elt *);
+static void invalidate_reg (rtx, bool);
 static void invalidate (rtx, machine_mode);
 static void remove_invalid_refs (unsigned int);
 static void remove_invalid_subreg_refs (unsigned int, poly_uint64,
@@ -1818,7 +1819,85 @@ check_dependence (const_rtx x, rtx exp, machine_mode 
mode, rtx addr)
 }
   return false;
 }
-
+
+/* Remove from the hash table, or mark as invalid, all expressions whose
+   values could be altered by storing in register X.
+
+   CLOBBER_HIGH is set if X was part of a CLOBBER_HIGH expression.  */
+
+static void
+invalidate_reg (rtx x, bool clobber_high)
+{
+  gcc_assert (GET_CODE (x) == REG);
+
+  /* If X is a register, dependencies on its contents are recorded
+ through the qty number mechanism.  Just change the qty number of
+ the register, mark it as invalid for expressions that refer to it,
+ and remove it itself.  */
+  unsigned int regno = REGNO (x);
+  unsigned int hash = HASH (x, GET_MODE (x));
+
+  /* Remove REGNO from any quantity list it might be on and indicate
+ that its value might have changed.  If it is a pseudo, remove its
+ entry from the hash table.
+
+ For a hard register, we do the first two actions above for any
+ additional hard registers corresponding to X.  Then, if any of these
+ registers are in the table, we must remove any REG entries that
+ overlap these registers.  */
+
+  delete_reg_equiv (regno);
+  REG_TICK (regno)++;
+  SUBREG_TICKED (regno) = -1;
+
+  if (regno >= FIRST_PSEUDO_REGISTER)
+{
+  gcc_assert (!clobber_high);
+  remove_pseudo_from_table (x, hash);
+}
+  else
+{
+  HOST_WIDE_INT in_table = TEST_HARD_REG_BIT (hard_regs_in_table, regno);
+  unsigned int endregno = END_REGNO (x);
+  unsigned int rn;
+  struct table_elt *p, *next;
+
+  CLEAR_HARD_REG_BIT (hard_regs_in_table, regno);
+
+  for (rn = regno + 1; rn < endregno; rn++)
+   {
+ in_table |= TEST_HARD_REG_BIT (hard_regs_in_table, rn);
+ CLEAR_HARD_REG_BIT (hard_regs_in_table, rn);
+ delete_reg_equiv (rn);
+ REG_TICK (rn)++;
+ SUBREG_TICKED (rn) = -1;
+   }
+
+  if (in_table)
+   for (hash = 0; hash < HASH_SIZE; hash++)
+ for (p = table[hash]; p; p = next)
+   {
+ next = p->next_same_hash;
+
+ if (!REG_P (p->exp) || REGNO (p->exp) >= FIRST_PSEUDO_REGISTER)
+   continue;
+
+ if (clobber_high)
+   {
+ if (reg_is_clobbered_by_clobber_high (p->exp, x))
+   remove_from_table (p, hash);
+   }
+ else
+   {
+ unsigned int tregno = REGNO (p->exp);
+ unsigned int tendregno = END_REGNO (p->exp);
+ if (tendregno > regno && tregno < endregno)
+   remove_from_table (p, hash);
+   }
+   }
+}
+}
+
 /* Remove from the hash table, or mark as invalid, all expressions whose
values could be altered by storing in X.  X is a register, a subreg, or
a memory reference with nonvarying address (because, when a memory
@@ -1841,65 +1920,7 @@ invalidate (rtx x, machine_mode full_mode)
   switch (GET_CODE (x))
 {
 case REG:
-  {
-   /* If X is a register, dependencies on its contents are recorded
-  through the qty number mechanism.  Just change the qty number of
-  the register, mark it as invalid for expressions that refer to it,
-  and remove it itself.  */
-   unsigned int regno = REGNO (x);
-   unsigned int hash = HASH (x, G

[PATCH v2 2/7] Generation support for CLOBBER_HIGH

2018-07-26 Thread Alan Hayward
Ensure clobber high is a register expression.
Info is passed through for the error case.

2018-07-25  Alan Hayward  

* emit-rtl.c (verify_rtx_sharing): Check for CLOBBER_HIGH.
(copy_insn_1): Likewise.
(gen_hard_reg_clobber_high): New gen function.
* genconfig.c (walk_insn_part): Check for CLOBBER_HIGH.
* genemit.c (gen_exp): Likewise.
(gen_emit_seq): Pass through info.
(gen_insn): Check for CLOBBER_HIGH.
(gen_expand): Pass through info.
(gen_split): Likewise.
(output_add_clobbers): Likewise.
* genrecog.c (validate_pattern): Check for CLOBBER_HIGH.
(remove_clobbers): Likewise.
* rtl.h (gen_hard_reg_clobber_high): New declaration.
---
 gcc/emit-rtl.c  | 16 
 gcc/genconfig.c |  1 +
 gcc/genemit.c   | 51 +++
 gcc/genrecog.c  |  3 ++-
 gcc/rtl.h   |  1 +
 5 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index e4b070486e8..6a32bcbdaf6 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -2865,6 +2865,7 @@ verify_rtx_sharing (rtx orig, rtx insn)
   /* SCRATCH must be shared because they represent distinct values.  */
   return;
 case CLOBBER:
+case CLOBBER_HIGH:
   /* Share clobbers of hard registers (like cc0), but do not share pseudo 
reg
  clobbers or clobbers of hard registers that originated as pseudos.
  This is needed to allow safe register renaming.  */
@@ -3118,6 +3119,7 @@ repeat:
   /* SCRATCH must be shared because they represent distinct values.  */
   return;
 case CLOBBER:
+case CLOBBER_HIGH:
   /* Share clobbers of hard registers (like cc0), but do not share pseudo 
reg
  clobbers or clobbers of hard registers that originated as pseudos.
  This is needed to allow safe register renaming.  */
@@ -5690,6 +5692,7 @@ copy_insn_1 (rtx orig)
 case SIMPLE_RETURN:
   return orig;
 case CLOBBER:
+case CLOBBER_HIGH:
   /* Share clobbers of hard registers (like cc0), but do not share pseudo 
reg
  clobbers or clobbers of hard registers that originated as pseudos.
  This is needed to allow safe register renaming.  */
@@ -6508,6 +6511,19 @@ gen_hard_reg_clobber (machine_mode mode, unsigned int 
regno)
gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (mode, regno)));
 }
 
+static GTY((deletable)) rtx
+hard_reg_clobbers_high[NUM_MACHINE_MODES][FIRST_PSEUDO_REGISTER];
+
+rtx
+gen_hard_reg_clobber_high (machine_mode mode, unsigned int regno)
+{
+  if (hard_reg_clobbers_high[mode][regno])
+return hard_reg_clobbers_high[mode][regno];
+  else
+return (hard_reg_clobbers_high[mode][regno]
+   = gen_rtx_CLOBBER_HIGH (VOIDmode, gen_rtx_REG (mode, regno)));
+}
+
 location_t prologue_location;
 location_t epilogue_location;
 
diff --git a/gcc/genconfig.c b/gcc/genconfig.c
index c1bfde8d54b..745d5374b39 100644
--- a/gcc/genconfig.c
+++ b/gcc/genconfig.c
@@ -72,6 +72,7 @@ walk_insn_part (rtx part, int recog_p, int non_pc_set_src)
   switch (code)
 {
 case CLOBBER:
+case CLOBBER_HIGH:
   clobbers_seen_this_insn++;
   break;
 
diff --git a/gcc/genemit.c b/gcc/genemit.c
index f4179e2b631..86e792f7396 100644
--- a/gcc/genemit.c
+++ b/gcc/genemit.c
@@ -79,7 +79,7 @@ gen_rtx_scratch (rtx x, enum rtx_code subroutine_type)
substituting any operand references appearing within.  */
 
 static void
-gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
+gen_exp (rtx x, enum rtx_code subroutine_type, char *used, md_rtx_info *info)
 {
   RTX_CODE code;
   int i;
@@ -123,7 +123,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
   for (i = 0; i < XVECLEN (x, 1); i++)
{
  printf (",\n\t\t");
- gen_exp (XVECEXP (x, 1, i), subroutine_type, used);
+ gen_exp (XVECEXP (x, 1, i), subroutine_type, used, info);
}
   printf (")");
   return;
@@ -137,7 +137,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
   for (i = 0; i < XVECLEN (x, 2); i++)
{
  printf (",\n\t\t");
- gen_exp (XVECEXP (x, 2, i), subroutine_type, used);
+ gen_exp (XVECEXP (x, 2, i), subroutine_type, used, info);
}
   printf (")");
   return;
@@ -163,12 +163,21 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
 case CLOBBER:
   if (REG_P (XEXP (x, 0)))
{
- printf ("gen_hard_reg_clobber (%smode, %i)", GET_MODE_NAME (GET_MODE 
(XEXP (x, 0))),
-REGNO (XEXP (x, 0)));
+ printf ("gen_hard_reg_clobber (%smode, %i)",
+ GET_MODE_NAME (GET_MODE (XEXP (x, 0))),
+ REGNO (XEXP (x, 0)));
  return;
}
   break;
-
+case CLOBBER_HIGH:
+  if (!REG_P (XEXP (x, 0)))
+   

[PATCH v2 1/7] Add CLOBBER_HIGH expression

2018-07-26 Thread Alan Hayward
Includes documentation.

2018-07-25  Alan Hayward  

* doc/rtl.texi (clobber_high): Add.
(parallel): Add in clobber high
* rtl.c (rtl_check_failed_code3): Add function.
* rtl.def (CLOBBER_HIGH): Add expression.
* rtl.h (RTL_CHECKC3): Add macro.
(rtl_check_failed_code3): Add declaration.
(XC3EXP): Add macro.
---
 gcc/doc/rtl.texi | 15 ++-
 gcc/rtl.c| 11 +++
 gcc/rtl.def  | 10 ++
 gcc/rtl.h| 16 +++-
 4 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index a37d9ac5389..20c57732679 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -3296,6 +3296,18 @@ There is one other known use for clobbering a pseudo 
register in a
 clobbered by the insn.  In this case, using the same pseudo register in
 the clobber and elsewhere in the insn produces the expected results.
 
+@findex clobber_high
+@item (clobber_high @var{x})
+Represents the storing or possible storing of an unpredictable,
+undescribed value into the upper parts of @var{x}. The mode of the expression
+represents the lower parts of the register which will not be overwritten.
+@code{reg} must be a reg expression.
+
+One place this is used is when calling into functions where the registers are
+preserved, but only up to a given number of bits.  For example when using
+Aarch64 SVE, calling a TLS descriptor will cause only the lower 128 bits of
+each of the vector registers to be preserved.
+
 @findex use
 @item (use @var{x})
 Represents the use of the value of @var{x}.  It indicates that the
@@ -3349,7 +3361,8 @@ Represents several side effects performed in parallel.  
The square
 brackets stand for a vector; the operand of @code{parallel} is a
 vector of expressions.  @var{x0}, @var{x1} and so on are individual
 side effect expressions---expressions of code @code{set}, @code{call},
-@code{return}, @code{simple_return}, @code{clobber} or @code{use}.
+@code{return}, @code{simple_return}, @code{clobber} @code{use} or
+@code{clobber_high}.
 
 ``In parallel'' means that first all the values used in the individual
 side-effects are computed, and second all the actual side-effects are
diff --git a/gcc/rtl.c b/gcc/rtl.c
index 90bbc7c6861..985db1c14f0 100644
--- a/gcc/rtl.c
+++ b/gcc/rtl.c
@@ -856,6 +856,17 @@ rtl_check_failed_code2 (const_rtx r, enum rtx_code code1, 
enum rtx_code code2,
  func, trim_filename (file), line);
 }
 
+void
+rtl_check_failed_code3 (const_rtx r, enum rtx_code code1, enum rtx_code code2,
+   enum rtx_code code3, const char *file, int line,
+   const char *func)
+{
+  internal_error
+("RTL check: expected code '%s', '%s' or '%s', have '%s' in %s, at %s:%d",
+ GET_RTX_NAME (code1), GET_RTX_NAME (code2), GET_RTX_NAME (code3),
+ GET_RTX_NAME (GET_CODE (r)), func, trim_filename (file), line);
+}
+
 void
 rtl_check_failed_code_mode (const_rtx r, enum rtx_code code, machine_mode mode,
bool not_mode, const char *file, int line,
diff --git a/gcc/rtl.def b/gcc/rtl.def
index 2578a0ccb9e..0ed27505545 100644
--- a/gcc/rtl.def
+++ b/gcc/rtl.def
@@ -312,6 +312,16 @@ DEF_RTL_EXPR(USE, "use", "e", RTX_EXTRA)
is considered undeletable before reload.  */
 DEF_RTL_EXPR(CLOBBER, "clobber", "e", RTX_EXTRA)
 
+/* Indicate that the upper parts of something are clobbered in a way that we
+   don't want to explain.  The MODE references the lower bits that will be
+   preserved.  Anything above that size will be clobbered.
+
+   CLOBBER_HIGH only occurs as the operand of a PARALLEL rtx.  It cannot appear
+   in other contexts, and unlike CLOBBER, it cannot appear on its own.
+   CLOBBER_HIGH can only be used with fixed register rtxes.  */
+
+DEF_RTL_EXPR(CLOBBER_HIGH, "clobber_high", "e", RTX_EXTRA)
+
 /* Call a subroutine.
Operand 1 is the address to call.
Operand 2 is the number of arguments.  */
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 565ce3abbe4..5e07e9bee80 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -1100,6 +1100,14 @@ is_a_helper ::test (rtx_insn *insn)
   __FUNCTION__); \
  &_rtx->u.fld[_n]; }))
 
+#define RTL_CHECKC3(RTX, N, C1, C2, C3) __extension__  \
+(*({ __typeof (RTX) const _rtx = (RTX); const int _n = (N);\
+ const enum rtx_code _code = GET_CODE (_rtx);  \
+ if (_code != (C1) && _code != (C2) && _code != (C3))  \
+   rtl_check_failed_code3 (_rtx, (C1), (C2), (C3), __FILE__,   \
+  __LINE__, __FUNCTION__); \
+ &_rtx->u.fld[_n]; }))
+
 #define RTVEC_ELT(RTVEC, I) __extension__  \
 (*({ __typeof (RTVEC) const _rtvec = (RTVEC); const int _i = (I);  \
  

[PATCH v2 3/7] Add func to check if register is clobbered by clobber_high

2018-07-26 Thread Alan Hayward
Given a CLOBBER_HIGH expression and a register, it checks if
the register will be clobbered.

A second version exists for the cases where the expressions are
not available.

The function will be used throughout the following patches.

2018-07-25  Alan Hayward  

* rtl.h (reg_is_clobbered_by_clobber_high): Add declarations.
* rtlanal.c (reg_is_clobbered_by_clobber_high): Add function.
---
 gcc/rtl.h | 10 ++
 gcc/rtlanal.c | 29 +
 2 files changed, 39 insertions(+)

diff --git a/gcc/rtl.h b/gcc/rtl.h
index f42d749511d..d549b0aad0b 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -3467,6 +3467,16 @@ extern bool tablejump_p (const rtx_insn *, rtx_insn **, 
rtx_jump_table_data **);
 extern int computed_jump_p (const rtx_insn *);
 extern bool tls_referenced_p (const_rtx);
 extern bool contains_mem_rtx_p (rtx x);
+extern bool reg_is_clobbered_by_clobber_high (unsigned int, machine_mode,
+ const_rtx);
+
+/* Convenient wrapper for reg_is_clobbered_by_clobber_high.  */
+inline bool
+reg_is_clobbered_by_clobber_high (const_rtx x, const_rtx clobber_high_op)
+{
+  return reg_is_clobbered_by_clobber_high (REGNO (x), GET_MODE (x),
+  clobber_high_op);
+}
 
 /* Overload for refers_to_regno_p for checking a single register.  */
 inline bool
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 9f84d7f2a8c..1cab1545744 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -6551,3 +6551,32 @@ tls_referenced_p (const_rtx x)
   return true;
   return false;
 }
+
+/* Return true if reg REGNO with mode REG_MODE would be clobbered by the
+   clobber_high operand in CLOBBER_HIGH_OP.  */
+
+bool
+reg_is_clobbered_by_clobber_high (unsigned int regno, machine_mode reg_mode,
+ const_rtx clobber_high_op)
+{
+  unsigned int clobber_regno = REGNO (clobber_high_op);
+  machine_mode clobber_mode = GET_MODE (clobber_high_op);
+  unsigned char regno_nregs = hard_regno_nregs (regno, reg_mode);
+
+  /* Clobber high should always span exactly one register.  */
+  gcc_assert (REG_NREGS (clobber_high_op) == 1);
+
+  /* Clobber high needs to match with one of the registers in X.  */
+  if (clobber_regno < regno || clobber_regno >= regno + regno_nregs)
+return false;
+
+  gcc_assert (reg_mode != BLKmode && clobber_mode != BLKmode);
+
+  if (reg_mode == VOIDmode)
+return clobber_mode != VOIDmode;
+
+  /* Clobber high will clobber if its size might be greater than the size of
+ register regno.  */
+  return maybe_gt (exact_div (GET_MODE_SIZE (reg_mode), regno_nregs),
+GET_MODE_SIZE (clobber_mode));
+}
-- 
2.15.2 (Apple Git-101.1)



[PATCH v2 4/7] lra support for clobber_high

2018-07-26 Thread Alan Hayward
The lra specific changes for clobber_high.

2018-07-25  Alan Hayward  

* lra-eliminations.c (lra_eliminate_regs_1): Check for clobber high.
(mark_not_eliminable): Likewise.
* lra-int.h (struct lra_insn_reg): Add clobber high marker.
* lra-lives.c (process_bb_lives): Check for clobber high.
* lra.c (new_insn_reg): Remember clobber highs.
(collect_non_operand_hard_regs): Check for clobber high.
(lra_set_insn_recog_data): Likewise.
(add_regs_to_insn_regno_info): Likewise.
(lra_update_insn_regno_info): Likewise.
---
 gcc/lra-eliminations.c | 11 +
 gcc/lra-int.h  |  2 ++
 gcc/lra-lives.c| 31 
 gcc/lra.c  | 66 +++---
 4 files changed, 80 insertions(+), 30 deletions(-)

diff --git a/gcc/lra-eliminations.c b/gcc/lra-eliminations.c
index f5f104020b3..d0cfaa8714a 100644
--- a/gcc/lra-eliminations.c
+++ b/gcc/lra-eliminations.c
@@ -654,6 +654,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode 
mem_mode,
   return x;
 
 case CLOBBER:
+case CLOBBER_HIGH:
 case SET:
   gcc_unreachable ();
 
@@ -806,6 +807,16 @@ mark_not_eliminable (rtx x, machine_mode mem_mode)
setup_can_eliminate (ep, false);
   return;
 
+case CLOBBER_HIGH:
+  gcc_assert (REG_P (XEXP (x, 0)));
+  gcc_assert (REGNO (XEXP (x, 0)) < FIRST_PSEUDO_REGISTER);
+  for (ep = reg_eliminate;
+  ep < ®_eliminate[NUM_ELIMINABLE_REGS];
+  ep++)
+   if (reg_is_clobbered_by_clobber_high (ep->to_rtx, XEXP (x, 0)))
+ setup_can_eliminate (ep, false);
+  return;
+
 case SET:
   if (SET_DEST (x) == stack_pointer_rtx
  && GET_CODE (SET_SRC (x)) == PLUS
diff --git a/gcc/lra-int.h b/gcc/lra-int.h
index 86e103b7480..5267b53c5e3 100644
--- a/gcc/lra-int.h
+++ b/gcc/lra-int.h
@@ -168,6 +168,8 @@ struct lra_insn_reg
   /* True if there is an early clobber alternative for this
  operand.  */
   unsigned int early_clobber : 1;
+  /* True if the reg is clobber highed by the operand.  */
+  unsigned int clobber_high : 1;
   /* The corresponding regno of the register.  */
   int regno;
   /* Next reg info of the same insn.  */
diff --git a/gcc/lra-lives.c b/gcc/lra-lives.c
index 920fd02b997..433c819d9e3 100644
--- a/gcc/lra-lives.c
+++ b/gcc/lra-lives.c
@@ -658,7 +658,7 @@ process_bb_lives (basic_block bb, int &curr_point, bool 
dead_insn_p)
   bool call_p;
   int n_alt, dst_regno, src_regno;
   rtx set;
-  struct lra_insn_reg *reg;
+  struct lra_insn_reg *reg, *hr;
 
   if (!NONDEBUG_INSN_P (curr_insn))
continue;
@@ -690,11 +690,12 @@ process_bb_lives (basic_block bb, int &curr_point, bool 
dead_insn_p)
break;
  }
  for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next)
-   if (reg->type != OP_IN)
+   if (reg->type != OP_IN && !reg->clobber_high)
  {
remove_p = false;
break;
  }
+
  if (remove_p && ! volatile_refs_p (PATTERN (curr_insn)))
{
  dst_regno = REGNO (SET_DEST (set));
@@ -812,14 +813,24 @@ process_bb_lives (basic_block bb, int &curr_point, bool 
dead_insn_p)
 unused values because they still conflict with quantities
 that are live at the time of the definition.  */
   for (reg = curr_id->regs; reg != NULL; reg = reg->next)
-   if (reg->type != OP_IN)
- {
-   need_curr_point_incr
- |= mark_regno_live (reg->regno, reg->biggest_mode,
- curr_point);
-   check_pseudos_live_through_calls (reg->regno,
- last_call_used_reg_set);
- }
+   {
+ if (reg->type != OP_IN)
+   {
+ need_curr_point_incr
+   |= mark_regno_live (reg->regno, reg->biggest_mode,
+   curr_point);
+ check_pseudos_live_through_calls (reg->regno,
+   last_call_used_reg_set);
+   }
+
+ if (reg->regno >= FIRST_PSEUDO_REGISTER)
+   for (hr = curr_static_id->hard_regs; hr != NULL; hr = hr->next)
+ if (hr->clobber_high
+ && maybe_gt (GET_MODE_SIZE (PSEUDO_REGNO_MODE (reg->regno)),
+  GET_MODE_SIZE (hr->biggest_mode)))
+   SET_HARD_REG_BIT (lra_reg_info[reg->regno].conflict_hard_regs,
+ hr->regno);
+   }
 
   for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next)
if (reg->type != OP_IN)
diff --git a/gcc/lra.c b/gcc/lra.c
index b410b90f126..aa768fb2a23 100644
--- a/gcc/lra.c
+++ b/gcc/lra.c
@@ -535

[PATCH v2 0/7] Support partial clobbers around TLS calls on Aarch64 SVE

2018-07-26 Thread Alan Hayward
This is rebasing of the patch posted in November.
It's aim is to support aarch64 SVE register preservation around TLS calls
by adding a CLOBBER_HIGH expression.

Across a TLS call, Aarch64 SVE does not explicitly preserve the
SVE vector registers. However, the Neon vector registers are preserved.
Due to overlapping of registers, this means the lower 128bits of all
SVE vector registers will be preserved.

The existing GCC code assume all Neon and SVE registers are clobbered
across TLS calls.

This patch introduces a CLOBBER_HIGH expression. This behaves a bit like
a CLOBBER expression. CLOBBER_HIGH can only refer to a single register.
The mode of the expression indicates the size of the lower bits which
will be preserved. If the register contains a value bigger than this
mode then the code will treat the register as clobbered, otherwise the
register remains untouched.

The means in order to evaluate if a clobber high is relevant, we need to
ensure the mode of the existing value in a register is tracked.

The first two patches introduce CLOBBER_HIGH and generation support.
The next patch adds a helper function for easily determining if a register
gets clobbered by a CLOBBER_HIGH.
The next three patches add clobber high checks to all of the passes. I
couldn't think of a better way of splitting this up (maybe needs dividing
into smaller patches?).
Finally the last patch adds the CLOBBER_HIGHS around a TLS call for
aarch64 SVE and some test cases.

Alan Hayward (7):
  Add CLOBBER_HIGH expression
  Generation support for CLOBBER_HIGH
  Add func to check if register is clobbered by clobber_high
  lra support for clobber_high
  cse support for clobber_high
  Remaining support for clobber high
  Enable clobber high for tls descs on Aarch64

 gcc/alias.c|  11 ++
 gcc/cfgexpand.c|   1 +
 gcc/combine-stack-adj.c|   1 +
 gcc/combine.c  |  38 -
 gcc/config/aarch64/aarch64.md  |  69 ++--
 gcc/cse.c  | 187 ++---
 gcc/cselib.c   |  42 +++--
 gcc/cselib.h   |   2 +-
 gcc/dce.c  |  11 +-
 gcc/df-scan.c  |   6 +
 gcc/doc/rtl.texi   |  15 +-
 gcc/dwarf2out.c|   1 +
 gcc/emit-rtl.c |  16 ++
 gcc/genconfig.c|   1 +
 gcc/genemit.c  |  51 +++---
 gcc/genrecog.c |   3 +-
 gcc/haifa-sched.c  |   3 +
 gcc/ira-build.c|   5 +
 gcc/ira-costs.c|   7 +
 gcc/ira.c  |   6 +-
 gcc/jump.c |   1 +
 gcc/lra-eliminations.c |  11 ++
 gcc/lra-int.h  |   2 +
 gcc/lra-lives.c|  31 ++--
 gcc/lra.c  |  66 +---
 gcc/postreload-gcse.c  |  21 ++-
 gcc/postreload.c   |  25 ++-
 gcc/print-rtl.c|   1 +
 gcc/recog.c|   9 +-
 gcc/regcprop.c |  10 +-
 gcc/reginfo.c  |   4 +
 gcc/reload1.c  |  16 +-
 gcc/reorg.c|  27 ++-
 gcc/resource.c |  24 ++-
 gcc/rtl.c  |  15 ++
 gcc/rtl.def|  10 ++
 gcc/rtl.h  |  27 ++-
 gcc/rtlanal.c  |  47 +-
 gcc/sched-deps.c   |  15 +-
 .../gcc.target/aarch64/sve_tls_preserve_1.c|  19 +++
 .../gcc.target/aarch64/sve_tls_preserve_2.c|  24 +++
 .../gcc.target/aarch64/sve_tls_preserve_3.c|  24 +++
 42 files changed, 725 insertions(+), 180 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_3.c

-- 
2.15.2 (Apple Git-101.1)



Re: [PATCH 2/7] Support >26 operands in generation code.

2018-02-07 Thread Alan Hayward


> On 6 Feb 2018, at 17:03, Richard Sandiford  
> wrote:
> 
> Alan Hayward  writes:
>> This patch adds support for CLOBBER_HIGH in the generation code.
>> 
>> An aarch64 will require 31 clobber high expressions, plus two
>> clobbers.
>> 
>> The exisiting gen code restricts to 26 vector operands by virtue
>> of using the operators [a-z]. This patch extends this to 52 by
>> supporting [a-zA-Z].
> 
> Although the main CLOBBER_HIGH patch will obviously need to wait
> until GCC 9 now, we can work around it for GCC 8 by making the
> tlsdesc pattern clobber all vector and predicate registers for SVE.
> We'll still need the support for more than 26 operands in order
> to do that though.
> 
>> 2017-11-16  Alan Hayward  
>> 
>> [...]
>>  * genextract.c (push_pathstr_operand): New function to
>>  support [a-zA-Z].
>>  (walk_rtx): Call push_pathstr_operand.
>>  (print_path): Support [a-zA-Z].
>> [...]
>> diff --git a/gcc/genextract.c b/gcc/genextract.c
>> index 
>> 258d234d2729bf16b152b90bb1833a37a6eb0bdc..e1fb716e459b9bd219e89cf36c30556d520305a2
>>  100644
>> --- a/gcc/genextract.c
>> +++ b/gcc/genextract.c
>> @@ -33,9 +33,10 @@ along with GCC; see the file COPYING3.  If not see
>> 
>>The string for each operand describes that path to the operand and
>>contains `0' through `9' when going into an expression and `a' through
>> -   `z' when going into a vector.  We assume here that only the first operand
>> -   of an rtl expression is a vector.  genrecog.c makes the same assumption
>> -   (and uses the same representation) and it is currently true.  */
>> +   `z' then 'A' through to 'Z' when going into a vector.  We assume here 
>> that
>> +   only the first operand of an rtl expression is a vector.  genrecog.c 
>> makes
>> +   the same assumption (and uses the same representation) and it is 
>> currently
>> +   true.  */
>> 
>> typedef char *locstr;
>> 
>> @@ -80,6 +81,22 @@ struct accum_extract
>> /* Forward declarations.  */
>> static void walk_rtx (md_rtx_info *, rtx, struct accum_extract *);
>> 
>> +#define UPPER_OFFSET ('A' - ('z' - 'a' + 1))
>> +
>> +/* Convert OPERAND into a character - either into [a-zA-Z] for vector 
>> operands
>> +   or [0-9] for integer operands - and push onto the end of the path ACC.  
>> */
>> +static void
>> +push_pathstr_operand (int operand, bool is_vector,
>> + struct accum_extract *acc)
>> +{
>> +  if (is_vector && 'a' + operand > 'z')
>> +acc->pathstr.safe_push (operand + UPPER_OFFSET);
>> +  else if (is_vector)
>> +acc->pathstr.safe_push (operand + 'a');
>> +  else
>> +acc->pathstr.safe_push (operand + '0');
>> +}
>> +
>> static void
>> gen_insn (md_rtx_info *info)
>> {
>> @@ -98,7 +115,7 @@ gen_insn (md_rtx_info *info)
>>   else
>> for (i = XVECLEN (insn, 1) - 1; i >= 0; i--)
>>   {
>> -acc.pathstr.safe_push ('a' + i);
>> +push_pathstr_operand (i, true, &acc);
>>  walk_rtx (info, XVECEXP (insn, 1, i), &acc);
>>  acc.pathstr.pop ();
>>   }
>> @@ -208,7 +225,7 @@ static void
>> walk_rtx (md_rtx_info *info, rtx x, struct accum_extract *acc)
>> {
>>   RTX_CODE code;
>> -  int i, len, base;
>> +  int i, len;
>>   const char *fmt;
>> 
>>   if (x == 0)
>> @@ -234,10 +251,9 @@ walk_rtx (md_rtx_info *info, rtx x, struct 
>> accum_extract *acc)
>>   VEC_safe_set_locstr (info, &acc->oplocs, XINT (x, 0),
>> VEC_char_to_string (acc->pathstr));
>> 
>> -  base = (code == MATCH_OPERATOR ? '0' : 'a');
>>   for (i = XVECLEN (x, 2) - 1; i >= 0; i--)
>>  {
>> -  acc->pathstr.safe_push (base + i);
>> +  push_pathstr_operand (i, code != MATCH_OPERATOR, acc);
>>walk_rtx (info, XVECEXP (x, 2, i), acc);
>>acc->pathstr.pop ();
>> }
>> @@ -252,10 +268,9 @@ walk_rtx (md_rtx_info *info, rtx x, struct 
>> accum_extract *acc)
>>   if (code == MATCH_DUP)
>>  break;
>> 
>> -  base = (code == MATCH_OP_DUP ? '0' : 'a');
>>   for (i = XVECLEN (x, 1) - 1; i >= 0; i--)
>> {
>> -  acc->pathstr.safe_push (base + i);
>> +  push_pathstr_operand (i, code != MATCH_OP

Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2018-01-24 Thread Alan Hayward
Ping.
Any comments on this?
The one line summary is that using self sets instead of clobber high would 
result in a
patch roughly the same, but with different condition checks.
It depends if people think it really is useful for self sets to not be live.

Given that we are at stage 4 now, and this can’t go in until stage 1, I’m happy 
to
leave the discussion until stage 1, but would appreciate any suggestions before 
then.

Thanks,
Alan.

> On 12 Jan 2018, at 11:58, Alan Hayward  wrote:
> 
> 
> 
>> On 19 Dec 2017, at 16:27, Jeff Law  wrote:
>> 
>> On 12/19/2017 03:12 AM, Alan Hayward wrote:
>>> Ping ping.
>>> I think there should be enough information in the first test to show that 
>>> any "set to self”
>>> registers become live. Let me know if there’s anything I’ve missed.
>> I think that both Richi and I would like you investigate fixing the df
>> infrastructure so that a self-set doesn't make things go live.  Sorry if
>> we weren't clear about that.  ie, level of complexity and likely fallout
>> so that we can evaluate that vs CLOBBER_HIGH.
>> 
>> 
>> 
>> jeff
>> 
> 
> 
> Right, sorry, I misunderstood. Ok, so I’ve been looking at trying to do this.
> 
> To summarise: To do this we need to check for 1) All reg sets to self where 
> 2) the existing value of the register fits within the mode of the new set. In 
> these cases we want then to (in effect) pretend the set doesn’t exist.
> To test this, I add a set to self in a tls_desc call for every V register ( 
> eg: (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)), (set (reg:TI V1_REGNUM) 
> (reg:TI V1_REGNUM))  etc etc). If the patch works, then these registers will 
> not be backed up around a tls call.
> 
> First added some checks into df-scan and successfully stopped the sets to 
> self from being added to the live and uses lists.  [ To simplify the issue 
> for now I ignored the existing value of the register, and just assumed it 
> fits ].
> However, running my test case, the code still results in my vector registers 
> being backed up around tls. Even though the dumps show that my vector 
> registers are no longer live.
> 
> Debugging further, finds that the register backing up is happening as part of 
> combine.c. Ok, I can add a check for reg set to self in here too…..
> But as part of my clobber_high patch I already had code in clobber.c that 
> checked for CLOBBER_HIGH and then checked the mode of the previous value in 
> the register. My new code ends up looking very similar to my clobber high 
> patch.
> 
> Instead of:
> 
> if (GET_CODE (setter) == CLOBBER_HIGH
>&& reg_is_clobbered_by_clobber_high(REGNO(dest), GET_MODE 
> (rsp->last_set_value))
> 
> Now becomes something like:
> 
> if (GET_CODE (setter) == SET
>&& REG_P (dest) && HARD_REGISTER_P (dest) && REG_P (src) && REGNO(dst) == 
> REGNO(src)
>&& reg_is_clobbered_by_self_set(REGNO(dest), GET_MODE 
> (rsp->last_set_value))
> 
> I then need to find the next pass that has similar checks…. and again it’ll 
> be the same places I already have clobber high code.
> 
> I suspect in the end I’ll be able to remove the df-scan changes, because as I 
> effectively found out with clobber high, they aren’t causing any register 
> backups to happen.
> 
> Ok, in the new patch we do save a bit of code because there is no new 
> expression to add. But that was a small part of the full patch.
> 
> I could rewrite the patch in this way, but personally I feel it’s now 
> exploiting a side effect of a set to self, rather than being explicit in what 
> it is trying to do. 
> 
> 
> Alan.
> 



Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2018-01-12 Thread Alan Hayward


> On 19 Dec 2017, at 16:27, Jeff Law  wrote:
> 
> On 12/19/2017 03:12 AM, Alan Hayward wrote:
>> Ping ping.
>> I think there should be enough information in the first test to show that 
>> any "set to self”
>> registers become live. Let me know if there’s anything I’ve missed.
> I think that both Richi and I would like you investigate fixing the df
> infrastructure so that a self-set doesn't make things go live.  Sorry if
> we weren't clear about that.  ie, level of complexity and likely fallout
> so that we can evaluate that vs CLOBBER_HIGH.
> 
> 
> 
> jeff
> 


Right, sorry, I misunderstood. Ok, so I’ve been looking at trying to do this.

To summarise: To do this we need to check for 1) All reg sets to self where 2) 
the existing value of the register fits within the mode of the new set. In 
these cases we want then to (in effect) pretend the set doesn’t exist.
To test this, I add a set to self in a tls_desc call for every V register ( eg: 
(set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)), (set (reg:TI V1_REGNUM) (reg:TI 
V1_REGNUM))  etc etc). If the patch works, then these registers will not be 
backed up around a tls call.

First added some checks into df-scan and successfully stopped the sets to self 
from being added to the live and uses lists.  [ To simplify the issue for now I 
ignored the existing value of the register, and just assumed it fits ].
However, running my test case, the code still results in my vector registers 
being backed up around tls. Even though the dumps show that my vector registers 
are no longer live.

Debugging further, finds that the register backing up is happening as part of 
combine.c. Ok, I can add a check for reg set to self in here too…..
But as part of my clobber_high patch I already had code in clobber.c that 
checked for CLOBBER_HIGH and then checked the mode of the previous value in the 
register. My new code ends up looking very similar to my clobber high patch.

Instead of:

if (GET_CODE (setter) == CLOBBER_HIGH
&& reg_is_clobbered_by_clobber_high(REGNO(dest), GET_MODE 
(rsp->last_set_value))

Now becomes something like:

if (GET_CODE (setter) == SET
&& REG_P (dest) && HARD_REGISTER_P (dest) && REG_P (src) && REGNO(dst) == 
REGNO(src)
&& reg_is_clobbered_by_self_set(REGNO(dest), GET_MODE (rsp->last_set_value))

I then need to find the next pass that has similar checks…. and again it’ll be 
the same places I already have clobber high code.

I suspect in the end I’ll be able to remove the df-scan changes, because as I 
effectively found out with clobber high, they aren’t causing any register 
backups to happen.

Ok, in the new patch we do save a bit of code because there is no new 
expression to add. But that was a small part of the full patch.

I could rewrite the patch in this way, but personally I feel it’s now 
exploiting a side effect of a set to self, rather than being explicit in what 
it is trying to do. 


Alan.



Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-12-19 Thread Alan Hayward
Ping ping.
I think there should be enough information in the first test to show that any 
"set to self”
registers become live. Let me know if there’s anything I’ve missed.


Thanks,
Alan.

> On 12 Dec 2017, at 11:11, Alan Hayward  wrote:
> 
> Ping.
> 
>> On 30 Nov 2017, at 11:03, Alan Hayward  wrote:
>> 
>> 
>>> On 27 Nov 2017, at 17:29, Jeff Law  wrote:
>>> 
>>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>>> 
>>>>> On 22 Nov 2017, at 17:33, Jeff Law  wrote:
>>>>> 
>>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>>> 
>>>>>>> On 21 Nov 2017, at 03:13, Jeff Law  wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>>>>>> to what you need…
>>>>>>>> 
>>>>>>>> Yes, some of the code is similar to the way
>>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>>> CLOBBER expr code served as a starting point for writing the patch. 
>>>>>>>> The main difference
>>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a 
>>>>>>>> specific Instruction,
>>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied 
>>>>>>>> to an expression (tls_desc).
>>>>>>>> It meant there wasn’t really any opportunity to resume any existing 
>>>>>>>> code.
>>>>>>> Understood.  Though your first patch mentions that you're trying to
>>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>>> represented as normal insns, not call_insn.
>>>>>>> 
>>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>>> the register.
>>>>>>> 
>>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>>> because that's the existing semantics of a subreg store.
>>>>>>> 
>>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>>> wrong given the semantics of a partial store.
>>>>>>> 
>>>>>> 
>>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>>> It’s something we considered, and then dismissed.
>>>>>> 
>>>>>> The problem then is you are now using SET semantics on those registers, 
>>>>>> and it
>>>>>> would make the register live around the function, which might not be the 
>>>>>> case.
>>>>>> Whereas clobber semantics will just make the register dead - which is 
>>>>>> exactly
>>>>>> what we want (but only conditionally).
>>>>> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
>>>>> low part is preserved while the upper part is clobbered across the TLS
>>>>> insns.
>>>>> 
>>>>> jeff
>>>> 
>>>> Consider where the TLS call is inside a loop. The compiler would normally 
>>>> want
>>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of 
>>>> the tls_desc we
>>>> are now making x live across the loop, x is dependant on the value from 
>>>> the previous
>>>> iteration, and the tls_desc can no longer be hoisted.
>>> Hmm.  I think I see the problem you're trying to point out.  Let me
>>> restate it and see if you agree.
>>> 
>>> The low subreg set does clearly indicate the upper part of the SVE
>>> register is clobbered.  The problem is from a liveness standpoint the
>>> compiler is considering the low part li

Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-12-12 Thread Alan Hayward
Ping.

> On 30 Nov 2017, at 11:03, Alan Hayward  wrote:
> 
> 
>> On 27 Nov 2017, at 17:29, Jeff Law  wrote:
>> 
>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>> 
>>>> On 22 Nov 2017, at 17:33, Jeff Law  wrote:
>>>> 
>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>> 
>>>>>> On 21 Nov 2017, at 03:13, Jeff Law  wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>>>>> to what you need…
>>>>>>> 
>>>>>>> Yes, some of the code is similar to the way
>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>> CLOBBER expr code served as a starting point for writing the patch. The 
>>>>>>> main difference
>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a 
>>>>>>> specific Instruction,
>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied 
>>>>>>> to an expression (tls_desc).
>>>>>>> It meant there wasn’t really any opportunity to resume any existing 
>>>>>>> code.
>>>>>> Understood.  Though your first patch mentions that you're trying to
>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>> represented as normal insns, not call_insn.
>>>>>> 
>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>> the register.
>>>>>> 
>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>> because that's the existing semantics of a subreg store.
>>>>>> 
>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>> wrong given the semantics of a partial store.
>>>>>> 
>>>>> 
>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>> It’s something we considered, and then dismissed.
>>>>> 
>>>>> The problem then is you are now using SET semantics on those registers, 
>>>>> and it
>>>>> would make the register live around the function, which might not be the 
>>>>> case.
>>>>> Whereas clobber semantics will just make the register dead - which is 
>>>>> exactly
>>>>> what we want (but only conditionally).
>>>> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
>>>> low part is preserved while the upper part is clobbered across the TLS
>>>> insns.
>>>> 
>>>> jeff
>>> 
>>> Consider where the TLS call is inside a loop. The compiler would normally 
>>> want
>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of 
>>> the tls_desc we
>>> are now making x live across the loop, x is dependant on the value from the 
>>> previous
>>> iteration, and the tls_desc can no longer be hoisted.
>> Hmm.  I think I see the problem you're trying to point out.  Let me
>> restate it and see if you agree.
>> 
>> The low subreg set does clearly indicate the upper part of the SVE
>> register is clobbered.  The problem is from a liveness standpoint the
>> compiler is considering the low part live, even though it's a self-set.
>> 
> 
> Yes.
> 
>> In fact, if that is the case, then a single TLS call (independent of a
>> loop) would make the low part of the register globally live.  This
>> should be testable.  Include one of these low part self sets on the
>> existing TLS calls and compile a little test function and let's look at
>> the liveness data.
>> 
> 
> 
> Ok, so I’ve put 
> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
>

Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-11-30 Thread Alan Hayward

> On 27 Nov 2017, at 17:29, Jeff Law  wrote:
> 
> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>> 
>>> On 22 Nov 2017, at 17:33, Jeff Law  wrote:
>>> 
>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>> 
>>>>> On 21 Nov 2017, at 03:13, Jeff Law  wrote:
>>>>>> 
>>>>>>> 
>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>>>> to what you need…
>>>>>> 
>>>>>> Yes, some of the code is similar to the way
>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>> CLOBBER expr code served as a starting point for writing the patch. The 
>>>>>> main difference
>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a 
>>>>>> specific Instruction,
>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to 
>>>>>> an expression (tls_desc).
>>>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>>>> Understood.  Though your first patch mentions that you're trying to
>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>> represented as normal insns, not call_insn.
>>>>> 
>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>> the register.
>>>>> 
>>>>> That should tell the generic parts of the compiler that you're just
>>>>> clobbering the upper part and at least in theory you can implement in
>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>> because that's the existing semantics of a subreg store.
>>>>> 
>>>>> The only worry would be if a pass tried to get overly smart and
>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>> wrong given the semantics of a partial store.
>>>>> 
>>>> 
>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>> It’s something we considered, and then dismissed.
>>>> 
>>>> The problem then is you are now using SET semantics on those registers, 
>>>> and it
>>>> would make the register live around the function, which might not be the 
>>>> case.
>>>> Whereas clobber semantics will just make the register dead - which is 
>>>> exactly
>>>> what we want (but only conditionally).
>>> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
>>> low part is preserved while the upper part is clobbered across the TLS
>>> insns.
>>> 
>>> jeff
>> 
>> Consider where the TLS call is inside a loop. The compiler would normally 
>> want
>> to hoist that out of the loop. By adding a set(x,x) into the parallel of the 
>> tls_desc we
>> are now making x live across the loop, x is dependant on the value from the 
>> previous
>> iteration, and the tls_desc can no longer be hoisted.
> Hmm.  I think I see the problem you're trying to point out.  Let me
> restate it and see if you agree.
> 
> The low subreg set does clearly indicate the upper part of the SVE
> register is clobbered.  The problem is from a liveness standpoint the
> compiler is considering the low part live, even though it's a self-set.
> 

Yes.

> In fact, if that is the case, then a single TLS call (independent of a
> loop) would make the low part of the register globally live.  This
> should be testable.  Include one of these low part self sets on the
> existing TLS calls and compile a little test function and let's look at
> the liveness data.
> 


Ok, so I’ve put 
(set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
(set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the 
neon registers.

In an ideal world, this change would do nothing as neon reg size is TI.

First I compiled up  sve_tls_preserve_1.c (Test1 below)

Without the SET changes, gcc does not backup any neon registers before the 
tlsdesccall (good).
With the SET changes, gcc backs up all the neon registers (not ideal).

Without the SET changes, dump logs give:

cse1:
;;  regs ever l

Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-11-23 Thread Alan Hayward

> On 22 Nov 2017, at 17:33, Jeff Law  wrote:
> 
> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>> 
>>> On 21 Nov 2017, at 03:13, Jeff Law  wrote:
>>>> 
>>>>> 
>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>> to what you need…
>>>> 
>>>> Yes, some of the code is similar to the way
>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>> CLOBBER expr code served as a starting point for writing the patch. The 
>>>> main difference
>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a 
>>>> specific Instruction,
>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to 
>>>> an expression (tls_desc).
>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>> Understood.  Though your first patch mentions that you're trying to
>>> describe partial preservation "around TLS calls". Presumably those are
>>> represented as normal insns, not call_insn.
>>> 
>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>> to itself using whatever mode is wide enough to cover the neon part of
>>> the register.
>>> 
>>> That should tell the generic parts of the compiler that you're just
>>> clobbering the upper part and at least in theory you can implement in
>>> the aarch64 backend and the rest of the compiler should "just work"
>>> because that's the existing semantics of a subreg store.
>>> 
>>> The only worry would be if a pass tried to get overly smart and
>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>> wrong given the semantics of a partial store.
>>> 
>> 
>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>> It’s something we considered, and then dismissed.
>> 
>> The problem then is you are now using SET semantics on those registers, and 
>> it
>> would make the register live around the function, which might not be the 
>> case.
>> Whereas clobber semantics will just make the register dead - which is exactly
>> what we want (but only conditionally).
> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
> low part is preserved while the upper part is clobbered across the TLS
> insns.
> 
> jeff

Consider where the TLS call is inside a loop. The compiler would normally want
to hoist that out of the loop. By adding a set(x,x) into the parallel of the 
tls_desc we
are now making x live across the loop, x is dependant on the value from the 
previous
iteration, and the tls_desc can no longer be hoisted.

Or consider a stream of code containing two tls_desc calls (ok, the compiler 
might
optimise one of the tls calls away, but this approach should be reusable for 
other exprs).
Between the two set(x,x)’s x is considered live so the register allocator can’t 
use that
register.
Given that we are applying this to all the neon registers, the register 
allocator now throws
an ICE because it can’t find any free hard neon registers to use.


Alan.

Re: [PATCH] Fix result for conditional reductions matching at index 0

2017-11-23 Thread Alan Hayward

> On 22 Nov 2017, at 16:57, Kilian Verhetsel  
> wrote:
> 
> 
> Thank you both for your comments.
> 
> I have added the check to ensure the index vector won't cause an
> overflow. I also added tests to the testsuite in order to check that the
> loop is vectorized for UINT_MAX - 1 iterations but not UINT_MAX
> iterations. I was not able to write code that triggers
> INTEGER_INDUC_COND_REDUCTION when using char or other smaller types
> (changing the types of last, min_v and a to something else causes
> COND_REDUCTION to be used instead), so these tests are only compiled and
> not executed.

I had similar problems when playing around with -14.c, but didn’t have chance to
investigate why. Possibly worth raising a bug to mentioning there is a missed
optimisation. It’d be nice to figure out why.

> 
> I also moved an instruction that generates a vector of zeroes (used for
> COND_REDUCTION) in the branch of code run only for COND_REDUCTION, this
> should remove the unused vector that Alan noticed.

Patch is ok for me.


Alan.



Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-11-22 Thread Alan Hayward

> On 21 Nov 2017, at 03:13, Jeff Law  wrote:
>> 
>>> 
>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>> totally forgotten about it.  And in fact it seems to come pretty close
>>> to what you need…
>> 
>> Yes, some of the code is similar to the way
>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>> CLOBBER expr code served as a starting point for writing the patch. The main 
>> difference
>> here, is that _PART_CLOBBERED is around all calls and is not tied to a 
>> specific Instruction,
>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an 
>> expression (tls_desc).
>> It meant there wasn’t really any opportunity to resume any existing code.
> Understood.  Though your first patch mentions that you're trying to
> describe partial preservation "around TLS calls".  Presumably those are
> represented as normal insns, not call_insn.
> 
> That brings me back to Richi's idea of exposing a set of the low subreg
> to itself using whatever mode is wide enough to cover the neon part of
> the register.
> 
> That should tell the generic parts of the compiler that you're just
> clobbering the upper part and at least in theory you can implement in
> the aarch64 backend and the rest of the compiler should "just work"
> because that's the existing semantics of a subreg store.
> 
> The only worry would be if a pass tried to get overly smart and
> considered that kind of set a nop -- but I think I'd argue that's simply
> wrong given the semantics of a partial store.
> 

So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
It’s something we considered, and then dismissed.

The problem then is you are now using SET semantics on those registers, and it
would make the register live around the function, which might not be the case.
Whereas clobber semantics will just make the register dead - which is exactly
what we want (but only conditionally).


Alan.

Re: [PATCH] Fix result for conditional reductions matching at index 0

2017-11-22 Thread Alan Hayward

> On 22 Nov 2017, at 09:14, Richard Biener  wrote:
> 
> On Tue, Nov 21, 2017 at 5:43 PM, Kilian Verhetsel
>  wrote:
>> 
>>> This is PR81179 I think, please mention that in the changelog.
>> 
>> Correct, my bad for missing that.
>> 
>>> This unconditionally pessimizes code even if there is no valid index
>>> zero, right?
>> 
>> Almost, since for a loop such as:
>> 
>>  #define OFFSET 1
>>  unsigned int find(const unsigned int *a, unsigned int v) {
>>unsigned int result = 120;
>>for (unsigned int i = OFFSET; i < 32+OFFSET; i++) {
>>  if (a[i-OFFSET] == v) result = i;
>>}
>>return result;
>>  }
>> 
>> the index i will match the contents of the index vector used here ---
>> but this does indeed pessimize the code generated for, say, OFFSET
>> = 2. It is probably more sensible to use the existing code path in those
>> situations.
>> 
>>> The issue with the COND_REDUCITION index vector is overflow IIRC.
>> 
>> Does that mean such overflows can already manifest themselves for
>> regular COND_REDUCTION? I had assumed sufficient checks were already in
>> place because of the presence of the is_nonwrapping_integer_induction
>> test.
> 
> But only if we need the index vector?  The patch looked like you're changing
> how other modes are handled (in my look I didn't make myself familiar with
> the various modes again...).  Anyway, Alan hopefully remembers what he
> coded so I'll defer to him here.
> 
> If Alan is happy with the patch consider it approved.
> 

Richard’s right with his question.

The optimisation needs to fail if the number of interactions of the loop + 1 
doesn’t
fit into the data type used for the result.

I took the test pr65947-14.c
First I set N to 0x-1. This compiled and vectorised. That’s ok.
Now if I set N to 0x it still vectorises, but this should fail.

Compare to pr65947-14.c where we set  last = a[I]; inside the if.
When set N to 0x-1, it compiled and vectorised. That’s ok.
When set N to 0x it fails to vectorise with the message
"loop size is greater than data size”.

Looks like you might just need to add the one check.

Also see pr65947-9.c which uses the slightly more useful char indexes.


Alan.





Re: [PATCH] Fix result for conditional reductions matching at index 0

2017-11-21 Thread Alan Hayward

> On 21 Nov 2017, at 16:43, Kilian Verhetsel  
> wrote:
> 
> 
>> This is PR81179 I think, please mention that in the changelog.
> 
> Correct, my bad for missing that.
> 
>> This unconditionally pessimizes code even if there is no valid index
>> zero, right?
> 
> Almost, since for a loop such as:
> 
>  #define OFFSET 1
>  unsigned int find(const unsigned int *a, unsigned int v) {
>unsigned int result = 120;
>for (unsigned int i = OFFSET; i < 32+OFFSET; i++) {
>  if (a[i-OFFSET] == v) result = i;
>}
>return result;
>  }
> 
> the index i will match the contents of the index vector used here ---
> but this does indeed pessimize the code generated for, say, OFFSET
> = 2. It is probably more sensible to use the existing code path in those
> situations.
> 

Looking at the final asm on aarch64 for -14.c, the code has only grown
By a single instruction in the epilogue. Which is good, given the vector
pass dump for this patch is quite a bit longer.

In the vector dump, there is a vector of 0’s
_57 = { 0, 0, 0, 0 };
created which is then never used (and later gets optimised away).
Be nice if that could avoid getting created.
I’ve not had chance to scrutinise the patch yet to see where that is created.


>> The issue with the COND_REDUCITION index vector is overflow IIRC.
> 
> Does that mean such overflows can already manifest themselves for
> regular COND_REDUCTION? I had assumed sufficient checks were already in
> place because of the presence of the is_nonwrapping_integer_induction
> test.

As an aside, it’s possibly worth mentioning that this issue will go away for
aarch64-sve when Richard Sandiford’s SVE patch goes in, as that’ll support
CLASTB reduction.



Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-11-20 Thread Alan Hayward

> On 17 Nov 2017, at 19:31, Jeff Law  wrote:
> 
> On 11/16/2017 11:50 AM, Alan Hayward wrote:
>> 
>>> On 16 Nov 2017, at 18:24, Richard Biener  wrote:
>>> 
>>> On November 16, 2017 7:05:30 PM GMT+01:00, Jeff Law  wrote:
>>>> On 11/16/2017 05:34 AM, Alan Hayward wrote:
>>>>> This is a set of patches aimed at supporting aarch64 SVE register
>>>>> preservation around TLS calls.
>>>>> 
>>>>> Across a TLS call, Aarch64 SVE does not explicitly preserve the
>>>>> SVE vector registers. However, the Neon vector registers are
>>>> preserved.
>>>>> Due to overlapping of registers, this means the lower 128bits of all
>>>>> SVE vector registers will be preserved.
>>>>> 
>>>>> The existing GCC code will currently incorrectly assume preservation
>>>>> of all of the SVE registers.
>>>>> 
>>>>> This patch introduces a CLOBBER_HIGH expression. This behaves a bit
>>>> like
>>>>> a CLOBBER expression. CLOBBER_HIGH can only refer to a single
>>>> register.
>>>>> The mode of the expression indicates the size of the lower bits which
>>>>> will be preserved. If the register contains a value bigger than this
>>>>> mode then the code will treat the register as clobbered.
>>>>> 
>>>>> The means in order to evaluate if a clobber high is relevant, we need
>>>> to ensure
>>>>> the mode of the existing value in a register is tracked.
>>>>> 
>>>>> The following patches in this series add support for the
>>>> CLOBBER_HIGH,
>>>>> with the final patch adding CLOBBER_HIGHs around TLS_DESC calls for
>>>>> aarch64. The testing performed on these patches is also detailed in
>>>> the
>>>>> final patch.
>>>>> 
>>>>> These patches are based on top of the linaro-dev/sve branch.
>>>>> 
>>>>> A simpler alternative to this patch would be to assume all Neon and
>>>> SVE
>>>>> registers are clobbered across TLS calls, however this would be a
>>>>> performance regression against all Aarch64 targets.
>>>> So just a couple design questions.
>>>> 
>>>> Presumably there's no reasonable way to set up GCC's view of the
>>>> register file to avoid this problem?  ISTM that if the SVE register was
>>>> split into two, one for the part that overlapped with the neon register
>>>> and one that did not, then this could be handled via standard
>>>> mechanisms?
>>>> 
>> 
>> Yes, that was an early alternative option for the patch.
>> 
>> With that it would effect every operation that uses SVE registers. A simple
>> add of two registers now has 4 inputs and two outputs. It would get in the
>> way when debugging any sve dumps and be generally annoying.
>> Possible that the code for that in would all be in the aarch64 target,
>> (making everyone else happy!) But I suspect that there would be still be
>> strange dependency issues that’d need sorting in the common code.
>> 
>> Whereas with this patch, there are no new oddities in non-tls compiles/dumps.
>> Although the patch touches a lot of files, the changes are mostly restricted
>> to places where standard clobbers were already being checked.
> I'm not entirely sure that it would require doubling the number of
> inputs/outputs.  It's not conceptually much different than how we
> describe DImode operations on 32bit targets.  The mode selects one or
> more consecutive registers, so you don't actually need anything weird in
> your patterns.  This is pretty standard stuff.

Ok, fair enough.

> 
> 
> It would be painful in that the Neon regs would have to interleave with
> the upper part of the SVE regs in terms of register numbers.  It would
> also mean that you couldn't tie together multiple neon regs into
> something wider.  I'm not sure if the latter would be an issue or not.

And there’s also the weirdness that the register would not be split evenly - 
it’ll be a TI
reg followed by a reg of the size of multiple TIs.

All of that has the potential to complicate all non-sve aarch64 code.

> 
> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
> totally forgotten about it.  And in fact it seems to come pretty close
> to what you need…

Yes, some of the code is similar to the way
TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
CLOBBER expr code served as a starting point for writing the patch. The main 
difference
here, is that _PART_CLOBBERED is around all calls and is not tied to a specific 
Instruction,
it’s part of the calling abi. Whereas clobber_high is explicitly tied to an 
expression (tls_desc).
It meant there wasn’t really any opportunity to resume any existing code.


Alan.




Re: [PATCH 7/7]: Enable clobber high for tls descs on Aarch64

2017-11-17 Thread Alan Hayward

> On 16 Nov 2017, at 19:32, Andrew Pinski  wrote:
> 
> On Thu, Nov 16, 2017 at 4:35 AM, Alan Hayward  wrote:
>> This final patch adds the clobber high expressions to tls_desc for aarch64.
>> It also adds three tests.
>> 
>> In addition I also tested by taking the gcc torture test suite and making
>> all global variables __thread. Then emended the suite to compile with -fpic,
>> save the .s file and only for one given O level.
>> I ran this before and after the patch and compared the resulting .s files,
>> ensuring that there were no ASM changes.
>> I discarded the 10% of tests that failed to compile (due to the code in
>> the test now being invalid C).
>> I did this for O0,O2,O3 on both x86 and aarch64 and observed no difference
>> between ASM files before and after the patch.
> 
> Isn't the ABI defined as non-clobbering the lower 64bits for normal
> function calls?  Or is the TLS function "special" in that it
> saves/restores the 128bit registers; is that documented anywhere?  The
> main reason why I am asking is because glibc is not the only libc out
> there and someone could have a slightly different ABI here.
> 

In NEON all the register SIMD registers are preserved around TLS calls - all
128bits of each register. That’s standard ABI behaviour for NEON.

SVE doesn’t have any explicit preserving of it’s SIMD registers.

However, the NEON and SVE registers share the same silicon - the lower
128bits of each SVE register is the same as the corresponding NEON
register. The side effect of this is that the lower 128bits of the SVE registers
are getting backed up.

Neither glibc or any libraries need updating to support this.
But, compilers do need to aware of this.


Alan.

Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-11-16 Thread Alan Hayward

> On 16 Nov 2017, at 18:24, Richard Biener  wrote:
> 
> On November 16, 2017 7:05:30 PM GMT+01:00, Jeff Law  wrote:
>> On 11/16/2017 05:34 AM, Alan Hayward wrote:
>>> This is a set of patches aimed at supporting aarch64 SVE register
>>> preservation around TLS calls.
>>> 
>>> Across a TLS call, Aarch64 SVE does not explicitly preserve the
>>> SVE vector registers. However, the Neon vector registers are
>> preserved.
>>> Due to overlapping of registers, this means the lower 128bits of all
>>> SVE vector registers will be preserved.
>>> 
>>> The existing GCC code will currently incorrectly assume preservation
>>> of all of the SVE registers.
>>> 
>>> This patch introduces a CLOBBER_HIGH expression. This behaves a bit
>> like
>>> a CLOBBER expression. CLOBBER_HIGH can only refer to a single
>> register.
>>> The mode of the expression indicates the size of the lower bits which
>>> will be preserved. If the register contains a value bigger than this
>>> mode then the code will treat the register as clobbered.
>>> 
>>> The means in order to evaluate if a clobber high is relevant, we need
>> to ensure
>>> the mode of the existing value in a register is tracked.
>>> 
>>> The following patches in this series add support for the
>> CLOBBER_HIGH,
>>> with the final patch adding CLOBBER_HIGHs around TLS_DESC calls for
>>> aarch64. The testing performed on these patches is also detailed in
>> the
>>> final patch.
>>> 
>>> These patches are based on top of the linaro-dev/sve branch.
>>> 
>>> A simpler alternative to this patch would be to assume all Neon and
>> SVE
>>> registers are clobbered across TLS calls, however this would be a
>>> performance regression against all Aarch64 targets.
>> So just a couple design questions.
>> 
>> Presumably there's no reasonable way to set up GCC's view of the
>> register file to avoid this problem?  ISTM that if the SVE register was
>> split into two, one for the part that overlapped with the neon register
>> and one that did not, then this could be handled via standard
>> mechanisms?
>> 

Yes, that was an early alternative option for the patch.

With that it would effect every operation that uses SVE registers. A simple
add of two registers now has 4 inputs and two outputs. It would get in the
way when debugging any sve dumps and be generally annoying.
Possible that the code for that in would all be in the aarch64 target,
(making everyone else happy!) But I suspect that there would be still be
strange dependency issues that’d need sorting in the common code.

Whereas with this patch, there are no new oddities in non-tls compiles/dumps.
Although the patch touches a lot of files, the changes are mostly restricted
to places where standard clobbers were already being checked.


>> Alternately would it be easier to clobber a subreg representing the
>> high
>> part of the register?  Hmm, probably not.
> 
> I thought of a set of the preserved part to itself that leaves the upper part 
> undefined. Not sure if we have such thing or if it would work in all places 
> that a clobber does.

I’ve not seen such a thing in the code. But it would need specific handling in
the all the existing clobber code.


Alan.




[PATCH 7/7]: Enable clobber high for tls descs on Aarch64

2017-11-16 Thread Alan Hayward
This final patch adds the clobber high expressions to tls_desc for aarch64.
It also adds three tests.

In addition I also tested by taking the gcc torture test suite and making
all global variables __thread. Then emended the suite to compile with -fpic,
save the .s file and only for one given O level.
I ran this before and after the patch and compared the resulting .s files,
ensuring that there were no ASM changes.
I discarded the 10% of tests that failed to compile (due to the code in
the test now being invalid C).
I did this for O0,O2,O3 on both x86 and aarch64 and observed no difference
between ASM files before and after the patch.

Alan.

2017-11-16  Alan Hayward  

gcc/
* config/aarch64/aarch64.md: Add clobber highs to tls_desc.

gcc/testsuite/  
* gcc.target/aarch64/sve_tls_preserve_1.c: New test.
* gcc.target/aarch64/sve_tls_preserve_2.c: New test.
* gcc.target/aarch64/sve_tls_preserve_3.c: New test.



diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
6a15ff0b61d775cf30189b8503cfa45987701228..1f332b254fe0e37954efbe92982f214100d7046f
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -57,7 +57,36 @@
 (LR_REGNUM 30)
 (SP_REGNUM 31)
 (V0_REGNUM 32)
+(V1_REGNUM 33)
+(V2_REGNUM 34)
+(V3_REGNUM 35)
+(V4_REGNUM 36)
+(V5_REGNUM 37)
+(V6_REGNUM 38)
+(V7_REGNUM 39)
+(V8_REGNUM 40)
+(V9_REGNUM 41)
+(V10_REGNUM42)
+(V11_REGNUM43)
+(V12_REGNUM44)
+(V13_REGNUM45)
+(V14_REGNUM46)
 (V15_REGNUM47)
+(V16_REGNUM48)
+(V17_REGNUM49)
+(V18_REGNUM50)
+(V19_REGNUM51)
+(V20_REGNUM52)
+(V21_REGNUM53)
+(V22_REGNUM54)
+(V23_REGNUM55)
+(V24_REGNUM56)
+(V25_REGNUM57)
+(V26_REGNUM58)
+(V27_REGNUM59)
+(V28_REGNUM60)
+(V29_REGNUM61)
+(V30_REGNUM62)
 (V31_REGNUM63)
 (LAST_SAVED_REGNUM 63)
 (SFP_REGNUM64)
@@ -5745,6 +5774,38 @@
   UNSPEC_TLSDESC))
(clobber (reg:DI LR_REGNUM))
(clobber (reg:CC CC_REGNUM))
+   (clobber_high (reg:TI V0_REGNUM))
+   (clobber_high (reg:TI V1_REGNUM))
+   (clobber_high (reg:TI V2_REGNUM))
+   (clobber_high (reg:TI V3_REGNUM))
+   (clobber_high (reg:TI V4_REGNUM))
+   (clobber_high (reg:TI V5_REGNUM))
+   (clobber_high (reg:TI V6_REGNUM))
+   (clobber_high (reg:TI V7_REGNUM))
+   (clobber_high (reg:TI V8_REGNUM))
+   (clobber_high (reg:TI V9_REGNUM))
+   (clobber_high (reg:TI V10_REGNUM))
+   (clobber_high (reg:TI V11_REGNUM))
+   (clobber_high (reg:TI V12_REGNUM))
+   (clobber_high (reg:TI V13_REGNUM))
+   (clobber_high (reg:TI V14_REGNUM))
+   (clobber_high (reg:TI V15_REGNUM))
+   (clobber_high (reg:TI V16_REGNUM))
+   (clobber_high (reg:TI V17_REGNUM))
+   (clobber_high (reg:TI V18_REGNUM))
+   (clobber_high (reg:TI V19_REGNUM))
+   (clobber_high (reg:TI V20_REGNUM))
+   (clobber_high (reg:TI V21_REGNUM))
+   (clobber_high (reg:TI V22_REGNUM))
+   (clobber_high (reg:TI V23_REGNUM))
+   (clobber_high (reg:TI V24_REGNUM))
+   (clobber_high (reg:TI V25_REGNUM))
+   (clobber_high (reg:TI V26_REGNUM))
+   (clobber_high (reg:TI V27_REGNUM))
+   (clobber_high (reg:TI V28_REGNUM))
+   (clobber_high (reg:TI V29_REGNUM))
+   (clobber_high (reg:TI V30_REGNUM))
+   (clobber_high (reg:TI V31_REGNUM))
(clobber (match_scratch:DI 1 "=r"))]
   "TARGET_TLS_DESC"
   "adrp\\tx0, %A0\;ldr\\t%1, [x0, #%L0]\;add\\t0, 0, 
%L0\;.tlsdesccall\\t%0\;blr\\t%1"
diff --git a/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c
new file mode 100644
index 
..5bad829568130181ef1ab386545bd3ee164c322e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fpic -march=armv8-a+sve" } */
+
+/* Clobber highs do not need to be spilled around tls usage.  */
+
+typedef float v4si __attribute__ ((vector_size (16)));
+
+__thread v4si tx;
+
+v4si foo (v4si a, v4si b, v4si c)
+{
+  v4si y;
+
+  y = a + tx + b + c;
+
+  return y + 7;
+}
+
+/* { dg-final { scan-assembler-not {\tstr\t} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c
new file mode 100644
index 
..69e8829287b8418c28f8c227391c4f8d2186ea63
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c
@@ -0,0 +1

[PATCH 5/7]: cse support for clobber_high

2017-11-16 Thread Alan Hayward
This patch simply adds the cse specific changes for clobber_high.

Alan.

2017-11-16  Alan Hayward  

* cse.c (invalidate_reg): New function extracted from...
(invalidate): ...here.
(canonicalize_insn): Check for clobber high.
(invalidate_from_clobbers): invalidate clobber highs.
(invalidate_from_sets_and_clobbers): Likewise.
(count_reg_usage): Check for clobber high.
(insn_live_p): Likewise.
* cselib.c (cselib_expand_value_rtx_1):Likewise.
(cselib_invalidate_regno): Check for clobber in setter.
(cselib_invalidate_rtx): Pass through setter.
(cselib_invalidate_rtx_note_stores):
(cselib_process_insn): Check for clobber high.
* cselib.h (cselib_invalidate_rtx): Add operand.

diff --git a/gcc/cse.c b/gcc/cse.c
index 
e3c0710215df0acca924ce74ffa54582125d0136..f15ae8693fbb243323dd049e21648a49546a4608
 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -559,6 +559,7 @@ static struct table_elt *insert_with_costs (rtx, struct 
table_elt *, unsigned,
 static struct table_elt *insert (rtx, struct table_elt *, unsigned,
 machine_mode);
 static void merge_equiv_classes (struct table_elt *, struct table_elt *);
+static void invalidate_reg (rtx, bool);
 static void invalidate (rtx, machine_mode);
 static void remove_invalid_refs (unsigned int);
 static void remove_invalid_subreg_refs (unsigned int, poly_uint64,
@@ -1818,7 +1819,85 @@ check_dependence (const_rtx x, rtx exp, machine_mode 
mode, rtx addr)
 }
   return false;
 }
-

+
+/* Remove from the hash table, or mark as invalid, all expressions whose
+   values could be altered by storing in register X.
+
+   CLOBBER_HIGH is set if X was part of a CLOBBER_HIGH expression.  */
+
+static void
+invalidate_reg (rtx x, bool clobber_high)
+{
+  gcc_assert (GET_CODE (x) == REG);
+
+  /* If X is a register, dependencies on its contents are recorded
+ through the qty number mechanism.  Just change the qty number of
+ the register, mark it as invalid for expressions that refer to it,
+ and remove it itself.  */
+  unsigned int regno = REGNO (x);
+  unsigned int hash = HASH (x, GET_MODE (x));
+
+  /* Remove REGNO from any quantity list it might be on and indicate
+ that its value might have changed.  If it is a pseudo, remove its
+ entry from the hash table.
+
+ For a hard register, we do the first two actions above for any
+ additional hard registers corresponding to X.  Then, if any of these
+ registers are in the table, we must remove any REG entries that
+ overlap these registers.  */
+
+  delete_reg_equiv (regno);
+  REG_TICK (regno)++;
+  SUBREG_TICKED (regno) = -1;
+
+  if (regno >= FIRST_PSEUDO_REGISTER)
+{
+  gcc_assert (!clobber_high);
+  remove_pseudo_from_table (x, hash);
+}
+  else
+{
+  HOST_WIDE_INT in_table = TEST_HARD_REG_BIT (hard_regs_in_table, regno);
+  unsigned int endregno = END_REGNO (x);
+  unsigned int rn;
+  struct table_elt *p, *next;
+
+  CLEAR_HARD_REG_BIT (hard_regs_in_table, regno);
+
+  for (rn = regno + 1; rn < endregno; rn++)
+   {
+ in_table |= TEST_HARD_REG_BIT (hard_regs_in_table, rn);
+ CLEAR_HARD_REG_BIT (hard_regs_in_table, rn);
+ delete_reg_equiv (rn);
+ REG_TICK (rn)++;
+ SUBREG_TICKED (rn) = -1;
+   }
+
+  if (in_table)
+   for (hash = 0; hash < HASH_SIZE; hash++)
+ for (p = table[hash]; p; p = next)
+   {
+ next = p->next_same_hash;
+
+ if (!REG_P (p->exp) || REGNO (p->exp) >= FIRST_PSEUDO_REGISTER)
+   continue;
+
+ if (clobber_high)
+   {
+ if (reg_is_clobbered_by_clobber_high (p->exp, x))
+   remove_from_table (p, hash);
+   }
+ else
+   {
+ unsigned int tregno = REGNO (p->exp);
+ unsigned int tendregno = END_REGNO (p->exp);
+ if (tendregno > regno && tregno < endregno)
+   remove_from_table (p, hash);
+   }
+   }
+}
+}
+
 /* Remove from the hash table, or mark as invalid, all expressions whose
values could be altered by storing in X.  X is a register, a subreg, or
a memory reference with nonvarying address (because, when a memory
@@ -1841,65 +1920,7 @@ invalidate (rtx x, machine_mode full_mode)
   switch (GET_CODE (x))
 {
 case REG:
-  {
-   /* If X is a register, dependencies on its contents are recorded
-  through the qty number mechanism.  Just change the qty number of
-  the register, mark it as invalid for expressions that refer to it,
-  and remove it itself.  */
-   unsigned int regno = REGNO (x);
-   unsigned int hash = HASH (x, GET_MODE (x));
-
-   /* Remove REGNO from any quantity list it might be on and indicate
-  th

[PATCH 6/7]: Remaining support for clobber high

2017-11-16 Thread Alan Hayward
This patch simply adds the remainder of clobber high checks.
Happy to split this into smaller patches if required (there didn't
seem anything obvious to split into).


Alan.

2017-11-16  Alan Hayward  

* alias.c (record_set): Check for clobber high.
* cfgexpand.c (expand_gimple_stmt): Likewise.
* combine-stack-adj.c (single_set_for_csa): Likewise.
* combine.c (find_single_use_1): Likewise.
(set_nonzero_bits_and_sign_copies): Likewise.
(get_combine_src_dest): Likewise.
(is_parallel_of_n_reg_sets): Likewise.
(try_combine): Likewise.
(record_dead_and_set_regs_1): Likewise.
(reg_dead_at_p_1): Likewise.
(reg_dead_at_p): Likewise.
* dce.c (deletable_insn_p): Likewise.
(mark_nonreg_stores_1): Likewise.
(mark_nonreg_stores_2): Likewise.
* df-scan.c (df_find_hard_reg_defs): Likewise.
(df_uses_record): Likewise.
(df_get_call_refs): Likewise.
* dwarf2out.c (mem_loc_descriptor): Likewise.
* haifa-sched.c (haifa_classify_rtx): Likewise.
* ira-build.c (create_insn_allocnos): Likewise.
* ira-costs.c (scan_one_insn): Likewise.
* ira.c (equiv_init_movable_p): Likewise.
(rtx_moveable_p): Likewise.
(interesting_dest_for_shprep): Likewise.
* jump.c (mark_jump_label_1): Likewise.
* postreload-gcse.c (record_opr_changes): Likewise.
* postreload.c (reload_cse_simplify): Likewise.
(struct reg_use): Add source expr.
(reload_combine): Check for clobber high.
(reload_combine_note_use): Likewise.
(reload_cse_move2add): Likewise.
(move2add_note_store): Likewise.
* print-rtl.c (print_pattern): Likewise.
* recog.c (decode_asm_operands): Likewise.
(store_data_bypass_p): Likewise.
(if_test_bypass_p): Likewise.
* regcprop.c (kill_clobbered_value): Likewise.
(kill_set_value): Likewise.
* reginfo.c (reg_scan_mark_refs): Likewise.
* reload1.c (maybe_fix_stack_asms): Likewise.
(eliminate_regs_1): Likewise.
(elimination_effects): Likewise.
(mark_not_eliminable): Likewise.
(scan_paradoxical_subregs): Likewise.
(forget_old_reloads_1): Likewise.
* reorg.c (find_end_label): Likewise.
(try_merge_delay_insns): Likewise.
(redundant_insn): Likewise.
(own_thread_p): Likewise.
(fill_simple_delay_slots): Likewise.
(fill_slots_from_thread): Likewise.
(dbr_schedule): Likewise.
* resource.c (update_live_status): Likewise.
(mark_referenced_resources): Likewise.
(mark_set_resources): Likewise.
* rtl.c (copy_rtx): Likewise.
* rtlanal.c (reg_referenced_p): Likewise.
(single_set_2): Likewise.
(noop_move_p): Likewise.
(note_stores): Likewise.
* sched-deps.c (sched_analyze_reg): Likewise.
(sched_analyze_insn): Likewise.


diff --git a/gcc/alias.c b/gcc/alias.c
index 
c69ef410edac2ab0ab93e8ec9fe4c89a7078c001..6a6734bd7d5732c255c009be47e68aa073a9ebb1
 100644
--- a/gcc/alias.c
+++ b/gcc/alias.c
@@ -1554,6 +1554,17 @@ record_set (rtx dest, const_rtx set, void *data 
ATTRIBUTE_UNUSED)
  new_reg_base_value[regno] = 0;
  return;
}
+  /* A CLOBBER_HIGH only wipes out the old value if the mode of the old
+value is greater than that of the clobber.  */
+  else if (GET_CODE (set) == CLOBBER_HIGH)
+   {
+ if (new_reg_base_value[regno] != 0
+ && reg_is_clobbered_by_clobber_high (
+  regno, GET_MODE (new_reg_base_value[regno]), XEXP (set, 0)))
+   new_reg_base_value[regno] = 0;
+ return;
+   }
+
   src = SET_SRC (set);
 }
   else
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 
06a8af8a1663c9e518a8169650a0c9969990df1f..ea6fc265f757543cff635a805fd4045a10add23e
 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -3803,6 +3803,7 @@ expand_gimple_stmt (gimple *stmt)
  /* If we want exceptions for non-call insns, any
 may_trap_p instruction may throw.  */
  && GET_CODE (PATTERN (insn)) != CLOBBER
+ && GET_CODE (PATTERN (insn)) != CLOBBER_HIGH
  && GET_CODE (PATTERN (insn)) != USE
  && insn_could_throw_p (insn))
make_reg_eh_region_note (insn, 0, lp_nr);
diff --git a/gcc/combine-stack-adj.c b/gcc/combine-stack-adj.c
index 
09f0be814f98922b6926a929401894809a890f61..595e83c73760a97e0f8ebd99e12b1853d1d52b92
 100644
--- a/gcc/combine-stack-adj.c
+++ b/gcc/combine-stack-adj.c
@@ -133,6 +133,7 @@ single_set_for_csa (rtx_insn *insn)
  && SET_SRC (this_rtx) == SET_DEST (this_rtx))
;
   else if (GET_CODE (this_rtx) != CLOBBER
+  && GET_CODE (this_rtx) != CLOBBER_HIGH
   && GET_CODE (this_rtx)

[PATCH 4/7]: lra support for clobber_high

2017-11-16 Thread Alan Hayward
This patch simply adds the lra specific changes for clobber_high.

Alan.


2017-11-16  Alan Hayward  

* lra-eliminations.c (lra_eliminate_regs_1): Check for clobber high.
(mark_not_eliminable): Likewise.
* lra-int.h (struct lra_insn_reg): Add clobber high marker.
* lra-lives.c (process_bb_lives): Check for clobber high. 
* lra.c (new_insn_reg): Remember clobber highs.
(collect_non_operand_hard_regs): Check for clobber high.
(lra_set_insn_recog_data): Likewise.
(add_regs_to_insn_regno_info): Likewise.
(lra_update_insn_regno_info): Likewise.


diff --git a/gcc/lra-eliminations.c b/gcc/lra-eliminations.c
index 
bea8b023b7cb7a512f7482a2f014647c30462870..251e539530456722e3f4231b928c2124f9d602b6
 100644
--- a/gcc/lra-eliminations.c
+++ b/gcc/lra-eliminations.c
@@ -654,6 +654,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode 
mem_mode,
   return x;

 case CLOBBER:
+case CLOBBER_HIGH:
 case SET:
   gcc_unreachable ();

@@ -806,6 +807,16 @@ mark_not_eliminable (rtx x, machine_mode mem_mode)
setup_can_eliminate (ep, false);
   return;

+case CLOBBER_HIGH:
+  gcc_assert (REG_P (XEXP (x, 0)));
+  gcc_assert (REGNO (XEXP (x, 0)) < FIRST_PSEUDO_REGISTER);
+  for (ep = reg_eliminate;
+  ep < ®_eliminate[NUM_ELIMINABLE_REGS];
+  ep++)
+   if (reg_is_clobbered_by_clobber_high (ep->to_rtx, XEXP (x, 0)))
+ setup_can_eliminate (ep, false);
+  return;
+
 case SET:
   if (SET_DEST (x) == stack_pointer_rtx
  && GET_CODE (SET_SRC (x)) == PLUS
diff --git a/gcc/lra-int.h b/gcc/lra-int.h
index 
6c219eacee3940054ed480a44cda1ed07993ad16..be439b95e9cedb358e9ba6c63f8a9490af5c816d
 100644
--- a/gcc/lra-int.h
+++ b/gcc/lra-int.h
@@ -166,6 +166,8 @@ struct lra_insn_reg
   /* True if there is an early clobber alternative for this
  operand.  */
   unsigned int early_clobber : 1;
+  /* True if the reg is clobber highed by the operand.  */
+  unsigned int clobber_high : 1;
   /* The corresponding regno of the register.  */
   int regno;
   /* Next reg info of the same insn.  */
diff --git a/gcc/lra-lives.c b/gcc/lra-lives.c
index 
df7e2537dd09a4abd5ce517f4bb556cc32000fa0..82a1811f837b425a326753c0d956149382d752cb
 100644
--- a/gcc/lra-lives.c
+++ b/gcc/lra-lives.c
@@ -655,7 +655,7 @@ process_bb_lives (basic_block bb, int &curr_point, bool 
dead_insn_p)
   bool call_p;
   int n_alt, dst_regno, src_regno;
   rtx set;
-  struct lra_insn_reg *reg;
+  struct lra_insn_reg *reg, *hr;

   if (!NONDEBUG_INSN_P (curr_insn))
continue;
@@ -687,11 +687,12 @@ process_bb_lives (basic_block bb, int &curr_point, bool 
dead_insn_p)
break;
  }
  for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next)
-   if (reg->type != OP_IN)
+   if (reg->type != OP_IN && !reg->clobber_high)
  {
remove_p = false;
break;
  }
+
  if (remove_p && ! volatile_refs_p (PATTERN (curr_insn)))
{
  dst_regno = REGNO (SET_DEST (set));
@@ -809,14 +810,24 @@ process_bb_lives (basic_block bb, int &curr_point, bool 
dead_insn_p)
 unused values because they still conflict with quantities
 that are live at the time of the definition.  */
   for (reg = curr_id->regs; reg != NULL; reg = reg->next)
-   if (reg->type != OP_IN)
- {
-   need_curr_point_incr
- |= mark_regno_live (reg->regno, reg->biggest_mode,
- curr_point);
-   check_pseudos_live_through_calls (reg->regno,
- last_call_used_reg_set);
- }
+   {
+ if (reg->type != OP_IN)
+   {
+ need_curr_point_incr
+   |= mark_regno_live (reg->regno, reg->biggest_mode,
+   curr_point);
+ check_pseudos_live_through_calls (reg->regno,
+   last_call_used_reg_set);
+   }
+
+ if (reg->regno >= FIRST_PSEUDO_REGISTER)
+   for (hr = curr_static_id->hard_regs; hr != NULL; hr = hr->next)
+ if (hr->clobber_high
+ && may_gt (GET_MODE_SIZE (PSEUDO_REGNO_MODE (reg->regno)),
+GET_MODE_SIZE (hr->biggest_mode)))
+   SET_HARD_REG_BIT (lra_reg_info[reg->regno].conflict_hard_regs,
+ hr->regno);
+   }

   for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next)
if (reg->type != OP_IN)
diff --git a/gcc/lra.c b/gcc/lra.c
index 
8d44c75b0b4f89ff9fe94d0b8dfb2e77d43fee26..d6775d629655700484760ab85f78b9fd16189ca0
 100644
--- a/gcc/lra.c
+++ b/gcc/lra.c
@@ -535

[PATCH 3/7] Add func to check if register is clobbered by clobber_high

2017-11-16 Thread Alan Hayward
This patch adds the function reg_is_clobbered_by_clobber_high.
Given a CLOBBER_HIGH expression and a register, it checks if
the register will be clobbered.

A second version exists for the cases where the expressions are
not available.

The function will be used throughout the following patches.

Alan.

2017-11-16  Alan Hayward  

* rtl.h (reg_is_clobbered_by_clobber_high): Add declarations.
* rtlanal.c (reg_is_clobbered_by_clobber_high): Add function.

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 
bdb05d00120e7fadeb7f2d29bd67afc7a77262c1..5a85eb42ea4455cf3a975b3adbdb9d0415441d3b
 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -3417,6 +3417,16 @@ extern bool tablejump_p (const rtx_insn *, rtx_insn **, 
rtx_jump_table_data **);
 extern int computed_jump_p (const rtx_insn *);
 extern bool tls_referenced_p (const_rtx);
 extern bool contains_mem_rtx_p (rtx x);
+extern bool reg_is_clobbered_by_clobber_high (unsigned int, machine_mode,
+ const_rtx);
+
+/* Convenient wrapper for reg_is_clobbered_by_clobber_high.  */
+inline bool
+reg_is_clobbered_by_clobber_high (const_rtx x, const_rtx clobber_high_op)
+{
+  return reg_is_clobbered_by_clobber_high (REGNO (x), GET_MODE (x),
+  clobber_high_op);
+}

 /* Overload for refers_to_regno_p for checking a single register.  */
 inline bool
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 
79a5ae197c14ba338240123f2fc912f2ea60e178..923e3314d25c05f9055907c61b4a24186701cc23
 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -6519,3 +6519,32 @@ tls_referenced_p (const_rtx x)
   return true;
   return false;
 }
+
+/* Return true if reg REGNO with mode REG_MODE would be clobbered by the
+   clobber_high operand in CLOBBER_HIGH_OP.  */
+
+bool
+reg_is_clobbered_by_clobber_high (unsigned int regno, machine_mode reg_mode,
+ const_rtx clobber_high_op)
+{
+  unsigned int clobber_regno = REGNO (clobber_high_op);
+  machine_mode clobber_mode = GET_MODE (clobber_high_op);
+  unsigned char regno_nregs = hard_regno_nregs (regno, reg_mode);
+
+  /* Clobber high should always span exactly one register.  */
+  gcc_assert (REG_NREGS (clobber_high_op) == 1);
+
+  /* Clobber high needs to match with one of the registers in X.  */
+  if (clobber_regno < regno || clobber_regno >= regno + regno_nregs)
+return false;
+
+  gcc_assert (reg_mode != BLKmode && clobber_mode != BLKmode);
+
+  if (reg_mode == VOIDmode)
+return clobber_mode != VOIDmode;
+
+  /* Clobber high will clobber if its size might be greater than the size of
+ register regno.  */
+  return may_gt (exact_div (GET_MODE_SIZE (reg_mode), regno_nregs),
+GET_MODE_SIZE (clobber_mode));
+}



[PATCH 2/7] Support >26 operands in generation code.

2017-11-16 Thread Alan Hayward
This patch adds support for CLOBBER_HIGH in the generation code.

An aarch64 will require 31 clobber high expressions, plus two
clobbers.

The exisiting gen code restricts to 26 vector operands by virtue
of using the operators [a-z]. This patch extends this to 52 by
supporting [a-zA-Z].

Alan.

2017-11-16  Alan Hayward  

* emit-rtl.c (verify_rtx_sharing): Check CLOBBER_HIGH.
(copy_insn_1): Likewise.
(gen_hard_reg_clobber_high): Add function.
* genconfig.c (walk_insn_part): Check CLOBBER_HIGH.
* genemit.c (gen_exp): Likewise.
(gen_emit_seq): Pass thru info.
(gen_insn): Check CLOBBER_HIGH.
(gen_expand): Pass thru info.
(gen_split): Likewise.
(output_add_clobbers): Likewise.
* genextract.c (push_pathstr_operand): New function to
support [a-zA-Z].
(walk_rtx): Call push_pathstr_operand.
(print_path): Support [a-zA-Z].
* genrecog.c (validate_pattern): Check CLOBBER_HIGH.
(remove_clobbers): Likewise.
* rtl.h (gen_hard_reg_clobber_high): Add declaration.

diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 
af4a038d75acf17c7b04ad58ab7467f7bd7cd129..64159a82e3f79b792d58166d4307db8751c88980
 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -2895,6 +2895,7 @@ verify_rtx_sharing (rtx orig, rtx insn)
   /* SCRATCH must be shared because they represent distinct values.  */
   return;
 case CLOBBER:
+case CLOBBER_HIGH:
   /* Share clobbers of hard registers (like cc0), but do not share pseudo 
reg
  clobbers or clobbers of hard registers that originated as pseudos.
  This is needed to allow safe register renaming.  */
@@ -3148,6 +3149,7 @@ repeat:
   /* SCRATCH must be shared because they represent distinct values.  */
   return;
 case CLOBBER:
+case CLOBBER_HIGH:
   /* Share clobbers of hard registers (like cc0), but do not share pseudo 
reg
  clobbers or clobbers of hard registers that originated as pseudos.
  This is needed to allow safe register renaming.  */
@@ -5707,6 +5709,7 @@ copy_insn_1 (rtx orig)
 case SIMPLE_RETURN:
   return orig;
 case CLOBBER:
+case CLOBBER_HIGH:
   /* Share clobbers of hard registers (like cc0), but do not share pseudo 
reg
  clobbers or clobbers of hard registers that originated as pseudos.
  This is needed to allow safe register renaming.  */
@@ -6529,6 +6532,19 @@ gen_hard_reg_clobber (machine_mode mode, unsigned int 
regno)
gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (mode, regno)));
 }

+static GTY((deletable)) rtx
+hard_reg_clobbers_high[NUM_MACHINE_MODES][FIRST_PSEUDO_REGISTER];
+
+rtx
+gen_hard_reg_clobber_high (machine_mode mode, unsigned int regno)
+{
+  if (hard_reg_clobbers_high[mode][regno])
+return hard_reg_clobbers_high[mode][regno];
+  else
+return (hard_reg_clobbers_high[mode][regno]
+   = gen_rtx_CLOBBER_HIGH (VOIDmode, gen_rtx_REG (mode, regno)));
+}
+
 location_t prologue_location;
 location_t epilogue_location;

diff --git a/gcc/genconfig.c b/gcc/genconfig.c
index 
4ff36cb019d427f410d9f251777b9b05217fac36..4108e9c457fce5529ec9a3284d37f933736776ad
 100644
--- a/gcc/genconfig.c
+++ b/gcc/genconfig.c
@@ -72,6 +72,7 @@ walk_insn_part (rtx part, int recog_p, int non_pc_set_src)
   switch (code)
 {
 case CLOBBER:
+case CLOBBER_HIGH:
   clobbers_seen_this_insn++;
   break;

diff --git a/gcc/genemit.c b/gcc/genemit.c
index 
708da27221546c406030e88a4b07a51fb9df4a14..4e93b6b9831c65fc829fed6367881233b8eddcac
 100644
--- a/gcc/genemit.c
+++ b/gcc/genemit.c
@@ -79,7 +79,7 @@ gen_rtx_scratch (rtx x, enum rtx_code subroutine_type)
substituting any operand references appearing within.  */

 static void
-gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
+gen_exp (rtx x, enum rtx_code subroutine_type, char *used, md_rtx_info *info)
 {
   RTX_CODE code;
   int i;
@@ -123,7 +123,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
   for (i = 0; i < XVECLEN (x, 1); i++)
{
  printf (",\n\t\t");
- gen_exp (XVECEXP (x, 1, i), subroutine_type, used);
+ gen_exp (XVECEXP (x, 1, i), subroutine_type, used, info);
}
   printf (")");
   return;
@@ -137,7 +137,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
   for (i = 0; i < XVECLEN (x, 2); i++)
{
  printf (",\n\t\t");
- gen_exp (XVECEXP (x, 2, i), subroutine_type, used);
+ gen_exp (XVECEXP (x, 2, i), subroutine_type, used, info);
}
   printf (")");
   return;
@@ -163,12 +163,21 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used)
 case CLOBBER:
   if (REG_P (XEXP (x, 0)))
{
- printf ("gen_hard_reg_clobber (%smode, %i)", GET_MODE_NAME (GET_MODE 
(XEXP (x, 0))),
-REGNO (XEXP (x, 0

[PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

2017-11-16 Thread Alan Hayward
This is a set of patches aimed at supporting aarch64 SVE register
preservation around TLS calls.

Across a TLS call, Aarch64 SVE does not explicitly preserve the
SVE vector registers. However, the Neon vector registers are preserved.
Due to overlapping of registers, this means the lower 128bits of all
SVE vector registers will be preserved.

The existing GCC code will currently incorrectly assume preservation
of all of the SVE registers.

This patch introduces a CLOBBER_HIGH expression. This behaves a bit like
a CLOBBER expression. CLOBBER_HIGH can only refer to a single register.
The mode of the expression indicates the size of the lower bits which
will be preserved. If the register contains a value bigger than this
mode then the code will treat the register as clobbered.

The means in order to evaluate if a clobber high is relevant, we need to ensure
the mode of the existing value in a register is tracked.

The following patches in this series add support for the CLOBBER_HIGH,
with the final patch adding CLOBBER_HIGHs around TLS_DESC calls for
aarch64. The testing performed on these patches is also detailed in the
final patch.

These patches are based on top of the linaro-dev/sve branch.

A simpler alternative to this patch would be to assume all Neon and SVE
registers are clobbered across TLS calls, however this would be a
performance regression against all Aarch64 targets.

Alan.


2017-11-16  Alan Hayward  

* doc/rtl.texi (clobber_high): Add.
(parallel): Add in clobber high
* rtl.c (rtl_check_failed_code3): Add function.
* rtl.def (CLOBBER_HIGH): Add expression.
* rtl.h (RTL_CHECKC3): Add macro.
(rtl_check_failed_code3): Add declaration.
(XC3EXP): Add macro.


diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index 
f583940b9441b2111c8d65a00a064e89bdd2ffaf..951322258ddbb57900225bd501bd23a8a9970ead
 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -3209,6 +3209,18 @@ There is one other known use for clobbering a pseudo 
register in a
 clobbered by the insn.  In this case, using the same pseudo register in
 the clobber and elsewhere in the insn produces the expected results.

+@findex clobber_high
+@item (clobber_high @var{x})
+Represents the storing or possible storing of an unpredictable,
+undescribed value into the upper parts of @var{x}. The mode of the expression
+represents the lower parts of the register which will not be overwritten.
+@code{reg} must be a reg expression.
+
+One place this is used is when calling into functions where the registers are
+preserved, but only up to a given number of bits.  For example when using
+Aarch64 SVE, calling a TLS descriptor will cause only the lower 128 bits of
+each of the vector registers to be preserved.
+
 @findex use
 @item (use @var{x})
 Represents the use of the value of @var{x}.  It indicates that the
@@ -3262,7 +3274,8 @@ Represents several side effects performed in parallel.  
The square
 brackets stand for a vector; the operand of @code{parallel} is a
 vector of expressions.  @var{x0}, @var{x1} and so on are individual
 side effect expressions---expressions of code @code{set}, @code{call},
-@code{return}, @code{simple_return}, @code{clobber} or @code{use}.
+@code{return}, @code{simple_return}, @code{clobber} @code{use} or
+@code{clobber_high}.

 ``In parallel'' means that first all the values used in the individual
 side-effects are computed, and second all the actual side-effects are
diff --git a/gcc/rtl.c b/gcc/rtl.c
index 
3b2728be8b506fb3c14a20297cf92368caa5ca3b..6db84f99627bb8617c6e227892ca44076f4e729b
 100644
--- a/gcc/rtl.c
+++ b/gcc/rtl.c
@@ -860,6 +860,17 @@ rtl_check_failed_code2 (const_rtx r, enum rtx_code code1, 
enum rtx_code code2,
 }

 void
+rtl_check_failed_code3 (const_rtx r, enum rtx_code code1, enum rtx_code code2,
+   enum rtx_code code3, const char *file, int line,
+   const char *func)
+{
+  internal_error
+("RTL check: expected code '%s', '%s' or '%s', have '%s' in %s, at %s:%d",
+ GET_RTX_NAME (code1), GET_RTX_NAME (code2), GET_RTX_NAME (code3),
+ GET_RTX_NAME (GET_CODE (r)), func, trim_filename (file), line);
+}
+
+void
 rtl_check_failed_code_mode (const_rtx r, enum rtx_code code, machine_mode mode,
bool not_mode, const char *file, int line,
const char *func)
diff --git a/gcc/rtl.def b/gcc/rtl.def
index 
83bcfcaadcacc45cce352bf7fba33fbbc87ccd58..a6c4d4a46c4eb4f6cb0eca66a3f6a558f94acc8a
 100644
--- a/gcc/rtl.def
+++ b/gcc/rtl.def
@@ -312,6 +312,16 @@ DEF_RTL_EXPR(USE, "use", "e", RTX_EXTRA)
is considered undeletable before reload.  */
 DEF_RTL_EXPR(CLOBBER, "clobber", "e", RTX_EXTRA)

+/* Indicate that the upper parts of something are clobbered in a way that we
+   don't want to explain.  The MODE references the lower bits that will be
+   p

Re: [PATCH] Implement cond and induction cond reduction w/o REDUC_MAX_EXPR

2017-06-22 Thread Alan Hayward

> On 22 Jun 2017, at 12:54, Richard Biener  wrote:
> 
> On Thu, 22 Jun 2017, Alan Hayward wrote:
> 
>> 
>>> On 22 Jun 2017, at 09:52, Richard Biener  wrote:
>>> 
>>> On Thu, 22 Jun 2017, Richard Biener wrote:
>>> 
>>>> On Wed, 21 Jun 2017, Richard Biener wrote:
>>>> 
>>>>> 
>>>>> During my attempt to refactor reduction vectorization I ran across
>>>>> the special casing of inital values for INTEGER_INDUC_COND_REDUCTION
>>>>> and tried to see what it is about.  So I ended up implementing
>>>>> cond reduction support for targets w/o REDUC_MAX_EXPR by simply
>>>>> doing the reduction in scalar code -- while that results in an
>>>>> expensive epilogue the vector loop should be reasonably fast.
>>>>> 
>>>>> I still didn't run into any exec FAILs in vect.exp with removing
>>>>> the INTEGER_INDUC_COND_REDUCTION special case thus the following
>>>>> patch.
>>>>> 
>>>>> Alan -- is there a testcase (maybe full bootstrap & regtest will
>>>>> unconver one) that shows how this is necessary?
>>>> 
>>>> Testing has some fallout and I discovered what this special-case is
>>>> about.  I'll commit the below testcase to exercise the case.
>>> 
>>> After fixing the typo (sorry) the testcase also shows a latent
>>> wrong-code bug.  We use an induction vector starting at index 1
>>> for COND_REDUCTION to allow for a special value of zero but it
>>> seems the INTEGER_INDUC_COND_REDUCTION special-casing of the
>>> default value relies on the value of zero being special but
>>> in the testcase it is meaningful…
>>> 
>> 
>> Yes, looks like you’re right. Apologies.
>> 
>> 
>>> In fact, given that for INTEGER_INDUC_COND_REDUCTION the
>>> values, while based on an induction variable, do not have to
>>> match 1:1, we can modify the testcase to use last = 2*i; or
>>> last = i - 1;  Which means we have to add a second vector
>>> to specify whether a lane was set or not, basically use
>>> the COND_REDUCTION code?
>>> 
>> 
>> We could only allow loops that start from 1, but that would exclude the
>> most common uses.
>> 
>> I was wondering if we could special case “-1” as the default value index,
>> but that would break the max operation in the epilogue.
>> 
>> How about, in the loop the vectcond, instead of storing index, could store
>> index+offset where offset is the value required to get the loop initial 
>> value to 1.
>> Then in the epilogue, the result is:
>> “If max == 0 ? default : max - offset"
>> 
>> Using offset would also allow loops that start negative - 
>> for (int i = -2; i < N; i++)   would have offset set to 3.
>> 
>> is_nonwrapping_integer_induction() would have to be updated too.
> 
> What for
> 
>  last = -i;
> 
> then?  Or non-constant step?

is_nonwrapping_integer_induction currently rejects those,
and I wouldn’t suggest changing that.
It requires that a loop be positive increments with a fixed integer
base and step and will not overflow.

If that fails, the standard COND_REDUCTION will kick in instead.

> 
> I guess we can look at the scalar evolution of the value
> stored and simply use a variable special value, like,
> for constant step, use CHREC_LEFT + (sign-of-CHREC_RIGHT * -1)
> if that value is < CHREC_LEFT (hmm, so only for positive
> constant CHREC_RIGHT).
> 
> The easiest fix is probably to use an extra induction variable
> and the same code as in the COND_REDUCTION case.
> 
> Or simply use a "flag" vector that is set to -1 when the condition
> is true, thus precompute the final != 0 vector compare value as IV.

Wouldn’t this result in similar code to the COND_REDUCTION case?

The INTEGER_INDUC_COND_REDUCTION cases will work fine with the
COND_REDUCTION code. Just need to disable the initial “condition
expression based on integer induction” in vectorise_reduciton.


> 
> So a quick fix would be to change is_nonwrapping_integer_induction
> to reject any CHREC that might become zero during the loop evaluation.
> But it will probably disable the most interesting cases...
> 
> Richard.
> 
>> 
>> Alan.
>> 
>> 
>>> Richard.
>>> 
>>>> Richard.
>>>> 
>>>> 2017-06-22  Richard Biener  
>>>> 
>>>>* gcc.dg/vect/pr65947-14.c: New testcase.
>>>> 
>>>> Index: gcc/testsuite/gcc.dg/vect/pr65947-14.c
>>

Re: [PATCH] Implement cond and induction cond reduction w/o REDUC_MAX_EXPR

2017-06-22 Thread Alan Hayward

> On 22 Jun 2017, at 09:52, Richard Biener  wrote:
> 
> On Thu, 22 Jun 2017, Richard Biener wrote:
> 
>> On Wed, 21 Jun 2017, Richard Biener wrote:
>> 
>>> 
>>> During my attempt to refactor reduction vectorization I ran across
>>> the special casing of inital values for INTEGER_INDUC_COND_REDUCTION
>>> and tried to see what it is about.  So I ended up implementing
>>> cond reduction support for targets w/o REDUC_MAX_EXPR by simply
>>> doing the reduction in scalar code -- while that results in an
>>> expensive epilogue the vector loop should be reasonably fast.
>>> 
>>> I still didn't run into any exec FAILs in vect.exp with removing
>>> the INTEGER_INDUC_COND_REDUCTION special case thus the following
>>> patch.
>>> 
>>> Alan -- is there a testcase (maybe full bootstrap & regtest will
>>> unconver one) that shows how this is necessary?
>> 
>> Testing has some fallout and I discovered what this special-case is
>> about.  I'll commit the below testcase to exercise the case.
> 
> After fixing the typo (sorry) the testcase also shows a latent
> wrong-code bug.  We use an induction vector starting at index 1
> for COND_REDUCTION to allow for a special value of zero but it
> seems the INTEGER_INDUC_COND_REDUCTION special-casing of the
> default value relies on the value of zero being special but
> in the testcase it is meaningful…
> 

Yes, looks like you’re right. Apologies.


> In fact, given that for INTEGER_INDUC_COND_REDUCTION the
> values, while based on an induction variable, do not have to
> match 1:1, we can modify the testcase to use last = 2*i; or
> last = i - 1;  Which means we have to add a second vector
> to specify whether a lane was set or not, basically use
> the COND_REDUCTION code?
> 

We could only allow loops that start from 1, but that would exclude the
most common uses.

I was wondering if we could special case “-1” as the default value index,
but that would break the max operation in the epilogue.

How about, in the loop the vectcond, instead of storing index, could store
index+offset where offset is the value required to get the loop initial 
value to 1.
Then in the epilogue, the result is:
“If max == 0 ? default : max - offset"

Using offset would also allow loops that start negative - 
for (int i = -2; i < N; i++)   would have offset set to 3.

is_nonwrapping_integer_induction() would have to be updated too.


Alan.


> Richard.
> 
>> Richard.
>> 
>> 2017-06-22  Richard Biener  
>> 
>>  * gcc.dg/vect/pr65947-14.c: New testcase.
>> 
>> Index: gcc/testsuite/gcc.dg/vect/pr65947-14.c
>> ===
>> --- gcc/testsuite/gcc.dg/vect/pr65947-14.c   (nonexistent)
>> +++ gcc/testsuite/gcc.dg/vect/pr65947-14.c   (working copy)
>> @@ -0,0 +1,44 @@
>> +/* { dg-require-effective-target vect_condition } */
>> +
>> +#include "tree-vect.h"
>> +
>> +extern void abort (void) __attribute__ ((noreturn));
>> +
>> +#define N 27
>> +
>> +/* Condition reduction with matches only in even lanes at runtime.  */
>> +
>> +int
>> +condition_reduction (int *a, int min_v)
>> +{
>> +  int last = N + 96;
>> +
>> +  for (int i = 0; i < N; i++)
>> +if (a[i] > min_v)
>> +  last = i;
>> +
>> +  return last;
>> +}
>> +
>> +int
>> +main (void)
>> +{
>> +  int a[N] = {
>> +  47, 12, 13, 14, 15, 16, 17, 18, 19, 20,
>> +  1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
>> +  21, 22, 23, 24, 25, 26, 27
>> +  };
>> +
>> +  check_vect ();
>> +
>> +  int ret = condition_reduction (a, 46);
>> +
>> +  /* loop should have found a value of 0, not default N + 96.  */
>> +  if (ret != 0)
>> +abort ();
>> +
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { xfail { ! 
>> vect_max_reduc } } } */
>> +/* { dg-final { scan-tree-dump-times "condition expression based on integer 
>> induction." 4 "vect" } } */
>> 
> 
> -- 
> Richard Biener 
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nuernberg)



Re: [PATCH] Fix PR81090, properly free niter estimates

2017-06-20 Thread Alan Hayward

> On 19 Jun 2017, at 13:35, Richard Biener  wrote:
> 
> On Mon, 19 Jun 2017, Christophe Lyon wrote:
> 
>> Hi Richard,
>> 
>> On 16 June 2017 at 14:18, Richard Biener  wrote:
>>> On Wed, 14 Jun 2017, Richard Biener wrote:
>>> 
 
 niter estimates are not kept up-to-date (they reference gimple stmts
 and trees) in the keep-loop-stuff infrastructure so similar to the
 SCEV cache we rely on people freeing it after passes.
 
 The following brings us a step closer to that by freeing them whenever
 SCEV is invalidated (we only compute them when SCEV is active) plus
 removing the odd record-bounds pass that just computes them, leaving
 scavenging to following passes.
 
 Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>>> 
>>> Some awkward interactions with peeling means I'm installing the
>>> following less aggressive variant.
>>> 
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>>> 
>>> Richard.
>>> 
>>> 2017-06-16  Richard Biener  
>>> 
>>>PR tree-optimization/81090
>>>* passes.def (pass_record_bounds): Remove.
>>>* tree-pass.h (make_pass_record_bounds): Likewise.
>>>* tree-ssa-loop.c (pass_data_record_bounds, pass_record_bounds,
>>>make_pass_record_bounds): Likewise.
>>>* tree-ssa-loop-ivcanon.c (canonicalize_induction_variables): Do
>>>not free niter estimates at the beginning but at the end.
>>>* tree-scalar-evolution.c (scev_finalize): Free niter estimates.
>>> 
>>>* gcc.dg/graphite/pr81090.c: New testcase.
>>> 
>> 
>> Sorry to bother you again...
>> With this commit (r249249), I've noticed regressions on aarch64/arm:
>> FAIL:gcc.dg/vect/pr65947-9.c -flto -ffat-lto-objects
>> scan-tree-dump-not vect "LOOP VECTORIZED"
>> FAIL:gcc.dg/vect/pr65947-9.c scan-tree-dump-not vect "LOOP VECTORIZED"
> 
> So the testcase gets vectorized now (for whatever reason) and still passes
> execution.  Not sure why the testcase checked for not being vectorized.
> 
> Alan?
> 
> Richard.
> 

I’ve not looked at the new patch, but pr65947-9.c was added to test:

+ /* Condition reduction with maximum possible loop size.  Will fail to
+vectorize because the vectorisation requires a slot for default values.  */

So, in the pr65947-9.c, if nothing passes the IF clause, then LAST needs to be
set to -72.


To vectorise pr65947-9.c the code will do: 

+  /* For cond reductions we want to create a new vector (INDEX_COND_EXPR)
+which is updated with the current index of the loop for every match of
+the original loop's cond_expr (VEC_STMT).  This results in a vector
+containing the last time the condition passed for that vector lane.
+The first match will be a 1 to allow 0 to be used for non-matching
+indexes.  If there are no matches at all then the vector will be all
+zeroes.  */

Which will then be used in the following way:

+  /* For condition reductions, we have a vector (NEW_PHI_RESULT) containing
+various data values where the condition matched and another vector
+(INDUCTION_INDEX) containing all the indexes of those matches.  We
+need to extract the last matching index (which will be the index with
+highest value) and use this to index into the data vector.
+For the case where there were no matches, the data vector will contain
+all default values and the index vector will be all zeros.  */


However, the INDUCTION_INDEX values will have a max size:

+  /* The additional index will be the same type as the condition.  Check
+that the loop can fit into this less one (because we'll use up the
+zero slot for when there are no matches).  */
+  tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type);
+  if (wi::geu_p (ni, wi::to_widest (max_index)))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"loop size is greater than data size.\n");
+ return false;
+   }
+}


Going back to pr65947-9.c,  the loop size is 255.

+ #define N 255

Given that the default value takes up value 0 of our index, it means the max
Possible: size of a value in INDUCTION_INDEX will be 256, 

However, the type of INDUCTION_INDEX will match the type of A, which is char:

+char
+condition_reduction (char *a, char min_v)
+{
+  char last = -72;
+
+  for (int i = 0; i < N; i++)
+if (a[i] < min_v)
+  last = a[i];
+
+  return last;
+}

So, it’s not safe for the loop to be vectorised, and it should fail to do so
with the message "loop size is greater than data size”.


(All the code snippets are taken from patch 
956442b598f3a7241119624edb9af0155a42bf6c
"Support for vectorizing conditional expressions” on 2015-10-23,
It’s probable some of the code may have changed since then.)


Hope that makes sense,

Alan.



>> Christophe
>> 
>>> Index: gcc/passes.def
>>> =

Re: [PATCH] PR71752 - SLP: Maintain operand ordering when creating vec defs

2016-08-17 Thread Alan Hayward


On 16/08/2016 10:01, "Alan Hayward"  wrote:

>
>
>On 16/08/2016 09:33, "Richard Biener"  wrote:
>
>>On Mon, Aug 15, 2016 at 4:16 PM, Alan Hayward 
>>wrote:
>>>
>>>
>>> On 15/08/2016 12:17, "Richard Biener" 
>>>wrote:
>>>
>>>>On Mon, Aug 15, 2016 at 11:48 AM, Alan Hayward 
>>>>wrote:
>>>>> The testcase pr71752.c was failing because the SLP code was
>>>>>generating
>>>>>an
>>>>> SLP
>>>>> vector using arguments from the SLP scalar stmts, but was using the
>>>>>wrong
>>>>> argument number.
>>>>>
>>>>> vect_get_slp_defs() is given a vector of operands. When calling down
>>>>>to
>>>>> vect_get_constant_vectors it uses i as op_num - making the assumption
>>>>>that
>>>>> the
>>>>> first op in the vector refers to the first argument in the SLP scalar
>>>>> statement, the second op refers to the second arg and so on.
>>>>>
>>>>> However, previously in vectorizable_reduction, the call to
>>>>> vect_get_vec_defs
>>>>> destroyed this ordering by potentially only passing op1.
>>>>>
>>>>> The solution is in vectorizable_reduction to create a vector of
>>>>>operands
>>>>> equal
>>>>> in size to the number of arguments in the SLP statements. We maintain
>>>>>the
>>>>> argument ordering and if we don't require defs for that argument we
>>>>>instead
>>>>> push NULL into the vector. In vect_get_slp_defs we need to handle
>>>>>cases
>>>>> where
>>>>> an op might be NULL.
>>>>>
>>>>> Tested with a check run on X86 and AArch64.
>>>>> Ok to commit?
>>>>>
>>>>
>>>>Ugh.  Note the logic in vect_get_slp_defs is incredibly fragile - I
>>>>think you can't
>>>>simply "skip" ops the way you do as you need to still increment
>>>>child_index
>>>>accordingly for ignored ops.
>>>
>>> Agreed, I should be maintaining the child_index.
>>> Looking at the code, I need a valid oprnd for that code to work.
>>>
>>> Given that the size of the ops vector is never more than 3, it probably
>>> makes
>>> sense to reset child_index each time and do a quick for loop through
>>>all
>>> the
>>> children.
>>
>>Hmm, yes - that should work.  It should also remove the need to keep
>>operand order in sync?  So it might no longer require the
>>vectorizable_reduction change.
>
>We still need to pass the correct index down to
>vect_get_constant_vectors(),
>which requires the operands in the original order for i to be correct.
>(We can’t use child_index for that).
>
>Another option would be to pass down an additional vector containing the
>indexes
>of the operands. But I think that’d be an even worse option.
>

Updated with a loop for calculating child_index.
Tested with a check run for x86 and aarch64.


diff --git a/gcc/testsuite/gcc.dg/vect/pr71752.c
b/gcc/testsuite/gcc.dg/vect/pr71752.c
new file mode 100644
index 
..8d26754b4fedf8b104caae8742a445dff
bf23f0a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr71752.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+
+unsigned int q4, yg;
+
+unsigned int
+w6 (unsigned int z5, unsigned int jv)
+{
+  unsigned int *f2 = &jv;
+
+  while (*f2 < 21)
+{
+  q4 -= jv;
+  z5 -= jv;
+  f2 = &yg;
+  ++(*f2);
+}
+  return z5;
+}
+
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
2a7e0c6661bc1ba82c9f03720e550749f2252a7c..826481af3d1d8b29bcdbd7d81c0fd5a85
9ecd9b0 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5364,7 +5364,7 @@ vectorizable_reduction (gimple *stmt,
gimple_stmt_iterator *gsi,
   auto_vec vect_defs;
   auto_vec phis;
   int vec_num;
-  tree def0, def1, tem, op0, op1 = NULL_TREE;
+  tree def0, def1, tem, op1 = NULL_TREE;
   bool first_p = true;
   tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE;
   gimple *cond_expr_induction_def_stmt = NULL;
@@ -5964,29 +5964,36 @@ vectorizable_reduction (gimple *stmt,
gimple_stmt_iterator *gsi,
   /* Handle uses.  */
   if (j == 0)
 {
-  op0 = ops[!reduc_index];
-  if (op_type == ternary_op)
-{
-  if (reduc_index == 0)
-op1 = ops[2];
-  else
-op1 = ops[1];
-}
+ if

Re: [PATCH] PR71752 - SLP: Maintain operand ordering when creating vec defs

2016-08-15 Thread Alan Hayward


On 15/08/2016 12:17, "Richard Biener"  wrote:

>On Mon, Aug 15, 2016 at 11:48 AM, Alan Hayward 
>wrote:
>> The testcase pr71752.c was failing because the SLP code was generating
>>an
>> SLP
>> vector using arguments from the SLP scalar stmts, but was using the
>>wrong
>> argument number.
>>
>> vect_get_slp_defs() is given a vector of operands. When calling down to
>> vect_get_constant_vectors it uses i as op_num - making the assumption
>>that
>> the
>> first op in the vector refers to the first argument in the SLP scalar
>> statement, the second op refers to the second arg and so on.
>>
>> However, previously in vectorizable_reduction, the call to
>> vect_get_vec_defs
>> destroyed this ordering by potentially only passing op1.
>>
>> The solution is in vectorizable_reduction to create a vector of operands
>> equal
>> in size to the number of arguments in the SLP statements. We maintain
>>the
>> argument ordering and if we don't require defs for that argument we
>>instead
>> push NULL into the vector. In vect_get_slp_defs we need to handle cases
>> where
>> an op might be NULL.
>>
>> Tested with a check run on X86 and AArch64.
>> Ok to commit?
>>
>
>Ugh.  Note the logic in vect_get_slp_defs is incredibly fragile - I
>think you can't
>simply "skip" ops the way you do as you need to still increment
>child_index
>accordingly for ignored ops.

Agreed, I should be maintaining the child_index.
Looking at the code, I need a valid oprnd for that code to work.

Given that the size of the ops vector is never more than 3, it probably
makes
sense to reset child_index each time and do a quick for loop through all
the
children.

>
>Why not let the function compute defs for all ops?  That said, the
>vectorizable_reduction
>part certainly is fixing a bug (I think I've seen similar issues
>elsewhere though).

If you let it compute defs for all ops then you can end up creating invalid
initial value SLP vectors which cause SSA failures (even though nothing
uses those
vectors).

In the test case, SLP is defined for _3. Op1 of this is q4_lsm.5_8, but
that is
a loop phi. Creating SLP vec refs for it results in vect_cst__67 and
vect_cst__68, which are clearly invalid:



:
  _58 = yg_lsm.7_23 + 1;
  _59 = _58 + 1;
  _60 = _58 + 2;
  vect_cst__61 = {yg_lsm.7_23, _58, _59, _60};
  vect_cst__62 = { 4, 4, 4, 4 };
  vect_cst__67 = {q4_lsm.5_8, z5_19, q4_lsm.5_8, z5_19};
  vect_cst__68 = {q4_lsm.5_8, z5_19, q4_lsm.5_8, z5_19};
  vect_cst__69 = {jv_9(D), jv_9(D), jv_9(D), jv_9(D)};
  vect_cst__70 = {jv_9(D), jv_9(D), jv_9(D), jv_9(D)};
  vect_cst__73 = {0, 0, 0, 0};
  vect_cst__74 = {q4_lsm.5_16, z5_10(D), 0, 0};
  vect_cst__91 = { 1, 1, 1, 1 };

  :
  # z5_19 = PHI 
  # q4_lsm.5_8 = PHI 
  # yg_lsm.7_17 = PHI 
  # vect_vec_iv_.14_63 = PHI 
  # vect__3.15_65 = PHI 
  # vect__3.15_66 = PHI 
  # ivtmp_94 = PHI <0(4), ivtmp_95(10)>
  vect_vec_iv_.14_64 = vect_vec_iv_.14_63 + vect_cst__62;
  _3 = q4_lsm.5_8 - jv_9(D);
  vect__3.15_71 = vect__3.15_65 - vect_cst__70;
  vect__3.15_72 = vect__3.15_66 - vect_cst__69;
  z5_13 = z5_19 - jv_9(D);
  vect__5.17_92 = vect_vec_iv_.14_63 + vect_cst__91;
  _5 = yg_lsm.7_17 + 1;
  ivtmp_95 = ivtmp_94 + 1;
  if (ivtmp_95 < bnd.11_18)
goto ;
  else
goto ;




>
>Richard.
>
>> Changelog:
>>
>> gcc/
>> * tree-vect-loop.c (vectorizable_reduction): Keep SLP operand
>>ordering.
>> * tree-vect-slp.c (vect_get_slp_defs): Handle null operands.
>>
>> gcc/testsuite/
>> * gcc.dg/vect/pr71752.c: New.
>>
>>
>>
>> Thanks,
>> Alan.
>>
>>
>>
>>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr71752.c
>> b/gcc/testsuite/gcc.dg/vect/pr71752.c
>> new file mode 100644
>> index
>> 
>>..8d26754b4fedf8b104caae8742a445d
>>ff
>> bf23f0a
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr71752.c
>> @@ -0,0 +1,19 @@
>> +/* { dg-do compile } */
>> +
>> +unsigned int q4, yg;
>> +
>> +unsigned int
>> +w6 (unsigned int z5, unsigned int jv)
>> +{
>> +  unsigned int *f2 = &jv;
>> +
>> +  while (*f2 < 21)
>> +{
>> +  q4 -= jv;
>> +  z5 -= jv;
>> +  f2 = &yg;
>> +  ++(*f2);
>> +}
>> +  return z5;
>> +}
>> +
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index
>> 
>>2a7e0c6661bc1ba82c9f03720e550749f2252a7c..826481af3d1d8b29bcdbd7d81c0fd5a
>>85
>> 9ecd9b0 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tr

[PATCH] PR71752 - SLP: Maintain operand ordering when creating vec defs

2016-08-15 Thread Alan Hayward
The testcase pr71752.c was failing because the SLP code was generating an
SLP
vector using arguments from the SLP scalar stmts, but was using the wrong
argument number.

vect_get_slp_defs() is given a vector of operands. When calling down to
vect_get_constant_vectors it uses i as op_num - making the assumption that
the
first op in the vector refers to the first argument in the SLP scalar
statement, the second op refers to the second arg and so on.

However, previously in vectorizable_reduction, the call to
vect_get_vec_defs
destroyed this ordering by potentially only passing op1.

The solution is in vectorizable_reduction to create a vector of operands
equal
in size to the number of arguments in the SLP statements. We maintain the
argument ordering and if we don't require defs for that argument we instead
push NULL into the vector. In vect_get_slp_defs we need to handle cases
where
an op might be NULL.

Tested with a check run on X86 and AArch64.
Ok to commit?


Changelog:

gcc/
* tree-vect-loop.c (vectorizable_reduction): Keep SLP operand ordering.
* tree-vect-slp.c (vect_get_slp_defs): Handle null operands.

gcc/testsuite/
* gcc.dg/vect/pr71752.c: New.



Thanks,
Alan.




diff --git a/gcc/testsuite/gcc.dg/vect/pr71752.c
b/gcc/testsuite/gcc.dg/vect/pr71752.c
new file mode 100644
index 
..8d26754b4fedf8b104caae8742a445dff
bf23f0a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr71752.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+
+unsigned int q4, yg;
+
+unsigned int
+w6 (unsigned int z5, unsigned int jv)
+{
+  unsigned int *f2 = &jv;
+
+  while (*f2 < 21)
+{
+  q4 -= jv;
+  z5 -= jv;
+  f2 = &yg;
+  ++(*f2);
+}
+  return z5;
+}
+
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
2a7e0c6661bc1ba82c9f03720e550749f2252a7c..826481af3d1d8b29bcdbd7d81c0fd5a85
9ecd9b0 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -5364,7 +5364,7 @@ vectorizable_reduction (gimple *stmt,
gimple_stmt_iterator *gsi,
   auto_vec vect_defs;
   auto_vec phis;
   int vec_num;
-  tree def0, def1, tem, op0, op1 = NULL_TREE;
+  tree def0, def1, tem, op1 = NULL_TREE;
   bool first_p = true;
   tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE;
   gimple *cond_expr_induction_def_stmt = NULL;
@@ -5964,29 +5964,36 @@ vectorizable_reduction (gimple *stmt,
gimple_stmt_iterator *gsi,
   /* Handle uses.  */
   if (j == 0)
 {
-  op0 = ops[!reduc_index];
-  if (op_type == ternary_op)
-{
-  if (reduc_index == 0)
-op1 = ops[2];
-  else
-op1 = ops[1];
-}
+ if (slp_node)
+   {
+ /* Get vec defs for all the operands except the reduction index,
+   ensuring the ordering of the ops in the vector is kept.  */
+ auto_vec slp_ops;
+ auto_vec, 3> vec_defs;

-  if (slp_node)
-vect_get_vec_defs (op0, op1, stmt, &vec_oprnds0, &vec_oprnds1,
-   slp_node, -1);
+ slp_ops.quick_push ((reduc_index == 0) ? NULL : ops[0]);
+ slp_ops.quick_push ((reduc_index == 1) ? NULL : ops[1]);
+ if (op_type == ternary_op)
+   slp_ops.quick_push ((reduc_index == 2) ? NULL : ops[2]);
+
+ vect_get_slp_defs (slp_ops, slp_node, &vec_defs, -1);
+
+ vec_oprnds0.safe_splice (vec_defs[(reduc_index == 0) ? 1 : 0]);
+ if (op_type == ternary_op)
+   vec_oprnds1.safe_splice (vec_defs[(reduc_index == 2) ? 1 : 2]);
+   }
   else
-{
+   {
   loop_vec_def0 = vect_get_vec_def_for_operand
(ops[!reduc_index],
 stmt);
   vec_oprnds0.quick_push (loop_vec_def0);
   if (op_type == ternary_op)
{
+op1 = (reduc_index == 0) ? ops[2] : ops[1];
  loop_vec_def1 = vect_get_vec_def_for_operand (op1, stmt);
  vec_oprnds1.quick_push (loop_vec_def1);
}
-}
+   }
 }
   else
 {
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 
fb325d54f1084461d44cd54a98e5b7f99541a188..7c480d59c823b5258255c8be047f050c8
3cc91fd 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -3200,10 +3200,19 @@ vect_get_slp_defs (vec ops, slp_tree
slp_node,
   vec vec_defs;
   tree oprnd;
   bool vectorized_defs;
+  bool first_iteration = true;

   first_stmt = SLP_TREE_SCALAR_STMTS (slp_node)[0];
   FOR_EACH_VEC_ELT (ops, i, oprnd)
 {
+  if (oprnd == NULL)
+   {
+ vec_defs = vNULL;
+ vec_defs.create (0);
+ vec_oprnds->quick_push (vec_defs);
+ continue;
+   }
+
   /* For each operand we check if it has vectorized definitions in a
child
 node or we need to create them (for

Re: PATCH: PR71818: Don't advance IVs with a variable step

2016-08-01 Thread Alan Hayward


On 01/08/2016 14:49, "Richard Biener"  wrote:

>On Mon, Aug 1, 2016 at 11:24 AM, Alan Hayward 
>wrote:
>> In the given test case, the loop is split into vectorised and non
>> vectorised
>> versions due to peeling. At the end of the loop the IVs are incremented
>>to
>> their latest value. This is achieved by taking the base of the loop
>>(g_21)
>> and
>> adding the iterations (240) multiplied by the step (_6):
>>
>>   :
>>   # _106 = PHI <_6(12)>
>>   _84 = _106 * 240;
>>   _85 = (char) _84;
>>   tmp.19_83 = g_21(D) + _85;
>>
>> However, the step (_6) varies within the loop and therefore the
>> calculation is
>> incorrect.
>>
>> This patch fixes the error by disallowing vectorization if the step of
>>the
>> IV
>> is not an invariant within the loop.
>>
>> Also added debug comment for when the optimisation fails due to chrec.
>>
>> Tested on x86.
>>
>> Ok to commit?
>
>Ok.
>
>To fix this we'd have to vectorize the induction variable itself, correct?


Yes, then extract final value with BIT_FIELD_REF (like
vectorise_live_operation).


Alan.




PATCH: PR71818: Don't advance IVs with a variable step

2016-08-01 Thread Alan Hayward
In the given test case, the loop is split into vectorised and non
vectorised
versions due to peeling. At the end of the loop the IVs are incremented to
their latest value. This is achieved by taking the base of the loop (g_21)
and
adding the iterations (240) multiplied by the step (_6):

  :
  # _106 = PHI <_6(12)>
  _84 = _106 * 240;
  _85 = (char) _84;
  tmp.19_83 = g_21(D) + _85;

However, the step (_6) varies within the loop and therefore the
calculation is
incorrect.

This patch fixes the error by disallowing vectorization if the step of the
IV
is not an invariant within the loop.

Also added debug comment for when the optimisation fails due to chrec.

Tested on x86.

Ok to commit?

Alan.


gcc/
PR tree-optimization/71818
* tree-vect-loop-manip.c (vect_can_advance_ivs_p): Don't advance IVs
with non invariant evolutions

testsuite/
PR tree-optimization/71818
* gcc.dg/vect/pr71818.c: New





diff --git a/gcc/testsuite/gcc.dg/vect/pr71818.c
b/gcc/testsuite/gcc.dg/vect/pr71818.c
new file mode 100644
index 
..2946551f8bb8c552565c2e79b16359ca3
9d13ed6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr71818.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+
+char a;
+short b;
+int c, d;
+void fn1() {
+  char e = 75, g;
+  unsigned char *f = &e;
+  a = 21;
+  for (; a <= 48; a++) {
+for (; e <= 6;)
+  ;
+g -= e -= b || g <= c;
+  }
+  d = *f;
+}
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 
819abcda81a25c4ed25749c29b357110fca647d2..4d68f7143e1117085aae8d2168ed1425e
7e6aa08 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "tree-ssa-loop-ivopts.h"

 /*
   Simple Loop Peeling Utilities
@@ -1592,10 +1593,26 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo)
 }

   /* FORNOW: We do not transform initial conditions of IVs
+which evolution functions are not invariants in the loop.  */
+
+  if (!expr_invariant_in_loop_p (loop, evolution_part))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"evolution not invariant in loop.\n");
+ return false;
+   }
+
+  /* FORNOW: We do not transform initial conditions of IVs
 which evolution functions are a polynomial of degree >= 2.  */

   if (tree_is_chrec (evolution_part))
-   return false;
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"evolution is chrec.\n");
+ return false;
+   }
 }

   return true;







PING Re:[PATCH] PR 71667 - Handle live operations with DEBUG uses

2016-07-07 Thread Alan Hayward
Ping.


From: Alan Hayward 
To: "gcc-patches at gcc dot gnu dot org" 
Date: Wed, 29 Jun 2016 08:49:34 +0100
Subject: [PATCH] PR 71667 - Handle live operations with DEBUG uses
Authentication-results: sourceware.org; auth=none

In vectorizable_live_operation() we always assume uses a of live operation
will be PHIs. However, when using -g a use of a live operation might be a
DEBUG stmt.

This patch avoids adding any DEBUG statments to the worklist in
vectorizable_live_operation(). Also fixes comment.

Tested on x86 and aarch64.
Ok to commit?

gcc/
PR tree-optimization/71667
* tree-vect-loop.c (vectorizable_live_operation): ignore DEBUG stmts

testsuite/gcc.dg/vect
PR tree-optimization/71667
* pr71667.c: New



diff --git a/gcc/testsuite/gcc.dg/vect/pr71667.c
b/gcc/testsuite/gcc.dg/vect/pr71667.c
new file mode 100644
index 
..e7012efa882a5497b0a6099c3d853f9eb
375cc53
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr71667.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-g" } */
+
+unsigned int mu;
+int pt;
+
+void
+qf (void)
+{
+  int gy;
+  long int vz;
+
+  for (;;)
+{
+  for (gy = 0; gy < 80; ++gy)
+  {
+   vz = mu;
+   ++mu;
+   pt = (vz != 0) && (pt != 0);
+  }
+  while (gy < 81)
+   while (gy < 83)
+ {
+   vz = (vz != 0) ? 0 : mu;
+   ++gy;
+ }
+  pt = vz;
+  ++mu;
+}
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
6c0337bbbcbebd6443fd3bcef45c1b23a7833486..2980a1b031cd3b919369b5e31dff7e066
5bc7578 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6352,11 +6352,12 @@ vectorizable_live_operation (gimple *stmt,
: gimple_get_lhs (stmt);
   lhs_type = TREE_TYPE (lhs);

-  /* Find all uses of STMT outside the loop - there should be exactly
one.  */
+  /* Find all uses of STMT outside the loop - there should be at least
one.  */
   auto_vec worklist;
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
-if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
-   worklist.safe_push (use_stmt);
+if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))
+   && !is_gimple_debug (use_stmt))
+  worklist.safe_push (use_stmt);
   gcc_assert (worklist.length () >= 1);

   bitsize = TYPE_SIZE (TREE_TYPE (vectype));




Re: [PATCH], Pr 71667, Fix PowerPC ISA 3.0 DImode Altivec load/store

2016-06-29 Thread Alan Hayward
Hi,

This patch references the wrong bug. It should reference 71677 and not
71667
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71677

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71667


Thanks,
Alan.



As we discussed in the patch review, there were some issues with using %Y
for
the ISA 3.0 instructions LXSD and STXSD.

I have rewritten the patch so that we have a new memory constraint (%wY)
that
explicitly targets those instructions.  I went back to the mov{DF,DD}
patterns
and changed their use of %o to %wY.

I removed the test that was generated from the XalanNamespacesStack.cpp
source
that showed up the problem.

I have bootstrapped this on a little endian power8 system and there were no
regressions in the test suite.  I also built Spec 2006 for power9 with this
compiler, and the xalancbmk benchmark now builds.  I will kick off a big
endian
build on a power7 system.

Assuming there are no regressions in power7, are these patches ok to
install in
the trunk, and backport to GCC 6.2 after a burn-in period?

2016-06-28  Michael Meissner  

PR target/71667
* config/rs6000/constraints.md (wY constraint): New constraint to
match the requirements for the LXSD and STXSD instructions.
* config/rs6000/predicates.md (offsettable_mem_14bit_operand): New
predicate to match the requirements for the LXSD and STXSD
instructions.
* config/rs6000/rs6000.md (mov_hardfloat32, FMOVE64 case):
Use constaint wY for LXSD/STXSD instructions instead of 'o' or 'Y'
to make sure that the bottom 2 bits of offset are 0, the address
form is offsettable, and no updating is done in the address mode.
(mov_hardfloat64, FMOVE64 case): Likewise.
(movdi_internal32): Likewise
(movdi_internal64): Likewise.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797






[PATCH] PR 71667 - Handle live operations with DEBUG uses

2016-06-29 Thread Alan Hayward
In vectorizable_live_operation() we always assume uses a of live operation
will be PHIs. However, when using -g a use of a live operation might be a
DEBUG stmt.

This patch avoids adding any DEBUG statments to the worklist in
vectorizable_live_operation(). Also fixes comment.

Tested on x86 and aarch64.
Ok to commit?

gcc/
PR tree-optimization/71667
* tree-vect-loop.c (vectorizable_live_operation): ignore DEBUG stmts

testsuite/gcc.dg/vect
PR tree-optimization/71667
* pr71667.c: New



diff --git a/gcc/testsuite/gcc.dg/vect/pr71667.c
b/gcc/testsuite/gcc.dg/vect/pr71667.c
new file mode 100644
index 
..e7012efa882a5497b0a6099c3d853f9eb
375cc53
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr71667.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-g" } */
+
+unsigned int mu;
+int pt;
+
+void
+qf (void)
+{
+  int gy;
+  long int vz;
+
+  for (;;)
+{
+  for (gy = 0; gy < 80; ++gy)
+  {
+   vz = mu;
+   ++mu;
+   pt = (vz != 0) && (pt != 0);
+  }
+  while (gy < 81)
+   while (gy < 83)
+ {
+   vz = (vz != 0) ? 0 : mu;
+   ++gy;
+ }
+  pt = vz;
+  ++mu;
+}
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
6c0337bbbcbebd6443fd3bcef45c1b23a7833486..2980a1b031cd3b919369b5e31dff7e066
5bc7578 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6352,11 +6352,12 @@ vectorizable_live_operation (gimple *stmt,
: gimple_get_lhs (stmt);
   lhs_type = TREE_TYPE (lhs);

-  /* Find all uses of STMT outside the loop - there should be exactly
one.  */
+  /* Find all uses of STMT outside the loop - there should be at least
one.  */
   auto_vec worklist;
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
-if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
-   worklist.safe_push (use_stmt);
+if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))
+   && !is_gimple_debug (use_stmt))
+  worklist.safe_push (use_stmt);
   gcc_assert (worklist.length () >= 1);

   bitsize = TYPE_SIZE (TREE_TYPE (vectype));






[PATCH] PR 71439 - Only vectorize live PHIs that are inductions

2016-06-15 Thread Alan Hayward
For a PHI to be used outside the loop it needs to be vectorized. However
the
vectorizer currently will only vectorize PHIs that are an induction.

This patch fixes PR 71439 by only allowing a live PHI to be vectorized if
it
is an induction. In addition, live PHIs need to pass a
vectorizable_live_operation check.

Tested on x86 and aarch64.
Ok to commit?

Alan.

gcc/
PR tree-optimization/71439
* tree-vect-loop.c (vect_analyze_loop_operations): Additional check for
live PHIs.

testsuite/
PR tree-optimization/71439
* gcc.dg/vect/pr71439.c: New



diff --git a/gcc/testsuite/gcc.dg/vect/pr71439.c
b/gcc/testsuite/gcc.dg/vect/pr71439.c
new file mode 100644
index 
..95e4763bad6e9f301d53c20ffa160b96b
dad9a53
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr71439.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+int a, b, c;
+short fn1(int p1, int p2) { return p1 + p2; }
+
+int main() {
+  a = 0;
+  for (; a < 30; a = fn1(a, 4)) {
+c = b;
+b = 6;
+  }
+
+  if (c != 6)
+abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
1231b95f6a71337833e8c4b24884da9f96a7b5bf..90ade75bcd212b542ad680877a79df717
751ff4b 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1669,7 +1669,8 @@ vect_analyze_loop_operations (loop_vec_info
loop_vinfo)

   gcc_assert (stmt_info);

-  if (STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
+  if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
+   || STMT_VINFO_LIVE_P (stmt_info))
   && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def)
 {
   /* A scalar-dependence cycle that we don't support.  */
@@ -1686,6 +1687,9 @@ vect_analyze_loop_operations (loop_vec_info
loop_vinfo)
 ok = vectorizable_induction (phi, NULL, NULL);
 }

+ if (ok && STMT_VINFO_LIVE_P (stmt_info))
+   ok = vectorizable_live_operation (phi, NULL, NULL, -1, NULL);
+
   if (!ok)
 {
   if (dump_enabled_p ())






[PATCH] PR 71483 - Fix live SLP operations

2016-06-14 Thread Alan Hayward
In the given testcase, g++ splits a live operation into two scalar
statements
and four vector statements.

_5 = _4 >> 2;
  _7 = (short int) _5;

Is turned into:

vect__5.32_80 = vect__4.31_76 >> 2;
  vect__5.32_81 = vect__4.31_77 >> 2;
  vect__5.32_82 = vect__4.31_78 >> 2;
  vect__5.32_83 = vect__4.31_79 >> 2;
  vect__7.33_86 = VEC_PACK_TRUNC_EXPR ;
  vect__7.33_87 = VEC_PACK_TRUNC_EXPR ;

_5 is then accessed outside the loop.

This patch ensures that vectorizable_live_operation picks the correct
scalar
statement.
I removed the "three possibilites" comment because it was no longer
accurate
(it's also possible to have more vector statements than scalar statements)
and
the calculation is now much simpler.

Tested on x86 and aarch64.
Ok to commit?

gcc/
PR tree-optimization/71483
* tree-vect-loop.c (vectorizable_live_operation): Pick correct index
for slp

testsuite/g++.dg/vect
PR tree-optimization/71483
* pr71483.c: New


Alan.


diff --git a/gcc/testsuite/g++.dg/vect/pr71483.c
b/gcc/testsuite/g++.dg/vect/pr71483.c
new file mode 100644
index 
..77f879c9a89b8b41ef9dde3c343591857
2dc8d01
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr71483.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+int b, c, d;
+short *e;
+void fn1() {
+  for (; b; b--) {
+d = *e >> 2;
+*e++ = d;
+c = *e;
+*e++ = d;
+  }
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
4c8678505df6ec572b69fd7d12ac55cf4619ece6..a2413bf9c678d11cc2ffd22bc7d984e91
1831804 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6368,24 +6368,20 @@ vectorizable_live_operation (gimple *stmt,

   int num_scalar = SLP_TREE_SCALAR_STMTS (slp_node).length ();
   int num_vec = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-  int scalar_per_vec = num_scalar / num_vec;

-  /* There are three possibilites here:
-1: All scalar stmts fit in a single vector.
-2: All scalar stmts fit multiple times into a single vector.
-   We must choose the last occurence of stmt in the vector.
-3: Scalar stmts are split across multiple vectors.
-   We must choose the correct vector and mod the lane accordingly.  */
+  /* Get the last occurrence of the scalar index from the
concatenation of
+all the slp vectors. Calculate which slp vector it is and the index
+within.  */
+  int pos = (num_vec * nunits) - num_scalar + slp_index;
+  int vec_entry = pos / nunits;
+  int vec_index = pos % nunits;

   /* Get the correct slp vectorized stmt.  */
-  int vec_entry = slp_index / scalar_per_vec;
   vec_lhs = gimple_get_lhs (SLP_TREE_VEC_STMTS (slp_node)[vec_entry]);

   /* Get entry to use.  */
-  bitstart = build_int_cst (unsigned_type_node,
-   scalar_per_vec - (slp_index % scalar_per_vec));
+  bitstart = build_int_cst (unsigned_type_node, vec_index);
   bitstart = int_const_binop (MULT_EXPR, bitsize, bitstart);
-  bitstart = int_const_binop (MINUS_EXPR, vec_bitsize, bitstart);
 }
   else
 {






[PATCH] PR 71416 - allow more than one use of a live operation

2016-06-13 Thread Alan Hayward
vectorizable_live_operation checks that there is exactly one use of each
operation that has been marked live.

However, it is possible for the operation is used more than once if the
usage PHIs are identical.

For example in 71416-1.c, _6 is used twice after the loop in bb 9.

  :
  # e.6_21 = PHI 
  _5 = g_16(D) >= 0;
  _6 = (int) _5;
  e.5_7 = (unsigned char) e.6_21;
  _8 = e.5_7 + 1;
  _9 = (char) _8;
  if (_9 != 0)
goto ;
  else
goto ;

  :
  # _12 = PHI <_6(8)>
  # _19 = PHI <_6(8)>
  goto ; 

This patch relaxes the restriction in vectorizable_live_operation, and now
makes sure there is at least one use of each operation that has been marked
live.

Tested on x86 and aarch64.
Ok to commit?

gcc/
PR tree-optimization/71416
* tree-vect-loop.c (vectorizable_live_operation): Let worklist have
multiple entries


Alan.


diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
90ade75bcd212b542ad680877a79df717751ff4b..4c8678505df6ec572b69fd7d12ac55cf4
619ece6 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6355,7 +6355,7 @@ vectorizable_live_operation (gimple *stmt,
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
 if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
worklist.safe_push (use_stmt);
-  gcc_assert (worklist.length () == 1);
+  gcc_assert (worklist.length () >= 1);

   bitsize = TYPE_SIZE (TREE_TYPE (vectype));
   vec_bitsize = TYPE_SIZE (vectype);
@@ -6413,9 +6413,12 @@ vectorizable_live_operation (gimple *stmt,

   /* Replace all uses of the USE_STMT in the worklist with the newly
inserted
  statement.  */
-  use_stmt = worklist.pop ();
-  replace_uses_by (gimple_phi_result (use_stmt), new_tree);
-  update_stmt (use_stmt);
+  while (!worklist.is_empty ())
+{
+  use_stmt = worklist.pop ();
+  replace_uses_by (gimple_phi_result (use_stmt), new_tree);
+  update_stmt (use_stmt);
+}

   return true;
 }







[PATCH] Fix PR 71407 : Use correct types for live usage of live operations

2016-06-09 Thread Alan Hayward
This patch fixes PR 71407 by ensuring the BIT_FIELD_REF is created using
the
given vectype and then casted to the result type.

Tested on x86 and aarch64.
Note that the test vectorizes on x86 but does not vectorize on aarch64 or
power (due to a != statement failing to vectorize)

Ok to commit?



gcc/
PR tree-optimization/71407
PR tree-optimization/71416
* tree-vect-loop.c (vectorizable_live_operation): Use vectype for
BIT_FIELD_REF type.

Testsuite/
PR tree-optimization/71407
PR tree-optimization/71416
* gcc.dg/vect/pr71407.c: New
* gcc.dg/vect/pr71416-1.c: New
* gcc.dg/vect/pr71416-2.c: New


Alan.



pr71407.patch
Description: Binary data


Re: [PATCH][2/3] Vectorize inductions that are live after the loop

2016-06-07 Thread Alan Hayward


On 07/06/2016 10:28, "Rainer Orth"  wrote:

>Alan Hayward  writes:
>
>> On 05/06/2016 12:00, "Andreas Schwab"  wrote:
>>
>>>Alan Hayward  writes:
>>>
>>>>* gcc.dg/vect/vect-live-2.c: New test.
>>>
>>>This test fails on powerpc64 (with -m64, but not with -m32):
>>>
>>>$ grep 'vectorized.*loops' ./vect-live-2.c.149t.vect
>>>../gcc/testsuite/gcc.dg/vect/vect-live-2.c:10:1: note: vectorized 0
>>>loops
>>>in function.
>>>../gcc/testsuite/gcc.dg/vect/vect-live-2.c:29:1: note: vectorized 0
>>>loops
>>>in function.
>>>
>>>
>>
>> "note: not vectorized: relevant stmt not supported: _1 = (long unsigned
>> int) j_24;"
>>
>>
>> This is failing because power does not support vectorising a cast from
>>int
>> to long.
>> (It works on power 32bit because longs are 32bit and therefore no need
>>to
>> cast).
>>
>> Can someone please suggest a target-supports define (or another method)
>>I
>> can use to
>> disable this test for power 64bit (but not 32bit) ?
>> I tried using vect_multiple_sizes, but that will also disable the test
>>on
>> x86 without
>> avx.
>
>I'm also seeing new FAILs on Solaris/SPARC:
>
>+FAIL: gcc.dg/vect/vect-live-2.c -flto -ffat-lto-objects
>scan-tree-dump-times v
>ect "vectorized 1 loops" 1
>+FAIL: gcc.dg/vect/vect-live-2.c scan-tree-dump-times vect "vectorized 1
>loops" 
>1
>
>32- and 64-bit:
>
>vect-live-2.c:16:3: note: not vectorized: relevant stmt not supported: _2
>= j.0_1 * 4;
>vect-live-2.c:48:7: note: not vectorized: control flow in loop.
>vect-live-2.c:35:3: note: not vectorized: loop contains function calls or
>data references that cannot be analyzed
>
>and
>
>+FAIL: gcc.dg/vect/vect-live-slp-3.c -flto -ffat-lto-objects
>scan-tree-dump-times vect "vec_stmt_relevant_p: stmt live but not
>relevant" 4
>+FAIL: gcc.dg/vect/vect-live-slp-3.c -flto -ffat-lto-objects
>scan-tree-dump-times vect "vectorized 1 loops" 4
>+FAIL: gcc.dg/vect/vect-live-slp-3.c -flto -ffat-lto-objects
>scan-tree-dump-times vect "vectorizing stmts using SLP" 4
>+FAIL: gcc.dg/vect/vect-live-slp-3.c scan-tree-dump-times vect
>"vec_stmt_relevant_p: stmt live but not relevant" 4
>+FAIL: gcc.dg/vect/vect-live-slp-3.c scan-tree-dump-times vect
>"vectorized 1 loops" 4
>+FAIL: gcc.dg/vect/vect-live-slp-3.c scan-tree-dump-times vect
>"vectorizing stmts using SLP" 4
>
>vect-live-slp-3.c:29:1: note: not vectorized: no vectype for stmt: n0_29
>= *_4;
>vect-live-slp-3.c:30:1: note: not vectorized: no vectype for stmt: n0_29
>= *_4;
>vect-live-slp-3.c:31:1: note: not vectorized: no vectype for stmt: n0_29
>= *_4;
>vect-live-slp-3.c:32:1: note: not vectorized: no vectype for stmt: n0_29
>= *_4;
>vect-live-slp-3.c:62:4: note: not vectorized: control flow in loop.
>vect-live-slp-3.c:45:3: note: not vectorized: loop contains function
>calls or data references that cannot be analyzed
>


I’ve been trying both these tests on x86,aarch64,power,sparc

vect-live-slp-3.c
Fails on  power 64 (altivec & vsx), sparc 64 (vis 2 & 3)
  - due to long int unsupported
Pass on x86, aarch64, power 32 (altivec & vsx), sparc 32 (vis 2 & 3)

vect-live-2.c
Fails on power 64 (altivec & vsx), sparc 64 (vis 2 & 3)
  - due to long int unsupported
Fails on sparc 32 (vis 2)
  - due to multiply/shift not supported
Pass on x86, aarch64, power 32 (altivec & vsx), sparc 32 (vis 3)


Therefore I think both tests should be gated on “vect_long”.
In addition, vect-live-2.c should also be gated on “vect_shift”

“vect_long” is not currently enabled for aarch64, but should be.

Also “vect_shift” is not currently enabled for sparc 32 (vis 3), but
probably
should be. I leave this for a task for a sparc maintainer to add (as I’m
unable to test).




This patch fixes the targets for vect-live-slp-3.c and vect-live-2.c.
It also adds aarch64 to vect_long.

As a side consequence, the following vector tests are now enabled for
aarch64:
pr18425.c, pr30843.c, pr36493.c, pr42193.c and pr60656.c

Tested on aarch64 and x86.
Tested by inspection on power and sparc

Ok to commit?

testsuite/
* gcc.dg/vect/vect-live-2.c : Likewise
* gcc.dg/vect/vect-live-slp-3.c : Update effective target
* lib/target-supports.exp : Add aarch64 to vect_long




diff --git a/gcc/testsuite/gcc.dg/vect/vect-live-2.c
b/gcc/testsuite/gcc.dg/vect/vect-live-2.c
index 
53adc3fee006e0577a4cf2f9ba8fe091d2a09353..9460624a515945bdd72f98a0b1a6751fd
c7a75de 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-live-2.c
+++ b/gcc/testsuite/gcc.dg/vect

Re: [PATCH][vectorizer] Remove blank debug lines after dump_gimple_stmt

2016-06-07 Thread Alan Hayward

On 06/06/2016 18:56, "Jeff Law"  wrote:

>On 06/06/2016 06:46 AM, Alan Hayward wrote:
>> Lots of code calls dump_gimple_stmt then print a newline, however
>> dump_gimple_stmt will print a newline itself. This makes the vectorizer
>> debug
>> file messy. I think the confusion is because dump_generic_expr does NOT
>> print a
>> newline. This patch removes all prints of a newline direcly after a
>> dump_gimple_stmt.
>>
>> Tested by examining a selection of vect dump files.
>>
>>  gcc/
>>  \* tree-vect-data-refs.c (vect_analyze_data_refs): Remove debug
>>newline.
>>  \* tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): likewise.
>>  (vect_can_advance_ivs_p): likewise.
>>  (vect_update_ivs_after_vectorizer): likewise.
>>  \* tree-vect-loop.c (vect_determine_vectorization_factor): likewise.
>>  (vect_analyze_scalar_cycles_1): likewise.
>>  (vect_analyze_loop_operations): likewise.
>>  (report_vect_op): likewise.
>>  (vect_is_slp_reduction): likewise.
>>  (vect_is_simple_reduction): likewise.
>>  (get_initial_def_for_induction): likewise.
>>  (vect_transform_loop): likewise.
>>  \* tree-vect-patterns.c (vect_recog_dot_prod_pattern): likewise.
>>  (vect_recog_sad_pattern): likewise.
>>  (vect_recog_widen_sum_pattern): likewise.
>>  (vect_recog_widening_pattern): likewise.
>>  (vect_recog_divmod_pattern): likewise.
>>  \* tree-vect-slp.c (vect-build-slp_tree_1): likewise.
>>  (vect_analyze_slp_instance): likewise.
>>  (vect_transform_slp_perm_load): likewise.
>>  (vect_schedule_slp_instance): likewise.
>Wouldn't you also need to verify that the testsuite passes since it
>could potentially have tests which assume the extra newlines in the
>test's regexps?
>
>So I'm fine with this patch once the testsuite has been verified to run
>without regressions on a target which exercises the vectorizer.
>

Sorry, I should have said that I have done a check run on x86 and aarch64,
and there were no regressions.


Alan.




Re: [PATCH][2/3] Vectorize inductions that are live after the loop

2016-06-07 Thread Alan Hayward


On 05/06/2016 12:00, "Andreas Schwab"  wrote:

>Alan Hayward  writes:
>
>>  * gcc.dg/vect/vect-live-2.c: New test.
>
>This test fails on powerpc64 (with -m64, but not with -m32):
>
>$ grep 'vectorized.*loops' ./vect-live-2.c.149t.vect
>../gcc/testsuite/gcc.dg/vect/vect-live-2.c:10:1: note: vectorized 0 loops
>in function.
>../gcc/testsuite/gcc.dg/vect/vect-live-2.c:29:1: note: vectorized 0 loops
>in function.
>
>

"note: not vectorized: relevant stmt not supported: _1 = (long unsigned
int) j_24;"


This is failing because power does not support vectorising a cast from int
to long.
(It works on power 32bit because longs are 32bit and therefore no need to
cast).

Can someone please suggest a target-supports define (or another method) I
can use to
disable this test for power 64bit (but not 32bit) ?
I tried using vect_multiple_sizes, but that will also disable the test on
x86 without
avx.


Thanks,
Alan.




[PATCH][vectorizer] Remove blank debug lines after dump_gimple_stmt

2016-06-06 Thread Alan Hayward
Lots of code calls dump_gimple_stmt then print a newline, however
dump_gimple_stmt will print a newline itself. This makes the vectorizer
debug
file messy. I think the confusion is because dump_generic_expr does NOT
print a
newline. This patch removes all prints of a newline direcly after a
dump_gimple_stmt.

Tested by examining a selection of vect dump files.

gcc/
\* tree-vect-data-refs.c (vect_analyze_data_refs): Remove debug newline.
\* tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): likewise.
(vect_can_advance_ivs_p): likewise.
(vect_update_ivs_after_vectorizer): likewise.
\* tree-vect-loop.c (vect_determine_vectorization_factor): likewise.
(vect_analyze_scalar_cycles_1): likewise.
(vect_analyze_loop_operations): likewise.
(report_vect_op): likewise.
(vect_is_slp_reduction): likewise.
(vect_is_simple_reduction): likewise.
(get_initial_def_for_induction): likewise.
(vect_transform_loop): likewise.
\* tree-vect-patterns.c (vect_recog_dot_prod_pattern): likewise.
(vect_recog_sad_pattern): likewise.
(vect_recog_widen_sum_pattern): likewise.
(vect_recog_widening_pattern): likewise.
(vect_recog_divmod_pattern): likewise.
\* tree-vect-slp.c (vect-build-slp_tree_1): likewise.
(vect_analyze_slp_instance): likewise.
(vect_transform_slp_perm_load): likewise.
(vect_schedule_slp_instance): likewise.


Alan




removenewlines.patch
Description: Binary data


Re: [PATCH][2/3] Vectorize inductions that are live after the loop

2016-06-06 Thread Alan Hayward

On 03/06/2016 18:45, "Jakub Jelinek"  wrote:

>On Thu, Jun 02, 2016 at 05:11:15PM +0100, Alan Hayward wrote:
>>  * gcc.dg/vect/vect-live-1.c: New test.
>>  * gcc.dg/vect/vect-live-2.c: New test.
>>  * gcc.dg/vect/vect-live-5.c: New test.
>>  * gcc.dg/vect/vect-live-slp-1.c: New test.
>>  * gcc.dg/vect/vect-live-slp-2.c: New test.
>>  * gcc.dg/vect/vect-live-slp-3.c: New test.
>
>These tests all fail for me on i686-linux.  The problem is
>in the use of dg-options in gcc.dg/vect/, where it override all the
>various
>needed vectorization options that need to be enabled on various arches
>(e.g. -msse2 on i686).
>
>Fixed thusly, tested on x86_64-linux and i686-linux, ok for trunk?

Thanks for fixing this. However, I think you had a copy/paste error.
For vect-live-slp-1.c and vect-live-slp-3.c you used dg-options instead
of dg-additional-options, which causes the tests to fail for me as
UNRESOLVED.

Ok to commit?

Alan.


testsuite/
* gcc.dg/vect/vect-live-1.c: Use additional-options.
* gcc.dg/vect/vect-live-3.c: Likewise.



diff --git a/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c
b/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c
index 
82cd8dcabed8f18408e0e5cf295c2f15e0621ae6..7fefbff17d8c34bdfb1e8d697d2b9a70a
048ae16 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c
@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-fno-tree-scev-cprop" } */
+/* { dg-additional-options "-fno-tree-scev-cprop" } */

 #include "tree-vect.h"

diff --git a/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c
b/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c
index 
9e4a59eca593289050086b4b0a1347be5d748723..aacf5cb98071f6fec1f4b522eeefeb669
6787334 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c
@@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-fno-tree-scev-cprop" } */
+/* { dg-additional-options "-fno-tree-scev-cprop" } */

 #include "tree-vect.h"





[PATCH] Fix check_GNU_style.sh for BSD / Mac OS X

2016-06-03 Thread Alan Hayward
check_GNU_style.sh fails to detect lines >80 chars on BSD / Mac OS X
systems.

This is becuase paste is being called with an empty delimiter list.
Instead \0 should be used.

Tested on Ubuntu 14.04 and OS X 10.9.5

contrib/
* check_GNU_style.sh: Fix paste args for BSD


Alan


diff --git a/contrib/check_GNU_style.sh b/contrib/check_GNU_style.sh
index 
a7478f8f573132aef5ed1010f0cf5b13f08350d4..87a276c9cf47b5e07c4407f740ce05dce
1928c30 100755
--- a/contrib/check_GNU_style.sh
+++ b/contrib/check_GNU_style.sh
@@ -191,7 +191,7 @@ col (){
# Combine prefix back with long lines.
# Filter out empty lines.
local found=false
-   paste -d '' "$tmp2" "$tmp3" \
+   paste -d '\0' "$tmp2" "$tmp3" \
| grep -v '^[0-9][0-9]*:+$' \
> "$tmp" && found=true





Re: [PATCH][3/3] No need to vectorize simple only-live stmts

2016-06-02 Thread Alan Hayward

>Statements which are live but not relevant need marking to ensure they are
>vectorized.
>
>Live statements which are simple and all uses of them are invariant do not
>need
>to be vectorized.
>
>This patch adds a check to make sure those stmts which pass both the above
>checks are not vectorized and then discarded.
>
>Tested on x86 and aarch64.
>
>
>gcc/
>*tree-vect-stmts.c (vect_stmt_relevant_p): Do not vectorize non
>live
>relevant stmts which are simple and invariant.
>
>testsuite/
>* gcc.dg/vect/vect-live-slp-5.c: Remove dg check.


This version adds an addition relevance check in
vectorizable_live_operation.
It requires the "Remove duplicated GOMP SIMD LANE code” to work.

gcc/
\* tree-vect-stmts.c (vect_stmt_relevant_p): Do not vectorize non live
relevant stmts which are simple and invariant.
\* tree-vect-loop.c (vectorizable_live_operation): Check relevance
instead of simple and invariant

testsuite/
\* gcc.dg/vect/vect-live-slp-5.c: Remove dg check.




Alan.



3of3relevant.patch
Description: Binary data


Re: [PATCH][2/3] Vectorize inductions that are live after the loop

2016-06-02 Thread Alan Hayward


On 01/06/2016 10:51, "Richard Biener"  wrote:

>On Wed, Jun 1, 2016 at 10:46 AM, Alan Hayward 
>wrote:
>>
>>
>> On 30/05/2016 14:22, "Richard Biener" 
>>wrote:
>>
>>>On Fri, May 27, 2016 at 5:12 PM, Alan Hayward 
>>>wrote:
>>>>
>>>> On 27/05/2016 12:41, "Richard Biener" 
>>>>wrote:
>>>>
>>>>>On Fri, May 27, 2016 at 11:09 AM, Alan Hayward 
>>>>>wrote:
>>
>>>>
>>>>>
>>>>>The rest of the changes look ok to me.
>>>>
>>>> Does that include PATCH 1/3  ?
>>>
>>>I don't like how 1/3 ends up looking :/  So what was the alternative
>>>again?
>>>I looked into 1/3 and what it takes to remove the 'stmt' argument and
>>>instead pass in a vect_def_type.  It's a bit twisted and just adding
>>>another
>>>argument (the loop_vinfo) doesn't help things here.
>>>
>>>So - instead of 1/3 you might want to split out a function
>>>
>>>tree
>>>vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type
>>>dt, tree vectype)
>>>{
>>>  switch (dt)
>>>{
>>>...
>>>}
>>>}
>>>
>>>and for constant/external force vectype != NULL.
>>
>> I’m still a little confused as to exactly what you want here.
>>
>> From your two comments I think you wanted me to completely remove the
>> boolean type check and the vect_init_vector call. But if I remove that
>> then other code paths will break.
>>
>> However, I’ve just realised that in vectorized_live_operation I already
>> have the def stmt and I can easily get hold of dt from
>>STMT_VINFO_DEF_TYPE.
>> Which means I can call vect_get_vec_def_for_operand_1 from
>> vectorized_live_operation.
>>
>> I’ve put together a version where I have:
>>
>> tree
>> vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type dt)
>> {
>>
>>  switch (dt)
>>  {
>>case vect_internal_def || vect_external_def:
>>  gcc_unreachable ()
>>
>>.. code for for all other cases..
>>  }
>> }
>>
>> /* Used by existing code  */
>> tree
>> vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype)
>> {
>>   vect_is_simple_use(op, loop_vinfo, &def_stmt, &dt); ..and the dump
>>code
>>
>>
>>   If dt == internal_def || vect_external_def:
>>   .. Check for BOOLEAN_TYPE ..
>>   return vect_init_vector (stmt, op, vector_type, NULL);
>>
>>   else
>> vect_get_vec_def_for_operand_1 (def_stmt, dt)
>> }
>>
>>
>> Does that look better?
>
>Yes!
>
>Thanks,
>Richard.
>


This version of the patch addresses the simple+invariant issues
(and patch [3/3] optimizes it).

Also includes a small change to handle when the vect pattern has introduced
new pattern match statements (in vectorizable_live_operation if
STMT_VINFO_RELATED_STMT is not null then use it instead of stmt).

gcc/
* tree-vect-loop.c (vect_analyze_loop_operations): Allow live stmts.
(vectorizable_reduction): Check for new relevant state.
(vectorizable_live_operation): vectorize live stmts using
BIT_FIELD_REF.  Remove special case for gimple assigns stmts.
* tree-vect-stmts.c (is_simple_and_all_uses_invariant): New function.
(vect_stmt_relevant_p): Check for stmts which are only used live.
(process_use): Use of a stmt does not inherit it's live value.
(vect_mark_stmts_to_be_vectorized): Simplify relevance inheritance.
(vect_analyze_stmt): Check for new relevant state.
*tree-vectorizer.h (vect_relevant): New entry for a stmt which is used
outside the loop, but not inside it.

testsuite/
* gcc.dg/tree-ssa/pr64183.c: Ensure test does not vectorize.
* testsuite/gcc.dg/vect/no-scevccp-vect-iv-2.c: Remove xfail.
* gcc.dg/vect/vect-live-1.c: New test.
* gcc.dg/vect/vect-live-2.c: New test.
* gcc.dg/vect/vect-live-3.c: New test.
* gcc.dg/vect/vect-live-4.c: New test.
* gcc.dg/vect/vect-live-5.c: New test.
* gcc.dg/vect/vect-live-slp-1.c: New test.
* gcc.dg/vect/vect-live-slp-2.c: New test.
* gcc.dg/vect/vect-live-slp-3.c: New test.


Alan.



2of3live.patch
Description: Binary data


Re: [PATCH][1/3] Add loop_vinfo to vect_get_vec_def_for_operand

2016-06-02 Thread Alan Hayward

> This patch simply adds loop_vinfo as an extra argument to
> vect_get_vec_def_for_operand and only generates a stmt_vinfo if required.
> This is a required cleanup for patch [2/3].
> Tested on x86 and aarch64.
>
> gcc/
> * tree-vectorizer.h (vect_get_vec_def_for_operand): Pass loop_vinfo in.
> * tree-vect-stmts.c (vect_get_vec_def_for_operand): Pass loop_vinfo in.
> (vect_get_vec_defs): Pass down loop_vinfo.
> (vectorizable_mask_load_store): Likewise.
> (vectorizable_call): Likewise.
> (vectorizable_simd_clone_call): Likewise.
> (vect_get_loop_based_defs): Likewise.
> (vectorizable_conversion): Likewise.
> (vectorizable_operation): Likewise.
> (vectorizable_store): Likewise.
> (vectorizable_load): Likewise.
> (vectorizable_condition): Likewise.
> (vectorizable_comparison): Likewise.
> * tree-vect-loop.c (get_initial_def_for_induction): Likewise.
> (get_initial_def_for_reduction): Likewise.
> (vectorizable_reduction):  Likewise.

New version. I've removed the additional loop_vinfo arg.
Instead, I've split part of vect_get_vec_def_for_operand into a new
function vect_get_vec_def_for_operand_1.
My [2/3] patch will call vect_get_vec_def_for_operand_1 direct from
vectorizeable_live_operation

gcc/
* tree-vectorizer.h (vect_get_vec_def_for_operand_1): New
* tree-vect-stmts.c (vect_get_vec_def_for_operand_1): New
(vect_get_vec_def_for_operand): Split out code.






Alan.



1of3get_vec_def.patch
Description: Binary data


Remove duplicated GOMP SIMD LANE code

2016-06-02 Thread Alan Hayward
The IFN_GOMP_SIMD_LANE code in vectorizable_call is essentially a
duplicate of
the code in vectorizable_live_operation. They both replace all uses
outside the
loop with the constant VF - 1.

Note that my patch to vectorize inductions that are live after the loop
will
also remove the IFN_GOMP_SIMD_LANE code in vectorizable_live_operation.

This patch is required in order for the follow on optimisation (No need to
Vectorise simple only-live stmts) to work.

Tested with libgomp on x86 and aarch64

gcc/
* tree-vect-stmts.c (vectorizable_call) Remove GOMP_SIMD_LANE code.

Alan.



duplicatedGOMP.patch
Description: Binary data


Re: [PATCH][2/3] Vectorize inductions that are live after the loop

2016-06-01 Thread Alan Hayward


On 30/05/2016 14:22, "Richard Biener"  wrote:

>On Fri, May 27, 2016 at 5:12 PM, Alan Hayward 
>wrote:
>>
>> On 27/05/2016 12:41, "Richard Biener" 
>>wrote:
>>
>>>On Fri, May 27, 2016 at 11:09 AM, Alan Hayward 
>>>wrote:

>>
>>>
>>>The rest of the changes look ok to me.
>>
>> Does that include PATCH 1/3  ?
>
>I don't like how 1/3 ends up looking :/  So what was the alternative
>again?
>I looked into 1/3 and what it takes to remove the 'stmt' argument and
>instead pass in a vect_def_type.  It's a bit twisted and just adding
>another
>argument (the loop_vinfo) doesn't help things here.
>
>So - instead of 1/3 you might want to split out a function
>
>tree
>vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type
>dt, tree vectype)
>{
>  switch (dt)
>{
>...
>}
>}
>
>and for constant/external force vectype != NULL.

I’m still a little confused as to exactly what you want here.

From your two comments I think you wanted me to completely remove the
boolean type check and the vect_init_vector call. But if I remove that
then other code paths will break.

However, I’ve just realised that in vectorized_live_operation I already
have the def stmt and I can easily get hold of dt from STMT_VINFO_DEF_TYPE.
Which means I can call vect_get_vec_def_for_operand_1 from
vectorized_live_operation.

I’ve put together a version where I have:

tree
vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type dt)
{

 switch (dt)
 {
   case vect_internal_def || vect_external_def:
 gcc_unreachable ()

   .. code for for all other cases..
 }
}

/* Used by existing code  */
tree
vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype)
{
  vect_is_simple_use(op, loop_vinfo, &def_stmt, &dt); ..and the dump code


  If dt == internal_def || vect_external_def:
  .. Check for BOOLEAN_TYPE ..
  return vect_init_vector (stmt, op, vector_type, NULL);
 
  else
vect_get_vec_def_for_operand_1 (def_stmt, dt)
}


Does that look better?


Alan.






Re: [PATCH][2/3] Vectorize inductions that are live after the loop

2016-05-27 Thread Alan Hayward

On 27/05/2016 12:41, "Richard Biener"  wrote:

>On Fri, May 27, 2016 at 11:09 AM, Alan Hayward 
>wrote:
>> This patch is a reworking of the previous vectorize inductions that are
>> live
>> after the loop patch.
>> It now supports SLP and an optimisation has been moved to patch [3/3].
>>
>>
>> Stmts which are live (ie: defined inside a loop and then used after the
>> loop)
>> are not currently supported by the vectorizer.  In many cases
>> vectorization can
>> still occur because the SCEV cprop pass will hoist the definition of the
>> stmt
>> outside of the loop before the vectorizor pass. However, there are
>>various
>> cases SCEV cprop cannot hoist, for example:
>>   for (i = 0; i < n; ++i)
>> {
>>   ret = x[i];
>>   x[i] = i;
>> }
>>return i;
>>
>> Currently stmts are marked live using a bool, and the relevant state
>>using
>> an
>> enum. Both these states are propagated to the definition of all uses of
>>the
>> stmt. Also, a stmt can be live but not relevant.
>>
>> This patch vectorizes a live stmt definition normally within the loop
>>and
>> then
>> after the loop uses BIT_FIELD_REF to extract the final scalar value from
>> the
>> vector.
>>
>> This patch adds a new relevant state (vect_used_only_live) for when a
>>stmt
>> is
>> used only outside the loop. The relevant state is still propagated to
>>all
>> it's
>> uses, but the live bool is not (this ensures that
>> vectorizable_live_operation
>> is only called with stmts that really are live).
>>
>> Tested on x86 and aarch64.
>
>+  /* If STMT is a simple assignment and its inputs are invariant, then
>it can
>+ remain in place, unvectorized.  The original last scalar value that
>it
>+ computes will be used.  */
>+  if (is_simple_and_all_uses_invariant (stmt, loop_vinfo))
> {
>
>so we can't use STMT_VINFO_RELEVANT or so?  I thought we somehow
>mark stmts we don't vectorize in the end.

It’s probably worth making clear that this check exists in the current
GCC head - today vectorize_live_operation only supports the
simple+invariant
case and the SSE2 case.
My patch simply moved the simple+invariant code into the new function
is_simple_and_all_uses_invariant.

Looking at this again, I think what we really care about is….
*If the stmt is live but not relevant, we need to mark it to ensure it is
vectorised.
*If a stmt is simple+invariant then a live usage of it can either use the
scalar or vectorized version.

So for a live stmt:
*If it is simple+invariant and not relevant, then it is more optimal to
use the
scalar version.
*If it is simple+invariant and relevant then it is more optimal to use the
vectorized version.
*If it is not simple+invariant then we must always use the vectorized
version.

Therefore, the patch as it stands is correct but not optimal. In patch 3/3
for
the code above (vectorize_live_operation) I can change the check to if not
relevant then assert that it is not simple+invariant and return true.



>
>+  lhs = (is_a  (stmt)) ? gimple_phi_result (stmt)
>+   : gimple_get_lhs (stmt);
>+  lhs_type = TREE_TYPE (lhs);
>+
>+  /* Find all uses of STMT outside the loop.  */
>+  auto_vec worklist;
>+  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
>+{
>+  basic_block bb = gimple_bb (use_stmt);
>+
>+  if (!flow_bb_inside_loop_p (loop, bb))
>+   worklist.safe_push (use_stmt);
> }
>+  gcc_assert (!worklist.is_empty ());
>
>as we are in loop-closed SSA there should be exactly one such use...?

Yes, I should change this to assert that worklist is of length 1.

>
>+  /* Get the correct slp vectorized stmt.  */
>+  vec_oprnds.create (num_vec);
>+  vect_get_slp_vect_defs (slp_node, &vec_oprnds);
>
>As you look at the vectorized stmt you can directly use the
>SLP_TREE_VEC_STMTS
>array (the stmts lhs, of course), no need to export this function.

Ok, I can change this.

>
>The rest of the changes look ok to me.

Does that include PATCH 1/3  ?

>
>Thanks,
>Richard.
>
>
>> gcc/
>> * tree-vect-loop.c (vect_analyze_loop_operations): Allow live
>> stmts.
>> (vectorizable_reduction): Check for new relevant state.
>> (vectorizable_live_operation): vectorize live stmts using
>> BIT_FIELD_REF.  Remove special case for gimple assigns stmts.
>> * tree-vect-stmts.c (is_simple_and_all_uses_invariant): New
>> function.
>> (vect_stmt_relevant_p): Check for stmts which are only used
>>live.
>> (process_use): Use of a stmt does not inherit 

[PATCH][3/3] No need to vectorize simple only-live stmts

2016-05-27 Thread Alan Hayward
Statements which are live but not relevant need marking to ensure they are
vectorized.

Live statements which are simple and all uses of them are invariant do not
need
to be vectorized.

This patch adds a check to make sure those stmts which pass both the above
checks are not vectorized and then discarded.

Tested on x86 and aarch64.


gcc/
*tree-vect-stmts.c (vect_stmt_relevant_p): Do not vectorize non
live
relevant stmts which are simple and invariant.

testsuite/
* gcc.dg/vect/vect-live-slp-5.c: Remove dg check.



Alan.



live3.patch
Description: Binary data


[PATCH][2/3] Vectorize inductions that are live after the loop

2016-05-27 Thread Alan Hayward
This patch is a reworking of the previous vectorize inductions that are
live
after the loop patch.
It now supports SLP and an optimisation has been moved to patch [3/3].


Stmts which are live (ie: defined inside a loop and then used after the
loop)
are not currently supported by the vectorizer.  In many cases
vectorization can
still occur because the SCEV cprop pass will hoist the definition of the
stmt
outside of the loop before the vectorizor pass. However, there are various
cases SCEV cprop cannot hoist, for example:
  for (i = 0; i < n; ++i)
{
  ret = x[i];
  x[i] = i;
}
   return i;

Currently stmts are marked live using a bool, and the relevant state using
an
enum. Both these states are propagated to the definition of all uses of the
stmt. Also, a stmt can be live but not relevant.

This patch vectorizes a live stmt definition normally within the loop and
then
after the loop uses BIT_FIELD_REF to extract the final scalar value from
the
vector.

This patch adds a new relevant state (vect_used_only_live) for when a stmt
is
used only outside the loop. The relevant state is still propagated to all
it's
uses, but the live bool is not (this ensures that
vectorizable_live_operation
is only called with stmts that really are live).

Tested on x86 and aarch64.

gcc/
* tree-vect-loop.c (vect_analyze_loop_operations): Allow live
stmts.
(vectorizable_reduction): Check for new relevant state.
(vectorizable_live_operation): vectorize live stmts using
BIT_FIELD_REF.  Remove special case for gimple assigns stmts.
* tree-vect-stmts.c (is_simple_and_all_uses_invariant): New
function.
(vect_stmt_relevant_p): Check for stmts which are only used live.
(process_use): Use of a stmt does not inherit it's live value.
(vect_mark_stmts_to_be_vectorized): Simplify relevance inheritance.
(vect_analyze_stmt): Check for new relevant state.
*tree-vect-slp.c (vect_get_slp_vect_defs): Make global
*tree-vectorizer.h (vect_relevant): New entry for a stmt which is
used
outside the loop, but not inside it.

testsuite/
* gcc.dg/tree-ssa/pr64183.c: Ensure test does not vectorize.
* testsuite/gcc.dg/vect/no-scevccp-vect-iv-2.c: Remove xfail.
* gcc.dg/vect/vect-live-1.c: New test.
* gcc.dg/vect/vect-live-2.c: New test.
* gcc.dg/vect/vect-live-3.c: New test.
* gcc.dg/vect/vect-live-4.c: New test.
* gcc.dg/vect/vect-live-5.c: New test.
* gcc.dg/vect/vect-live-slp-1.c: New test.
* gcc.dg/vect/vect-live-slp-2.c: New test.
* gcc.dg/vect/vect-live-slp-3.c: New test.
* gcc.dg/vect/vect-live-slp-4.c: New test.



Alan.



live2.patch
Description: Binary data


[PATCH][1/3] Add loop_vinfo to vect_get_vec_def_for_operand

2016-05-27 Thread Alan Hayward
This patch simply adds loop_vinfo as an extra argument to
vect_get_vec_def_for_operand and only generates a stmt_vinfo if required.
This is a required cleanup for patch [2/3].
Tested on x86 and aarch64.

gcc/
* tree-vectorizer.h (vect_get_vec_def_for_operand): Pass loop_vinfo in.
* tree-vect-stmts.c (vect_get_vec_def_for_operand): Pass loop_vinfo in.
(vect_get_vec_defs): Pass down loop_vinfo.
(vectorizable_mask_load_store): Likewise.
(vectorizable_call): Likewise.
(vectorizable_simd_clone_call): Likewise.
(vect_get_loop_based_defs): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
(vectorizable_condition): Likewise.
(vectorizable_comparison): Likewise.
* tree-vect-loop.c (get_initial_def_for_induction): Likewise.
(get_initial_def_for_reduction): Likewise.
(vectorizable_reduction):  Likewise.


Alan.



live1.patch
Description: Binary data


Re: [PATCH] Vectorize inductions that are live after the loop.

2016-05-23 Thread Alan Hayward

Thanks for the review.

On 23/05/2016 11:35, "Richard Biener"  wrote:

>
>@@ -6332,79 +6324,81 @@ vectorizable_live_operation (gimple *stmt,
>   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>-  tree op;
>-  gimple *def_stmt;
>-  ssa_op_iter iter;
>+  imm_use_iterator imm_iter;
>+  tree lhs, lhs_type, vec_lhs;
>+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>+  int nunits = TYPE_VECTOR_SUBPARTS (vectype);
>+  int ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits;
>+  gimple *use_stmt;
>
>   gcc_assert (STMT_VINFO_LIVE_P (stmt_info));
>
>+  if (STMT_VINFO_TYPE (stmt_info) == reduc_vec_info_type)
>+return true;
>+
>
>This is an odd check - it says the stmt is handled by
>vectorizable_reduction.  And your
>return claims it is handled by vectorizable_live_operation ...

Previously this check was made to decide whether to call
vectorizable_live_operation,
So it made sense to put this check inside the function.

But, yes, I agree that the return value of the function no longer makes
sense.
I can revert this.

>
>You removed the SIMD lane handling?

The SIMD lane handling effectively checked for a special case, then added
code which would extract the final value of the vector.
The new code I’ve added does the exact same thing for more generic cases,
so the SIMD check can be removed and it’ll still be vectorized correctly.

>
>@@ -303,6 +335,16 @@ vect_stmt_relevant_p (gimple *stmt, loop_vec_info
>loop_vinfo,
>}
> }
>
>+  if (*live_p && *relevant == vect_unused_in_scope
>+  && !is_simple_and_all_uses_invariant (stmt, loop_vinfo))
>+{
>+  if (dump_enabled_p ())
>+   dump_printf_loc (MSG_NOTE, vect_location,
>+"vec_stmt_relevant_p: live and not all uses "
>+"invariant.\n");
>+  *relevant = vect_used_only_live;
>+}
>
>But that's a missed invariant motion / code sinking opportunity then.
>Did you have a
>testcase for this?

I don’t have a test case :(
It made sense that this was the correct action to do on the failure
(rather than assert).

>
>@@ -618,57 +660,31 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info
>loop_vinfo)
>}
>
>   /* Examine the USEs of STMT. For each USE, mark the stmt that
>defines it
>-(DEF_STMT) as relevant/irrelevant and live/dead according to the
>-liveness and relevance properties of STMT.  */
>+(DEF_STMT) as relevant/irrelevant according to the relevance
>property
>+of STMT.  */
>   stmt_vinfo = vinfo_for_stmt (stmt);
>   relevant = STMT_VINFO_RELEVANT (stmt_vinfo);
>-  live_p = STMT_VINFO_LIVE_P (stmt_vinfo);
>-
>-  /* Generally, the liveness and relevance properties of STMT are
>-propagated as is to the DEF_STMTs of its USEs:
>- live_p <-- STMT_VINFO_LIVE_P (STMT_VINFO)
>- relevant <-- STMT_VINFO_RELEVANT (STMT_VINFO)
>-
>-One exception is when STMT has been identified as defining a
>reduction
>-variable; in this case we set the liveness/relevance as follows:
>-  live_p = false
>-  relevant = vect_used_by_reduction
>-This is because we distinguish between two kinds of relevant
>stmts -
>-those that are used by a reduction computation, and those that
>are
>-(also) used by a regular computation.  This allows us later on to
>-identify stmts that are used solely by a reduction, and
>therefore the
>-order of the results that they produce does not have to be kept.
> */
>-
>-  def_type = STMT_VINFO_DEF_TYPE (stmt_vinfo);
>-  tmp_relevant = relevant;
>-  switch (def_type)
>+
>+  switch (STMT_VINFO_DEF_TYPE (stmt_vinfo))
> {
>
>you removed this comment.  Is it no longer valid?  Can you please
>instead update it?
>This is a tricky area.

I’ll replace with a new comment.

>
>
>@@ -1310,17 +1325,14 @@ vect_init_vector (gimple *stmt, tree val, tree
>type, gimple_stmt_iterator *gsi)
>In case OP is an invariant or constant, a new stmt that creates a
>vector def
>needs to be introduced.  VECTYPE may be used to specify a required
>type for
>vector invariant.  */
>-
>-tree
>-vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype)
>+static tree
>+vect_get_vec_def_for_operand_internal (tree op, gimple *stmt,
>+  loop_vec_info loop_vinfo, tree
>vectype)
> {
>   tree vec_oprnd;
>...
>
>+tree
>+vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype)
>+{
>+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
>+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
>+  return vect_get_vec_def_for_operand_internal (op, stmt, loop_vinfo,
>vectype);
>+}
>+
>+tree
>+vect_get_vec_def_for_operand_outside (tree op, loop_vec_info loop_vinfo)
>+{
>+  return vect_get_vec_def_for_operand_internal (op, NULL, loop_vinfo,
>NULL);
>+}
>
>I was trying to redu

[PATCH] Vectorize inductions that are live after the loop.

2016-05-23 Thread Alan Hayward
Vectorize inductions that are live after the loop.

Stmts which are live (ie: defined inside a loop and then used after the
loop)
are not currently supported by the vectorizer.  In many cases
vectorization can
still occur because the SCEV cprop pass will hoist the definition of the
stmt
outside of the loop before the vectorizor pass. However, there are various
cases SCEV cprop cannot hoist, for example:
  for (i = 0; i < n; ++i)
{
  ret = x[i];
  x[i] = i;
}
  return i;

Currently stmts are marked live using a bool, and the relevant state using
an
enum. Both these states are propagated to the definition of all uses of the
stmt. Also, a stmt can be live but not relevant.

This patch vectorizes a live stmt definition normally within the loop and
then
after the loop uses BIT_FIELD_REF to extract the final scalar value from
the
vector.

This patch adds a new relevant state (vect_used_only_live) for when a stmt
is
used only outside the loop. The relevant state is still propagated to all
it's
uses, but the live bool is not (this ensures that
vectorizable_live_operation
is only called with stmts that really are live).

Tested on x86 and aarch64.

gcc/
* tree-vect-loop.c (vect_analyze_loop_operations): Allow live stmts.
(vectorizable_reduction): Check for new relevant state.
(vectorizable_live_operation): vectorize live stmts using
BIT_FIELD_REF.  Remove special case for gimple assigns stmts.
* tree-vect-stmts.c (is_simple_and_all_uses_invariant): New function.
(vect_stmt_relevant_p): Check for stmts which are only used outside the
loop.
(process_use): Use of a stmt does not inherit it's live value.
(vect_mark_stmts_to_be_vectorized): Simplify relevance inheritance.
(vect_get_vec_def_for_operand): Split body into...
(vect_get_vec_def_for_operand_internal): ...new function
(vect_get_vec_def_for_operand_outside): New. Same as above but for
stmts outside a loop.
(vect_analyze_stmt): Check for new relevant state.
*tree-vectorizer.h (vect_relevant): New entry for a stmt which is used
outside the loop, but not inside it.

testsuite/
* gcc.dg/tree-ssa/pr64183.c: Ensure test does not vectorize.
* testsuite/gcc.dg/vect/no-scevccp-vect-iv-2.c: Remove xfail.
* gcc.dg/vect/vect-live-1.c: New test.
* gcc.dg/vect/vect-live-2.c: New test.
* gcc.dg/vect/vect-live-3.c: New test.
* gcc.dg/vect/vect-live-4.c: New test.


Cheers,
Alan.



liveInductions.patch
Description: liveInductions.patch


Re: [PATCH] PR tree-optimization/68413 : Only check for integer cond reduction on analysis stage

2015-11-20 Thread Alan Hayward


On 20/11/2015 13:47, "Richard Biener"  wrote:

>On Fri, Nov 20, 2015 at 1:33 PM, Alan Hayward 
>wrote:
>>
>>
>>On 20/11/2015 11:00, "Richard Biener"  wrote:
>>
>>>On Fri, Nov 20, 2015 at 10:24 AM, Alan Hayward 
>>>wrote:
>>>>When vectorising a integer induction condition reduction,
>>>>is_nonwrapping_integer_induction ends up with different values for base
>>>>during the analysis and build phases. In the first it is an
>>>>INTEGER_CST,
>>>>in the second the loop has been vectorised out and the base is now a
>>>>variable.
>>>>
>>>>This results in the analysis and build stage detecting the
>>>>STMT_VINFO_VEC_REDUCTION_TYPE as different types.
>>>>
>>>>The easiest way to fix this is to only check for integer induction
>>>>conditions on the analysis stage.
>>>
>>>I don't like this.  For the evolution part we have added
>>>STMT_VINFO_LOOP_PHI_EVOLUTION_PART.  If you now need
>>>the original initial value as well then just save it.
>>>
>>>Or if you really want to go with the hack then please do not call
>>>is_nonwrapping_integer_induction with vec_stmt != NULL but
>>>initialize cond_expr_is_nonwrapping_integer_induction from
>>>STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
>>>
>>>The hack also lacks a comment.
>>>
>>
>>Ok. I've gone for a combination of both:
>>
>>I now cache the base in STMT_VINFO_LOOP_PHI_EVOLUTION_BASE.
>>
>>I've removed the vec_stmt != NULL checks.
>>
>>I've moved the call to is_nonwrapping_integer_induction until after the
>>vect_is_simple_reduction check. I never liked that I had
>>is_nonwrapping_integer_induction early in the function, and think this
>>looks better.
>
>It looks better but the comment for loop_phi_evolution_base is wrong.
>The value is _not_ "correct" after prologue peeling (unless you
>update it there to a non-constant expr).  It is conservatively "correct"
>for is_nonwrapping_integer_induction though.  Which is why I'd
>probably rename it to STMT_VINFO_LOOP_PHI_EVOLUTION_BASE_UNCHANGED
>to reflect this (and similar to STMT_VINFO_NITERS_UNCHANGED).
>
>Ok with that change.

Updated as requested and submitted.

Alan.



analysisonlycondcheck3.patch
Description: Binary data


Re: [PATCH] PR tree-optimization/68413 : Only check for integer cond reduction on analysis stage

2015-11-20 Thread Alan Hayward


On 20/11/2015 11:00, "Richard Biener"  wrote:

>On Fri, Nov 20, 2015 at 10:24 AM, Alan Hayward 
>wrote:
>> When vectorising a integer induction condition reduction,
>> is_nonwrapping_integer_induction ends up with different values for base
>> during the analysis and build phases. In the first it is an INTEGER_CST,
>> in the second the loop has been vectorised out and the base is now a
>> variable.
>>
>> This results in the analysis and build stage detecting the
>> STMT_VINFO_VEC_REDUCTION_TYPE as different types.
>>
>> The easiest way to fix this is to only check for integer induction
>> conditions on the analysis stage.
>
>I don't like this.  For the evolution part we have added
>STMT_VINFO_LOOP_PHI_EVOLUTION_PART.  If you now need
>the original initial value as well then just save it.
>
>Or if you really want to go with the hack then please do not call
>is_nonwrapping_integer_induction with vec_stmt != NULL but
>initialize cond_expr_is_nonwrapping_integer_induction from
>STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
>
>The hack also lacks a comment.
>

Ok. I've gone for a combination of both:

I now cache the base in STMT_VINFO_LOOP_PHI_EVOLUTION_BASE.

I've removed the vec_stmt != NULL checks.

I've moved the call to is_nonwrapping_integer_induction until after the
vect_is_simple_reduction check. I never liked that I had
is_nonwrapping_integer_induction early in the function, and think this
looks better.


Alan.



analysisonlycondcheck2.patch
Description: Binary data


[PATCH] PR tree-optimization/68413 : Only check for integer cond reduction on analysis stage

2015-11-20 Thread Alan Hayward
When vectorising a integer induction condition reduction,
is_nonwrapping_integer_induction ends up with different values for base
during the analysis and build phases. In the first it is an INTEGER_CST,
in the second the loop has been vectorised out and the base is now a
variable.

This results in the analysis and build stage detecting the
STMT_VINFO_VEC_REDUCTION_TYPE as different types.

The easiest way to fix this is to only check for integer induction
conditions on the analysis stage.


gcc/
PR tree-optimization/68413
* tree-vect-loop.c (vectorizable_reduction): Only check for
integer cond reduction on analysis stage.




Thanks,
Alan.



analysisonlycondcheck.patch
Description: Binary data


Re: [Patch] Optimize condition reductions where the result is an integer induction variable

2015-11-13 Thread Alan Hayward


On 12/11/2015 10:53, "Richard Biener"  wrote:

>On Wed, Nov 11, 2015 at 7:54 PM, Alan Hayward 
>wrote:
>>
>>
>> On 11/11/2015 13:25, "Richard Biener" 
>>wrote:
>>
>>>On Wed, Nov 11, 2015 at 1:22 PM, Alan Hayward 
>>>wrote:
>>>> Hi,
>>>> I hoped to post this in time for Monday’s cut off date, but
>>>>circumstances
>>>> delayed me until today. Hoping if possible this patch will still be
>>>>able
>>>> to go in.
>>>>
>>>>
>>>> This patch builds upon the change for PR65947, and reduces the amount
>>>>of
>>>> code produced in a vectorized condition reduction where operand 2 of
>>>>the
>>>> COND_EXPR is an assignment of a increasing integer induction variable
>>>>that
>>>> won't wrap.
>>>>
>>>>
>>>> For example (assuming all types are ints), this is a match:
>>>>
>>>> last = 5;
>>>> for (i = 0; i < N; i++)
>>>>   if (a[i] < min_v)
>>>> last = i;
>>>>
>>>> Whereas, this is not because the result is based off a memory access:
>>>> last = 5;
>>>> for (i = 0; i < N; i++)
>>>>   if (a[i] < min_v)
>>>> last = a[i];
>>>>
>>>> In the integer induction variable case we can just use a MAX reduction
>>>>and
>>>> skip all the code I added in my vectorized condition reduction patch -
>>>>the
>>>> additional induction variables in vectorizable_reduction () and the
>>>> additional checks in vect_create_epilog_for_reduction (). From the
>>>>patch
>>>> diff only, it's not immediately obvious that those parts will be
>>>>skipped
>>>> as there is no code changes in those areas.
>>>>
>>>> The initial value of the induction variable is force set to zero, as
>>>>any
>>>> other value could effect the result of the induction. At the end of
>>>>the
>>>> loop, if the result is zero, then we restore the original initial
>>>>value.
>>>
>>>+static bool
>>>+is_integer_induction (gimple *stmt, struct loop *loop)
>>>
>>>is_nonwrapping_integer_induction?
>>>
>>>+  tree lhs_max = TYPE_MAX_VALUE (TREE_TYPE (gimple_phi_result (stmt)));
>>>
>>>don't use TYPE_MAX_VALUE.
>>>
>>>+  /* Check that the induction increments.  */
>>>+  if (tree_int_cst_compare (step, size_zero_node) <= 0)
>>>+return false;
>>>
>>>tree_int_cst_sgn (step) == -1
>>>
>>>+  /* Check that the max size of the loop will not wrap.  */
>>>+
>>>+  if (! max_loop_iterations (loop, &ni))
>>>+return false;
>>>+  /* Convert backedges to iterations.  */
>>>+  ni += 1;
>>>
>>>just use max_stmt_executions (loop, &ni) which properly checks for
>>>overflow
>>>of the +1.
>>>
>>>+  max_loop_value = wi::add (wi::to_widest (base),
>>>+   wi::mul (wi::to_widest (step), ni));
>>>+
>>>+  if (wi::gtu_p (max_loop_value, wi::to_widest (lhs_max)))
>>>+return false;
>>>
>>>you miss a check for the wi::add / wi::mul to overflow.  You can use
>>>extra args to determine this.
>>>
>>>Instead of TYPE_MAX_VALUE use wi::max_value (precision, sign).
>>>
>>>I wonder if you want to skip all the overflow checks for
>>>TYPE_OVERFLOW_UNDEFINED
>>>IV types?
>>>
>>
>> Ok with all the above.
>>
>> Tried using max_value () but this gave me a wide_int instead of a
>> widest_int.
>> Instead I've replaced with min_precision and GET_MODE_BITSIZE.
>>
>> Added an extra check for when the type is TYPE_OVERFLOW_UNDEFINED.
>
>+ /* Set the loop-entry arg of the reduction-phi.  */
>+
>+ if (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
>+   == INTEGER_INDUC_COND_REDUCTION)
>
>extra vertical space
>
>+ tree zero = build_int_cst ( TREE_TYPE (vec_init_def_type),
>0);
>+ tree zero_vec = build_vector_from_val (vec_init_def_type,
>zero);
>+
>
>build_zero_cst (vec_init_def_type);
>
>+ else
>+   {
>+ add_phi_arg (as_a  (phi), vec_init_def,
>   loop_preheader_edge (loop), UNKNOWN_LOCATION);
>+   }
>
>no {}s around single stmts
>
>+

Re: [Patch] Optimize condition reductions where the result is an integer induction variable

2015-11-11 Thread Alan Hayward


On 11/11/2015 13:25, "Richard Biener"  wrote:

>On Wed, Nov 11, 2015 at 1:22 PM, Alan Hayward 
>wrote:
>> Hi,
>> I hoped to post this in time for Monday’s cut off date, but
>>circumstances
>> delayed me until today. Hoping if possible this patch will still be able
>> to go in.
>>
>>
>> This patch builds upon the change for PR65947, and reduces the amount of
>> code produced in a vectorized condition reduction where operand 2 of the
>> COND_EXPR is an assignment of a increasing integer induction variable
>>that
>> won't wrap.
>>
>>
>> For example (assuming all types are ints), this is a match:
>>
>> last = 5;
>> for (i = 0; i < N; i++)
>>   if (a[i] < min_v)
>> last = i;
>>
>> Whereas, this is not because the result is based off a memory access:
>> last = 5;
>> for (i = 0; i < N; i++)
>>   if (a[i] < min_v)
>> last = a[i];
>>
>> In the integer induction variable case we can just use a MAX reduction
>>and
>> skip all the code I added in my vectorized condition reduction patch -
>>the
>> additional induction variables in vectorizable_reduction () and the
>> additional checks in vect_create_epilog_for_reduction (). From the patch
>> diff only, it's not immediately obvious that those parts will be skipped
>> as there is no code changes in those areas.
>>
>> The initial value of the induction variable is force set to zero, as any
>> other value could effect the result of the induction. At the end of the
>> loop, if the result is zero, then we restore the original initial value.
>
>+static bool
>+is_integer_induction (gimple *stmt, struct loop *loop)
>
>is_nonwrapping_integer_induction?
>
>+  tree lhs_max = TYPE_MAX_VALUE (TREE_TYPE (gimple_phi_result (stmt)));
>
>don't use TYPE_MAX_VALUE.
>
>+  /* Check that the induction increments.  */
>+  if (tree_int_cst_compare (step, size_zero_node) <= 0)
>+return false;
>
>tree_int_cst_sgn (step) == -1
>
>+  /* Check that the max size of the loop will not wrap.  */
>+
>+  if (! max_loop_iterations (loop, &ni))
>+return false;
>+  /* Convert backedges to iterations.  */
>+  ni += 1;
>
>just use max_stmt_executions (loop, &ni) which properly checks for
>overflow
>of the +1.
>
>+  max_loop_value = wi::add (wi::to_widest (base),
>+   wi::mul (wi::to_widest (step), ni));
>+
>+  if (wi::gtu_p (max_loop_value, wi::to_widest (lhs_max)))
>+return false;
>
>you miss a check for the wi::add / wi::mul to overflow.  You can use
>extra args to determine this.
>
>Instead of TYPE_MAX_VALUE use wi::max_value (precision, sign).
>
>I wonder if you want to skip all the overflow checks for
>TYPE_OVERFLOW_UNDEFINED
>IV types?
>

Ok with all the above.

Tried using max_value () but this gave me a wide_int instead of a
widest_int.
Instead I've replaced with min_precision and GET_MODE_BITSIZE.

Added an extra check for when the type is TYPE_OVERFLOW_UNDEFINED.



Alan.



optimizeConditionReductions2.patch
Description: Binary data


[Patch] Optimize condition reductions where the result is an integer induction variable

2015-11-11 Thread Alan Hayward
Hi,
I hoped to post this in time for Monday’s cut off date, but circumstances
delayed me until today. Hoping if possible this patch will still be able
to go in.


This patch builds upon the change for PR65947, and reduces the amount of
code produced in a vectorized condition reduction where operand 2 of the
COND_EXPR is an assignment of a increasing integer induction variable that
won't wrap.
 

For example (assuming all types are ints), this is a match:

last = 5;
for (i = 0; i < N; i++)
  if (a[i] < min_v)
last = i;

Whereas, this is not because the result is based off a memory access:
last = 5;
for (i = 0; i < N; i++)
  if (a[i] < min_v)
last = a[i];

In the integer induction variable case we can just use a MAX reduction and
skip all the code I added in my vectorized condition reduction patch - the
additional induction variables in vectorizable_reduction () and the
additional checks in vect_create_epilog_for_reduction (). From the patch
diff only, it's not immediately obvious that those parts will be skipped
as there is no code changes in those areas.

The initial value of the induction variable is force set to zero, as any
other value could effect the result of the induction. At the end of the
loop, if the result is zero, then we restore the original initial value.




Cheers,
Alan.



optimizeConditionReductions.patch
Description: Binary data


Re: [Patch] Avoid is_simple_use bug in vectorizable_live_operation

2015-10-27 Thread Alan Hayward


On 27/10/2015 11:36, "Richard Biener"  wrote:

>On Mon, Oct 26, 2015 at 6:15 PM, Alan Hayward 
>wrote:
>>
>>
>> On 26/10/2015 13:35, "Richard Biener" 
>>wrote:
>>
>>>On Mon, Oct 26, 2015 at 1:33 PM, Alan Hayward 
>>>wrote:
>>>> There is a potential bug in vectorizable_live_operation.
>>>>
>>>> Consider the case where the first op for stmt is valid, but the second
>>>>is
>>>> null.
>>>> The first time through the for () loop, it will call out to
>>>> vect_is_simple_use () which will set dt.
>>>> The second time, because op is null, vect_is_simple_use () will not be
>>>> called.
>>>> However, dt is still set to a valid value, therefore the loop will
>>>> continue.
>>>>
>>>> Note this is different from the case where the first op is null, which
>>>> will cause the loop not call vect_is_simple_use () and then return
>>>>false.
>>>>
>>>> It is possible that this was intentional, however that is not clear
>>>>from
>>>> the code.
>>>>
>>>> The fix is to simply ensure dt is initialized to a default value on
>>>>each
>>>> iteration.
>>>
>>>I think the patch is a strict improvement, thus OK.  Still a NULL
>>>operand
>>>is not possible in GIMPLE so the op && check is not necessary.  The way
>>>it iterates over all stmt uses is a bit scary anyway.  As it is ok with
>>>all invariants it should probably simply do sth like
>>>
>>>   FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
>>> if (!vect_is_simple_use (op, ))
>>>
>>>and be done with that.  Unvisited uses can only be constants (ok).
>>>
>>>Care to rework the funtion like that if you are here?
>>>
>>
>> Ok, I’ve updated as requested.
>
>Ok.  Please remove
>
>  code = gimple_assign_rhs_code (stmt);
>  op_type = TREE_CODE_LENGTH (code);
>  rhs_class = get_gimple_rhs_class (code);
>  gcc_assert (rhs_class != GIMPLE_UNARY_RHS || op_type == unary_op);
>  gcc_assert (rhs_class != GIMPLE_BINARY_RHS || op_type == binary_op);
>
>and associated variables as I belive otherwise a --disable-checking build
>will fail with set-but-not-used warnings.  And the asserts are quite
>pointless
>given the now sanitized loop.
>

I did consider removing those asserts, but didn’t want to remove any
important checks. Didn’t think about the disable-checking build.
Anyway, new patch attached.


Cheers,
Alan.



avoid_issimple3.patch
Description: Binary data


Re: [Patch] Avoid is_simple_use bug in vectorizable_live_operation

2015-10-26 Thread Alan Hayward


On 26/10/2015 13:35, "Richard Biener"  wrote:

>On Mon, Oct 26, 2015 at 1:33 PM, Alan Hayward 
>wrote:
>> There is a potential bug in vectorizable_live_operation.
>>
>> Consider the case where the first op for stmt is valid, but the second
>>is
>> null.
>> The first time through the for () loop, it will call out to
>> vect_is_simple_use () which will set dt.
>> The second time, because op is null, vect_is_simple_use () will not be
>> called.
>> However, dt is still set to a valid value, therefore the loop will
>> continue.
>>
>> Note this is different from the case where the first op is null, which
>> will cause the loop not call vect_is_simple_use () and then return
>>false.
>>
>> It is possible that this was intentional, however that is not clear from
>> the code.
>>
>> The fix is to simply ensure dt is initialized to a default value on each
>> iteration.
>
>I think the patch is a strict improvement, thus OK.  Still a NULL operand
>is not possible in GIMPLE so the op && check is not necessary.  The way
>it iterates over all stmt uses is a bit scary anyway.  As it is ok with
>all invariants it should probably simply do sth like
>
>   FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE)
> if (!vect_is_simple_use (op, ))
>
>and be done with that.  Unvisited uses can only be constants (ok).
>
>Care to rework the funtion like that if you are here?
>

Ok, I’ve updated as requested.


Cheers,
Alan.



avoid_issimple2.patch
Description: Binary data


[Patch] Avoid is_simple_use bug in vectorizable_live_operation

2015-10-26 Thread Alan Hayward
There is a potential bug in vectorizable_live_operation.

Consider the case where the first op for stmt is valid, but the second is
null.
The first time through the for () loop, it will call out to
vect_is_simple_use () which will set dt.
The second time, because op is null, vect_is_simple_use () will not be
called.
However, dt is still set to a valid value, therefore the loop will
continue.

Note this is different from the case where the first op is null, which
will cause the loop not call vect_is_simple_use () and then return false.

It is possible that this was intentional, however that is not clear from
the code.

The fix is to simply ensure dt is initialized to a default value on each
iteration.


2015-09-07  Alan Hayward  alan.hayw...@arm.com
* tree-vect-looop.c (vectorizable_live_operation): localize variable.


Cheers,
Alan



avoid_issimple.patch
Description: Binary data


[PATCH] Fix VEC_COND_EXPR types when vectorizing conditional expressions

2015-10-26 Thread Alan Hayward
VEC_COND_EXPRs have been changed to use boolean vectors for the first
argument.
This changes the vectorizing conditional expressions to use this new
format,
fixing the compiler failures for the 65947-*.c tests.


2015-10-26  Alan Hayward 

gcc/
PR tree-optimization/65947
* tree-vect-loop.c (vect_create_epilog_for_reduction):
Fix VEC_COND_EXPR types.


Cheers,
Alan.





fix_veccondexprtypes.patch
Description: Binary data


[COMMITTED] Add myself to MAINTAINERS (Write After Approval)

2015-10-23 Thread Alan Hayward
Hi All,

I've just added myself to Write After Approval maintainers.
Committed r229216


Cheers,
Alan.

Index: ChangeLog
===
--- ChangeLog   (revision 229215)
+++ ChangeLog   (revision 229216)
@@ -1,3 +1,7 @@
+2015-10-23  Alan Hayward 
+
+   * MAINTAINERS  (Write After Approval): Add myself.
+
 2015-10-20  H.J. Lu  

Sync with binutils-gdb:
Index: MAINTAINERS
===
--- MAINTAINERS (revision 229215)
+++ MAINTAINERS (revision 229216)
@@ -419,6 +419,7 @@
 Stuart Hastings
 Michael Haubenwallner  

 Pat Haugen     
+Alan Hayward   
 Mark Heffernan 
 George Helffrich   
 Daniel Hellstrom   




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-10-22 Thread Alan Hayward


On 22/10/2015 15:15, "Alan Lawrence"  wrote:

>Just one very small point...
>
>On 19/10/15 09:17, Alan Hayward wrote:
>
> > -  if (check_reduction
> > -  && (!commutative_tree_code (code) || !associative_tree_code
>(code)))
> > +  if (check_reduction)
> >  {
> > -  if (dump_enabled_p ())
> > -report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> > -   "reduction: not commutative/associative: ");
> > -  return NULL;
> > +  if (code != COND_EXPR
> > + && (!commutative_tree_code (code) || !associative_tree_code
>(code)))
> > +   {
> > + if (dump_enabled_p ())
> > +   report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> > +   "reduction: not commutative/associative: ");
> > + return NULL;
> > +   }
> > +
> > +  if (code == COND_EXPR)
> > +   *v_reduc_type = COND_REDUCTION;
>
>Wouldn't this be easier written as
>
>if (code == COND_EXPR)
>   *v_reduc_type = COND_REDUCTION;
>else if (!commutative_tree_code (code) || !associative_tree_code (code))
>   {...}
>
>? Your call!
>
>Cheers, Alan


Good spot! I suspect that’s slipped through when I’ve rewritten bits.
I’ll add this in with Richard’s suggestion of the change around the call
to vectorizable_condition.


Alan.




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-10-01 Thread Alan Hayward


On 30/09/2015 13:45, "Richard Biener"  wrote:

>On Wed, Sep 23, 2015 at 5:51 PM, Alan Hayward 
>wrote:
>>
>>
>> On 18/09/2015 14:53, "Alan Hayward"  wrote:
>>
>>>
>>>
>>>On 18/09/2015 14:26, "Alan Lawrence"  wrote:
>>>
>>>>On 18/09/15 13:17, Richard Biener wrote:
>>>>>
>>>>> Ok, I see.
>>>>>
>>>>> That this case is already vectorized is because it implements
>>>>>MAX_EXPR,
>>>>> modifying it slightly to
>>>>>
>>>>> int foo (int *a)
>>>>> {
>>>>>int val = 0;
>>>>>for (int i = 0; i < 1024; ++i)
>>>>>  if (a[i] > val)
>>>>>val = a[i] + 1;
>>>>>return val;
>>>>> }
>>>>>
>>>>> makes it no longer handled by current code.
>>>>>
>>>>
>>>>Yes. I believe the idea for the patch is to handle arbitrary
>>>>expressions
>>>>like
>>>>
>>>>int foo (int *a)
>>>>{
>>>>int val = 0;
>>>>for (int i = 0; i < 1024; ++i)
>>>>  if (some_expression (i))
>>>>val = another_expression (i);
>>>>return val;
>>>>}
>>>
>>>Yes, that’s correct. Hopefully my new test cases should cover
>>>everything.
>>>
>>
>> Attached is a new version of the patch containing all the changes
>> requested by Richard.
>
>+  /* Compare the max index vector to the vector of found indexes to
>find
>+the postion of the max value.  This will result in either a
>single
>+match or all of the values.  */
>+  tree vec_compare = make_ssa_name (index_vec_type_signed);
>+  gimple vec_compare_stmt = gimple_build_assign (vec_compare,
>EQ_EXPR,
>+induction_index,
>+max_index_vec);
>
>I'm not sure all targets can handle this.  If I deciper the code
>correctly then we do
>
>  mask = induction_index == max_index_vec;
>  vec_and = mask & vec_data;
>
>plus some casts.  So this is basically
>
>  vec_and = induction_index == max_index_vec ? vec_data : {0, 0, ... };
>
>without the need to relate the induction index vector type to the data
>vector type.
>I believe this is also the form all targets support.


Ok, I’ll replace this.

>
>I am missing a comment before all this code-generation that shows the
>transform
>result with the variable names used in the code-gen.  I have a hard
>time connecting
>things here.

Ok, I’ll add some comments.

>
>+  tree matched_data_reduc_cast = build1 (VIEW_CONVERT_EXPR,
>scalar_type,
>+matched_data_reduc);
>+  epilog_stmt = gimple_build_assign (new_scalar_dest,
>+matched_data_reduc_cast);
>+  new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
>+  gimple_assign_set_lhs (epilog_stmt, new_temp);
>
>this will leave the stmt unsimplified.  scalar sign-changes should use
>NOP_EXPR,
>not VIEW_CONVERT_EXPR.  The easiest fix is to use fold_convert instead.
>Also just do like before - first make_ssa_name and then directly use it
>in the
>gimple_build_assign.

We need the VIEW_CONVERT_EXPR for the cases where we have float data
values. The index is always integer.


>
>The patch is somewhat hard to parse with all the indentation changes.  A
>context
>diff would be much easier to read in those contexts.

Ok, I’ll make the next patch like that

>
>+  if (v_reduc_type == COND_REDUCTION)
>+{
>+  widest_int ni;
>+
>+  if (! max_loop_iterations (loop, &ni))
>+   {
>+ if (dump_enabled_p ())
>+   dump_printf_loc (MSG_NOTE, vect_location,
>+"loop count not known, cannot create cond "
>+"reduction.\n");
>
>ugh.  That's bad.
>
>+  /* The additional index will be the same type as the condition.
>Check
>+that the loop can fit into this less one (because we'll use up
>the
>+zero slot for when there are no matches).  */
>+  tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type);
>+  if (wi::geu_p (ni, wi::to_widest (max_index)))
>+   {
>+ if (dump_enabled_p ())
>+   dump_printf_loc (MSG_NOTE, vect_location,
>+"loop size is greater than data size.\n");
>+

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-23 Thread Alan Hayward


On 18/09/2015 14:53, "Alan Hayward"  wrote:

>
>
>On 18/09/2015 14:26, "Alan Lawrence"  wrote:
>
>>On 18/09/15 13:17, Richard Biener wrote:
>>>
>>> Ok, I see.
>>>
>>> That this case is already vectorized is because it implements MAX_EXPR,
>>> modifying it slightly to
>>>
>>> int foo (int *a)
>>> {
>>>int val = 0;
>>>for (int i = 0; i < 1024; ++i)
>>>  if (a[i] > val)
>>>val = a[i] + 1;
>>>return val;
>>> }
>>>
>>> makes it no longer handled by current code.
>>>
>>
>>Yes. I believe the idea for the patch is to handle arbitrary expressions
>>like
>>
>>int foo (int *a)
>>{
>>int val = 0;
>>for (int i = 0; i < 1024; ++i)
>>  if (some_expression (i))
>>val = another_expression (i);
>>return val;
>>}
>
>Yes, that’s correct. Hopefully my new test cases should cover everything.
>

Attached is a new version of the patch containing all the changes
requested by Richard.


Thanks,
Alan.




0001-Support-for-vectorizing-conditional-expressions.patch
Description: Binary data


Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-18 Thread Alan Hayward

On 18/09/2015 14:26, "Alan Lawrence"  wrote:

>On 18/09/15 13:17, Richard Biener wrote:
>>
>> Ok, I see.
>>
>> That this case is already vectorized is because it implements MAX_EXPR,
>> modifying it slightly to
>>
>> int foo (int *a)
>> {
>>int val = 0;
>>for (int i = 0; i < 1024; ++i)
>>  if (a[i] > val)
>>val = a[i] + 1;
>>return val;
>> }
>>
>> makes it no longer handled by current code.
>>
>
>Yes. I believe the idea for the patch is to handle arbitrary expressions
>like
>
>int foo (int *a)
>{
>int val = 0;
>for (int i = 0; i < 1024; ++i)
>  if (some_expression (i))
>val = another_expression (i);
>return val;
>}


Yes, that’s correct. Hopefully my new test cases should cover everything.


Alan.




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-15 Thread Alan Hayward


On 15/09/2015 13:09, "Richard Biener"  wrote:

>
>Now comments on the patch itself.
>
>+  if (code == COND_EXPR)
>+   *v_reduc_type = COND_REDUCTION;
>
>so why not simply use COND_EXPR instead of the new v_reduc_type?

v_reduc_type is also dependant on check_reduction (which comes from
!nested_cycle in vectorizable_reduction).
It seemed messy to keep on checking for both those things throughout.

In my patch to catch simpler condition reductions, I’ll be adding another
value to this enum too. v_reduc_type will be set to this new value based
on the same properties for COND_REDUCTION plus some additional constraints.

>
>+  if (check_reduction && code != COND_EXPR &&
>+  vect_is_slp_reduction (loop_info, phi, def_stmt))
>
>&&s go to the next line

ok

>
>+ /* Reduction of the max index and a reduction of the found
>+values.  */
>+ epilogue_cost += add_stmt_cost (target_cost_data, 1,
>+ vec_to_scalar, stmt_info, 0,
>+ vect_epilogue);
>
>vec_to_scalar once isn't what the comment suggests.  Instead the
>comment suggests twice what a regular reduction would do
>but I guess we can "hide" the vec_to_scalar cost and "merge" it
>with the broadcast.  Thus make the above two vector_stmt costs?
>
>+ /* A broadcast of the max value.  */
>+ epilogue_cost += add_stmt_cost (target_cost_data, 2,
>+ scalar_to_vec, stmt_info, 0,
>+ vect_epilogue);
>
>comment suggests a single broadcast.

I’ve made a copy/paste error here. Just need to swap the 1 and the 2.


>
>@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi)
>  the final vector of induction results:  */
>   exit_phi = NULL;
>   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg)
>-{
>+   {
>  gimple use_stmt = USE_STMT (use_p);
>  if (is_gimple_debug (use_stmt))
>continue;
>
>please avoid unrelated whitespace changes.

Ok. I was changing “8 spaces” to a tab, but happy to revert.

>
>+  case COND_EXPR:
>+   if (v_reduc_type == COND_REDUCTION)
>+ {
>...
>+   /* Fall through.  */
>+
>   case MIN_EXPR:
>   case MAX_EXPR:
>-  case COND_EXPR:
>
>aww, so we could already handle COND_EXPR reductions?  How do they
>differ from what you add?  Did you see if that path is even exercised
>today?

Today, COND_EXPRs are only supported when they are nested inside a loop.
See the vect-cond-*.c tests.
For example:

for (j = 0; j < M; j++)
{
  x = x_in[j];
  curr_a = a[0];

for (i = 0; i < N; i++)
  {
next_a = a[i+1];
curr_a = x > c[i] ? curr_a : next_a;
  }
  x_out[j] = curr_a;
}


In that case, only the outer loop is vectorised.

>
>+   /* Create a vector of {init_value, 0, 0, 0...}.  */
>+   vec *v;
>+   vec_alloc (v, nunits);
>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val);
>+   if (SCALAR_FLOAT_TYPE_P (scalar_type))
>+ for (i = 1; i < nunits; ++i)
>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
>+   build_real (scalar_type,
>dconst0));
>+   else
>+ for (i = 1; i < nunits; ++i)
>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
>+   build_int_cst (scalar_type, 0));
>+   init_def = build_constructor (vectype, v);
>
>you can unify the float/int case by using build_zero_cst (scalar_type).
>Note that you should build a vector constant instead of a constructor
>if init_val is a constant.  The convenient way is to build the vector
>elements into a tree[] array and use build_vector_stat in that case.

Ok, will switch to build_zero_cst.
Also, I will switch my vector to  {init_value, init_value, init_value…}.
I had {init_value, 0, 0, 0…} because I was going have the option of
ADD_REDUC_EXPR,
But that got removed along the way.

>
>+  /* Find maximum value from the vector of found indexes.  */
>+  tree max_index = make_temp_ssa_name (index_scalar_type, NULL, "");
>
>just use make_ssa_name (index_scalar_type);

Ok

>
>+  /* Convert the vector of data to the same type as the EQ.  */
>+  tree vec_data_cast;
>+  if ( TYPE_UNSIGNED (index_vec_type))
>+   {
>
>How come it never happens the element
>sizes do not match?  (int index type and double data type?)

This was a little unclear.
The induction index is originally created as an unsigned version of the
type as the data vector.
(see the definition of cr_index_vector_type in vectorizable_reduction(),
which is then used to create cond_name)

I will remove the if and replace with a gcc_checking_assert(TYPE_UNSIGNED
(index_vec_type))


>
>+  /* Where the max index occured, use the value from the data
>vector.  */
>+  tree vec_and = make_temp_ssa_name (index_vec_type_signed, NUL

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Alan Hayward
Hi Bill,

I’d be a bit worried about asking the backend for the cost of a
COND_REDUCTION, as that will rely on the backend understanding the
implementation the vectorizer is using - every time the vectorizer
changed, the backends would need to be updated too. I’m hoping soon to get
together a patch to reduce the stmts produced on the simpler cases, which
would require a different set of costings. I can also imagine further
improvements being added for other special cases over time. Having the
backends understand every variation would be a little cumbersome.

As it stands today, we correctly exit the optimisation if max reduction
isn’t supported in hardware, which is what the cost model is expecting.


If power wanted to use this implementation, then I think it’d probably
need some code in tree-vect-generic.c to implement on emulated max
reduction, which would then require updates to the costs modelling of
anything that uses max reduction (not just cond reduction). All of that is
outside the scope of this patch.


Thanks,
Alan.

On 10/09/2015 23:14, "Bill Schmidt"  wrote:

>Hi Alan,
>
>The cost modeling of the epilogue code seems pretty target-specific ("An
>EQ stmt and an AND stmt, reduction of the max index and a reduction of
>the found values, a broadcast of the max value," resulting in two
>vector_stmts, one vec_to_scalar, and two scalar_to_vecs).  On powerpc,
>this will not represent the cost accurately, and the cost will indeed be
>quite different depending on the mode (logarithmic in the number of
>elements).  I think that you need to create a new entry in
>vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow
>each target to calculate the cost appropriately.
>
>(Powerpc doesn't have a max-reduction hardware instruction, but because
>the reduction will be only in the epilogue code, it may still be
>profitable for us to generate the somewhat expensive reduction sequence
>in order to vectorize the loop.  But we definitely want to model it as
>costly in and of itself.  Also, the sequence will produce the maximum
>value in all positions without a separate broadcast.)
>
>Thanks,
>Bill
>
>On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote:
>> Hi,
>> This patch (attached) adds support for vectorizing conditional
>>expressions
>> (PR 65947), for example:
>> 
>> int condition_reduction (int *a, int min_v)
>> {
>>   int last = 0;
>>   for (int i = 0; i < N; i++)
>> if (a[i] < min_v)
>>   last = a[i];
>>   return last;
>> }
>> 
>> To do this the loop is vectorised to create a vector of data results (ie
>> of matching a[i] values). Using an induction variable, an additional
>> vector is added containing the indexes where the matches occured. In the
>> function epilogue this is reduced to a single max value and then used to
>> index into the vector of data results.
>> When no values are matched in the loop, the indexes vector will contain
>> all zeroes, eventually matching the first entry in the data results
>>vector.
>> 
>> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This
>>is
>> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
>> REDUC_MAX_EXPR is not supported for the required modes, failing the
>> vectorization. On mips it complains that the required vcond expression
>>is
>> not supported. It is suggested the relevant backend experts add the
>> required backend support.
>> 
>> Using a simple testcase based around a large number of N and run on an
>> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
>> it's original time.
>> 
>> This patch caused binary differences in three spec2006 binaries on
>>aarch64
>> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
>> showed no improvement or degregation in runtime.
>> 
>> 
>> In the near future I hope to submit a further patch (as PR 66558) which
>> optimises the case where the result is simply the index of the loop, for
>> example:
>> int condition_reduction (int *a, int min_v)
>> {
>>   int last = 0;
>>   for (int i = 0; i < N; i++)
>> if (a[i] < min_v)
>>   last = i;
>>   return last;
>> }
>> In this case a lot of the new code can be optimized away.
>> 
>> I have run check for aarch64, arm and x86 and have seen no regressions.
>> 
>> 
>> Changelog:
>> 
>> 2015-08-28  Alan Hayward 
>> 
>> PR tree-optimization/65947
>> * tree-vect-loop.c
>> (vect_is_simple_reduction_1): Find condition reductions.
>> (vec

[PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-10 Thread Alan Hayward
Hi,
This patch (attached) adds support for vectorizing conditional expressions
(PR 65947), for example:

int condition_reduction (int *a, int min_v)
{
  int last = 0;
  for (int i = 0; i < N; i++)
if (a[i] < min_v)
  last = a[i];
  return last;
}

To do this the loop is vectorised to create a vector of data results (ie
of matching a[i] values). Using an induction variable, an additional
vector is added containing the indexes where the matches occured. In the
function epilogue this is reduced to a single max value and then used to
index into the vector of data results.
When no values are matched in the loop, the indexes vector will contain
all zeroes, eventually matching the first entry in the data results vector.

To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This is
supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
REDUC_MAX_EXPR is not supported for the required modes, failing the
vectorization. On mips it complains that the required vcond expression is
not supported. It is suggested the relevant backend experts add the
required backend support.

Using a simple testcase based around a large number of N and run on an
aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
it's original time.

This patch caused binary differences in three spec2006 binaries on aarch64
- 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
showed no improvement or degregation in runtime.


In the near future I hope to submit a further patch (as PR 66558) which
optimises the case where the result is simply the index of the loop, for
example:
int condition_reduction (int *a, int min_v)
{
  int last = 0;
  for (int i = 0; i < N; i++)
if (a[i] < min_v)
  last = i;
  return last;
}
In this case a lot of the new code can be optimized away.

I have run check for aarch64, arm and x86 and have seen no regressions.


Changelog:

    2015-08-28  Alan Hayward 

PR tree-optimization/65947
* tree-vect-loop.c
(vect_is_simple_reduction_1): Find condition reductions.
(vect_model_reduction_cost): Add condition reduction costs.
(get_initial_def_for_reduction): Add condition reduction initial
var.
(vect_create_epilog_for_reduction): Add condition reduction epilog.
(vectorizable_reduction): Condition reduction support.
* tree-vect-stmts.c
(vectorizable_condition): Add vect reduction arg
* doc/sourcebuild.texi (Vector-specific attributes): Document
vect_max_reduc

testsuite/Changelog:

PR tree-optimization/65947
* lib/target-supports.exp
(check_effective_target_vect_max_reduc): Add.
* gcc.dg/vect/pr65947-1.c: New test.
* gcc.dg/vect/pr65947-2.c: New test.
* gcc.dg/vect/pr65947-3.c: New test.
* gcc.dg/vect/pr65947-4.c: New test.
* gcc.dg/vect/pr65947-5.c: New test.
* gcc.dg/vect/pr65947-6.c: New test.
* gcc.dg/vect/pr65947-7.c: New test.
* gcc.dg/vect/pr65947-8.c: New test.
* gcc.dg/vect/pr65947-9.c: New test.
* gcc.dg/vect/pr65947-10.c: New test.
* gcc.dg/vect/pr65947-11.c: New test.



Thanks,
Alan




0001-Support-for-vectorizing-conditional-expressions.patch
Description: Binary data


[PATCH][rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1

2014-12-12 Thread Alan Hayward
[Cleaning this thread up to submit patch again, with better explanation]

This patch causes subreg_get_info() to exit early in the simple cases
where we are extracting a whole register from a multi register.

In aarch64 for Big Endian we were producing a subreg of a OImode (256bits)
from a CImode (384bits) This would hit the following assert in
subreg_get_info:

gcc_assert ((GET_MODE_SIZE (xmode) % GET_MODE_SIZE (ymode)) == 0);


This is a rule we should be able to relax a little - if the subreg we want
fits into a whole register then this is a valid result and can be easily
detected earlier in the function.

This has the bonus that we should be slightly reducing the execution time
for more common cases, for example a subreg of 64bits from 256bits.

This patch is required for the second part of the patch, which is aarch64
specific, and fixes up aarch64 Big Endian movoi/ci/xi. This second part
has already been approved.

This patch will apply cleanly by itself and no regressions were seen when
testing aarch64 and x86_64 on make check.

Cheers,
Alan


Changelog:

2014-11-14  Alan Hayward  

* rtlanal.c
(subreg_get_info): Exit early for simple and common cases



---
gcc/rtlanal.c | 13 +
1 file changed, 13 insertions(+)

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index c9bf69c..a3f7b78 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -3561,6 +3561,19 @@ subreg_get_info (unsigned int xregno, machine_mode
xmode,
  info->offset = offset / regsize_xmode;
  return;
}
+   /* Quick exit for the simple and common case of extracting whole
+  subregisters from a multiregister value.  */
+  if (!rknown
+ && WORDS_BIG_ENDIAN == REG_WORDS_BIG_ENDIAN
+ && regsize_xmode == regsize_ymode
+ && (offset % regsize_ymode) == 0)
+   {
+ info->representable_p = true;
+ info->nregs = nregs_ymode;
+ info->offset = offset / regsize_ymode;
+ gcc_assert (info->offset + info->nregs <= nregs_xmode);
+ return;
+   }
 }
   /* Lowpart subregs are otherwise valid.  */
--
1.9.1



0001-BE-fix-load-stores.-Common-code.patch
Description: Binary data


Re: [PATCH][rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1

2014-12-10 Thread Alan Hayward

On 02/12/2014 12:36, "Alan Hayward"  wrote:

>
>On 21/11/2014 14:08, "Alan Hayward"  wrote:
>
>>
>>On 14/11/2014 16:48, "Alan Hayward"  wrote:
>>
>>>This is a new version of my BE patch from a few weeks ago.
>>>This is part 1 and covers rtlanal.c. The second part will be aarch64
>>>specific.
>>>
>>>When combined with the second patch, It fixes up movoi/ci/xi for Big
>>>Endian, so that we end up with the lab of a big-endian integer to be in
>>>the low byte of the highest-numbered register.
>>>
>>>This will apply cleanly by itself and no regressions were seen when
>>>testing aarch64 and x86_64 on make check.
>>>
>>>
>>>Changelog:
>>>
>>>2014-11-14  Alan Hayward  
>>>
>>>* rtlanal.c
>>>(subreg_get_info): Exit early for simple and common cases
>>>
>>>
>>>Alan.
>>
>>Hi,
>>
>>The second part to this patch (aarch64 specific) has been approved.
>>
>>
>>Could someone review this one please.
>>
>>
>>Thanks,
>>Alan.
>
>
>Ping.
>
>
>Thanks,
>Alan.


Ping ping.

Thanks,
Alan.


0001-BE-fix-load-stores.-Common-code.patch
Description: Binary data


Re: [rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1

2014-12-02 Thread Alan Hayward

On 21/11/2014 14:08, "Alan Hayward"  wrote:

>
>On 14/11/2014 16:48, "Alan Hayward"  wrote:
>
>>This is a new version of my BE patch from a few weeks ago.
>>This is part 1 and covers rtlanal.c. The second part will be aarch64
>>specific.
>>
>>When combined with the second patch, It fixes up movoi/ci/xi for Big
>>Endian, so that we end up with the lab of a big-endian integer to be in
>>the low byte of the highest-numbered register.
>>
>>This will apply cleanly by itself and no regressions were seen when
>>testing aarch64 and x86_64 on make check.
>>
>>
>>Changelog:
>>
>>2014-11-14  Alan Hayward  
>>
>>* rtlanal.c
>>(subreg_get_info): Exit early for simple and common cases
>>
>>
>>Alan.
>
>Hi,
>
>The second part to this patch (aarch64 specific) has been approved.
>
>
>Could someone review this one please.
>
>
>Thanks,
>Alan.


Ping.


Thanks,
Alan.


0001-BE-fix-load-stores.-Common-code.patch
Description: Binary data


Re: [rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1

2014-11-21 Thread Alan Hayward

On 14/11/2014 16:48, "Alan Hayward"  wrote:

>This is a new version of my BE patch from a few weeks ago.
>This is part 1 and covers rtlanal.c. The second part will be aarch64
>specific.
>
>When combined with the second patch, It fixes up movoi/ci/xi for Big
>Endian, so that we end up with the lab of a big-endian integer to be in
>the low byte of the highest-numbered register.
>
>This will apply cleanly by itself and no regressions were seen when
>testing aarch64 and x86_64 on make check.
>
>
>Changelog:
>
>2014-11-14  Alan Hayward  
>
>* rtlanal.c
>(subreg_get_info): Exit early for simple and common cases
>
>
>Alan.

Hi,

The second part to this patch (aarch64 specific) has been approved.


Could someone review this one please.


Thanks,
Alan.





FW: [Aarch64][BE][2/2] Fix vector load/stores to not use ld1/st1

2014-11-21 Thread Alan Hayward

On 20/11/2014 18:13, "Marcus Shawcroft"  wrote:

>On 14 November 2014 16:48, Alan Hayward  wrote:
>>This is a new version of my BE patch from a few weeks ago.
>>This is part 2 and covers all the aarch64 changes.
>>
>>When combined with the first patch, It fixes up movoi/ci/xi for Big
>>Endian, so that we end up with the lab of a big-endian integer to be in
>>the low byte of the highest-numbered register.
>>
>>This patch requires part 1 and David Sherwood’s patch:
>>  [AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe.
>>
>>When tested with David’s patch and [1/2] of this patch, no regressions
>>were seen when testing aarch64 and x86_64 on make check.
>>
>>
>>Changelog:
>>2014-11-14  Alan Hayward  
>>
>> * config/aarch64/aarch64.c
>> (aarch64_classify_address): Allow extra addressing modes for BE.
>> (aarch64_print_operand): new operand for printing a q
>>register+1.
>
>
>Just a bunch of ChangeLog nits.

>
>+void aarch64_simd_emit_reg_reg_move (rtx *operands, enum machine_mode
>mode,
>+ unsigned int count);
>
>Drop the formal argument names.
>
>
>Can you respin with these changes please.
>
>/Marcus


New version. Identical to previous version of the patch except for:
* removal of parameter names in aarch64-protos.h
* new changelog


2014-11-21  Alan Hayward  

PR 57233
PR 59810
* config/aarch64/aarch64.c
(aarch64_classify_address): Allow extra addressing modes for BE.
(aarch64_print_operand): New operand for printing a q register+1.
(aarch64_simd_emit_reg_reg_move): Define.
(aarch64_simd_disambiguate_copy): Remove.
* config/aarch64/aarch64-protos.h
(aarch64_simd_emit_reg_reg_move): Define.
(aarch64_simd_disambiguate_copy): Remove.
* config/aarch64/aarch64-simd.md
(define_split): Use aarch64_simd_emit_reg_reg_move.
(define_expand "mov"): Less restrictive predicates.
(define_insn "*aarch64_mov"): Simplify and only allow for LE.
(define_insn "*aarch64_be_movoi"): Define.
(define_insn "*aarch64_be_movci"): Define.
(define_insn "*aarch64_be_movxi"): Define.
(define_split): OI mov.  Use aarch64_simd_emit_reg_reg_move.
(define_split): CI mov.  Use aarch64_simd_emit_reg_reg_move.
(define_split): XI mov.  Use aarch64_simd_emit_reg_reg_move.


Alan.




0001-BE-fix-load-stores.-Aarch64-code.-v2.patch
Description: Binary data


[Aarch64][BE][2/2] Fix vector load/stores to not use ld1/st1

2014-11-14 Thread Alan Hayward
This is a new version of my BE patch from a few weeks ago.
This is part 2 and covers all the aarch64 changes.

When combined with the first patch, It fixes up movoi/ci/xi for Big
Endian, so that we end up with the lab of a big-endian integer to be in
the low byte of the highest-numbered register.

This patch requires part 1 and David Sherwood’s patch:
 [AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe.

When tested with David’s patch and [1/2] of this patch, no regressions
were seen when testing aarch64 and x86_64 on make check.


Changelog:
2014-11-14  Alan Hayward  

* config/aarch64/aarch64.c
(aarch64_classify_address): Allow extra addressing modes for BE.
(aarch64_print_operand): new operand for printing a q register+1.
(aarch64_simd_emit_reg_reg_move): replacement for
aarch64_simd_disambiguate_copy that plants the required mov.
* config/aarch64/aarch64-protos.h
(aarch64_simd_emit_reg_reg_move): replacement for
aarch64_simd_disambiguate_copy.
* config/aarch64/aarch64-simd.md
(define_split): Use new aarch64_simd_emit_reg_reg_move.
(define_expand "mov"): less restrictive predicates.
(define_insn "*aarch64_mov"): Simplify and only allow for LE.
(define_insn "*aarch64_be_movoi"): New.  BE only.  Plant ldp or
stp.
(define_insn "*aarch64_be_movci"): New.  BE only.  No instructions.
(define_insn "*aarch64_be_movxi"): New.  BE only.  No instructions.
(define_split): OI mov.  Use new aarch64_simd_emit_reg_reg_move.
(define_split): CI mov.  Use new aarch64_simd_emit_reg_reg_move.
On BE
plant movs for reg to/from mem case.
(define_split): XI mov.  Use new aarch64_simd_emit_reg_reg_move.
On BE
plant movs for reg to/from mem case.

Alan.



0001-BE-fix-load-stores.-Aarch64-code.patch
Description: Binary data


[rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1

2014-11-14 Thread Alan Hayward
This is a new version of my BE patch from a few weeks ago.
This is part 1 and covers rtlanal.c. The second part will be aarch64
specific.

When combined with the second patch, It fixes up movoi/ci/xi for Big
Endian, so that we end up with the lab of a big-endian integer to be in
the low byte of the highest-numbered register.

This will apply cleanly by itself and no regressions were seen when
testing aarch64 and x86_64 on make check.


Changelog:

2014-11-14  Alan Hayward  

* rtlanal.c
(subreg_get_info): Exit early for simple and common cases


Alan.



0001-BE-fix-load-stores.-Common-code.patch
Description: Binary data


Re: [AArch64] [BE] Fix vector load/stores to not use ld1/st1

2014-11-06 Thread Alan Hayward


On 06/11/2014 09:09, "Yangfei (Felix)"  wrote:

>> On 10 October 2014 16:19, Alan Hayward  wrote:
>> > This patch is dependant on "[AArch64] [BE] [1/2] Make large opaque
>> > integer modes endianness-safe.”
>> >
>> > It fixes up movoi/ci/xi for Big Endian, so that we end up with the lsb
>> > of a big-endian integer to be in the low byte of the highest-numbered
>> > register.
>> >
>> > movoi uses stp/ldp
>> > movxi needs a split version (which is shared with register->register
>> > movxi), which just splits to two new movs movci can then split to
>> > three movs. A future patch will instead split to an movoi and a movti.
>> >
>> >
>> > There are no changes for LE.
>> >
>> > Ran whole of check with both parts of "Make large opaque integer modes
>> > endianness-safe”. No regressions.
>> >
>> >
>> > ChangeLog:
>> >
>> > gcc/:
>> > 2014-10-10  Alan Hayward  
>> > * config/aarch64/aarch64.c
>> > (aarch64_classify_address): Allow extra addressing modes for
>>BE.
>> > (aarch64_print_operand): new operand for printing a q
>>register+1.
>> > (aarch64_simd_emit_reg_reg_move): replacement for
>> > aarch64_simd_disambiguate_copy that plants the required mov.
>> > * config/aarch64/aarch64-protos.h
>> > (aarch64_simd_emit_reg_reg_move): replacement for
>> > aarch64_simd_disambiguate_copy.
>> > * config/aarch64/aarch64-simd.md
>> > (define_split) Use new aarch64_simd_emit_reg_reg_move.
>> > (define_expand "mov") less restrictive predicates.
>> > (define_insn "*aarch64_mov") Simplify and only allow
>>for LE.
>> > (define_insn "*aarch64_be_movoi") New.  BE only.  Plant ldp or
>> stp.
>> > (define_insn "*aarch64_be_movci") New.  BE only.  No
>> instructions.
>> > (define_insn "*aarch64_be_movxi") New.  BE only.  No
>> instructions.
>> > (define_split) OI mov.  Use new
>> aarch64_simd_emit_reg_reg_move.
>> > (define_split) CI mov.  Use new
>> aarch64_simd_emit_reg_reg_move.
>> > On BE
>> > plant movs for reg to/from mem case.
>> > (define_split) XI mov.  Use new
>> aarch64_simd_emit_reg_reg_move.
>> > On BE
>> > plant movs for reg to/from mem case.
>> >
>> 
>> OK /Marcus
>
>
>Hello, 
>
>  It seems that this patch breaks the compile of some testcases under
>big-endian. 
>  On example: testsuite/gcc.target/aarch64/advsimd-intrinsics/ vldX.c
>  Any thing I missed? Please confirm. Thanks.
>
>$ aarch64_be-linux-gnu-gcc vldX.c
>
>vldX.c: In function 'exec_vldX':
>vldX.c:686:1: internal compiler error: in change_address_1, at
>emit-rtl.c:2096
> }
> ^
>0x73b18c change_address_1
>/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/emit-rtl.c:2096
>0xdbd278 gen_split_2991(rtx_insn*, rtx_def**)
>
>/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/config/aarch64/aarch64-simd
>.md:4436
>0x743aa1 try_split(rtx_def*, rtx_def*, int)
>/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/emit-rtl.c:3639
>0x9b8c93 split_insn
>/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/recog.c:2901
>0x9b8ee7 split_all_insns_noflow()
>/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/recog.c:3042
>Please submit a full bug report,
>with preprocessed source if appropriate.
>Please include the complete backtrace with any bug report.
>See <http://gcc.gnu.org/bugs.html> for instructions.
>


Did you try my patch with or without it’s dependency ?
"[AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe.”
(David Sherwood)

Without that above patch then I would expect some tests to fail.


In addition, it looks like David has had some problems merging that patch
to the latest head - I’d suggest not using this patch until David has
sorted his.

Alan.





[AArch64] [BE] Fix vector load/stores to not use ld1/st1

2014-10-10 Thread Alan Hayward
This patch is dependant on "[AArch64] [BE] [1/2] Make large opaque integer
modes endianness-safe.”

It fixes up movoi/ci/xi for Big Endian, so that we end up with the lsb of
a big-endian integer to be in the low byte of the highest-numbered
register.

movoi uses stp/ldp
movxi needs a split version (which is shared with register->register
movxi), which just splits to two new movs
movci can then split to three movs. A future patch will instead split to
an movoi and a movti.


There are no changes for LE.

Ran whole of check with both parts of "Make large opaque integer modes
endianness-safe”. No regressions.


ChangeLog:

gcc/:
    2014-10-10  Alan Hayward  
* config/aarch64/aarch64.c
(aarch64_classify_address): Allow extra addressing modes for BE.
(aarch64_print_operand): new operand for printing a q register+1.
(aarch64_simd_emit_reg_reg_move): replacement for
aarch64_simd_disambiguate_copy that plants the required mov.
* config/aarch64/aarch64-protos.h
(aarch64_simd_emit_reg_reg_move): replacement for
aarch64_simd_disambiguate_copy.
* config/aarch64/aarch64-simd.md
(define_split) Use new aarch64_simd_emit_reg_reg_move.
(define_expand "mov") less restrictive predicates.
(define_insn "*aarch64_mov") Simplify and only allow for LE.
(define_insn "*aarch64_be_movoi") New.  BE only.  Plant ldp or stp.
(define_insn "*aarch64_be_movci") New.  BE only.  No instructions.
(define_insn "*aarch64_be_movxi") New.  BE only.  No instructions.
(define_split) OI mov.  Use new aarch64_simd_emit_reg_reg_move.
(define_split) CI mov.  Use new aarch64_simd_emit_reg_reg_move.
On BE
plant movs for reg to/from mem case.
(define_split) XI mov.  Use new aarch64_simd_emit_reg_reg_move.
On BE
plant movs for reg to/from mem case.



0001-be-mov.patch
Description: Binary data