Re: [RFC] [Patch] [Debug] Add new NOTE to be used for debugging.
(Apologies this isn’t using the reply id) FYI, I tried the patch below and I’m happy it fixes the GDB issue in PR debug/88432. In the meantime, GDB HEAD has a workaround to disable stack protection in the GDB testsuite, and a KFAILed test for the exact bug. See: https://sourceware.org/ml/gdb-patches/2019-01/msg00425.html Alan. Re: [RFC] [Patch] [Debug] Add new FUNCTION_BEG NOTE to be used for debugging. • From: Matthew Malcomson • To: "gcc-patches at gcc dot gnu dot org" • Cc: "jason at redhat dot com" , nd , "ccoutant at gmail dot com" , "wilson at tuliptree dot org" , Eric Botcazou • Date: Fri, 18 Jan 2019 15:45:58 + • Subject: Re: [RFC] [Patch] [Debug] Add new FUNCTION_BEG NOTE to be used for debugging. • References: Ping. (note -- when running the GDB testsuite ensuring that -fstack-protector-all is used for compiling each testcase, this patch fixes over 1500 FAIL's) On 10/01/19 13:28, Matthew Malcomson wrote: > At the moment NOTE_INSN_FUNCTION_BEG is used for three different purposes. > The first is as a marker just before the first insn coming from a > "source code statement" of the function. > Bug 88432 is due to the fact that the note does not accurately point to > this logical position in a function -- in that case the stack protect > prologue is directly after NOTE_INSN_FUNCTION_BEG. > > The second is (I believe) to make assumptions about what values are in the > parameter passing registers (in alias.c and calls.c). > (I'm not sure about this second use, if I am correctly reading this code then > it seems like a bug -- e.g. asan_emit_stack_protect inserts insns in the > stream > that break the assumption that seems to be made.) > > The third is as a marker to determine where to put extra code later in > sjlj_emit_function_enter from except.c, where to insert profiling code for a > function in final.c, and where to insert variable expansion code in > pass_expand::execute from cfgexpand.c. > > These three uses seem to be at odds with each other -- insns that change the > values in the parameter passing registers store can come from automatically > inserted code like stack protection, and some requirements on where > instructions > should get inserted have moved the position of this NOTE (e.g. see bugzilla > bug > 81186). > > This patch splits the current note into two different notes, one to retain > uses > 2 and 3 above, and one for use in genrating debug information. > > The first two uses are still attached to NOTE_INSN_FUNCTION_BEG, while the > debugging use is now implemented with NOTE_INSN_DEBUG_FUNCTION_BEG. > > These two notes are put into the functions' insn chain in different > places during the expand pass, and can hence satisfy their respective > uses. > > Bootstrapped and regtested on aarch64. > TODO -- Manual tests done on resulting debug information -- yet to be > automated. > > gcc/ChangeLog: > > 2019-01-10 Matthew Malcomson > > PR debug/88432 > * cfgexpand.c (pass_expand::execute): Insert > NOTE_INSN_DEBUG_FUNCTION_BEG. > * function.c (thread_prologue_and_epilogue_insns): Account > for NOTE_INSN_DEBUG_FUNCTION_BEG. > * cfgrtl.c (duplicate_insn_chain): Account for new NOTE. > * doc/rtl.texi: Document new NOTE. > * dwarf2out.c (dwarf2out_source_line): Change comment to > reference new NOTE. > * final.c (asm_show_source): As above. > (final_scan_insn_1): Split action on NOTE_INSN_FUNCTION_BEG into > two, and move debugging info action to trigger on > NOTE_INSN_DEBUG_FUNCTION_BEG. > * insn-notes.def (INSN_NOTE): Add new NOTE. > > > > ### Attachment also inlined for ease of reply > ### > > > diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c > index > 60c1cfb4556e1a659db19f6719adccc1dab0fe46..491f441d01de226ba5aff2af8c71680b78648a12 > 100644 > --- a/gcc/cfgexpand.c > +++ b/gcc/cfgexpand.c > @@ -6476,6 +6476,12 @@ pass_expand::execute (function *fun) > if (crtl->stack_protect_guard && targetm.stack_protect_runtime_enabled_p > ()) > stack_protect_prologue (); > > + /* Insert a NOTE that marks the end of "generated code" and the start of > code > + that comes from the user. This is the point which dwarf2out.c will > treat > + as the beginning of the users code in this function. e.g. GDB will stop > + just after this note when breaking on entry to the function. */ > + emit_note (NOTE_INSN_DEBUG_FUNCTION_BEG); > + > expand_phi_nodes (&SA); > > /* Release any stale SSA redirection data. */ > diff --git a/gcc/cfgrtl.c b/gcc/cfgrtl.c > index > 172bdf585d036e27bcf53dba89c1ffc1b6cb84c7..d0cbca84aa3f14002a568a65e70016c3e15d6b9c > 100644 > --- a/gcc/cfgrtl.c > +++ b/gcc/cfgrtl.c > @@ -4215,6 +4215,7 @@ duplicate_insn_chain (rtx_insn *from, rtx_insn *to) > case NOTE_INSN_DELETED_DEBUG_LABEL: > /* No problem to strip thes
PING [PATCH v2 0/7] Support partial clobbers around TLS calls on Aarch64 SVE
> On 26 Jul 2018, at 10:13, Alan Hayward wrote: > > This is rebasing of the patch posted in November. > It's aim is to support aarch64 SVE register preservation around TLS calls > by adding a CLOBBER_HIGH expression. > > Across a TLS call, Aarch64 SVE does not explicitly preserve the > SVE vector registers. However, the Neon vector registers are preserved. > Due to overlapping of registers, this means the lower 128bits of all > SVE vector registers will be preserved. > > The existing GCC code assume all Neon and SVE registers are clobbered > across TLS calls. > > This patch introduces a CLOBBER_HIGH expression. This behaves a bit like > a CLOBBER expression. CLOBBER_HIGH can only refer to a single register. > The mode of the expression indicates the size of the lower bits which > will be preserved. If the register contains a value bigger than this > mode then the code will treat the register as clobbered, otherwise the > register remains untouched. > > The means in order to evaluate if a clobber high is relevant, we need to > ensure the mode of the existing value in a register is tracked. > > The first two patches introduce CLOBBER_HIGH and generation support. > The next patch adds a helper function for easily determining if a register > gets clobbered by a CLOBBER_HIGH. > The next three patches add clobber high checks to all of the passes. I > couldn't think of a better way of splitting this up (maybe needs dividing > into smaller patches?). > Finally the last patch adds the CLOBBER_HIGHS around a TLS call for > aarch64 SVE and some test cases. > > Alan Hayward (7): > Add CLOBBER_HIGH expression > Generation support for CLOBBER_HIGH > Add func to check if register is clobbered by clobber_high > lra support for clobber_high > cse support for clobber_high > Remaining support for clobber high > Enable clobber high for tls descs on Aarch64 > > gcc/alias.c| 11 ++ > gcc/cfgexpand.c| 1 + > gcc/combine-stack-adj.c| 1 + > gcc/combine.c | 38 - > gcc/config/aarch64/aarch64.md | 69 ++-- > gcc/cse.c | 187 ++--- > gcc/cselib.c | 42 +++-- > gcc/cselib.h | 2 +- > gcc/dce.c | 11 +- > gcc/df-scan.c | 6 + > gcc/doc/rtl.texi | 15 +- > gcc/dwarf2out.c| 1 + > gcc/emit-rtl.c | 16 ++ > gcc/genconfig.c| 1 + > gcc/genemit.c | 51 +++--- > gcc/genrecog.c | 3 +- > gcc/haifa-sched.c | 3 + > gcc/ira-build.c| 5 + > gcc/ira-costs.c| 7 + > gcc/ira.c | 6 +- > gcc/jump.c | 1 + > gcc/lra-eliminations.c | 11 ++ > gcc/lra-int.h | 2 + > gcc/lra-lives.c| 31 ++-- > gcc/lra.c | 66 +--- > gcc/postreload-gcse.c | 21 ++- > gcc/postreload.c | 25 ++- > gcc/print-rtl.c| 1 + > gcc/recog.c| 9 +- > gcc/regcprop.c | 10 +- > gcc/reginfo.c | 4 + > gcc/reload1.c | 16 +- > gcc/reorg.c| 27 ++- > gcc/resource.c | 24 ++- > gcc/rtl.c | 15 ++ > gcc/rtl.def| 10 ++ > gcc/rtl.h | 27 ++- > gcc/rtlanal.c | 47 +- > gcc/sched-deps.c | 15 +- > .../gcc.target/aarch64/sve_tls_preserve_1.c| 19 +++ > .../gcc.target/aarch64/sve_tls_preserve_2.c| 24 +++ > .../gcc.target/aarch64/sve_tls_preserve_3.c| 24 +++ > 42 files changed, 725 insertions(+), 180 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_3.c > > -- > 2.15.2 (Apple Git-101.1) >
[PATCH v2 6/7] Remaining support for clobber high
Add the remainder of clobber high checks. Happy to split this into smaller patches if required (there didn't seem anything obvious to split into). 2018-07-25 Alan Hayward * alias.c (record_set): Check for clobber high. * cfgexpand.c (expand_gimple_stmt): Likewise. * combine-stack-adj.c (single_set_for_csa): Likewise. * combine.c (find_single_use_1): Likewise. (set_nonzero_bits_and_sign_copies): Likewise. (get_combine_src_dest): Likewise. (is_parallel_of_n_reg_sets): Likewise. (try_combine): Likewise. (record_dead_and_set_regs_1): Likewise. (reg_dead_at_p_1): Likewise. (reg_dead_at_p): Likewise. * dce.c (deletable_insn_p): Likewise. (mark_nonreg_stores_1): Likewise. (mark_nonreg_stores_2): Likewise. * df-scan.c (df_find_hard_reg_defs): Likewise. (df_uses_record): Likewise. (df_get_call_refs): Likewise. * dwarf2out.c (mem_loc_descriptor): Likewise. * haifa-sched.c (haifa_classify_rtx): Likewise. * ira-build.c (create_insn_allocnos): Likewise. * ira-costs.c (scan_one_insn): Likewise. * ira.c (equiv_init_movable_p): Likewise. (rtx_moveable_p): Likewise. (interesting_dest_for_shprep): Likewise. * jump.c (mark_jump_label_1): Likewise. * postreload-gcse.c (record_opr_changes): Likewise. * postreload.c (reload_cse_simplify): Likewise. (struct reg_use): Add source expr. (reload_combine): Check for clobber high. (reload_combine_note_use): Likewise. (reload_cse_move2add): Likewise. (move2add_note_store): Likewise. * print-rtl.c (print_pattern): Likewise. * recog.c (decode_asm_operands): Likewise. (store_data_bypass_p): Likewise. (if_test_bypass_p): Likewise. * regcprop.c (kill_clobbered_value): Likewise. (kill_set_value): Likewise. * reginfo.c (reg_scan_mark_refs): Likewise. * reload1.c (maybe_fix_stack_asms): Likewise. (eliminate_regs_1): Likewise. (elimination_effects): Likewise. (mark_not_eliminable): Likewise. (scan_paradoxical_subregs): Likewise. (forget_old_reloads_1): Likewise. * reorg.c (find_end_label): Likewise. (try_merge_delay_insns): Likewise. (redundant_insn): Likewise. (own_thread_p): Likewise. (fill_simple_delay_slots): Likewise. (fill_slots_from_thread): Likewise. (dbr_schedule): Likewise. * resource.c (update_live_status): Likewise. (mark_referenced_resources): Likewise. (mark_set_resources): Likewise. * rtl.c (copy_rtx): Likewise. * rtlanal.c (reg_referenced_p): Likewise. (single_set_2): Likewise. (noop_move_p): Likewise. (note_stores): Likewise. * sched-deps.c (sched_analyze_reg): Likewise. (sched_analyze_insn): Likewise. --- gcc/alias.c | 11 +++ gcc/cfgexpand.c | 1 + gcc/combine-stack-adj.c | 1 + gcc/combine.c | 38 +- gcc/dce.c | 11 +-- gcc/df-scan.c | 6 ++ gcc/dwarf2out.c | 1 + gcc/haifa-sched.c | 3 +++ gcc/ira-build.c | 5 + gcc/ira-costs.c | 7 +++ gcc/ira.c | 6 +- gcc/jump.c | 1 + gcc/postreload-gcse.c | 21 - gcc/postreload.c| 25 - gcc/print-rtl.c | 1 + gcc/recog.c | 9 ++--- gcc/regcprop.c | 10 -- gcc/reginfo.c | 4 gcc/reload1.c | 16 +++- gcc/reorg.c | 27 ++- gcc/resource.c | 24 +--- gcc/rtl.c | 4 gcc/rtlanal.c | 18 +++--- gcc/sched-deps.c| 15 ++- 24 files changed, 225 insertions(+), 40 deletions(-) diff --git a/gcc/alias.c b/gcc/alias.c index 2091dfbf3d7..748da2b6951 100644 --- a/gcc/alias.c +++ b/gcc/alias.c @@ -1554,6 +1554,17 @@ record_set (rtx dest, const_rtx set, void *data ATTRIBUTE_UNUSED) new_reg_base_value[regno] = 0; return; } + /* A CLOBBER_HIGH only wipes out the old value if the mode of the old +value is greater than that of the clobber. */ + else if (GET_CODE (set) == CLOBBER_HIGH) + { + if (new_reg_base_value[regno] != 0 + && reg_is_clobbered_by_clobber_high ( + regno, GET_MODE (new_reg_base_value[regno]), XEXP (set, 0))) + new_reg_base_value[regno] = 0; + return; + } + src = SET_SRC (set); } else diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index d6e3c382085..39db6ed435f 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -3750,6 +3750,7 @@ expand_gimple_stmt
[PATCH v2 7/7] Enable clobber high for tls descs on Aarch64
Add the clobber high expressions to tls_desc for aarch64. It also adds three tests. In addition I also tested by taking the gcc torture test suite and making all global variables __thread. Then emended the suite to compile with -fpic, save the .s file and only for one given O level. I ran this before and after the patch and compared the resulting .s files, ensuring that there were no ASM changes. I discarded the 10% of tests that failed to compile (due to the code in the test now being invalid C). I did this for O0,O2,O3 on both x86 and aarch64 and observed no difference between ASM files before and after the patch. Alan. 2018-07-25 Alan Hayward gcc/ * config/aarch64/aarch64.md: Add clobber highs to tls_desc. gcc/testsuite/ * gcc.target/aarch64/sve_tls_preserve_1.c: New test. * gcc.target/aarch64/sve_tls_preserve_2.c: New test. * gcc.target/aarch64/sve_tls_preserve_3.c: New test. --- gcc/config/aarch64/aarch64.md | 69 ++ .../gcc.target/aarch64/sve_tls_preserve_1.c| 19 ++ .../gcc.target/aarch64/sve_tls_preserve_2.c| 24 .../gcc.target/aarch64/sve_tls_preserve_3.c| 24 4 files changed, 124 insertions(+), 12 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_3.c diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index e9c16f9697b..a41d6e15bc8 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -57,14 +57,36 @@ (LR_REGNUM 30) (SP_REGNUM 31) (V0_REGNUM 32) +(V1_REGNUM 33) +(V2_REGNUM 34) +(V3_REGNUM 35) (V4_REGNUM 36) +(V5_REGNUM 37) +(V6_REGNUM 38) +(V7_REGNUM 39) (V8_REGNUM 40) +(V9_REGNUM 41) +(V10_REGNUM42) +(V11_REGNUM43) (V12_REGNUM44) +(V13_REGNUM45) +(V14_REGNUM46) (V15_REGNUM47) (V16_REGNUM48) +(V17_REGNUM49) +(V18_REGNUM50) +(V19_REGNUM51) (V20_REGNUM52) +(V21_REGNUM53) +(V22_REGNUM54) +(V23_REGNUM55) (V24_REGNUM56) +(V25_REGNUM57) +(V26_REGNUM58) +(V27_REGNUM59) (V28_REGNUM60) +(V29_REGNUM61) +(V30_REGNUM62) (V31_REGNUM63) (LAST_SAVED_REGNUM 63) (SFP_REGNUM64) @@ -6302,24 +6324,47 @@ [(set_attr "type" "call") (set_attr "length" "16")]) -;; For SVE, model tlsdesc calls as clobbering all vector and predicate -;; registers, on top of the usual R0 and LR. In reality the calls -;; preserve the low 128 bits of the vector registers, but we don't -;; yet have a way of representing that in the instruction pattern. +;; For SVE, model tlsdesc calls as clobbering the lower 128 bits of +;; all vector registers, and clobber all predicate registers, on +;; top of the usual R0 and LR. (define_insn "tlsdesc_small_sve_" [(set (reg:PTR R0_REGNUM) (unspec:PTR [(match_operand 0 "aarch64_valid_symref" "S")] UNSPEC_TLSDESC)) (clobber (reg:DI LR_REGNUM)) (clobber (reg:CC CC_REGNUM)) - (clobber (reg:XI V0_REGNUM)) - (clobber (reg:XI V4_REGNUM)) - (clobber (reg:XI V8_REGNUM)) - (clobber (reg:XI V12_REGNUM)) - (clobber (reg:XI V16_REGNUM)) - (clobber (reg:XI V20_REGNUM)) - (clobber (reg:XI V24_REGNUM)) - (clobber (reg:XI V28_REGNUM)) + (clobber_high (reg:TI V0_REGNUM)) + (clobber_high (reg:TI V1_REGNUM)) + (clobber_high (reg:TI V2_REGNUM)) + (clobber_high (reg:TI V3_REGNUM)) + (clobber_high (reg:TI V4_REGNUM)) + (clobber_high (reg:TI V5_REGNUM)) + (clobber_high (reg:TI V6_REGNUM)) + (clobber_high (reg:TI V7_REGNUM)) + (clobber_high (reg:TI V8_REGNUM)) + (clobber_high (reg:TI V9_REGNUM)) + (clobber_high (reg:TI V10_REGNUM)) + (clobber_high (reg:TI V11_REGNUM)) + (clobber_high (reg:TI V12_REGNUM)) + (clobber_high (reg:TI V13_REGNUM)) + (clobber_high (reg:TI V14_REGNUM)) + (clobber_high (reg:TI V15_REGNUM)) + (clobber_high (reg:TI V16_REGNUM)) + (clobber_high (reg:TI V17_REGNUM)) + (clobber_high (reg:TI V18_REGNUM)) + (clobber_high (reg:TI V19_REGNUM)) + (clobber_high (reg:TI V20_REGNUM)) + (clobber_high (reg:TI V21_REGNUM)) + (clobber_high (reg:TI V22_REGNUM)) + (clobber_high (reg:TI V23_REGNUM)) + (clobber_high (reg:TI V24_REGNUM)) + (clobber_high (reg:TI V25_REGNUM))
[PATCH v2 5/7] cse support for clobber_high
The cse specific changes for clobber_high. 2018-07-25 Alan Hayward * cse.c (invalidate_reg): New function extracted from... (invalidate): ...here. (canonicalize_insn): Check for clobber high. (invalidate_from_clobbers): invalidate clobber highs. (invalidate_from_sets_and_clobbers): Likewise. (count_reg_usage): Check for clobber high. (insn_live_p): Likewise. * cselib.c (cselib_expand_value_rtx_1):Likewise. (cselib_invalidate_regno): Check for clobber in setter. (cselib_invalidate_rtx): Pass through setter. (cselib_invalidate_rtx_note_stores): (cselib_process_insn): Check for clobber high. * cselib.h (cselib_invalidate_rtx): Add operand. --- gcc/cse.c| 187 +++ gcc/cselib.c | 42 ++ gcc/cselib.h | 2 +- 3 files changed, 156 insertions(+), 75 deletions(-) diff --git a/gcc/cse.c b/gcc/cse.c index 4e94152b380..3d7888b7093 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -559,6 +559,7 @@ static struct table_elt *insert_with_costs (rtx, struct table_elt *, unsigned, static struct table_elt *insert (rtx, struct table_elt *, unsigned, machine_mode); static void merge_equiv_classes (struct table_elt *, struct table_elt *); +static void invalidate_reg (rtx, bool); static void invalidate (rtx, machine_mode); static void remove_invalid_refs (unsigned int); static void remove_invalid_subreg_refs (unsigned int, poly_uint64, @@ -1818,7 +1819,85 @@ check_dependence (const_rtx x, rtx exp, machine_mode mode, rtx addr) } return false; } - + +/* Remove from the hash table, or mark as invalid, all expressions whose + values could be altered by storing in register X. + + CLOBBER_HIGH is set if X was part of a CLOBBER_HIGH expression. */ + +static void +invalidate_reg (rtx x, bool clobber_high) +{ + gcc_assert (GET_CODE (x) == REG); + + /* If X is a register, dependencies on its contents are recorded + through the qty number mechanism. Just change the qty number of + the register, mark it as invalid for expressions that refer to it, + and remove it itself. */ + unsigned int regno = REGNO (x); + unsigned int hash = HASH (x, GET_MODE (x)); + + /* Remove REGNO from any quantity list it might be on and indicate + that its value might have changed. If it is a pseudo, remove its + entry from the hash table. + + For a hard register, we do the first two actions above for any + additional hard registers corresponding to X. Then, if any of these + registers are in the table, we must remove any REG entries that + overlap these registers. */ + + delete_reg_equiv (regno); + REG_TICK (regno)++; + SUBREG_TICKED (regno) = -1; + + if (regno >= FIRST_PSEUDO_REGISTER) +{ + gcc_assert (!clobber_high); + remove_pseudo_from_table (x, hash); +} + else +{ + HOST_WIDE_INT in_table = TEST_HARD_REG_BIT (hard_regs_in_table, regno); + unsigned int endregno = END_REGNO (x); + unsigned int rn; + struct table_elt *p, *next; + + CLEAR_HARD_REG_BIT (hard_regs_in_table, regno); + + for (rn = regno + 1; rn < endregno; rn++) + { + in_table |= TEST_HARD_REG_BIT (hard_regs_in_table, rn); + CLEAR_HARD_REG_BIT (hard_regs_in_table, rn); + delete_reg_equiv (rn); + REG_TICK (rn)++; + SUBREG_TICKED (rn) = -1; + } + + if (in_table) + for (hash = 0; hash < HASH_SIZE; hash++) + for (p = table[hash]; p; p = next) + { + next = p->next_same_hash; + + if (!REG_P (p->exp) || REGNO (p->exp) >= FIRST_PSEUDO_REGISTER) + continue; + + if (clobber_high) + { + if (reg_is_clobbered_by_clobber_high (p->exp, x)) + remove_from_table (p, hash); + } + else + { + unsigned int tregno = REGNO (p->exp); + unsigned int tendregno = END_REGNO (p->exp); + if (tendregno > regno && tregno < endregno) + remove_from_table (p, hash); + } + } +} +} + /* Remove from the hash table, or mark as invalid, all expressions whose values could be altered by storing in X. X is a register, a subreg, or a memory reference with nonvarying address (because, when a memory @@ -1841,65 +1920,7 @@ invalidate (rtx x, machine_mode full_mode) switch (GET_CODE (x)) { case REG: - { - /* If X is a register, dependencies on its contents are recorded - through the qty number mechanism. Just change the qty number of - the register, mark it as invalid for expressions that refer to it, - and remove it itself. */ - unsigned int regno = REGNO (x); - unsigned int hash = HASH (x, G
[PATCH v2 2/7] Generation support for CLOBBER_HIGH
Ensure clobber high is a register expression. Info is passed through for the error case. 2018-07-25 Alan Hayward * emit-rtl.c (verify_rtx_sharing): Check for CLOBBER_HIGH. (copy_insn_1): Likewise. (gen_hard_reg_clobber_high): New gen function. * genconfig.c (walk_insn_part): Check for CLOBBER_HIGH. * genemit.c (gen_exp): Likewise. (gen_emit_seq): Pass through info. (gen_insn): Check for CLOBBER_HIGH. (gen_expand): Pass through info. (gen_split): Likewise. (output_add_clobbers): Likewise. * genrecog.c (validate_pattern): Check for CLOBBER_HIGH. (remove_clobbers): Likewise. * rtl.h (gen_hard_reg_clobber_high): New declaration. --- gcc/emit-rtl.c | 16 gcc/genconfig.c | 1 + gcc/genemit.c | 51 +++ gcc/genrecog.c | 3 ++- gcc/rtl.h | 1 + 5 files changed, 51 insertions(+), 21 deletions(-) diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c index e4b070486e8..6a32bcbdaf6 100644 --- a/gcc/emit-rtl.c +++ b/gcc/emit-rtl.c @@ -2865,6 +2865,7 @@ verify_rtx_sharing (rtx orig, rtx insn) /* SCRATCH must be shared because they represent distinct values. */ return; case CLOBBER: +case CLOBBER_HIGH: /* Share clobbers of hard registers (like cc0), but do not share pseudo reg clobbers or clobbers of hard registers that originated as pseudos. This is needed to allow safe register renaming. */ @@ -3118,6 +3119,7 @@ repeat: /* SCRATCH must be shared because they represent distinct values. */ return; case CLOBBER: +case CLOBBER_HIGH: /* Share clobbers of hard registers (like cc0), but do not share pseudo reg clobbers or clobbers of hard registers that originated as pseudos. This is needed to allow safe register renaming. */ @@ -5690,6 +5692,7 @@ copy_insn_1 (rtx orig) case SIMPLE_RETURN: return orig; case CLOBBER: +case CLOBBER_HIGH: /* Share clobbers of hard registers (like cc0), but do not share pseudo reg clobbers or clobbers of hard registers that originated as pseudos. This is needed to allow safe register renaming. */ @@ -6508,6 +6511,19 @@ gen_hard_reg_clobber (machine_mode mode, unsigned int regno) gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (mode, regno))); } +static GTY((deletable)) rtx +hard_reg_clobbers_high[NUM_MACHINE_MODES][FIRST_PSEUDO_REGISTER]; + +rtx +gen_hard_reg_clobber_high (machine_mode mode, unsigned int regno) +{ + if (hard_reg_clobbers_high[mode][regno]) +return hard_reg_clobbers_high[mode][regno]; + else +return (hard_reg_clobbers_high[mode][regno] + = gen_rtx_CLOBBER_HIGH (VOIDmode, gen_rtx_REG (mode, regno))); +} + location_t prologue_location; location_t epilogue_location; diff --git a/gcc/genconfig.c b/gcc/genconfig.c index c1bfde8d54b..745d5374b39 100644 --- a/gcc/genconfig.c +++ b/gcc/genconfig.c @@ -72,6 +72,7 @@ walk_insn_part (rtx part, int recog_p, int non_pc_set_src) switch (code) { case CLOBBER: +case CLOBBER_HIGH: clobbers_seen_this_insn++; break; diff --git a/gcc/genemit.c b/gcc/genemit.c index f4179e2b631..86e792f7396 100644 --- a/gcc/genemit.c +++ b/gcc/genemit.c @@ -79,7 +79,7 @@ gen_rtx_scratch (rtx x, enum rtx_code subroutine_type) substituting any operand references appearing within. */ static void -gen_exp (rtx x, enum rtx_code subroutine_type, char *used) +gen_exp (rtx x, enum rtx_code subroutine_type, char *used, md_rtx_info *info) { RTX_CODE code; int i; @@ -123,7 +123,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used) for (i = 0; i < XVECLEN (x, 1); i++) { printf (",\n\t\t"); - gen_exp (XVECEXP (x, 1, i), subroutine_type, used); + gen_exp (XVECEXP (x, 1, i), subroutine_type, used, info); } printf (")"); return; @@ -137,7 +137,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used) for (i = 0; i < XVECLEN (x, 2); i++) { printf (",\n\t\t"); - gen_exp (XVECEXP (x, 2, i), subroutine_type, used); + gen_exp (XVECEXP (x, 2, i), subroutine_type, used, info); } printf (")"); return; @@ -163,12 +163,21 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used) case CLOBBER: if (REG_P (XEXP (x, 0))) { - printf ("gen_hard_reg_clobber (%smode, %i)", GET_MODE_NAME (GET_MODE (XEXP (x, 0))), -REGNO (XEXP (x, 0))); + printf ("gen_hard_reg_clobber (%smode, %i)", + GET_MODE_NAME (GET_MODE (XEXP (x, 0))), + REGNO (XEXP (x, 0))); return; } break; - +case CLOBBER_HIGH: + if (!REG_P (XEXP (x, 0))) +
[PATCH v2 1/7] Add CLOBBER_HIGH expression
Includes documentation. 2018-07-25 Alan Hayward * doc/rtl.texi (clobber_high): Add. (parallel): Add in clobber high * rtl.c (rtl_check_failed_code3): Add function. * rtl.def (CLOBBER_HIGH): Add expression. * rtl.h (RTL_CHECKC3): Add macro. (rtl_check_failed_code3): Add declaration. (XC3EXP): Add macro. --- gcc/doc/rtl.texi | 15 ++- gcc/rtl.c| 11 +++ gcc/rtl.def | 10 ++ gcc/rtl.h| 16 +++- 4 files changed, 50 insertions(+), 2 deletions(-) diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi index a37d9ac5389..20c57732679 100644 --- a/gcc/doc/rtl.texi +++ b/gcc/doc/rtl.texi @@ -3296,6 +3296,18 @@ There is one other known use for clobbering a pseudo register in a clobbered by the insn. In this case, using the same pseudo register in the clobber and elsewhere in the insn produces the expected results. +@findex clobber_high +@item (clobber_high @var{x}) +Represents the storing or possible storing of an unpredictable, +undescribed value into the upper parts of @var{x}. The mode of the expression +represents the lower parts of the register which will not be overwritten. +@code{reg} must be a reg expression. + +One place this is used is when calling into functions where the registers are +preserved, but only up to a given number of bits. For example when using +Aarch64 SVE, calling a TLS descriptor will cause only the lower 128 bits of +each of the vector registers to be preserved. + @findex use @item (use @var{x}) Represents the use of the value of @var{x}. It indicates that the @@ -3349,7 +3361,8 @@ Represents several side effects performed in parallel. The square brackets stand for a vector; the operand of @code{parallel} is a vector of expressions. @var{x0}, @var{x1} and so on are individual side effect expressions---expressions of code @code{set}, @code{call}, -@code{return}, @code{simple_return}, @code{clobber} or @code{use}. +@code{return}, @code{simple_return}, @code{clobber} @code{use} or +@code{clobber_high}. ``In parallel'' means that first all the values used in the individual side-effects are computed, and second all the actual side-effects are diff --git a/gcc/rtl.c b/gcc/rtl.c index 90bbc7c6861..985db1c14f0 100644 --- a/gcc/rtl.c +++ b/gcc/rtl.c @@ -856,6 +856,17 @@ rtl_check_failed_code2 (const_rtx r, enum rtx_code code1, enum rtx_code code2, func, trim_filename (file), line); } +void +rtl_check_failed_code3 (const_rtx r, enum rtx_code code1, enum rtx_code code2, + enum rtx_code code3, const char *file, int line, + const char *func) +{ + internal_error +("RTL check: expected code '%s', '%s' or '%s', have '%s' in %s, at %s:%d", + GET_RTX_NAME (code1), GET_RTX_NAME (code2), GET_RTX_NAME (code3), + GET_RTX_NAME (GET_CODE (r)), func, trim_filename (file), line); +} + void rtl_check_failed_code_mode (const_rtx r, enum rtx_code code, machine_mode mode, bool not_mode, const char *file, int line, diff --git a/gcc/rtl.def b/gcc/rtl.def index 2578a0ccb9e..0ed27505545 100644 --- a/gcc/rtl.def +++ b/gcc/rtl.def @@ -312,6 +312,16 @@ DEF_RTL_EXPR(USE, "use", "e", RTX_EXTRA) is considered undeletable before reload. */ DEF_RTL_EXPR(CLOBBER, "clobber", "e", RTX_EXTRA) +/* Indicate that the upper parts of something are clobbered in a way that we + don't want to explain. The MODE references the lower bits that will be + preserved. Anything above that size will be clobbered. + + CLOBBER_HIGH only occurs as the operand of a PARALLEL rtx. It cannot appear + in other contexts, and unlike CLOBBER, it cannot appear on its own. + CLOBBER_HIGH can only be used with fixed register rtxes. */ + +DEF_RTL_EXPR(CLOBBER_HIGH, "clobber_high", "e", RTX_EXTRA) + /* Call a subroutine. Operand 1 is the address to call. Operand 2 is the number of arguments. */ diff --git a/gcc/rtl.h b/gcc/rtl.h index 565ce3abbe4..5e07e9bee80 100644 --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -1100,6 +1100,14 @@ is_a_helper ::test (rtx_insn *insn) __FUNCTION__); \ &_rtx->u.fld[_n]; })) +#define RTL_CHECKC3(RTX, N, C1, C2, C3) __extension__ \ +(*({ __typeof (RTX) const _rtx = (RTX); const int _n = (N);\ + const enum rtx_code _code = GET_CODE (_rtx); \ + if (_code != (C1) && _code != (C2) && _code != (C3)) \ + rtl_check_failed_code3 (_rtx, (C1), (C2), (C3), __FILE__, \ + __LINE__, __FUNCTION__); \ + &_rtx->u.fld[_n]; })) + #define RTVEC_ELT(RTVEC, I) __extension__ \ (*({ __typeof (RTVEC) const _rtvec = (RTVEC); const int _i = (I); \
[PATCH v2 3/7] Add func to check if register is clobbered by clobber_high
Given a CLOBBER_HIGH expression and a register, it checks if the register will be clobbered. A second version exists for the cases where the expressions are not available. The function will be used throughout the following patches. 2018-07-25 Alan Hayward * rtl.h (reg_is_clobbered_by_clobber_high): Add declarations. * rtlanal.c (reg_is_clobbered_by_clobber_high): Add function. --- gcc/rtl.h | 10 ++ gcc/rtlanal.c | 29 + 2 files changed, 39 insertions(+) diff --git a/gcc/rtl.h b/gcc/rtl.h index f42d749511d..d549b0aad0b 100644 --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -3467,6 +3467,16 @@ extern bool tablejump_p (const rtx_insn *, rtx_insn **, rtx_jump_table_data **); extern int computed_jump_p (const rtx_insn *); extern bool tls_referenced_p (const_rtx); extern bool contains_mem_rtx_p (rtx x); +extern bool reg_is_clobbered_by_clobber_high (unsigned int, machine_mode, + const_rtx); + +/* Convenient wrapper for reg_is_clobbered_by_clobber_high. */ +inline bool +reg_is_clobbered_by_clobber_high (const_rtx x, const_rtx clobber_high_op) +{ + return reg_is_clobbered_by_clobber_high (REGNO (x), GET_MODE (x), + clobber_high_op); +} /* Overload for refers_to_regno_p for checking a single register. */ inline bool diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c index 9f84d7f2a8c..1cab1545744 100644 --- a/gcc/rtlanal.c +++ b/gcc/rtlanal.c @@ -6551,3 +6551,32 @@ tls_referenced_p (const_rtx x) return true; return false; } + +/* Return true if reg REGNO with mode REG_MODE would be clobbered by the + clobber_high operand in CLOBBER_HIGH_OP. */ + +bool +reg_is_clobbered_by_clobber_high (unsigned int regno, machine_mode reg_mode, + const_rtx clobber_high_op) +{ + unsigned int clobber_regno = REGNO (clobber_high_op); + machine_mode clobber_mode = GET_MODE (clobber_high_op); + unsigned char regno_nregs = hard_regno_nregs (regno, reg_mode); + + /* Clobber high should always span exactly one register. */ + gcc_assert (REG_NREGS (clobber_high_op) == 1); + + /* Clobber high needs to match with one of the registers in X. */ + if (clobber_regno < regno || clobber_regno >= regno + regno_nregs) +return false; + + gcc_assert (reg_mode != BLKmode && clobber_mode != BLKmode); + + if (reg_mode == VOIDmode) +return clobber_mode != VOIDmode; + + /* Clobber high will clobber if its size might be greater than the size of + register regno. */ + return maybe_gt (exact_div (GET_MODE_SIZE (reg_mode), regno_nregs), +GET_MODE_SIZE (clobber_mode)); +} -- 2.15.2 (Apple Git-101.1)
[PATCH v2 4/7] lra support for clobber_high
The lra specific changes for clobber_high. 2018-07-25 Alan Hayward * lra-eliminations.c (lra_eliminate_regs_1): Check for clobber high. (mark_not_eliminable): Likewise. * lra-int.h (struct lra_insn_reg): Add clobber high marker. * lra-lives.c (process_bb_lives): Check for clobber high. * lra.c (new_insn_reg): Remember clobber highs. (collect_non_operand_hard_regs): Check for clobber high. (lra_set_insn_recog_data): Likewise. (add_regs_to_insn_regno_info): Likewise. (lra_update_insn_regno_info): Likewise. --- gcc/lra-eliminations.c | 11 + gcc/lra-int.h | 2 ++ gcc/lra-lives.c| 31 gcc/lra.c | 66 +++--- 4 files changed, 80 insertions(+), 30 deletions(-) diff --git a/gcc/lra-eliminations.c b/gcc/lra-eliminations.c index f5f104020b3..d0cfaa8714a 100644 --- a/gcc/lra-eliminations.c +++ b/gcc/lra-eliminations.c @@ -654,6 +654,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode mem_mode, return x; case CLOBBER: +case CLOBBER_HIGH: case SET: gcc_unreachable (); @@ -806,6 +807,16 @@ mark_not_eliminable (rtx x, machine_mode mem_mode) setup_can_eliminate (ep, false); return; +case CLOBBER_HIGH: + gcc_assert (REG_P (XEXP (x, 0))); + gcc_assert (REGNO (XEXP (x, 0)) < FIRST_PSEUDO_REGISTER); + for (ep = reg_eliminate; + ep < ®_eliminate[NUM_ELIMINABLE_REGS]; + ep++) + if (reg_is_clobbered_by_clobber_high (ep->to_rtx, XEXP (x, 0))) + setup_can_eliminate (ep, false); + return; + case SET: if (SET_DEST (x) == stack_pointer_rtx && GET_CODE (SET_SRC (x)) == PLUS diff --git a/gcc/lra-int.h b/gcc/lra-int.h index 86e103b7480..5267b53c5e3 100644 --- a/gcc/lra-int.h +++ b/gcc/lra-int.h @@ -168,6 +168,8 @@ struct lra_insn_reg /* True if there is an early clobber alternative for this operand. */ unsigned int early_clobber : 1; + /* True if the reg is clobber highed by the operand. */ + unsigned int clobber_high : 1; /* The corresponding regno of the register. */ int regno; /* Next reg info of the same insn. */ diff --git a/gcc/lra-lives.c b/gcc/lra-lives.c index 920fd02b997..433c819d9e3 100644 --- a/gcc/lra-lives.c +++ b/gcc/lra-lives.c @@ -658,7 +658,7 @@ process_bb_lives (basic_block bb, int &curr_point, bool dead_insn_p) bool call_p; int n_alt, dst_regno, src_regno; rtx set; - struct lra_insn_reg *reg; + struct lra_insn_reg *reg, *hr; if (!NONDEBUG_INSN_P (curr_insn)) continue; @@ -690,11 +690,12 @@ process_bb_lives (basic_block bb, int &curr_point, bool dead_insn_p) break; } for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next) - if (reg->type != OP_IN) + if (reg->type != OP_IN && !reg->clobber_high) { remove_p = false; break; } + if (remove_p && ! volatile_refs_p (PATTERN (curr_insn))) { dst_regno = REGNO (SET_DEST (set)); @@ -812,14 +813,24 @@ process_bb_lives (basic_block bb, int &curr_point, bool dead_insn_p) unused values because they still conflict with quantities that are live at the time of the definition. */ for (reg = curr_id->regs; reg != NULL; reg = reg->next) - if (reg->type != OP_IN) - { - need_curr_point_incr - |= mark_regno_live (reg->regno, reg->biggest_mode, - curr_point); - check_pseudos_live_through_calls (reg->regno, - last_call_used_reg_set); - } + { + if (reg->type != OP_IN) + { + need_curr_point_incr + |= mark_regno_live (reg->regno, reg->biggest_mode, + curr_point); + check_pseudos_live_through_calls (reg->regno, + last_call_used_reg_set); + } + + if (reg->regno >= FIRST_PSEUDO_REGISTER) + for (hr = curr_static_id->hard_regs; hr != NULL; hr = hr->next) + if (hr->clobber_high + && maybe_gt (GET_MODE_SIZE (PSEUDO_REGNO_MODE (reg->regno)), + GET_MODE_SIZE (hr->biggest_mode))) + SET_HARD_REG_BIT (lra_reg_info[reg->regno].conflict_hard_regs, + hr->regno); + } for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next) if (reg->type != OP_IN) diff --git a/gcc/lra.c b/gcc/lra.c index b410b90f126..aa768fb2a23 100644 --- a/gcc/lra.c +++ b/gcc/lra.c @@ -535
[PATCH v2 0/7] Support partial clobbers around TLS calls on Aarch64 SVE
This is rebasing of the patch posted in November. It's aim is to support aarch64 SVE register preservation around TLS calls by adding a CLOBBER_HIGH expression. Across a TLS call, Aarch64 SVE does not explicitly preserve the SVE vector registers. However, the Neon vector registers are preserved. Due to overlapping of registers, this means the lower 128bits of all SVE vector registers will be preserved. The existing GCC code assume all Neon and SVE registers are clobbered across TLS calls. This patch introduces a CLOBBER_HIGH expression. This behaves a bit like a CLOBBER expression. CLOBBER_HIGH can only refer to a single register. The mode of the expression indicates the size of the lower bits which will be preserved. If the register contains a value bigger than this mode then the code will treat the register as clobbered, otherwise the register remains untouched. The means in order to evaluate if a clobber high is relevant, we need to ensure the mode of the existing value in a register is tracked. The first two patches introduce CLOBBER_HIGH and generation support. The next patch adds a helper function for easily determining if a register gets clobbered by a CLOBBER_HIGH. The next three patches add clobber high checks to all of the passes. I couldn't think of a better way of splitting this up (maybe needs dividing into smaller patches?). Finally the last patch adds the CLOBBER_HIGHS around a TLS call for aarch64 SVE and some test cases. Alan Hayward (7): Add CLOBBER_HIGH expression Generation support for CLOBBER_HIGH Add func to check if register is clobbered by clobber_high lra support for clobber_high cse support for clobber_high Remaining support for clobber high Enable clobber high for tls descs on Aarch64 gcc/alias.c| 11 ++ gcc/cfgexpand.c| 1 + gcc/combine-stack-adj.c| 1 + gcc/combine.c | 38 - gcc/config/aarch64/aarch64.md | 69 ++-- gcc/cse.c | 187 ++--- gcc/cselib.c | 42 +++-- gcc/cselib.h | 2 +- gcc/dce.c | 11 +- gcc/df-scan.c | 6 + gcc/doc/rtl.texi | 15 +- gcc/dwarf2out.c| 1 + gcc/emit-rtl.c | 16 ++ gcc/genconfig.c| 1 + gcc/genemit.c | 51 +++--- gcc/genrecog.c | 3 +- gcc/haifa-sched.c | 3 + gcc/ira-build.c| 5 + gcc/ira-costs.c| 7 + gcc/ira.c | 6 +- gcc/jump.c | 1 + gcc/lra-eliminations.c | 11 ++ gcc/lra-int.h | 2 + gcc/lra-lives.c| 31 ++-- gcc/lra.c | 66 +--- gcc/postreload-gcse.c | 21 ++- gcc/postreload.c | 25 ++- gcc/print-rtl.c| 1 + gcc/recog.c| 9 +- gcc/regcprop.c | 10 +- gcc/reginfo.c | 4 + gcc/reload1.c | 16 +- gcc/reorg.c| 27 ++- gcc/resource.c | 24 ++- gcc/rtl.c | 15 ++ gcc/rtl.def| 10 ++ gcc/rtl.h | 27 ++- gcc/rtlanal.c | 47 +- gcc/sched-deps.c | 15 +- .../gcc.target/aarch64/sve_tls_preserve_1.c| 19 +++ .../gcc.target/aarch64/sve_tls_preserve_2.c| 24 +++ .../gcc.target/aarch64/sve_tls_preserve_3.c| 24 +++ 42 files changed, 725 insertions(+), 180 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_3.c -- 2.15.2 (Apple Git-101.1)
Re: [PATCH 2/7] Support >26 operands in generation code.
> On 6 Feb 2018, at 17:03, Richard Sandiford > wrote: > > Alan Hayward writes: >> This patch adds support for CLOBBER_HIGH in the generation code. >> >> An aarch64 will require 31 clobber high expressions, plus two >> clobbers. >> >> The exisiting gen code restricts to 26 vector operands by virtue >> of using the operators [a-z]. This patch extends this to 52 by >> supporting [a-zA-Z]. > > Although the main CLOBBER_HIGH patch will obviously need to wait > until GCC 9 now, we can work around it for GCC 8 by making the > tlsdesc pattern clobber all vector and predicate registers for SVE. > We'll still need the support for more than 26 operands in order > to do that though. > >> 2017-11-16 Alan Hayward >> >> [...] >> * genextract.c (push_pathstr_operand): New function to >> support [a-zA-Z]. >> (walk_rtx): Call push_pathstr_operand. >> (print_path): Support [a-zA-Z]. >> [...] >> diff --git a/gcc/genextract.c b/gcc/genextract.c >> index >> 258d234d2729bf16b152b90bb1833a37a6eb0bdc..e1fb716e459b9bd219e89cf36c30556d520305a2 >> 100644 >> --- a/gcc/genextract.c >> +++ b/gcc/genextract.c >> @@ -33,9 +33,10 @@ along with GCC; see the file COPYING3. If not see >> >>The string for each operand describes that path to the operand and >>contains `0' through `9' when going into an expression and `a' through >> - `z' when going into a vector. We assume here that only the first operand >> - of an rtl expression is a vector. genrecog.c makes the same assumption >> - (and uses the same representation) and it is currently true. */ >> + `z' then 'A' through to 'Z' when going into a vector. We assume here >> that >> + only the first operand of an rtl expression is a vector. genrecog.c >> makes >> + the same assumption (and uses the same representation) and it is >> currently >> + true. */ >> >> typedef char *locstr; >> >> @@ -80,6 +81,22 @@ struct accum_extract >> /* Forward declarations. */ >> static void walk_rtx (md_rtx_info *, rtx, struct accum_extract *); >> >> +#define UPPER_OFFSET ('A' - ('z' - 'a' + 1)) >> + >> +/* Convert OPERAND into a character - either into [a-zA-Z] for vector >> operands >> + or [0-9] for integer operands - and push onto the end of the path ACC. >> */ >> +static void >> +push_pathstr_operand (int operand, bool is_vector, >> + struct accum_extract *acc) >> +{ >> + if (is_vector && 'a' + operand > 'z') >> +acc->pathstr.safe_push (operand + UPPER_OFFSET); >> + else if (is_vector) >> +acc->pathstr.safe_push (operand + 'a'); >> + else >> +acc->pathstr.safe_push (operand + '0'); >> +} >> + >> static void >> gen_insn (md_rtx_info *info) >> { >> @@ -98,7 +115,7 @@ gen_insn (md_rtx_info *info) >> else >> for (i = XVECLEN (insn, 1) - 1; i >= 0; i--) >> { >> -acc.pathstr.safe_push ('a' + i); >> +push_pathstr_operand (i, true, &acc); >> walk_rtx (info, XVECEXP (insn, 1, i), &acc); >> acc.pathstr.pop (); >> } >> @@ -208,7 +225,7 @@ static void >> walk_rtx (md_rtx_info *info, rtx x, struct accum_extract *acc) >> { >> RTX_CODE code; >> - int i, len, base; >> + int i, len; >> const char *fmt; >> >> if (x == 0) >> @@ -234,10 +251,9 @@ walk_rtx (md_rtx_info *info, rtx x, struct >> accum_extract *acc) >> VEC_safe_set_locstr (info, &acc->oplocs, XINT (x, 0), >> VEC_char_to_string (acc->pathstr)); >> >> - base = (code == MATCH_OPERATOR ? '0' : 'a'); >> for (i = XVECLEN (x, 2) - 1; i >= 0; i--) >> { >> - acc->pathstr.safe_push (base + i); >> + push_pathstr_operand (i, code != MATCH_OPERATOR, acc); >>walk_rtx (info, XVECEXP (x, 2, i), acc); >>acc->pathstr.pop (); >> } >> @@ -252,10 +268,9 @@ walk_rtx (md_rtx_info *info, rtx x, struct >> accum_extract *acc) >> if (code == MATCH_DUP) >> break; >> >> - base = (code == MATCH_OP_DUP ? '0' : 'a'); >> for (i = XVECLEN (x, 1) - 1; i >= 0; i--) >> { >> - acc->pathstr.safe_push (base + i); >> + push_pathstr_operand (i, code != MATCH_OP
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
Ping. Any comments on this? The one line summary is that using self sets instead of clobber high would result in a patch roughly the same, but with different condition checks. It depends if people think it really is useful for self sets to not be live. Given that we are at stage 4 now, and this can’t go in until stage 1, I’m happy to leave the discussion until stage 1, but would appreciate any suggestions before then. Thanks, Alan. > On 12 Jan 2018, at 11:58, Alan Hayward wrote: > > > >> On 19 Dec 2017, at 16:27, Jeff Law wrote: >> >> On 12/19/2017 03:12 AM, Alan Hayward wrote: >>> Ping ping. >>> I think there should be enough information in the first test to show that >>> any "set to self” >>> registers become live. Let me know if there’s anything I’ve missed. >> I think that both Richi and I would like you investigate fixing the df >> infrastructure so that a self-set doesn't make things go live. Sorry if >> we weren't clear about that. ie, level of complexity and likely fallout >> so that we can evaluate that vs CLOBBER_HIGH. >> >> >> >> jeff >> > > > Right, sorry, I misunderstood. Ok, so I’ve been looking at trying to do this. > > To summarise: To do this we need to check for 1) All reg sets to self where > 2) the existing value of the register fits within the mode of the new set. In > these cases we want then to (in effect) pretend the set doesn’t exist. > To test this, I add a set to self in a tls_desc call for every V register ( > eg: (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)), (set (reg:TI V1_REGNUM) > (reg:TI V1_REGNUM)) etc etc). If the patch works, then these registers will > not be backed up around a tls call. > > First added some checks into df-scan and successfully stopped the sets to > self from being added to the live and uses lists. [ To simplify the issue > for now I ignored the existing value of the register, and just assumed it > fits ]. > However, running my test case, the code still results in my vector registers > being backed up around tls. Even though the dumps show that my vector > registers are no longer live. > > Debugging further, finds that the register backing up is happening as part of > combine.c. Ok, I can add a check for reg set to self in here too….. > But as part of my clobber_high patch I already had code in clobber.c that > checked for CLOBBER_HIGH and then checked the mode of the previous value in > the register. My new code ends up looking very similar to my clobber high > patch. > > Instead of: > > if (GET_CODE (setter) == CLOBBER_HIGH >&& reg_is_clobbered_by_clobber_high(REGNO(dest), GET_MODE > (rsp->last_set_value)) > > Now becomes something like: > > if (GET_CODE (setter) == SET >&& REG_P (dest) && HARD_REGISTER_P (dest) && REG_P (src) && REGNO(dst) == > REGNO(src) >&& reg_is_clobbered_by_self_set(REGNO(dest), GET_MODE > (rsp->last_set_value)) > > I then need to find the next pass that has similar checks…. and again it’ll > be the same places I already have clobber high code. > > I suspect in the end I’ll be able to remove the df-scan changes, because as I > effectively found out with clobber high, they aren’t causing any register > backups to happen. > > Ok, in the new patch we do save a bit of code because there is no new > expression to add. But that was a small part of the full patch. > > I could rewrite the patch in this way, but personally I feel it’s now > exploiting a side effect of a set to self, rather than being explicit in what > it is trying to do. > > > Alan. >
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
> On 19 Dec 2017, at 16:27, Jeff Law wrote: > > On 12/19/2017 03:12 AM, Alan Hayward wrote: >> Ping ping. >> I think there should be enough information in the first test to show that >> any "set to self” >> registers become live. Let me know if there’s anything I’ve missed. > I think that both Richi and I would like you investigate fixing the df > infrastructure so that a self-set doesn't make things go live. Sorry if > we weren't clear about that. ie, level of complexity and likely fallout > so that we can evaluate that vs CLOBBER_HIGH. > > > > jeff > Right, sorry, I misunderstood. Ok, so I’ve been looking at trying to do this. To summarise: To do this we need to check for 1) All reg sets to self where 2) the existing value of the register fits within the mode of the new set. In these cases we want then to (in effect) pretend the set doesn’t exist. To test this, I add a set to self in a tls_desc call for every V register ( eg: (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)), (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM)) etc etc). If the patch works, then these registers will not be backed up around a tls call. First added some checks into df-scan and successfully stopped the sets to self from being added to the live and uses lists. [ To simplify the issue for now I ignored the existing value of the register, and just assumed it fits ]. However, running my test case, the code still results in my vector registers being backed up around tls. Even though the dumps show that my vector registers are no longer live. Debugging further, finds that the register backing up is happening as part of combine.c. Ok, I can add a check for reg set to self in here too….. But as part of my clobber_high patch I already had code in clobber.c that checked for CLOBBER_HIGH and then checked the mode of the previous value in the register. My new code ends up looking very similar to my clobber high patch. Instead of: if (GET_CODE (setter) == CLOBBER_HIGH && reg_is_clobbered_by_clobber_high(REGNO(dest), GET_MODE (rsp->last_set_value)) Now becomes something like: if (GET_CODE (setter) == SET && REG_P (dest) && HARD_REGISTER_P (dest) && REG_P (src) && REGNO(dst) == REGNO(src) && reg_is_clobbered_by_self_set(REGNO(dest), GET_MODE (rsp->last_set_value)) I then need to find the next pass that has similar checks…. and again it’ll be the same places I already have clobber high code. I suspect in the end I’ll be able to remove the df-scan changes, because as I effectively found out with clobber high, they aren’t causing any register backups to happen. Ok, in the new patch we do save a bit of code because there is no new expression to add. But that was a small part of the full patch. I could rewrite the patch in this way, but personally I feel it’s now exploiting a side effect of a set to self, rather than being explicit in what it is trying to do. Alan.
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
Ping ping. I think there should be enough information in the first test to show that any "set to self” registers become live. Let me know if there’s anything I’ve missed. Thanks, Alan. > On 12 Dec 2017, at 11:11, Alan Hayward wrote: > > Ping. > >> On 30 Nov 2017, at 11:03, Alan Hayward wrote: >> >> >>> On 27 Nov 2017, at 17:29, Jeff Law wrote: >>> >>> On 11/23/2017 04:11 AM, Alan Hayward wrote: >>>> >>>>> On 22 Nov 2017, at 17:33, Jeff Law wrote: >>>>> >>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote: >>>>>> >>>>>>> On 21 Nov 2017, at 03:13, Jeff Law wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd >>>>>>>>> totally forgotten about it. And in fact it seems to come pretty close >>>>>>>>> to what you need… >>>>>>>> >>>>>>>> Yes, some of the code is similar to the way >>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the >>>>>>>> CLOBBER expr code served as a starting point for writing the patch. >>>>>>>> The main difference >>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a >>>>>>>> specific Instruction, >>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied >>>>>>>> to an expression (tls_desc). >>>>>>>> It meant there wasn’t really any opportunity to resume any existing >>>>>>>> code. >>>>>>> Understood. Though your first patch mentions that you're trying to >>>>>>> describe partial preservation "around TLS calls". Presumably those are >>>>>>> represented as normal insns, not call_insn. >>>>>>> >>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg >>>>>>> to itself using whatever mode is wide enough to cover the neon part of >>>>>>> the register. >>>>>>> >>>>>>> That should tell the generic parts of the compiler that you're just >>>>>>> clobbering the upper part and at least in theory you can implement in >>>>>>> the aarch64 backend and the rest of the compiler should "just work" >>>>>>> because that's the existing semantics of a subreg store. >>>>>>> >>>>>>> The only worry would be if a pass tried to get overly smart and >>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply >>>>>>> wrong given the semantics of a partial store. >>>>>>> >>>>>> >>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X). >>>>>> It’s something we considered, and then dismissed. >>>>>> >>>>>> The problem then is you are now using SET semantics on those registers, >>>>>> and it >>>>>> would make the register live around the function, which might not be the >>>>>> case. >>>>>> Whereas clobber semantics will just make the register dead - which is >>>>>> exactly >>>>>> what we want (but only conditionally). >>>>> ?!? A set of the subreg is the *exact* semantics you want. It says the >>>>> low part is preserved while the upper part is clobbered across the TLS >>>>> insns. >>>>> >>>>> jeff >>>> >>>> Consider where the TLS call is inside a loop. The compiler would normally >>>> want >>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of >>>> the tls_desc we >>>> are now making x live across the loop, x is dependant on the value from >>>> the previous >>>> iteration, and the tls_desc can no longer be hoisted. >>> Hmm. I think I see the problem you're trying to point out. Let me >>> restate it and see if you agree. >>> >>> The low subreg set does clearly indicate the upper part of the SVE >>> register is clobbered. The problem is from a liveness standpoint the >>> compiler is considering the low part li
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
Ping. > On 30 Nov 2017, at 11:03, Alan Hayward wrote: > > >> On 27 Nov 2017, at 17:29, Jeff Law wrote: >> >> On 11/23/2017 04:11 AM, Alan Hayward wrote: >>> >>>> On 22 Nov 2017, at 17:33, Jeff Law wrote: >>>> >>>> On 11/22/2017 04:31 AM, Alan Hayward wrote: >>>>> >>>>>> On 21 Nov 2017, at 03:13, Jeff Law wrote: >>>>>>> >>>>>>>> >>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd >>>>>>>> totally forgotten about it. And in fact it seems to come pretty close >>>>>>>> to what you need… >>>>>>> >>>>>>> Yes, some of the code is similar to the way >>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the >>>>>>> CLOBBER expr code served as a starting point for writing the patch. The >>>>>>> main difference >>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a >>>>>>> specific Instruction, >>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied >>>>>>> to an expression (tls_desc). >>>>>>> It meant there wasn’t really any opportunity to resume any existing >>>>>>> code. >>>>>> Understood. Though your first patch mentions that you're trying to >>>>>> describe partial preservation "around TLS calls". Presumably those are >>>>>> represented as normal insns, not call_insn. >>>>>> >>>>>> That brings me back to Richi's idea of exposing a set of the low subreg >>>>>> to itself using whatever mode is wide enough to cover the neon part of >>>>>> the register. >>>>>> >>>>>> That should tell the generic parts of the compiler that you're just >>>>>> clobbering the upper part and at least in theory you can implement in >>>>>> the aarch64 backend and the rest of the compiler should "just work" >>>>>> because that's the existing semantics of a subreg store. >>>>>> >>>>>> The only worry would be if a pass tried to get overly smart and >>>>>> considered that kind of set a nop -- but I think I'd argue that's simply >>>>>> wrong given the semantics of a partial store. >>>>>> >>>>> >>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X). >>>>> It’s something we considered, and then dismissed. >>>>> >>>>> The problem then is you are now using SET semantics on those registers, >>>>> and it >>>>> would make the register live around the function, which might not be the >>>>> case. >>>>> Whereas clobber semantics will just make the register dead - which is >>>>> exactly >>>>> what we want (but only conditionally). >>>> ?!? A set of the subreg is the *exact* semantics you want. It says the >>>> low part is preserved while the upper part is clobbered across the TLS >>>> insns. >>>> >>>> jeff >>> >>> Consider where the TLS call is inside a loop. The compiler would normally >>> want >>> to hoist that out of the loop. By adding a set(x,x) into the parallel of >>> the tls_desc we >>> are now making x live across the loop, x is dependant on the value from the >>> previous >>> iteration, and the tls_desc can no longer be hoisted. >> Hmm. I think I see the problem you're trying to point out. Let me >> restate it and see if you agree. >> >> The low subreg set does clearly indicate the upper part of the SVE >> register is clobbered. The problem is from a liveness standpoint the >> compiler is considering the low part live, even though it's a self-set. >> > > Yes. > >> In fact, if that is the case, then a single TLS call (independent of a >> loop) would make the low part of the register globally live. This >> should be testable. Include one of these low part self sets on the >> existing TLS calls and compile a little test function and let's look at >> the liveness data. >> > > > Ok, so I’ve put > (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)) > (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM)) >
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
> On 27 Nov 2017, at 17:29, Jeff Law wrote: > > On 11/23/2017 04:11 AM, Alan Hayward wrote: >> >>> On 22 Nov 2017, at 17:33, Jeff Law wrote: >>> >>> On 11/22/2017 04:31 AM, Alan Hayward wrote: >>>> >>>>> On 21 Nov 2017, at 03:13, Jeff Law wrote: >>>>>> >>>>>>> >>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd >>>>>>> totally forgotten about it. And in fact it seems to come pretty close >>>>>>> to what you need… >>>>>> >>>>>> Yes, some of the code is similar to the way >>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the >>>>>> CLOBBER expr code served as a starting point for writing the patch. The >>>>>> main difference >>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a >>>>>> specific Instruction, >>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to >>>>>> an expression (tls_desc). >>>>>> It meant there wasn’t really any opportunity to resume any existing code. >>>>> Understood. Though your first patch mentions that you're trying to >>>>> describe partial preservation "around TLS calls". Presumably those are >>>>> represented as normal insns, not call_insn. >>>>> >>>>> That brings me back to Richi's idea of exposing a set of the low subreg >>>>> to itself using whatever mode is wide enough to cover the neon part of >>>>> the register. >>>>> >>>>> That should tell the generic parts of the compiler that you're just >>>>> clobbering the upper part and at least in theory you can implement in >>>>> the aarch64 backend and the rest of the compiler should "just work" >>>>> because that's the existing semantics of a subreg store. >>>>> >>>>> The only worry would be if a pass tried to get overly smart and >>>>> considered that kind of set a nop -- but I think I'd argue that's simply >>>>> wrong given the semantics of a partial store. >>>>> >>>> >>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X). >>>> It’s something we considered, and then dismissed. >>>> >>>> The problem then is you are now using SET semantics on those registers, >>>> and it >>>> would make the register live around the function, which might not be the >>>> case. >>>> Whereas clobber semantics will just make the register dead - which is >>>> exactly >>>> what we want (but only conditionally). >>> ?!? A set of the subreg is the *exact* semantics you want. It says the >>> low part is preserved while the upper part is clobbered across the TLS >>> insns. >>> >>> jeff >> >> Consider where the TLS call is inside a loop. The compiler would normally >> want >> to hoist that out of the loop. By adding a set(x,x) into the parallel of the >> tls_desc we >> are now making x live across the loop, x is dependant on the value from the >> previous >> iteration, and the tls_desc can no longer be hoisted. > Hmm. I think I see the problem you're trying to point out. Let me > restate it and see if you agree. > > The low subreg set does clearly indicate the upper part of the SVE > register is clobbered. The problem is from a liveness standpoint the > compiler is considering the low part live, even though it's a self-set. > Yes. > In fact, if that is the case, then a single TLS call (independent of a > loop) would make the low part of the register globally live. This > should be testable. Include one of these low part self sets on the > existing TLS calls and compile a little test function and let's look at > the liveness data. > Ok, so I’ve put (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)) (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM)) etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the neon registers. In an ideal world, this change would do nothing as neon reg size is TI. First I compiled up sve_tls_preserve_1.c (Test1 below) Without the SET changes, gcc does not backup any neon registers before the tlsdesccall (good). With the SET changes, gcc backs up all the neon registers (not ideal). Without the SET changes, dump logs give: cse1: ;; regs ever l
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
> On 22 Nov 2017, at 17:33, Jeff Law wrote: > > On 11/22/2017 04:31 AM, Alan Hayward wrote: >> >>> On 21 Nov 2017, at 03:13, Jeff Law wrote: >>>> >>>>> >>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd >>>>> totally forgotten about it. And in fact it seems to come pretty close >>>>> to what you need… >>>> >>>> Yes, some of the code is similar to the way >>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the >>>> CLOBBER expr code served as a starting point for writing the patch. The >>>> main difference >>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a >>>> specific Instruction, >>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to >>>> an expression (tls_desc). >>>> It meant there wasn’t really any opportunity to resume any existing code. >>> Understood. Though your first patch mentions that you're trying to >>> describe partial preservation "around TLS calls". Presumably those are >>> represented as normal insns, not call_insn. >>> >>> That brings me back to Richi's idea of exposing a set of the low subreg >>> to itself using whatever mode is wide enough to cover the neon part of >>> the register. >>> >>> That should tell the generic parts of the compiler that you're just >>> clobbering the upper part and at least in theory you can implement in >>> the aarch64 backend and the rest of the compiler should "just work" >>> because that's the existing semantics of a subreg store. >>> >>> The only worry would be if a pass tried to get overly smart and >>> considered that kind of set a nop -- but I think I'd argue that's simply >>> wrong given the semantics of a partial store. >>> >> >> So, the instead of using clobber_high(reg X), to use set(reg X, reg X). >> It’s something we considered, and then dismissed. >> >> The problem then is you are now using SET semantics on those registers, and >> it >> would make the register live around the function, which might not be the >> case. >> Whereas clobber semantics will just make the register dead - which is exactly >> what we want (but only conditionally). > ?!? A set of the subreg is the *exact* semantics you want. It says the > low part is preserved while the upper part is clobbered across the TLS > insns. > > jeff Consider where the TLS call is inside a loop. The compiler would normally want to hoist that out of the loop. By adding a set(x,x) into the parallel of the tls_desc we are now making x live across the loop, x is dependant on the value from the previous iteration, and the tls_desc can no longer be hoisted. Or consider a stream of code containing two tls_desc calls (ok, the compiler might optimise one of the tls calls away, but this approach should be reusable for other exprs). Between the two set(x,x)’s x is considered live so the register allocator can’t use that register. Given that we are applying this to all the neon registers, the register allocator now throws an ICE because it can’t find any free hard neon registers to use. Alan.
Re: [PATCH] Fix result for conditional reductions matching at index 0
> On 22 Nov 2017, at 16:57, Kilian Verhetsel > wrote: > > > Thank you both for your comments. > > I have added the check to ensure the index vector won't cause an > overflow. I also added tests to the testsuite in order to check that the > loop is vectorized for UINT_MAX - 1 iterations but not UINT_MAX > iterations. I was not able to write code that triggers > INTEGER_INDUC_COND_REDUCTION when using char or other smaller types > (changing the types of last, min_v and a to something else causes > COND_REDUCTION to be used instead), so these tests are only compiled and > not executed. I had similar problems when playing around with -14.c, but didn’t have chance to investigate why. Possibly worth raising a bug to mentioning there is a missed optimisation. It’d be nice to figure out why. > > I also moved an instruction that generates a vector of zeroes (used for > COND_REDUCTION) in the branch of code run only for COND_REDUCTION, this > should remove the unused vector that Alan noticed. Patch is ok for me. Alan.
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
> On 21 Nov 2017, at 03:13, Jeff Law wrote: >> >>> >>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd >>> totally forgotten about it. And in fact it seems to come pretty close >>> to what you need… >> >> Yes, some of the code is similar to the way >> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the >> CLOBBER expr code served as a starting point for writing the patch. The main >> difference >> here, is that _PART_CLOBBERED is around all calls and is not tied to a >> specific Instruction, >> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an >> expression (tls_desc). >> It meant there wasn’t really any opportunity to resume any existing code. > Understood. Though your first patch mentions that you're trying to > describe partial preservation "around TLS calls". Presumably those are > represented as normal insns, not call_insn. > > That brings me back to Richi's idea of exposing a set of the low subreg > to itself using whatever mode is wide enough to cover the neon part of > the register. > > That should tell the generic parts of the compiler that you're just > clobbering the upper part and at least in theory you can implement in > the aarch64 backend and the rest of the compiler should "just work" > because that's the existing semantics of a subreg store. > > The only worry would be if a pass tried to get overly smart and > considered that kind of set a nop -- but I think I'd argue that's simply > wrong given the semantics of a partial store. > So, the instead of using clobber_high(reg X), to use set(reg X, reg X). It’s something we considered, and then dismissed. The problem then is you are now using SET semantics on those registers, and it would make the register live around the function, which might not be the case. Whereas clobber semantics will just make the register dead - which is exactly what we want (but only conditionally). Alan.
Re: [PATCH] Fix result for conditional reductions matching at index 0
> On 22 Nov 2017, at 09:14, Richard Biener wrote: > > On Tue, Nov 21, 2017 at 5:43 PM, Kilian Verhetsel > wrote: >> >>> This is PR81179 I think, please mention that in the changelog. >> >> Correct, my bad for missing that. >> >>> This unconditionally pessimizes code even if there is no valid index >>> zero, right? >> >> Almost, since for a loop such as: >> >> #define OFFSET 1 >> unsigned int find(const unsigned int *a, unsigned int v) { >>unsigned int result = 120; >>for (unsigned int i = OFFSET; i < 32+OFFSET; i++) { >> if (a[i-OFFSET] == v) result = i; >>} >>return result; >> } >> >> the index i will match the contents of the index vector used here --- >> but this does indeed pessimize the code generated for, say, OFFSET >> = 2. It is probably more sensible to use the existing code path in those >> situations. >> >>> The issue with the COND_REDUCITION index vector is overflow IIRC. >> >> Does that mean such overflows can already manifest themselves for >> regular COND_REDUCTION? I had assumed sufficient checks were already in >> place because of the presence of the is_nonwrapping_integer_induction >> test. > > But only if we need the index vector? The patch looked like you're changing > how other modes are handled (in my look I didn't make myself familiar with > the various modes again...). Anyway, Alan hopefully remembers what he > coded so I'll defer to him here. > > If Alan is happy with the patch consider it approved. > Richard’s right with his question. The optimisation needs to fail if the number of interactions of the loop + 1 doesn’t fit into the data type used for the result. I took the test pr65947-14.c First I set N to 0x-1. This compiled and vectorised. That’s ok. Now if I set N to 0x it still vectorises, but this should fail. Compare to pr65947-14.c where we set last = a[I]; inside the if. When set N to 0x-1, it compiled and vectorised. That’s ok. When set N to 0x it fails to vectorise with the message "loop size is greater than data size”. Looks like you might just need to add the one check. Also see pr65947-9.c which uses the slightly more useful char indexes. Alan.
Re: [PATCH] Fix result for conditional reductions matching at index 0
> On 21 Nov 2017, at 16:43, Kilian Verhetsel > wrote: > > >> This is PR81179 I think, please mention that in the changelog. > > Correct, my bad for missing that. > >> This unconditionally pessimizes code even if there is no valid index >> zero, right? > > Almost, since for a loop such as: > > #define OFFSET 1 > unsigned int find(const unsigned int *a, unsigned int v) { >unsigned int result = 120; >for (unsigned int i = OFFSET; i < 32+OFFSET; i++) { > if (a[i-OFFSET] == v) result = i; >} >return result; > } > > the index i will match the contents of the index vector used here --- > but this does indeed pessimize the code generated for, say, OFFSET > = 2. It is probably more sensible to use the existing code path in those > situations. > Looking at the final asm on aarch64 for -14.c, the code has only grown By a single instruction in the epilogue. Which is good, given the vector pass dump for this patch is quite a bit longer. In the vector dump, there is a vector of 0’s _57 = { 0, 0, 0, 0 }; created which is then never used (and later gets optimised away). Be nice if that could avoid getting created. I’ve not had chance to scrutinise the patch yet to see where that is created. >> The issue with the COND_REDUCITION index vector is overflow IIRC. > > Does that mean such overflows can already manifest themselves for > regular COND_REDUCTION? I had assumed sufficient checks were already in > place because of the presence of the is_nonwrapping_integer_induction > test. As an aside, it’s possibly worth mentioning that this issue will go away for aarch64-sve when Richard Sandiford’s SVE patch goes in, as that’ll support CLASTB reduction.
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
> On 17 Nov 2017, at 19:31, Jeff Law wrote: > > On 11/16/2017 11:50 AM, Alan Hayward wrote: >> >>> On 16 Nov 2017, at 18:24, Richard Biener wrote: >>> >>> On November 16, 2017 7:05:30 PM GMT+01:00, Jeff Law wrote: >>>> On 11/16/2017 05:34 AM, Alan Hayward wrote: >>>>> This is a set of patches aimed at supporting aarch64 SVE register >>>>> preservation around TLS calls. >>>>> >>>>> Across a TLS call, Aarch64 SVE does not explicitly preserve the >>>>> SVE vector registers. However, the Neon vector registers are >>>> preserved. >>>>> Due to overlapping of registers, this means the lower 128bits of all >>>>> SVE vector registers will be preserved. >>>>> >>>>> The existing GCC code will currently incorrectly assume preservation >>>>> of all of the SVE registers. >>>>> >>>>> This patch introduces a CLOBBER_HIGH expression. This behaves a bit >>>> like >>>>> a CLOBBER expression. CLOBBER_HIGH can only refer to a single >>>> register. >>>>> The mode of the expression indicates the size of the lower bits which >>>>> will be preserved. If the register contains a value bigger than this >>>>> mode then the code will treat the register as clobbered. >>>>> >>>>> The means in order to evaluate if a clobber high is relevant, we need >>>> to ensure >>>>> the mode of the existing value in a register is tracked. >>>>> >>>>> The following patches in this series add support for the >>>> CLOBBER_HIGH, >>>>> with the final patch adding CLOBBER_HIGHs around TLS_DESC calls for >>>>> aarch64. The testing performed on these patches is also detailed in >>>> the >>>>> final patch. >>>>> >>>>> These patches are based on top of the linaro-dev/sve branch. >>>>> >>>>> A simpler alternative to this patch would be to assume all Neon and >>>> SVE >>>>> registers are clobbered across TLS calls, however this would be a >>>>> performance regression against all Aarch64 targets. >>>> So just a couple design questions. >>>> >>>> Presumably there's no reasonable way to set up GCC's view of the >>>> register file to avoid this problem? ISTM that if the SVE register was >>>> split into two, one for the part that overlapped with the neon register >>>> and one that did not, then this could be handled via standard >>>> mechanisms? >>>> >> >> Yes, that was an early alternative option for the patch. >> >> With that it would effect every operation that uses SVE registers. A simple >> add of two registers now has 4 inputs and two outputs. It would get in the >> way when debugging any sve dumps and be generally annoying. >> Possible that the code for that in would all be in the aarch64 target, >> (making everyone else happy!) But I suspect that there would be still be >> strange dependency issues that’d need sorting in the common code. >> >> Whereas with this patch, there are no new oddities in non-tls compiles/dumps. >> Although the patch touches a lot of files, the changes are mostly restricted >> to places where standard clobbers were already being checked. > I'm not entirely sure that it would require doubling the number of > inputs/outputs. It's not conceptually much different than how we > describe DImode operations on 32bit targets. The mode selects one or > more consecutive registers, so you don't actually need anything weird in > your patterns. This is pretty standard stuff. Ok, fair enough. > > > It would be painful in that the Neon regs would have to interleave with > the upper part of the SVE regs in terms of register numbers. It would > also mean that you couldn't tie together multiple neon regs into > something wider. I'm not sure if the latter would be an issue or not. And there’s also the weirdness that the register would not be split evenly - it’ll be a TI reg followed by a reg of the size of multiple TIs. All of that has the potential to complicate all non-sve aarch64 code. > > You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd > totally forgotten about it. And in fact it seems to come pretty close > to what you need… Yes, some of the code is similar to the way TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the CLOBBER expr code served as a starting point for writing the patch. The main difference here, is that _PART_CLOBBERED is around all calls and is not tied to a specific Instruction, it’s part of the calling abi. Whereas clobber_high is explicitly tied to an expression (tls_desc). It meant there wasn’t really any opportunity to resume any existing code. Alan.
Re: [PATCH 7/7]: Enable clobber high for tls descs on Aarch64
> On 16 Nov 2017, at 19:32, Andrew Pinski wrote: > > On Thu, Nov 16, 2017 at 4:35 AM, Alan Hayward wrote: >> This final patch adds the clobber high expressions to tls_desc for aarch64. >> It also adds three tests. >> >> In addition I also tested by taking the gcc torture test suite and making >> all global variables __thread. Then emended the suite to compile with -fpic, >> save the .s file and only for one given O level. >> I ran this before and after the patch and compared the resulting .s files, >> ensuring that there were no ASM changes. >> I discarded the 10% of tests that failed to compile (due to the code in >> the test now being invalid C). >> I did this for O0,O2,O3 on both x86 and aarch64 and observed no difference >> between ASM files before and after the patch. > > Isn't the ABI defined as non-clobbering the lower 64bits for normal > function calls? Or is the TLS function "special" in that it > saves/restores the 128bit registers; is that documented anywhere? The > main reason why I am asking is because glibc is not the only libc out > there and someone could have a slightly different ABI here. > In NEON all the register SIMD registers are preserved around TLS calls - all 128bits of each register. That’s standard ABI behaviour for NEON. SVE doesn’t have any explicit preserving of it’s SIMD registers. However, the NEON and SVE registers share the same silicon - the lower 128bits of each SVE register is the same as the corresponding NEON register. The side effect of this is that the lower 128bits of the SVE registers are getting backed up. Neither glibc or any libraries need updating to support this. But, compilers do need to aware of this. Alan.
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
> On 16 Nov 2017, at 18:24, Richard Biener wrote: > > On November 16, 2017 7:05:30 PM GMT+01:00, Jeff Law wrote: >> On 11/16/2017 05:34 AM, Alan Hayward wrote: >>> This is a set of patches aimed at supporting aarch64 SVE register >>> preservation around TLS calls. >>> >>> Across a TLS call, Aarch64 SVE does not explicitly preserve the >>> SVE vector registers. However, the Neon vector registers are >> preserved. >>> Due to overlapping of registers, this means the lower 128bits of all >>> SVE vector registers will be preserved. >>> >>> The existing GCC code will currently incorrectly assume preservation >>> of all of the SVE registers. >>> >>> This patch introduces a CLOBBER_HIGH expression. This behaves a bit >> like >>> a CLOBBER expression. CLOBBER_HIGH can only refer to a single >> register. >>> The mode of the expression indicates the size of the lower bits which >>> will be preserved. If the register contains a value bigger than this >>> mode then the code will treat the register as clobbered. >>> >>> The means in order to evaluate if a clobber high is relevant, we need >> to ensure >>> the mode of the existing value in a register is tracked. >>> >>> The following patches in this series add support for the >> CLOBBER_HIGH, >>> with the final patch adding CLOBBER_HIGHs around TLS_DESC calls for >>> aarch64. The testing performed on these patches is also detailed in >> the >>> final patch. >>> >>> These patches are based on top of the linaro-dev/sve branch. >>> >>> A simpler alternative to this patch would be to assume all Neon and >> SVE >>> registers are clobbered across TLS calls, however this would be a >>> performance regression against all Aarch64 targets. >> So just a couple design questions. >> >> Presumably there's no reasonable way to set up GCC's view of the >> register file to avoid this problem? ISTM that if the SVE register was >> split into two, one for the part that overlapped with the neon register >> and one that did not, then this could be handled via standard >> mechanisms? >> Yes, that was an early alternative option for the patch. With that it would effect every operation that uses SVE registers. A simple add of two registers now has 4 inputs and two outputs. It would get in the way when debugging any sve dumps and be generally annoying. Possible that the code for that in would all be in the aarch64 target, (making everyone else happy!) But I suspect that there would be still be strange dependency issues that’d need sorting in the common code. Whereas with this patch, there are no new oddities in non-tls compiles/dumps. Although the patch touches a lot of files, the changes are mostly restricted to places where standard clobbers were already being checked. >> Alternately would it be easier to clobber a subreg representing the >> high >> part of the register? Hmm, probably not. > > I thought of a set of the preserved part to itself that leaves the upper part > undefined. Not sure if we have such thing or if it would work in all places > that a clobber does. I’ve not seen such a thing in the code. But it would need specific handling in the all the existing clobber code. Alan.
[PATCH 7/7]: Enable clobber high for tls descs on Aarch64
This final patch adds the clobber high expressions to tls_desc for aarch64. It also adds three tests. In addition I also tested by taking the gcc torture test suite and making all global variables __thread. Then emended the suite to compile with -fpic, save the .s file and only for one given O level. I ran this before and after the patch and compared the resulting .s files, ensuring that there were no ASM changes. I discarded the 10% of tests that failed to compile (due to the code in the test now being invalid C). I did this for O0,O2,O3 on both x86 and aarch64 and observed no difference between ASM files before and after the patch. Alan. 2017-11-16 Alan Hayward gcc/ * config/aarch64/aarch64.md: Add clobber highs to tls_desc. gcc/testsuite/ * gcc.target/aarch64/sve_tls_preserve_1.c: New test. * gcc.target/aarch64/sve_tls_preserve_2.c: New test. * gcc.target/aarch64/sve_tls_preserve_3.c: New test. diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 6a15ff0b61d775cf30189b8503cfa45987701228..1f332b254fe0e37954efbe92982f214100d7046f 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -57,7 +57,36 @@ (LR_REGNUM 30) (SP_REGNUM 31) (V0_REGNUM 32) +(V1_REGNUM 33) +(V2_REGNUM 34) +(V3_REGNUM 35) +(V4_REGNUM 36) +(V5_REGNUM 37) +(V6_REGNUM 38) +(V7_REGNUM 39) +(V8_REGNUM 40) +(V9_REGNUM 41) +(V10_REGNUM42) +(V11_REGNUM43) +(V12_REGNUM44) +(V13_REGNUM45) +(V14_REGNUM46) (V15_REGNUM47) +(V16_REGNUM48) +(V17_REGNUM49) +(V18_REGNUM50) +(V19_REGNUM51) +(V20_REGNUM52) +(V21_REGNUM53) +(V22_REGNUM54) +(V23_REGNUM55) +(V24_REGNUM56) +(V25_REGNUM57) +(V26_REGNUM58) +(V27_REGNUM59) +(V28_REGNUM60) +(V29_REGNUM61) +(V30_REGNUM62) (V31_REGNUM63) (LAST_SAVED_REGNUM 63) (SFP_REGNUM64) @@ -5745,6 +5774,38 @@ UNSPEC_TLSDESC)) (clobber (reg:DI LR_REGNUM)) (clobber (reg:CC CC_REGNUM)) + (clobber_high (reg:TI V0_REGNUM)) + (clobber_high (reg:TI V1_REGNUM)) + (clobber_high (reg:TI V2_REGNUM)) + (clobber_high (reg:TI V3_REGNUM)) + (clobber_high (reg:TI V4_REGNUM)) + (clobber_high (reg:TI V5_REGNUM)) + (clobber_high (reg:TI V6_REGNUM)) + (clobber_high (reg:TI V7_REGNUM)) + (clobber_high (reg:TI V8_REGNUM)) + (clobber_high (reg:TI V9_REGNUM)) + (clobber_high (reg:TI V10_REGNUM)) + (clobber_high (reg:TI V11_REGNUM)) + (clobber_high (reg:TI V12_REGNUM)) + (clobber_high (reg:TI V13_REGNUM)) + (clobber_high (reg:TI V14_REGNUM)) + (clobber_high (reg:TI V15_REGNUM)) + (clobber_high (reg:TI V16_REGNUM)) + (clobber_high (reg:TI V17_REGNUM)) + (clobber_high (reg:TI V18_REGNUM)) + (clobber_high (reg:TI V19_REGNUM)) + (clobber_high (reg:TI V20_REGNUM)) + (clobber_high (reg:TI V21_REGNUM)) + (clobber_high (reg:TI V22_REGNUM)) + (clobber_high (reg:TI V23_REGNUM)) + (clobber_high (reg:TI V24_REGNUM)) + (clobber_high (reg:TI V25_REGNUM)) + (clobber_high (reg:TI V26_REGNUM)) + (clobber_high (reg:TI V27_REGNUM)) + (clobber_high (reg:TI V28_REGNUM)) + (clobber_high (reg:TI V29_REGNUM)) + (clobber_high (reg:TI V30_REGNUM)) + (clobber_high (reg:TI V31_REGNUM)) (clobber (match_scratch:DI 1 "=r"))] "TARGET_TLS_DESC" "adrp\\tx0, %A0\;ldr\\t%1, [x0, #%L0]\;add\\t0, 0, %L0\;.tlsdesccall\\t%0\;blr\\t%1" diff --git a/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c new file mode 100644 index ..5bad829568130181ef1ab386545bd3ee164c322e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_1.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -fpic -march=armv8-a+sve" } */ + +/* Clobber highs do not need to be spilled around tls usage. */ + +typedef float v4si __attribute__ ((vector_size (16))); + +__thread v4si tx; + +v4si foo (v4si a, v4si b, v4si c) +{ + v4si y; + + y = a + tx + b + c; + + return y + 7; +} + +/* { dg-final { scan-assembler-not {\tstr\t} } } */ + diff --git a/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c new file mode 100644 index ..69e8829287b8418c28f8c227391c4f8d2186ea63 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve_tls_preserve_2.c @@ -0,0 +1
[PATCH 5/7]: cse support for clobber_high
This patch simply adds the cse specific changes for clobber_high. Alan. 2017-11-16 Alan Hayward * cse.c (invalidate_reg): New function extracted from... (invalidate): ...here. (canonicalize_insn): Check for clobber high. (invalidate_from_clobbers): invalidate clobber highs. (invalidate_from_sets_and_clobbers): Likewise. (count_reg_usage): Check for clobber high. (insn_live_p): Likewise. * cselib.c (cselib_expand_value_rtx_1):Likewise. (cselib_invalidate_regno): Check for clobber in setter. (cselib_invalidate_rtx): Pass through setter. (cselib_invalidate_rtx_note_stores): (cselib_process_insn): Check for clobber high. * cselib.h (cselib_invalidate_rtx): Add operand. diff --git a/gcc/cse.c b/gcc/cse.c index e3c0710215df0acca924ce74ffa54582125d0136..f15ae8693fbb243323dd049e21648a49546a4608 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -559,6 +559,7 @@ static struct table_elt *insert_with_costs (rtx, struct table_elt *, unsigned, static struct table_elt *insert (rtx, struct table_elt *, unsigned, machine_mode); static void merge_equiv_classes (struct table_elt *, struct table_elt *); +static void invalidate_reg (rtx, bool); static void invalidate (rtx, machine_mode); static void remove_invalid_refs (unsigned int); static void remove_invalid_subreg_refs (unsigned int, poly_uint64, @@ -1818,7 +1819,85 @@ check_dependence (const_rtx x, rtx exp, machine_mode mode, rtx addr) } return false; } - + +/* Remove from the hash table, or mark as invalid, all expressions whose + values could be altered by storing in register X. + + CLOBBER_HIGH is set if X was part of a CLOBBER_HIGH expression. */ + +static void +invalidate_reg (rtx x, bool clobber_high) +{ + gcc_assert (GET_CODE (x) == REG); + + /* If X is a register, dependencies on its contents are recorded + through the qty number mechanism. Just change the qty number of + the register, mark it as invalid for expressions that refer to it, + and remove it itself. */ + unsigned int regno = REGNO (x); + unsigned int hash = HASH (x, GET_MODE (x)); + + /* Remove REGNO from any quantity list it might be on and indicate + that its value might have changed. If it is a pseudo, remove its + entry from the hash table. + + For a hard register, we do the first two actions above for any + additional hard registers corresponding to X. Then, if any of these + registers are in the table, we must remove any REG entries that + overlap these registers. */ + + delete_reg_equiv (regno); + REG_TICK (regno)++; + SUBREG_TICKED (regno) = -1; + + if (regno >= FIRST_PSEUDO_REGISTER) +{ + gcc_assert (!clobber_high); + remove_pseudo_from_table (x, hash); +} + else +{ + HOST_WIDE_INT in_table = TEST_HARD_REG_BIT (hard_regs_in_table, regno); + unsigned int endregno = END_REGNO (x); + unsigned int rn; + struct table_elt *p, *next; + + CLEAR_HARD_REG_BIT (hard_regs_in_table, regno); + + for (rn = regno + 1; rn < endregno; rn++) + { + in_table |= TEST_HARD_REG_BIT (hard_regs_in_table, rn); + CLEAR_HARD_REG_BIT (hard_regs_in_table, rn); + delete_reg_equiv (rn); + REG_TICK (rn)++; + SUBREG_TICKED (rn) = -1; + } + + if (in_table) + for (hash = 0; hash < HASH_SIZE; hash++) + for (p = table[hash]; p; p = next) + { + next = p->next_same_hash; + + if (!REG_P (p->exp) || REGNO (p->exp) >= FIRST_PSEUDO_REGISTER) + continue; + + if (clobber_high) + { + if (reg_is_clobbered_by_clobber_high (p->exp, x)) + remove_from_table (p, hash); + } + else + { + unsigned int tregno = REGNO (p->exp); + unsigned int tendregno = END_REGNO (p->exp); + if (tendregno > regno && tregno < endregno) + remove_from_table (p, hash); + } + } +} +} + /* Remove from the hash table, or mark as invalid, all expressions whose values could be altered by storing in X. X is a register, a subreg, or a memory reference with nonvarying address (because, when a memory @@ -1841,65 +1920,7 @@ invalidate (rtx x, machine_mode full_mode) switch (GET_CODE (x)) { case REG: - { - /* If X is a register, dependencies on its contents are recorded - through the qty number mechanism. Just change the qty number of - the register, mark it as invalid for expressions that refer to it, - and remove it itself. */ - unsigned int regno = REGNO (x); - unsigned int hash = HASH (x, GET_MODE (x)); - - /* Remove REGNO from any quantity list it might be on and indicate - th
[PATCH 6/7]: Remaining support for clobber high
This patch simply adds the remainder of clobber high checks. Happy to split this into smaller patches if required (there didn't seem anything obvious to split into). Alan. 2017-11-16 Alan Hayward * alias.c (record_set): Check for clobber high. * cfgexpand.c (expand_gimple_stmt): Likewise. * combine-stack-adj.c (single_set_for_csa): Likewise. * combine.c (find_single_use_1): Likewise. (set_nonzero_bits_and_sign_copies): Likewise. (get_combine_src_dest): Likewise. (is_parallel_of_n_reg_sets): Likewise. (try_combine): Likewise. (record_dead_and_set_regs_1): Likewise. (reg_dead_at_p_1): Likewise. (reg_dead_at_p): Likewise. * dce.c (deletable_insn_p): Likewise. (mark_nonreg_stores_1): Likewise. (mark_nonreg_stores_2): Likewise. * df-scan.c (df_find_hard_reg_defs): Likewise. (df_uses_record): Likewise. (df_get_call_refs): Likewise. * dwarf2out.c (mem_loc_descriptor): Likewise. * haifa-sched.c (haifa_classify_rtx): Likewise. * ira-build.c (create_insn_allocnos): Likewise. * ira-costs.c (scan_one_insn): Likewise. * ira.c (equiv_init_movable_p): Likewise. (rtx_moveable_p): Likewise. (interesting_dest_for_shprep): Likewise. * jump.c (mark_jump_label_1): Likewise. * postreload-gcse.c (record_opr_changes): Likewise. * postreload.c (reload_cse_simplify): Likewise. (struct reg_use): Add source expr. (reload_combine): Check for clobber high. (reload_combine_note_use): Likewise. (reload_cse_move2add): Likewise. (move2add_note_store): Likewise. * print-rtl.c (print_pattern): Likewise. * recog.c (decode_asm_operands): Likewise. (store_data_bypass_p): Likewise. (if_test_bypass_p): Likewise. * regcprop.c (kill_clobbered_value): Likewise. (kill_set_value): Likewise. * reginfo.c (reg_scan_mark_refs): Likewise. * reload1.c (maybe_fix_stack_asms): Likewise. (eliminate_regs_1): Likewise. (elimination_effects): Likewise. (mark_not_eliminable): Likewise. (scan_paradoxical_subregs): Likewise. (forget_old_reloads_1): Likewise. * reorg.c (find_end_label): Likewise. (try_merge_delay_insns): Likewise. (redundant_insn): Likewise. (own_thread_p): Likewise. (fill_simple_delay_slots): Likewise. (fill_slots_from_thread): Likewise. (dbr_schedule): Likewise. * resource.c (update_live_status): Likewise. (mark_referenced_resources): Likewise. (mark_set_resources): Likewise. * rtl.c (copy_rtx): Likewise. * rtlanal.c (reg_referenced_p): Likewise. (single_set_2): Likewise. (noop_move_p): Likewise. (note_stores): Likewise. * sched-deps.c (sched_analyze_reg): Likewise. (sched_analyze_insn): Likewise. diff --git a/gcc/alias.c b/gcc/alias.c index c69ef410edac2ab0ab93e8ec9fe4c89a7078c001..6a6734bd7d5732c255c009be47e68aa073a9ebb1 100644 --- a/gcc/alias.c +++ b/gcc/alias.c @@ -1554,6 +1554,17 @@ record_set (rtx dest, const_rtx set, void *data ATTRIBUTE_UNUSED) new_reg_base_value[regno] = 0; return; } + /* A CLOBBER_HIGH only wipes out the old value if the mode of the old +value is greater than that of the clobber. */ + else if (GET_CODE (set) == CLOBBER_HIGH) + { + if (new_reg_base_value[regno] != 0 + && reg_is_clobbered_by_clobber_high ( + regno, GET_MODE (new_reg_base_value[regno]), XEXP (set, 0))) + new_reg_base_value[regno] = 0; + return; + } + src = SET_SRC (set); } else diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 06a8af8a1663c9e518a8169650a0c9969990df1f..ea6fc265f757543cff635a805fd4045a10add23e 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -3803,6 +3803,7 @@ expand_gimple_stmt (gimple *stmt) /* If we want exceptions for non-call insns, any may_trap_p instruction may throw. */ && GET_CODE (PATTERN (insn)) != CLOBBER + && GET_CODE (PATTERN (insn)) != CLOBBER_HIGH && GET_CODE (PATTERN (insn)) != USE && insn_could_throw_p (insn)) make_reg_eh_region_note (insn, 0, lp_nr); diff --git a/gcc/combine-stack-adj.c b/gcc/combine-stack-adj.c index 09f0be814f98922b6926a929401894809a890f61..595e83c73760a97e0f8ebd99e12b1853d1d52b92 100644 --- a/gcc/combine-stack-adj.c +++ b/gcc/combine-stack-adj.c @@ -133,6 +133,7 @@ single_set_for_csa (rtx_insn *insn) && SET_SRC (this_rtx) == SET_DEST (this_rtx)) ; else if (GET_CODE (this_rtx) != CLOBBER + && GET_CODE (this_rtx) != CLOBBER_HIGH && GET_CODE (this_rtx)
[PATCH 4/7]: lra support for clobber_high
This patch simply adds the lra specific changes for clobber_high. Alan. 2017-11-16 Alan Hayward * lra-eliminations.c (lra_eliminate_regs_1): Check for clobber high. (mark_not_eliminable): Likewise. * lra-int.h (struct lra_insn_reg): Add clobber high marker. * lra-lives.c (process_bb_lives): Check for clobber high. * lra.c (new_insn_reg): Remember clobber highs. (collect_non_operand_hard_regs): Check for clobber high. (lra_set_insn_recog_data): Likewise. (add_regs_to_insn_regno_info): Likewise. (lra_update_insn_regno_info): Likewise. diff --git a/gcc/lra-eliminations.c b/gcc/lra-eliminations.c index bea8b023b7cb7a512f7482a2f014647c30462870..251e539530456722e3f4231b928c2124f9d602b6 100644 --- a/gcc/lra-eliminations.c +++ b/gcc/lra-eliminations.c @@ -654,6 +654,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode mem_mode, return x; case CLOBBER: +case CLOBBER_HIGH: case SET: gcc_unreachable (); @@ -806,6 +807,16 @@ mark_not_eliminable (rtx x, machine_mode mem_mode) setup_can_eliminate (ep, false); return; +case CLOBBER_HIGH: + gcc_assert (REG_P (XEXP (x, 0))); + gcc_assert (REGNO (XEXP (x, 0)) < FIRST_PSEUDO_REGISTER); + for (ep = reg_eliminate; + ep < ®_eliminate[NUM_ELIMINABLE_REGS]; + ep++) + if (reg_is_clobbered_by_clobber_high (ep->to_rtx, XEXP (x, 0))) + setup_can_eliminate (ep, false); + return; + case SET: if (SET_DEST (x) == stack_pointer_rtx && GET_CODE (SET_SRC (x)) == PLUS diff --git a/gcc/lra-int.h b/gcc/lra-int.h index 6c219eacee3940054ed480a44cda1ed07993ad16..be439b95e9cedb358e9ba6c63f8a9490af5c816d 100644 --- a/gcc/lra-int.h +++ b/gcc/lra-int.h @@ -166,6 +166,8 @@ struct lra_insn_reg /* True if there is an early clobber alternative for this operand. */ unsigned int early_clobber : 1; + /* True if the reg is clobber highed by the operand. */ + unsigned int clobber_high : 1; /* The corresponding regno of the register. */ int regno; /* Next reg info of the same insn. */ diff --git a/gcc/lra-lives.c b/gcc/lra-lives.c index df7e2537dd09a4abd5ce517f4bb556cc32000fa0..82a1811f837b425a326753c0d956149382d752cb 100644 --- a/gcc/lra-lives.c +++ b/gcc/lra-lives.c @@ -655,7 +655,7 @@ process_bb_lives (basic_block bb, int &curr_point, bool dead_insn_p) bool call_p; int n_alt, dst_regno, src_regno; rtx set; - struct lra_insn_reg *reg; + struct lra_insn_reg *reg, *hr; if (!NONDEBUG_INSN_P (curr_insn)) continue; @@ -687,11 +687,12 @@ process_bb_lives (basic_block bb, int &curr_point, bool dead_insn_p) break; } for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next) - if (reg->type != OP_IN) + if (reg->type != OP_IN && !reg->clobber_high) { remove_p = false; break; } + if (remove_p && ! volatile_refs_p (PATTERN (curr_insn))) { dst_regno = REGNO (SET_DEST (set)); @@ -809,14 +810,24 @@ process_bb_lives (basic_block bb, int &curr_point, bool dead_insn_p) unused values because they still conflict with quantities that are live at the time of the definition. */ for (reg = curr_id->regs; reg != NULL; reg = reg->next) - if (reg->type != OP_IN) - { - need_curr_point_incr - |= mark_regno_live (reg->regno, reg->biggest_mode, - curr_point); - check_pseudos_live_through_calls (reg->regno, - last_call_used_reg_set); - } + { + if (reg->type != OP_IN) + { + need_curr_point_incr + |= mark_regno_live (reg->regno, reg->biggest_mode, + curr_point); + check_pseudos_live_through_calls (reg->regno, + last_call_used_reg_set); + } + + if (reg->regno >= FIRST_PSEUDO_REGISTER) + for (hr = curr_static_id->hard_regs; hr != NULL; hr = hr->next) + if (hr->clobber_high + && may_gt (GET_MODE_SIZE (PSEUDO_REGNO_MODE (reg->regno)), +GET_MODE_SIZE (hr->biggest_mode))) + SET_HARD_REG_BIT (lra_reg_info[reg->regno].conflict_hard_regs, + hr->regno); + } for (reg = curr_static_id->hard_regs; reg != NULL; reg = reg->next) if (reg->type != OP_IN) diff --git a/gcc/lra.c b/gcc/lra.c index 8d44c75b0b4f89ff9fe94d0b8dfb2e77d43fee26..d6775d629655700484760ab85f78b9fd16189ca0 100644 --- a/gcc/lra.c +++ b/gcc/lra.c @@ -535
[PATCH 3/7] Add func to check if register is clobbered by clobber_high
This patch adds the function reg_is_clobbered_by_clobber_high. Given a CLOBBER_HIGH expression and a register, it checks if the register will be clobbered. A second version exists for the cases where the expressions are not available. The function will be used throughout the following patches. Alan. 2017-11-16 Alan Hayward * rtl.h (reg_is_clobbered_by_clobber_high): Add declarations. * rtlanal.c (reg_is_clobbered_by_clobber_high): Add function. diff --git a/gcc/rtl.h b/gcc/rtl.h index bdb05d00120e7fadeb7f2d29bd67afc7a77262c1..5a85eb42ea4455cf3a975b3adbdb9d0415441d3b 100644 --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -3417,6 +3417,16 @@ extern bool tablejump_p (const rtx_insn *, rtx_insn **, rtx_jump_table_data **); extern int computed_jump_p (const rtx_insn *); extern bool tls_referenced_p (const_rtx); extern bool contains_mem_rtx_p (rtx x); +extern bool reg_is_clobbered_by_clobber_high (unsigned int, machine_mode, + const_rtx); + +/* Convenient wrapper for reg_is_clobbered_by_clobber_high. */ +inline bool +reg_is_clobbered_by_clobber_high (const_rtx x, const_rtx clobber_high_op) +{ + return reg_is_clobbered_by_clobber_high (REGNO (x), GET_MODE (x), + clobber_high_op); +} /* Overload for refers_to_regno_p for checking a single register. */ inline bool diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c index 79a5ae197c14ba338240123f2fc912f2ea60e178..923e3314d25c05f9055907c61b4a24186701cc23 100644 --- a/gcc/rtlanal.c +++ b/gcc/rtlanal.c @@ -6519,3 +6519,32 @@ tls_referenced_p (const_rtx x) return true; return false; } + +/* Return true if reg REGNO with mode REG_MODE would be clobbered by the + clobber_high operand in CLOBBER_HIGH_OP. */ + +bool +reg_is_clobbered_by_clobber_high (unsigned int regno, machine_mode reg_mode, + const_rtx clobber_high_op) +{ + unsigned int clobber_regno = REGNO (clobber_high_op); + machine_mode clobber_mode = GET_MODE (clobber_high_op); + unsigned char regno_nregs = hard_regno_nregs (regno, reg_mode); + + /* Clobber high should always span exactly one register. */ + gcc_assert (REG_NREGS (clobber_high_op) == 1); + + /* Clobber high needs to match with one of the registers in X. */ + if (clobber_regno < regno || clobber_regno >= regno + regno_nregs) +return false; + + gcc_assert (reg_mode != BLKmode && clobber_mode != BLKmode); + + if (reg_mode == VOIDmode) +return clobber_mode != VOIDmode; + + /* Clobber high will clobber if its size might be greater than the size of + register regno. */ + return may_gt (exact_div (GET_MODE_SIZE (reg_mode), regno_nregs), +GET_MODE_SIZE (clobber_mode)); +}
[PATCH 2/7] Support >26 operands in generation code.
This patch adds support for CLOBBER_HIGH in the generation code. An aarch64 will require 31 clobber high expressions, plus two clobbers. The exisiting gen code restricts to 26 vector operands by virtue of using the operators [a-z]. This patch extends this to 52 by supporting [a-zA-Z]. Alan. 2017-11-16 Alan Hayward * emit-rtl.c (verify_rtx_sharing): Check CLOBBER_HIGH. (copy_insn_1): Likewise. (gen_hard_reg_clobber_high): Add function. * genconfig.c (walk_insn_part): Check CLOBBER_HIGH. * genemit.c (gen_exp): Likewise. (gen_emit_seq): Pass thru info. (gen_insn): Check CLOBBER_HIGH. (gen_expand): Pass thru info. (gen_split): Likewise. (output_add_clobbers): Likewise. * genextract.c (push_pathstr_operand): New function to support [a-zA-Z]. (walk_rtx): Call push_pathstr_operand. (print_path): Support [a-zA-Z]. * genrecog.c (validate_pattern): Check CLOBBER_HIGH. (remove_clobbers): Likewise. * rtl.h (gen_hard_reg_clobber_high): Add declaration. diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c index af4a038d75acf17c7b04ad58ab7467f7bd7cd129..64159a82e3f79b792d58166d4307db8751c88980 100644 --- a/gcc/emit-rtl.c +++ b/gcc/emit-rtl.c @@ -2895,6 +2895,7 @@ verify_rtx_sharing (rtx orig, rtx insn) /* SCRATCH must be shared because they represent distinct values. */ return; case CLOBBER: +case CLOBBER_HIGH: /* Share clobbers of hard registers (like cc0), but do not share pseudo reg clobbers or clobbers of hard registers that originated as pseudos. This is needed to allow safe register renaming. */ @@ -3148,6 +3149,7 @@ repeat: /* SCRATCH must be shared because they represent distinct values. */ return; case CLOBBER: +case CLOBBER_HIGH: /* Share clobbers of hard registers (like cc0), but do not share pseudo reg clobbers or clobbers of hard registers that originated as pseudos. This is needed to allow safe register renaming. */ @@ -5707,6 +5709,7 @@ copy_insn_1 (rtx orig) case SIMPLE_RETURN: return orig; case CLOBBER: +case CLOBBER_HIGH: /* Share clobbers of hard registers (like cc0), but do not share pseudo reg clobbers or clobbers of hard registers that originated as pseudos. This is needed to allow safe register renaming. */ @@ -6529,6 +6532,19 @@ gen_hard_reg_clobber (machine_mode mode, unsigned int regno) gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (mode, regno))); } +static GTY((deletable)) rtx +hard_reg_clobbers_high[NUM_MACHINE_MODES][FIRST_PSEUDO_REGISTER]; + +rtx +gen_hard_reg_clobber_high (machine_mode mode, unsigned int regno) +{ + if (hard_reg_clobbers_high[mode][regno]) +return hard_reg_clobbers_high[mode][regno]; + else +return (hard_reg_clobbers_high[mode][regno] + = gen_rtx_CLOBBER_HIGH (VOIDmode, gen_rtx_REG (mode, regno))); +} + location_t prologue_location; location_t epilogue_location; diff --git a/gcc/genconfig.c b/gcc/genconfig.c index 4ff36cb019d427f410d9f251777b9b05217fac36..4108e9c457fce5529ec9a3284d37f933736776ad 100644 --- a/gcc/genconfig.c +++ b/gcc/genconfig.c @@ -72,6 +72,7 @@ walk_insn_part (rtx part, int recog_p, int non_pc_set_src) switch (code) { case CLOBBER: +case CLOBBER_HIGH: clobbers_seen_this_insn++; break; diff --git a/gcc/genemit.c b/gcc/genemit.c index 708da27221546c406030e88a4b07a51fb9df4a14..4e93b6b9831c65fc829fed6367881233b8eddcac 100644 --- a/gcc/genemit.c +++ b/gcc/genemit.c @@ -79,7 +79,7 @@ gen_rtx_scratch (rtx x, enum rtx_code subroutine_type) substituting any operand references appearing within. */ static void -gen_exp (rtx x, enum rtx_code subroutine_type, char *used) +gen_exp (rtx x, enum rtx_code subroutine_type, char *used, md_rtx_info *info) { RTX_CODE code; int i; @@ -123,7 +123,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used) for (i = 0; i < XVECLEN (x, 1); i++) { printf (",\n\t\t"); - gen_exp (XVECEXP (x, 1, i), subroutine_type, used); + gen_exp (XVECEXP (x, 1, i), subroutine_type, used, info); } printf (")"); return; @@ -137,7 +137,7 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used) for (i = 0; i < XVECLEN (x, 2); i++) { printf (",\n\t\t"); - gen_exp (XVECEXP (x, 2, i), subroutine_type, used); + gen_exp (XVECEXP (x, 2, i), subroutine_type, used, info); } printf (")"); return; @@ -163,12 +163,21 @@ gen_exp (rtx x, enum rtx_code subroutine_type, char *used) case CLOBBER: if (REG_P (XEXP (x, 0))) { - printf ("gen_hard_reg_clobber (%smode, %i)", GET_MODE_NAME (GET_MODE (XEXP (x, 0))), -REGNO (XEXP (x, 0
[PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
This is a set of patches aimed at supporting aarch64 SVE register preservation around TLS calls. Across a TLS call, Aarch64 SVE does not explicitly preserve the SVE vector registers. However, the Neon vector registers are preserved. Due to overlapping of registers, this means the lower 128bits of all SVE vector registers will be preserved. The existing GCC code will currently incorrectly assume preservation of all of the SVE registers. This patch introduces a CLOBBER_HIGH expression. This behaves a bit like a CLOBBER expression. CLOBBER_HIGH can only refer to a single register. The mode of the expression indicates the size of the lower bits which will be preserved. If the register contains a value bigger than this mode then the code will treat the register as clobbered. The means in order to evaluate if a clobber high is relevant, we need to ensure the mode of the existing value in a register is tracked. The following patches in this series add support for the CLOBBER_HIGH, with the final patch adding CLOBBER_HIGHs around TLS_DESC calls for aarch64. The testing performed on these patches is also detailed in the final patch. These patches are based on top of the linaro-dev/sve branch. A simpler alternative to this patch would be to assume all Neon and SVE registers are clobbered across TLS calls, however this would be a performance regression against all Aarch64 targets. Alan. 2017-11-16 Alan Hayward * doc/rtl.texi (clobber_high): Add. (parallel): Add in clobber high * rtl.c (rtl_check_failed_code3): Add function. * rtl.def (CLOBBER_HIGH): Add expression. * rtl.h (RTL_CHECKC3): Add macro. (rtl_check_failed_code3): Add declaration. (XC3EXP): Add macro. diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi index f583940b9441b2111c8d65a00a064e89bdd2ffaf..951322258ddbb57900225bd501bd23a8a9970ead 100644 --- a/gcc/doc/rtl.texi +++ b/gcc/doc/rtl.texi @@ -3209,6 +3209,18 @@ There is one other known use for clobbering a pseudo register in a clobbered by the insn. In this case, using the same pseudo register in the clobber and elsewhere in the insn produces the expected results. +@findex clobber_high +@item (clobber_high @var{x}) +Represents the storing or possible storing of an unpredictable, +undescribed value into the upper parts of @var{x}. The mode of the expression +represents the lower parts of the register which will not be overwritten. +@code{reg} must be a reg expression. + +One place this is used is when calling into functions where the registers are +preserved, but only up to a given number of bits. For example when using +Aarch64 SVE, calling a TLS descriptor will cause only the lower 128 bits of +each of the vector registers to be preserved. + @findex use @item (use @var{x}) Represents the use of the value of @var{x}. It indicates that the @@ -3262,7 +3274,8 @@ Represents several side effects performed in parallel. The square brackets stand for a vector; the operand of @code{parallel} is a vector of expressions. @var{x0}, @var{x1} and so on are individual side effect expressions---expressions of code @code{set}, @code{call}, -@code{return}, @code{simple_return}, @code{clobber} or @code{use}. +@code{return}, @code{simple_return}, @code{clobber} @code{use} or +@code{clobber_high}. ``In parallel'' means that first all the values used in the individual side-effects are computed, and second all the actual side-effects are diff --git a/gcc/rtl.c b/gcc/rtl.c index 3b2728be8b506fb3c14a20297cf92368caa5ca3b..6db84f99627bb8617c6e227892ca44076f4e729b 100644 --- a/gcc/rtl.c +++ b/gcc/rtl.c @@ -860,6 +860,17 @@ rtl_check_failed_code2 (const_rtx r, enum rtx_code code1, enum rtx_code code2, } void +rtl_check_failed_code3 (const_rtx r, enum rtx_code code1, enum rtx_code code2, + enum rtx_code code3, const char *file, int line, + const char *func) +{ + internal_error +("RTL check: expected code '%s', '%s' or '%s', have '%s' in %s, at %s:%d", + GET_RTX_NAME (code1), GET_RTX_NAME (code2), GET_RTX_NAME (code3), + GET_RTX_NAME (GET_CODE (r)), func, trim_filename (file), line); +} + +void rtl_check_failed_code_mode (const_rtx r, enum rtx_code code, machine_mode mode, bool not_mode, const char *file, int line, const char *func) diff --git a/gcc/rtl.def b/gcc/rtl.def index 83bcfcaadcacc45cce352bf7fba33fbbc87ccd58..a6c4d4a46c4eb4f6cb0eca66a3f6a558f94acc8a 100644 --- a/gcc/rtl.def +++ b/gcc/rtl.def @@ -312,6 +312,16 @@ DEF_RTL_EXPR(USE, "use", "e", RTX_EXTRA) is considered undeletable before reload. */ DEF_RTL_EXPR(CLOBBER, "clobber", "e", RTX_EXTRA) +/* Indicate that the upper parts of something are clobbered in a way that we + don't want to explain. The MODE references the lower bits that will be + p
Re: [PATCH] Implement cond and induction cond reduction w/o REDUC_MAX_EXPR
> On 22 Jun 2017, at 12:54, Richard Biener wrote: > > On Thu, 22 Jun 2017, Alan Hayward wrote: > >> >>> On 22 Jun 2017, at 09:52, Richard Biener wrote: >>> >>> On Thu, 22 Jun 2017, Richard Biener wrote: >>> >>>> On Wed, 21 Jun 2017, Richard Biener wrote: >>>> >>>>> >>>>> During my attempt to refactor reduction vectorization I ran across >>>>> the special casing of inital values for INTEGER_INDUC_COND_REDUCTION >>>>> and tried to see what it is about. So I ended up implementing >>>>> cond reduction support for targets w/o REDUC_MAX_EXPR by simply >>>>> doing the reduction in scalar code -- while that results in an >>>>> expensive epilogue the vector loop should be reasonably fast. >>>>> >>>>> I still didn't run into any exec FAILs in vect.exp with removing >>>>> the INTEGER_INDUC_COND_REDUCTION special case thus the following >>>>> patch. >>>>> >>>>> Alan -- is there a testcase (maybe full bootstrap & regtest will >>>>> unconver one) that shows how this is necessary? >>>> >>>> Testing has some fallout and I discovered what this special-case is >>>> about. I'll commit the below testcase to exercise the case. >>> >>> After fixing the typo (sorry) the testcase also shows a latent >>> wrong-code bug. We use an induction vector starting at index 1 >>> for COND_REDUCTION to allow for a special value of zero but it >>> seems the INTEGER_INDUC_COND_REDUCTION special-casing of the >>> default value relies on the value of zero being special but >>> in the testcase it is meaningful… >>> >> >> Yes, looks like you’re right. Apologies. >> >> >>> In fact, given that for INTEGER_INDUC_COND_REDUCTION the >>> values, while based on an induction variable, do not have to >>> match 1:1, we can modify the testcase to use last = 2*i; or >>> last = i - 1; Which means we have to add a second vector >>> to specify whether a lane was set or not, basically use >>> the COND_REDUCTION code? >>> >> >> We could only allow loops that start from 1, but that would exclude the >> most common uses. >> >> I was wondering if we could special case “-1” as the default value index, >> but that would break the max operation in the epilogue. >> >> How about, in the loop the vectcond, instead of storing index, could store >> index+offset where offset is the value required to get the loop initial >> value to 1. >> Then in the epilogue, the result is: >> “If max == 0 ? default : max - offset" >> >> Using offset would also allow loops that start negative - >> for (int i = -2; i < N; i++) would have offset set to 3. >> >> is_nonwrapping_integer_induction() would have to be updated too. > > What for > > last = -i; > > then? Or non-constant step? is_nonwrapping_integer_induction currently rejects those, and I wouldn’t suggest changing that. It requires that a loop be positive increments with a fixed integer base and step and will not overflow. If that fails, the standard COND_REDUCTION will kick in instead. > > I guess we can look at the scalar evolution of the value > stored and simply use a variable special value, like, > for constant step, use CHREC_LEFT + (sign-of-CHREC_RIGHT * -1) > if that value is < CHREC_LEFT (hmm, so only for positive > constant CHREC_RIGHT). > > The easiest fix is probably to use an extra induction variable > and the same code as in the COND_REDUCTION case. > > Or simply use a "flag" vector that is set to -1 when the condition > is true, thus precompute the final != 0 vector compare value as IV. Wouldn’t this result in similar code to the COND_REDUCTION case? The INTEGER_INDUC_COND_REDUCTION cases will work fine with the COND_REDUCTION code. Just need to disable the initial “condition expression based on integer induction” in vectorise_reduciton. > > So a quick fix would be to change is_nonwrapping_integer_induction > to reject any CHREC that might become zero during the loop evaluation. > But it will probably disable the most interesting cases... > > Richard. > >> >> Alan. >> >> >>> Richard. >>> >>>> Richard. >>>> >>>> 2017-06-22 Richard Biener >>>> >>>>* gcc.dg/vect/pr65947-14.c: New testcase. >>>> >>>> Index: gcc/testsuite/gcc.dg/vect/pr65947-14.c >>
Re: [PATCH] Implement cond and induction cond reduction w/o REDUC_MAX_EXPR
> On 22 Jun 2017, at 09:52, Richard Biener wrote: > > On Thu, 22 Jun 2017, Richard Biener wrote: > >> On Wed, 21 Jun 2017, Richard Biener wrote: >> >>> >>> During my attempt to refactor reduction vectorization I ran across >>> the special casing of inital values for INTEGER_INDUC_COND_REDUCTION >>> and tried to see what it is about. So I ended up implementing >>> cond reduction support for targets w/o REDUC_MAX_EXPR by simply >>> doing the reduction in scalar code -- while that results in an >>> expensive epilogue the vector loop should be reasonably fast. >>> >>> I still didn't run into any exec FAILs in vect.exp with removing >>> the INTEGER_INDUC_COND_REDUCTION special case thus the following >>> patch. >>> >>> Alan -- is there a testcase (maybe full bootstrap & regtest will >>> unconver one) that shows how this is necessary? >> >> Testing has some fallout and I discovered what this special-case is >> about. I'll commit the below testcase to exercise the case. > > After fixing the typo (sorry) the testcase also shows a latent > wrong-code bug. We use an induction vector starting at index 1 > for COND_REDUCTION to allow for a special value of zero but it > seems the INTEGER_INDUC_COND_REDUCTION special-casing of the > default value relies on the value of zero being special but > in the testcase it is meaningful… > Yes, looks like you’re right. Apologies. > In fact, given that for INTEGER_INDUC_COND_REDUCTION the > values, while based on an induction variable, do not have to > match 1:1, we can modify the testcase to use last = 2*i; or > last = i - 1; Which means we have to add a second vector > to specify whether a lane was set or not, basically use > the COND_REDUCTION code? > We could only allow loops that start from 1, but that would exclude the most common uses. I was wondering if we could special case “-1” as the default value index, but that would break the max operation in the epilogue. How about, in the loop the vectcond, instead of storing index, could store index+offset where offset is the value required to get the loop initial value to 1. Then in the epilogue, the result is: “If max == 0 ? default : max - offset" Using offset would also allow loops that start negative - for (int i = -2; i < N; i++) would have offset set to 3. is_nonwrapping_integer_induction() would have to be updated too. Alan. > Richard. > >> Richard. >> >> 2017-06-22 Richard Biener >> >> * gcc.dg/vect/pr65947-14.c: New testcase. >> >> Index: gcc/testsuite/gcc.dg/vect/pr65947-14.c >> === >> --- gcc/testsuite/gcc.dg/vect/pr65947-14.c (nonexistent) >> +++ gcc/testsuite/gcc.dg/vect/pr65947-14.c (working copy) >> @@ -0,0 +1,44 @@ >> +/* { dg-require-effective-target vect_condition } */ >> + >> +#include "tree-vect.h" >> + >> +extern void abort (void) __attribute__ ((noreturn)); >> + >> +#define N 27 >> + >> +/* Condition reduction with matches only in even lanes at runtime. */ >> + >> +int >> +condition_reduction (int *a, int min_v) >> +{ >> + int last = N + 96; >> + >> + for (int i = 0; i < N; i++) >> +if (a[i] > min_v) >> + last = i; >> + >> + return last; >> +} >> + >> +int >> +main (void) >> +{ >> + int a[N] = { >> + 47, 12, 13, 14, 15, 16, 17, 18, 19, 20, >> + 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, >> + 21, 22, 23, 24, 25, 26, 27 >> + }; >> + >> + check_vect (); >> + >> + int ret = condition_reduction (a, 46); >> + >> + /* loop should have found a value of 0, not default N + 96. */ >> + if (ret != 0) >> +abort (); >> + >> + return 0; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { xfail { ! >> vect_max_reduc } } } */ >> +/* { dg-final { scan-tree-dump-times "condition expression based on integer >> induction." 4 "vect" } } */ >> > > -- > Richard Biener > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nuernberg)
Re: [PATCH] Fix PR81090, properly free niter estimates
> On 19 Jun 2017, at 13:35, Richard Biener wrote: > > On Mon, 19 Jun 2017, Christophe Lyon wrote: > >> Hi Richard, >> >> On 16 June 2017 at 14:18, Richard Biener wrote: >>> On Wed, 14 Jun 2017, Richard Biener wrote: >>> niter estimates are not kept up-to-date (they reference gimple stmts and trees) in the keep-loop-stuff infrastructure so similar to the SCEV cache we rely on people freeing it after passes. The following brings us a step closer to that by freeing them whenever SCEV is invalidated (we only compute them when SCEV is active) plus removing the odd record-bounds pass that just computes them, leaving scavenging to following passes. Bootstrap and regtest running on x86_64-unknown-linux-gnu. >>> >>> Some awkward interactions with peeling means I'm installing the >>> following less aggressive variant. >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. >>> >>> Richard. >>> >>> 2017-06-16 Richard Biener >>> >>>PR tree-optimization/81090 >>>* passes.def (pass_record_bounds): Remove. >>>* tree-pass.h (make_pass_record_bounds): Likewise. >>>* tree-ssa-loop.c (pass_data_record_bounds, pass_record_bounds, >>>make_pass_record_bounds): Likewise. >>>* tree-ssa-loop-ivcanon.c (canonicalize_induction_variables): Do >>>not free niter estimates at the beginning but at the end. >>>* tree-scalar-evolution.c (scev_finalize): Free niter estimates. >>> >>>* gcc.dg/graphite/pr81090.c: New testcase. >>> >> >> Sorry to bother you again... >> With this commit (r249249), I've noticed regressions on aarch64/arm: >> FAIL:gcc.dg/vect/pr65947-9.c -flto -ffat-lto-objects >> scan-tree-dump-not vect "LOOP VECTORIZED" >> FAIL:gcc.dg/vect/pr65947-9.c scan-tree-dump-not vect "LOOP VECTORIZED" > > So the testcase gets vectorized now (for whatever reason) and still passes > execution. Not sure why the testcase checked for not being vectorized. > > Alan? > > Richard. > I’ve not looked at the new patch, but pr65947-9.c was added to test: + /* Condition reduction with maximum possible loop size. Will fail to +vectorize because the vectorisation requires a slot for default values. */ So, in the pr65947-9.c, if nothing passes the IF clause, then LAST needs to be set to -72. To vectorise pr65947-9.c the code will do: + /* For cond reductions we want to create a new vector (INDEX_COND_EXPR) +which is updated with the current index of the loop for every match of +the original loop's cond_expr (VEC_STMT). This results in a vector +containing the last time the condition passed for that vector lane. +The first match will be a 1 to allow 0 to be used for non-matching +indexes. If there are no matches at all then the vector will be all +zeroes. */ Which will then be used in the following way: + /* For condition reductions, we have a vector (NEW_PHI_RESULT) containing +various data values where the condition matched and another vector +(INDUCTION_INDEX) containing all the indexes of those matches. We +need to extract the last matching index (which will be the index with +highest value) and use this to index into the data vector. +For the case where there were no matches, the data vector will contain +all default values and the index vector will be all zeros. */ However, the INDUCTION_INDEX values will have a max size: + /* The additional index will be the same type as the condition. Check +that the loop can fit into this less one (because we'll use up the +zero slot for when there are no matches). */ + tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type); + if (wi::geu_p (ni, wi::to_widest (max_index))) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, +"loop size is greater than data size.\n"); + return false; + } +} Going back to pr65947-9.c, the loop size is 255. + #define N 255 Given that the default value takes up value 0 of our index, it means the max Possible: size of a value in INDUCTION_INDEX will be 256, However, the type of INDUCTION_INDEX will match the type of A, which is char: +char +condition_reduction (char *a, char min_v) +{ + char last = -72; + + for (int i = 0; i < N; i++) +if (a[i] < min_v) + last = a[i]; + + return last; +} So, it’s not safe for the loop to be vectorised, and it should fail to do so with the message "loop size is greater than data size”. (All the code snippets are taken from patch 956442b598f3a7241119624edb9af0155a42bf6c "Support for vectorizing conditional expressions” on 2015-10-23, It’s probable some of the code may have changed since then.) Hope that makes sense, Alan. >> Christophe >> >>> Index: gcc/passes.def >>> =
Re: [PATCH] PR71752 - SLP: Maintain operand ordering when creating vec defs
On 16/08/2016 10:01, "Alan Hayward" wrote: > > >On 16/08/2016 09:33, "Richard Biener" wrote: > >>On Mon, Aug 15, 2016 at 4:16 PM, Alan Hayward >>wrote: >>> >>> >>> On 15/08/2016 12:17, "Richard Biener" >>>wrote: >>> >>>>On Mon, Aug 15, 2016 at 11:48 AM, Alan Hayward >>>>wrote: >>>>> The testcase pr71752.c was failing because the SLP code was >>>>>generating >>>>>an >>>>> SLP >>>>> vector using arguments from the SLP scalar stmts, but was using the >>>>>wrong >>>>> argument number. >>>>> >>>>> vect_get_slp_defs() is given a vector of operands. When calling down >>>>>to >>>>> vect_get_constant_vectors it uses i as op_num - making the assumption >>>>>that >>>>> the >>>>> first op in the vector refers to the first argument in the SLP scalar >>>>> statement, the second op refers to the second arg and so on. >>>>> >>>>> However, previously in vectorizable_reduction, the call to >>>>> vect_get_vec_defs >>>>> destroyed this ordering by potentially only passing op1. >>>>> >>>>> The solution is in vectorizable_reduction to create a vector of >>>>>operands >>>>> equal >>>>> in size to the number of arguments in the SLP statements. We maintain >>>>>the >>>>> argument ordering and if we don't require defs for that argument we >>>>>instead >>>>> push NULL into the vector. In vect_get_slp_defs we need to handle >>>>>cases >>>>> where >>>>> an op might be NULL. >>>>> >>>>> Tested with a check run on X86 and AArch64. >>>>> Ok to commit? >>>>> >>>> >>>>Ugh. Note the logic in vect_get_slp_defs is incredibly fragile - I >>>>think you can't >>>>simply "skip" ops the way you do as you need to still increment >>>>child_index >>>>accordingly for ignored ops. >>> >>> Agreed, I should be maintaining the child_index. >>> Looking at the code, I need a valid oprnd for that code to work. >>> >>> Given that the size of the ops vector is never more than 3, it probably >>> makes >>> sense to reset child_index each time and do a quick for loop through >>>all >>> the >>> children. >> >>Hmm, yes - that should work. It should also remove the need to keep >>operand order in sync? So it might no longer require the >>vectorizable_reduction change. > >We still need to pass the correct index down to >vect_get_constant_vectors(), >which requires the operands in the original order for i to be correct. >(We can’t use child_index for that). > >Another option would be to pass down an additional vector containing the >indexes >of the operands. But I think that’d be an even worse option. > Updated with a loop for calculating child_index. Tested with a check run for x86 and aarch64. diff --git a/gcc/testsuite/gcc.dg/vect/pr71752.c b/gcc/testsuite/gcc.dg/vect/pr71752.c new file mode 100644 index ..8d26754b4fedf8b104caae8742a445dff bf23f0a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr71752.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ + +unsigned int q4, yg; + +unsigned int +w6 (unsigned int z5, unsigned int jv) +{ + unsigned int *f2 = &jv; + + while (*f2 < 21) +{ + q4 -= jv; + z5 -= jv; + f2 = &yg; + ++(*f2); +} + return z5; +} + diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 2a7e0c6661bc1ba82c9f03720e550749f2252a7c..826481af3d1d8b29bcdbd7d81c0fd5a85 9ecd9b0 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -5364,7 +5364,7 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi, auto_vec vect_defs; auto_vec phis; int vec_num; - tree def0, def1, tem, op0, op1 = NULL_TREE; + tree def0, def1, tem, op1 = NULL_TREE; bool first_p = true; tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE; gimple *cond_expr_induction_def_stmt = NULL; @@ -5964,29 +5964,36 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi, /* Handle uses. */ if (j == 0) { - op0 = ops[!reduc_index]; - if (op_type == ternary_op) -{ - if (reduc_index == 0) -op1 = ops[2]; - else -op1 = ops[1]; -} + if
Re: [PATCH] PR71752 - SLP: Maintain operand ordering when creating vec defs
On 15/08/2016 12:17, "Richard Biener" wrote: >On Mon, Aug 15, 2016 at 11:48 AM, Alan Hayward >wrote: >> The testcase pr71752.c was failing because the SLP code was generating >>an >> SLP >> vector using arguments from the SLP scalar stmts, but was using the >>wrong >> argument number. >> >> vect_get_slp_defs() is given a vector of operands. When calling down to >> vect_get_constant_vectors it uses i as op_num - making the assumption >>that >> the >> first op in the vector refers to the first argument in the SLP scalar >> statement, the second op refers to the second arg and so on. >> >> However, previously in vectorizable_reduction, the call to >> vect_get_vec_defs >> destroyed this ordering by potentially only passing op1. >> >> The solution is in vectorizable_reduction to create a vector of operands >> equal >> in size to the number of arguments in the SLP statements. We maintain >>the >> argument ordering and if we don't require defs for that argument we >>instead >> push NULL into the vector. In vect_get_slp_defs we need to handle cases >> where >> an op might be NULL. >> >> Tested with a check run on X86 and AArch64. >> Ok to commit? >> > >Ugh. Note the logic in vect_get_slp_defs is incredibly fragile - I >think you can't >simply "skip" ops the way you do as you need to still increment >child_index >accordingly for ignored ops. Agreed, I should be maintaining the child_index. Looking at the code, I need a valid oprnd for that code to work. Given that the size of the ops vector is never more than 3, it probably makes sense to reset child_index each time and do a quick for loop through all the children. > >Why not let the function compute defs for all ops? That said, the >vectorizable_reduction >part certainly is fixing a bug (I think I've seen similar issues >elsewhere though). If you let it compute defs for all ops then you can end up creating invalid initial value SLP vectors which cause SSA failures (even though nothing uses those vectors). In the test case, SLP is defined for _3. Op1 of this is q4_lsm.5_8, but that is a loop phi. Creating SLP vec refs for it results in vect_cst__67 and vect_cst__68, which are clearly invalid: : _58 = yg_lsm.7_23 + 1; _59 = _58 + 1; _60 = _58 + 2; vect_cst__61 = {yg_lsm.7_23, _58, _59, _60}; vect_cst__62 = { 4, 4, 4, 4 }; vect_cst__67 = {q4_lsm.5_8, z5_19, q4_lsm.5_8, z5_19}; vect_cst__68 = {q4_lsm.5_8, z5_19, q4_lsm.5_8, z5_19}; vect_cst__69 = {jv_9(D), jv_9(D), jv_9(D), jv_9(D)}; vect_cst__70 = {jv_9(D), jv_9(D), jv_9(D), jv_9(D)}; vect_cst__73 = {0, 0, 0, 0}; vect_cst__74 = {q4_lsm.5_16, z5_10(D), 0, 0}; vect_cst__91 = { 1, 1, 1, 1 }; : # z5_19 = PHI # q4_lsm.5_8 = PHI # yg_lsm.7_17 = PHI # vect_vec_iv_.14_63 = PHI # vect__3.15_65 = PHI # vect__3.15_66 = PHI # ivtmp_94 = PHI <0(4), ivtmp_95(10)> vect_vec_iv_.14_64 = vect_vec_iv_.14_63 + vect_cst__62; _3 = q4_lsm.5_8 - jv_9(D); vect__3.15_71 = vect__3.15_65 - vect_cst__70; vect__3.15_72 = vect__3.15_66 - vect_cst__69; z5_13 = z5_19 - jv_9(D); vect__5.17_92 = vect_vec_iv_.14_63 + vect_cst__91; _5 = yg_lsm.7_17 + 1; ivtmp_95 = ivtmp_94 + 1; if (ivtmp_95 < bnd.11_18) goto ; else goto ; > >Richard. > >> Changelog: >> >> gcc/ >> * tree-vect-loop.c (vectorizable_reduction): Keep SLP operand >>ordering. >> * tree-vect-slp.c (vect_get_slp_defs): Handle null operands. >> >> gcc/testsuite/ >> * gcc.dg/vect/pr71752.c: New. >> >> >> >> Thanks, >> Alan. >> >> >> >> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr71752.c >> b/gcc/testsuite/gcc.dg/vect/pr71752.c >> new file mode 100644 >> index >> >>..8d26754b4fedf8b104caae8742a445d >>ff >> bf23f0a >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/vect/pr71752.c >> @@ -0,0 +1,19 @@ >> +/* { dg-do compile } */ >> + >> +unsigned int q4, yg; >> + >> +unsigned int >> +w6 (unsigned int z5, unsigned int jv) >> +{ >> + unsigned int *f2 = &jv; >> + >> + while (*f2 < 21) >> +{ >> + q4 -= jv; >> + z5 -= jv; >> + f2 = &yg; >> + ++(*f2); >> +} >> + return z5; >> +} >> + >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c >> index >> >>2a7e0c6661bc1ba82c9f03720e550749f2252a7c..826481af3d1d8b29bcdbd7d81c0fd5a >>85 >> 9ecd9b0 100644 >> --- a/gcc/tree-vect-loop.c >> +++ b/gcc/tr
[PATCH] PR71752 - SLP: Maintain operand ordering when creating vec defs
The testcase pr71752.c was failing because the SLP code was generating an SLP vector using arguments from the SLP scalar stmts, but was using the wrong argument number. vect_get_slp_defs() is given a vector of operands. When calling down to vect_get_constant_vectors it uses i as op_num - making the assumption that the first op in the vector refers to the first argument in the SLP scalar statement, the second op refers to the second arg and so on. However, previously in vectorizable_reduction, the call to vect_get_vec_defs destroyed this ordering by potentially only passing op1. The solution is in vectorizable_reduction to create a vector of operands equal in size to the number of arguments in the SLP statements. We maintain the argument ordering and if we don't require defs for that argument we instead push NULL into the vector. In vect_get_slp_defs we need to handle cases where an op might be NULL. Tested with a check run on X86 and AArch64. Ok to commit? Changelog: gcc/ * tree-vect-loop.c (vectorizable_reduction): Keep SLP operand ordering. * tree-vect-slp.c (vect_get_slp_defs): Handle null operands. gcc/testsuite/ * gcc.dg/vect/pr71752.c: New. Thanks, Alan. diff --git a/gcc/testsuite/gcc.dg/vect/pr71752.c b/gcc/testsuite/gcc.dg/vect/pr71752.c new file mode 100644 index ..8d26754b4fedf8b104caae8742a445dff bf23f0a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr71752.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ + +unsigned int q4, yg; + +unsigned int +w6 (unsigned int z5, unsigned int jv) +{ + unsigned int *f2 = &jv; + + while (*f2 < 21) +{ + q4 -= jv; + z5 -= jv; + f2 = &yg; + ++(*f2); +} + return z5; +} + diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 2a7e0c6661bc1ba82c9f03720e550749f2252a7c..826481af3d1d8b29bcdbd7d81c0fd5a85 9ecd9b0 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -5364,7 +5364,7 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi, auto_vec vect_defs; auto_vec phis; int vec_num; - tree def0, def1, tem, op0, op1 = NULL_TREE; + tree def0, def1, tem, op1 = NULL_TREE; bool first_p = true; tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE; gimple *cond_expr_induction_def_stmt = NULL; @@ -5964,29 +5964,36 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi, /* Handle uses. */ if (j == 0) { - op0 = ops[!reduc_index]; - if (op_type == ternary_op) -{ - if (reduc_index == 0) -op1 = ops[2]; - else -op1 = ops[1]; -} + if (slp_node) + { + /* Get vec defs for all the operands except the reduction index, + ensuring the ordering of the ops in the vector is kept. */ + auto_vec slp_ops; + auto_vec, 3> vec_defs; - if (slp_node) -vect_get_vec_defs (op0, op1, stmt, &vec_oprnds0, &vec_oprnds1, - slp_node, -1); + slp_ops.quick_push ((reduc_index == 0) ? NULL : ops[0]); + slp_ops.quick_push ((reduc_index == 1) ? NULL : ops[1]); + if (op_type == ternary_op) + slp_ops.quick_push ((reduc_index == 2) ? NULL : ops[2]); + + vect_get_slp_defs (slp_ops, slp_node, &vec_defs, -1); + + vec_oprnds0.safe_splice (vec_defs[(reduc_index == 0) ? 1 : 0]); + if (op_type == ternary_op) + vec_oprnds1.safe_splice (vec_defs[(reduc_index == 2) ? 1 : 2]); + } else -{ + { loop_vec_def0 = vect_get_vec_def_for_operand (ops[!reduc_index], stmt); vec_oprnds0.quick_push (loop_vec_def0); if (op_type == ternary_op) { +op1 = (reduc_index == 0) ? ops[2] : ops[1]; loop_vec_def1 = vect_get_vec_def_for_operand (op1, stmt); vec_oprnds1.quick_push (loop_vec_def1); } -} + } } else { diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index fb325d54f1084461d44cd54a98e5b7f99541a188..7c480d59c823b5258255c8be047f050c8 3cc91fd 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -3200,10 +3200,19 @@ vect_get_slp_defs (vec ops, slp_tree slp_node, vec vec_defs; tree oprnd; bool vectorized_defs; + bool first_iteration = true; first_stmt = SLP_TREE_SCALAR_STMTS (slp_node)[0]; FOR_EACH_VEC_ELT (ops, i, oprnd) { + if (oprnd == NULL) + { + vec_defs = vNULL; + vec_defs.create (0); + vec_oprnds->quick_push (vec_defs); + continue; + } + /* For each operand we check if it has vectorized definitions in a child node or we need to create them (for
Re: PATCH: PR71818: Don't advance IVs with a variable step
On 01/08/2016 14:49, "Richard Biener" wrote: >On Mon, Aug 1, 2016 at 11:24 AM, Alan Hayward >wrote: >> In the given test case, the loop is split into vectorised and non >> vectorised >> versions due to peeling. At the end of the loop the IVs are incremented >>to >> their latest value. This is achieved by taking the base of the loop >>(g_21) >> and >> adding the iterations (240) multiplied by the step (_6): >> >> : >> # _106 = PHI <_6(12)> >> _84 = _106 * 240; >> _85 = (char) _84; >> tmp.19_83 = g_21(D) + _85; >> >> However, the step (_6) varies within the loop and therefore the >> calculation is >> incorrect. >> >> This patch fixes the error by disallowing vectorization if the step of >>the >> IV >> is not an invariant within the loop. >> >> Also added debug comment for when the optimisation fails due to chrec. >> >> Tested on x86. >> >> Ok to commit? > >Ok. > >To fix this we'd have to vectorize the induction variable itself, correct? Yes, then extract final value with BIT_FIELD_REF (like vectorise_live_operation). Alan.
PATCH: PR71818: Don't advance IVs with a variable step
In the given test case, the loop is split into vectorised and non vectorised versions due to peeling. At the end of the loop the IVs are incremented to their latest value. This is achieved by taking the base of the loop (g_21) and adding the iterations (240) multiplied by the step (_6): : # _106 = PHI <_6(12)> _84 = _106 * 240; _85 = (char) _84; tmp.19_83 = g_21(D) + _85; However, the step (_6) varies within the loop and therefore the calculation is incorrect. This patch fixes the error by disallowing vectorization if the step of the IV is not an invariant within the loop. Also added debug comment for when the optimisation fails due to chrec. Tested on x86. Ok to commit? Alan. gcc/ PR tree-optimization/71818 * tree-vect-loop-manip.c (vect_can_advance_ivs_p): Don't advance IVs with non invariant evolutions testsuite/ PR tree-optimization/71818 * gcc.dg/vect/pr71818.c: New diff --git a/gcc/testsuite/gcc.dg/vect/pr71818.c b/gcc/testsuite/gcc.dg/vect/pr71818.c new file mode 100644 index ..2946551f8bb8c552565c2e79b16359ca3 9d13ed6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr71818.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ + +char a; +short b; +int c, d; +void fn1() { + char e = 75, g; + unsigned char *f = &e; + a = 21; + for (; a <= 48; a++) { +for (; e <= 6;) + ; +g -= e -= b || g <= c; + } + d = *f; +} diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 819abcda81a25c4ed25749c29b357110fca647d2..4d68f7143e1117085aae8d2168ed1425e 7e6aa08 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see #include "cfgloop.h" #include "tree-scalar-evolution.h" #include "tree-vectorizer.h" +#include "tree-ssa-loop-ivopts.h" /* Simple Loop Peeling Utilities @@ -1592,10 +1593,26 @@ vect_can_advance_ivs_p (loop_vec_info loop_vinfo) } /* FORNOW: We do not transform initial conditions of IVs +which evolution functions are not invariants in the loop. */ + + if (!expr_invariant_in_loop_p (loop, evolution_part)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"evolution not invariant in loop.\n"); + return false; + } + + /* FORNOW: We do not transform initial conditions of IVs which evolution functions are a polynomial of degree >= 2. */ if (tree_is_chrec (evolution_part)) - return false; + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"evolution is chrec.\n"); + return false; + } } return true;
PING Re:[PATCH] PR 71667 - Handle live operations with DEBUG uses
Ping. From: Alan Hayward To: "gcc-patches at gcc dot gnu dot org" Date: Wed, 29 Jun 2016 08:49:34 +0100 Subject: [PATCH] PR 71667 - Handle live operations with DEBUG uses Authentication-results: sourceware.org; auth=none In vectorizable_live_operation() we always assume uses a of live operation will be PHIs. However, when using -g a use of a live operation might be a DEBUG stmt. This patch avoids adding any DEBUG statments to the worklist in vectorizable_live_operation(). Also fixes comment. Tested on x86 and aarch64. Ok to commit? gcc/ PR tree-optimization/71667 * tree-vect-loop.c (vectorizable_live_operation): ignore DEBUG stmts testsuite/gcc.dg/vect PR tree-optimization/71667 * pr71667.c: New diff --git a/gcc/testsuite/gcc.dg/vect/pr71667.c b/gcc/testsuite/gcc.dg/vect/pr71667.c new file mode 100644 index ..e7012efa882a5497b0a6099c3d853f9eb 375cc53 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr71667.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-g" } */ + +unsigned int mu; +int pt; + +void +qf (void) +{ + int gy; + long int vz; + + for (;;) +{ + for (gy = 0; gy < 80; ++gy) + { + vz = mu; + ++mu; + pt = (vz != 0) && (pt != 0); + } + while (gy < 81) + while (gy < 83) + { + vz = (vz != 0) ? 0 : mu; + ++gy; + } + pt = vz; + ++mu; +} +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 6c0337bbbcbebd6443fd3bcef45c1b23a7833486..2980a1b031cd3b919369b5e31dff7e066 5bc7578 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6352,11 +6352,12 @@ vectorizable_live_operation (gimple *stmt, : gimple_get_lhs (stmt); lhs_type = TREE_TYPE (lhs); - /* Find all uses of STMT outside the loop - there should be exactly one. */ + /* Find all uses of STMT outside the loop - there should be at least one. */ auto_vec worklist; FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs) -if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))) - worklist.safe_push (use_stmt); +if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)) + && !is_gimple_debug (use_stmt)) + worklist.safe_push (use_stmt); gcc_assert (worklist.length () >= 1); bitsize = TYPE_SIZE (TREE_TYPE (vectype));
Re: [PATCH], Pr 71667, Fix PowerPC ISA 3.0 DImode Altivec load/store
Hi, This patch references the wrong bug. It should reference 71677 and not 71667 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71677 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71667 Thanks, Alan. As we discussed in the patch review, there were some issues with using %Y for the ISA 3.0 instructions LXSD and STXSD. I have rewritten the patch so that we have a new memory constraint (%wY) that explicitly targets those instructions. I went back to the mov{DF,DD} patterns and changed their use of %o to %wY. I removed the test that was generated from the XalanNamespacesStack.cpp source that showed up the problem. I have bootstrapped this on a little endian power8 system and there were no regressions in the test suite. I also built Spec 2006 for power9 with this compiler, and the xalancbmk benchmark now builds. I will kick off a big endian build on a power7 system. Assuming there are no regressions in power7, are these patches ok to install in the trunk, and backport to GCC 6.2 after a burn-in period? 2016-06-28 Michael Meissner PR target/71667 * config/rs6000/constraints.md (wY constraint): New constraint to match the requirements for the LXSD and STXSD instructions. * config/rs6000/predicates.md (offsettable_mem_14bit_operand): New predicate to match the requirements for the LXSD and STXSD instructions. * config/rs6000/rs6000.md (mov_hardfloat32, FMOVE64 case): Use constaint wY for LXSD/STXSD instructions instead of 'o' or 'Y' to make sure that the bottom 2 bits of offset are 0, the address form is offsettable, and no updating is done in the address mode. (mov_hardfloat64, FMOVE64 case): Likewise. (movdi_internal32): Likewise (movdi_internal64): Likewise. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
[PATCH] PR 71667 - Handle live operations with DEBUG uses
In vectorizable_live_operation() we always assume uses a of live operation will be PHIs. However, when using -g a use of a live operation might be a DEBUG stmt. This patch avoids adding any DEBUG statments to the worklist in vectorizable_live_operation(). Also fixes comment. Tested on x86 and aarch64. Ok to commit? gcc/ PR tree-optimization/71667 * tree-vect-loop.c (vectorizable_live_operation): ignore DEBUG stmts testsuite/gcc.dg/vect PR tree-optimization/71667 * pr71667.c: New diff --git a/gcc/testsuite/gcc.dg/vect/pr71667.c b/gcc/testsuite/gcc.dg/vect/pr71667.c new file mode 100644 index ..e7012efa882a5497b0a6099c3d853f9eb 375cc53 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr71667.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-g" } */ + +unsigned int mu; +int pt; + +void +qf (void) +{ + int gy; + long int vz; + + for (;;) +{ + for (gy = 0; gy < 80; ++gy) + { + vz = mu; + ++mu; + pt = (vz != 0) && (pt != 0); + } + while (gy < 81) + while (gy < 83) + { + vz = (vz != 0) ? 0 : mu; + ++gy; + } + pt = vz; + ++mu; +} +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 6c0337bbbcbebd6443fd3bcef45c1b23a7833486..2980a1b031cd3b919369b5e31dff7e066 5bc7578 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6352,11 +6352,12 @@ vectorizable_live_operation (gimple *stmt, : gimple_get_lhs (stmt); lhs_type = TREE_TYPE (lhs); - /* Find all uses of STMT outside the loop - there should be exactly one. */ + /* Find all uses of STMT outside the loop - there should be at least one. */ auto_vec worklist; FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs) -if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))) - worklist.safe_push (use_stmt); +if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)) + && !is_gimple_debug (use_stmt)) + worklist.safe_push (use_stmt); gcc_assert (worklist.length () >= 1); bitsize = TYPE_SIZE (TREE_TYPE (vectype));
[PATCH] PR 71439 - Only vectorize live PHIs that are inductions
For a PHI to be used outside the loop it needs to be vectorized. However the vectorizer currently will only vectorize PHIs that are an induction. This patch fixes PR 71439 by only allowing a live PHI to be vectorized if it is an induction. In addition, live PHIs need to pass a vectorizable_live_operation check. Tested on x86 and aarch64. Ok to commit? Alan. gcc/ PR tree-optimization/71439 * tree-vect-loop.c (vect_analyze_loop_operations): Additional check for live PHIs. testsuite/ PR tree-optimization/71439 * gcc.dg/vect/pr71439.c: New diff --git a/gcc/testsuite/gcc.dg/vect/pr71439.c b/gcc/testsuite/gcc.dg/vect/pr71439.c new file mode 100644 index ..95e4763bad6e9f301d53c20ffa160b96b dad9a53 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr71439.c @@ -0,0 +1,17 @@ +#include "tree-vect.h" + +int a, b, c; +short fn1(int p1, int p2) { return p1 + p2; } + +int main() { + a = 0; + for (; a < 30; a = fn1(a, 4)) { +c = b; +b = 6; + } + + if (c != 6) +abort (); + + return 0; +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 1231b95f6a71337833e8c4b24884da9f96a7b5bf..90ade75bcd212b542ad680877a79df717 751ff4b 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -1669,7 +1669,8 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo) gcc_assert (stmt_info); - if (STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope + if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope + || STMT_VINFO_LIVE_P (stmt_info)) && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def) { /* A scalar-dependence cycle that we don't support. */ @@ -1686,6 +1687,9 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo) ok = vectorizable_induction (phi, NULL, NULL); } + if (ok && STMT_VINFO_LIVE_P (stmt_info)) + ok = vectorizable_live_operation (phi, NULL, NULL, -1, NULL); + if (!ok) { if (dump_enabled_p ())
[PATCH] PR 71483 - Fix live SLP operations
In the given testcase, g++ splits a live operation into two scalar statements and four vector statements. _5 = _4 >> 2; _7 = (short int) _5; Is turned into: vect__5.32_80 = vect__4.31_76 >> 2; vect__5.32_81 = vect__4.31_77 >> 2; vect__5.32_82 = vect__4.31_78 >> 2; vect__5.32_83 = vect__4.31_79 >> 2; vect__7.33_86 = VEC_PACK_TRUNC_EXPR ; vect__7.33_87 = VEC_PACK_TRUNC_EXPR ; _5 is then accessed outside the loop. This patch ensures that vectorizable_live_operation picks the correct scalar statement. I removed the "three possibilites" comment because it was no longer accurate (it's also possible to have more vector statements than scalar statements) and the calculation is now much simpler. Tested on x86 and aarch64. Ok to commit? gcc/ PR tree-optimization/71483 * tree-vect-loop.c (vectorizable_live_operation): Pick correct index for slp testsuite/g++.dg/vect PR tree-optimization/71483 * pr71483.c: New Alan. diff --git a/gcc/testsuite/g++.dg/vect/pr71483.c b/gcc/testsuite/g++.dg/vect/pr71483.c new file mode 100644 index ..77f879c9a89b8b41ef9dde3c343591857 2dc8d01 --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/pr71483.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +int b, c, d; +short *e; +void fn1() { + for (; b; b--) { +d = *e >> 2; +*e++ = d; +c = *e; +*e++ = d; + } +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 4c8678505df6ec572b69fd7d12ac55cf4619ece6..a2413bf9c678d11cc2ffd22bc7d984e91 1831804 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6368,24 +6368,20 @@ vectorizable_live_operation (gimple *stmt, int num_scalar = SLP_TREE_SCALAR_STMTS (slp_node).length (); int num_vec = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - int scalar_per_vec = num_scalar / num_vec; - /* There are three possibilites here: -1: All scalar stmts fit in a single vector. -2: All scalar stmts fit multiple times into a single vector. - We must choose the last occurence of stmt in the vector. -3: Scalar stmts are split across multiple vectors. - We must choose the correct vector and mod the lane accordingly. */ + /* Get the last occurrence of the scalar index from the concatenation of +all the slp vectors. Calculate which slp vector it is and the index +within. */ + int pos = (num_vec * nunits) - num_scalar + slp_index; + int vec_entry = pos / nunits; + int vec_index = pos % nunits; /* Get the correct slp vectorized stmt. */ - int vec_entry = slp_index / scalar_per_vec; vec_lhs = gimple_get_lhs (SLP_TREE_VEC_STMTS (slp_node)[vec_entry]); /* Get entry to use. */ - bitstart = build_int_cst (unsigned_type_node, - scalar_per_vec - (slp_index % scalar_per_vec)); + bitstart = build_int_cst (unsigned_type_node, vec_index); bitstart = int_const_binop (MULT_EXPR, bitsize, bitstart); - bitstart = int_const_binop (MINUS_EXPR, vec_bitsize, bitstart); } else {
[PATCH] PR 71416 - allow more than one use of a live operation
vectorizable_live_operation checks that there is exactly one use of each operation that has been marked live. However, it is possible for the operation is used more than once if the usage PHIs are identical. For example in 71416-1.c, _6 is used twice after the loop in bb 9. : # e.6_21 = PHI _5 = g_16(D) >= 0; _6 = (int) _5; e.5_7 = (unsigned char) e.6_21; _8 = e.5_7 + 1; _9 = (char) _8; if (_9 != 0) goto ; else goto ; : # _12 = PHI <_6(8)> # _19 = PHI <_6(8)> goto ; This patch relaxes the restriction in vectorizable_live_operation, and now makes sure there is at least one use of each operation that has been marked live. Tested on x86 and aarch64. Ok to commit? gcc/ PR tree-optimization/71416 * tree-vect-loop.c (vectorizable_live_operation): Let worklist have multiple entries Alan. diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 90ade75bcd212b542ad680877a79df717751ff4b..4c8678505df6ec572b69fd7d12ac55cf4 619ece6 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6355,7 +6355,7 @@ vectorizable_live_operation (gimple *stmt, FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs) if (!flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))) worklist.safe_push (use_stmt); - gcc_assert (worklist.length () == 1); + gcc_assert (worklist.length () >= 1); bitsize = TYPE_SIZE (TREE_TYPE (vectype)); vec_bitsize = TYPE_SIZE (vectype); @@ -6413,9 +6413,12 @@ vectorizable_live_operation (gimple *stmt, /* Replace all uses of the USE_STMT in the worklist with the newly inserted statement. */ - use_stmt = worklist.pop (); - replace_uses_by (gimple_phi_result (use_stmt), new_tree); - update_stmt (use_stmt); + while (!worklist.is_empty ()) +{ + use_stmt = worklist.pop (); + replace_uses_by (gimple_phi_result (use_stmt), new_tree); + update_stmt (use_stmt); +} return true; }
[PATCH] Fix PR 71407 : Use correct types for live usage of live operations
This patch fixes PR 71407 by ensuring the BIT_FIELD_REF is created using the given vectype and then casted to the result type. Tested on x86 and aarch64. Note that the test vectorizes on x86 but does not vectorize on aarch64 or power (due to a != statement failing to vectorize) Ok to commit? gcc/ PR tree-optimization/71407 PR tree-optimization/71416 * tree-vect-loop.c (vectorizable_live_operation): Use vectype for BIT_FIELD_REF type. Testsuite/ PR tree-optimization/71407 PR tree-optimization/71416 * gcc.dg/vect/pr71407.c: New * gcc.dg/vect/pr71416-1.c: New * gcc.dg/vect/pr71416-2.c: New Alan. pr71407.patch Description: Binary data
Re: [PATCH][2/3] Vectorize inductions that are live after the loop
On 07/06/2016 10:28, "Rainer Orth" wrote: >Alan Hayward writes: > >> On 05/06/2016 12:00, "Andreas Schwab" wrote: >> >>>Alan Hayward writes: >>> >>>>* gcc.dg/vect/vect-live-2.c: New test. >>> >>>This test fails on powerpc64 (with -m64, but not with -m32): >>> >>>$ grep 'vectorized.*loops' ./vect-live-2.c.149t.vect >>>../gcc/testsuite/gcc.dg/vect/vect-live-2.c:10:1: note: vectorized 0 >>>loops >>>in function. >>>../gcc/testsuite/gcc.dg/vect/vect-live-2.c:29:1: note: vectorized 0 >>>loops >>>in function. >>> >>> >> >> "note: not vectorized: relevant stmt not supported: _1 = (long unsigned >> int) j_24;" >> >> >> This is failing because power does not support vectorising a cast from >>int >> to long. >> (It works on power 32bit because longs are 32bit and therefore no need >>to >> cast). >> >> Can someone please suggest a target-supports define (or another method) >>I >> can use to >> disable this test for power 64bit (but not 32bit) ? >> I tried using vect_multiple_sizes, but that will also disable the test >>on >> x86 without >> avx. > >I'm also seeing new FAILs on Solaris/SPARC: > >+FAIL: gcc.dg/vect/vect-live-2.c -flto -ffat-lto-objects >scan-tree-dump-times v >ect "vectorized 1 loops" 1 >+FAIL: gcc.dg/vect/vect-live-2.c scan-tree-dump-times vect "vectorized 1 >loops" >1 > >32- and 64-bit: > >vect-live-2.c:16:3: note: not vectorized: relevant stmt not supported: _2 >= j.0_1 * 4; >vect-live-2.c:48:7: note: not vectorized: control flow in loop. >vect-live-2.c:35:3: note: not vectorized: loop contains function calls or >data references that cannot be analyzed > >and > >+FAIL: gcc.dg/vect/vect-live-slp-3.c -flto -ffat-lto-objects >scan-tree-dump-times vect "vec_stmt_relevant_p: stmt live but not >relevant" 4 >+FAIL: gcc.dg/vect/vect-live-slp-3.c -flto -ffat-lto-objects >scan-tree-dump-times vect "vectorized 1 loops" 4 >+FAIL: gcc.dg/vect/vect-live-slp-3.c -flto -ffat-lto-objects >scan-tree-dump-times vect "vectorizing stmts using SLP" 4 >+FAIL: gcc.dg/vect/vect-live-slp-3.c scan-tree-dump-times vect >"vec_stmt_relevant_p: stmt live but not relevant" 4 >+FAIL: gcc.dg/vect/vect-live-slp-3.c scan-tree-dump-times vect >"vectorized 1 loops" 4 >+FAIL: gcc.dg/vect/vect-live-slp-3.c scan-tree-dump-times vect >"vectorizing stmts using SLP" 4 > >vect-live-slp-3.c:29:1: note: not vectorized: no vectype for stmt: n0_29 >= *_4; >vect-live-slp-3.c:30:1: note: not vectorized: no vectype for stmt: n0_29 >= *_4; >vect-live-slp-3.c:31:1: note: not vectorized: no vectype for stmt: n0_29 >= *_4; >vect-live-slp-3.c:32:1: note: not vectorized: no vectype for stmt: n0_29 >= *_4; >vect-live-slp-3.c:62:4: note: not vectorized: control flow in loop. >vect-live-slp-3.c:45:3: note: not vectorized: loop contains function >calls or data references that cannot be analyzed > I’ve been trying both these tests on x86,aarch64,power,sparc vect-live-slp-3.c Fails on power 64 (altivec & vsx), sparc 64 (vis 2 & 3) - due to long int unsupported Pass on x86, aarch64, power 32 (altivec & vsx), sparc 32 (vis 2 & 3) vect-live-2.c Fails on power 64 (altivec & vsx), sparc 64 (vis 2 & 3) - due to long int unsupported Fails on sparc 32 (vis 2) - due to multiply/shift not supported Pass on x86, aarch64, power 32 (altivec & vsx), sparc 32 (vis 3) Therefore I think both tests should be gated on “vect_long”. In addition, vect-live-2.c should also be gated on “vect_shift” “vect_long” is not currently enabled for aarch64, but should be. Also “vect_shift” is not currently enabled for sparc 32 (vis 3), but probably should be. I leave this for a task for a sparc maintainer to add (as I’m unable to test). This patch fixes the targets for vect-live-slp-3.c and vect-live-2.c. It also adds aarch64 to vect_long. As a side consequence, the following vector tests are now enabled for aarch64: pr18425.c, pr30843.c, pr36493.c, pr42193.c and pr60656.c Tested on aarch64 and x86. Tested by inspection on power and sparc Ok to commit? testsuite/ * gcc.dg/vect/vect-live-2.c : Likewise * gcc.dg/vect/vect-live-slp-3.c : Update effective target * lib/target-supports.exp : Add aarch64 to vect_long diff --git a/gcc/testsuite/gcc.dg/vect/vect-live-2.c b/gcc/testsuite/gcc.dg/vect/vect-live-2.c index 53adc3fee006e0577a4cf2f9ba8fe091d2a09353..9460624a515945bdd72f98a0b1a6751fd c7a75de 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-live-2.c +++ b/gcc/testsuite/gcc.dg/vect
Re: [PATCH][vectorizer] Remove blank debug lines after dump_gimple_stmt
On 06/06/2016 18:56, "Jeff Law" wrote: >On 06/06/2016 06:46 AM, Alan Hayward wrote: >> Lots of code calls dump_gimple_stmt then print a newline, however >> dump_gimple_stmt will print a newline itself. This makes the vectorizer >> debug >> file messy. I think the confusion is because dump_generic_expr does NOT >> print a >> newline. This patch removes all prints of a newline direcly after a >> dump_gimple_stmt. >> >> Tested by examining a selection of vect dump files. >> >> gcc/ >> \* tree-vect-data-refs.c (vect_analyze_data_refs): Remove debug >>newline. >> \* tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): likewise. >> (vect_can_advance_ivs_p): likewise. >> (vect_update_ivs_after_vectorizer): likewise. >> \* tree-vect-loop.c (vect_determine_vectorization_factor): likewise. >> (vect_analyze_scalar_cycles_1): likewise. >> (vect_analyze_loop_operations): likewise. >> (report_vect_op): likewise. >> (vect_is_slp_reduction): likewise. >> (vect_is_simple_reduction): likewise. >> (get_initial_def_for_induction): likewise. >> (vect_transform_loop): likewise. >> \* tree-vect-patterns.c (vect_recog_dot_prod_pattern): likewise. >> (vect_recog_sad_pattern): likewise. >> (vect_recog_widen_sum_pattern): likewise. >> (vect_recog_widening_pattern): likewise. >> (vect_recog_divmod_pattern): likewise. >> \* tree-vect-slp.c (vect-build-slp_tree_1): likewise. >> (vect_analyze_slp_instance): likewise. >> (vect_transform_slp_perm_load): likewise. >> (vect_schedule_slp_instance): likewise. >Wouldn't you also need to verify that the testsuite passes since it >could potentially have tests which assume the extra newlines in the >test's regexps? > >So I'm fine with this patch once the testsuite has been verified to run >without regressions on a target which exercises the vectorizer. > Sorry, I should have said that I have done a check run on x86 and aarch64, and there were no regressions. Alan.
Re: [PATCH][2/3] Vectorize inductions that are live after the loop
On 05/06/2016 12:00, "Andreas Schwab" wrote: >Alan Hayward writes: > >> * gcc.dg/vect/vect-live-2.c: New test. > >This test fails on powerpc64 (with -m64, but not with -m32): > >$ grep 'vectorized.*loops' ./vect-live-2.c.149t.vect >../gcc/testsuite/gcc.dg/vect/vect-live-2.c:10:1: note: vectorized 0 loops >in function. >../gcc/testsuite/gcc.dg/vect/vect-live-2.c:29:1: note: vectorized 0 loops >in function. > > "note: not vectorized: relevant stmt not supported: _1 = (long unsigned int) j_24;" This is failing because power does not support vectorising a cast from int to long. (It works on power 32bit because longs are 32bit and therefore no need to cast). Can someone please suggest a target-supports define (or another method) I can use to disable this test for power 64bit (but not 32bit) ? I tried using vect_multiple_sizes, but that will also disable the test on x86 without avx. Thanks, Alan.
[PATCH][vectorizer] Remove blank debug lines after dump_gimple_stmt
Lots of code calls dump_gimple_stmt then print a newline, however dump_gimple_stmt will print a newline itself. This makes the vectorizer debug file messy. I think the confusion is because dump_generic_expr does NOT print a newline. This patch removes all prints of a newline direcly after a dump_gimple_stmt. Tested by examining a selection of vect dump files. gcc/ \* tree-vect-data-refs.c (vect_analyze_data_refs): Remove debug newline. \* tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): likewise. (vect_can_advance_ivs_p): likewise. (vect_update_ivs_after_vectorizer): likewise. \* tree-vect-loop.c (vect_determine_vectorization_factor): likewise. (vect_analyze_scalar_cycles_1): likewise. (vect_analyze_loop_operations): likewise. (report_vect_op): likewise. (vect_is_slp_reduction): likewise. (vect_is_simple_reduction): likewise. (get_initial_def_for_induction): likewise. (vect_transform_loop): likewise. \* tree-vect-patterns.c (vect_recog_dot_prod_pattern): likewise. (vect_recog_sad_pattern): likewise. (vect_recog_widen_sum_pattern): likewise. (vect_recog_widening_pattern): likewise. (vect_recog_divmod_pattern): likewise. \* tree-vect-slp.c (vect-build-slp_tree_1): likewise. (vect_analyze_slp_instance): likewise. (vect_transform_slp_perm_load): likewise. (vect_schedule_slp_instance): likewise. Alan removenewlines.patch Description: Binary data
Re: [PATCH][2/3] Vectorize inductions that are live after the loop
On 03/06/2016 18:45, "Jakub Jelinek" wrote: >On Thu, Jun 02, 2016 at 05:11:15PM +0100, Alan Hayward wrote: >> * gcc.dg/vect/vect-live-1.c: New test. >> * gcc.dg/vect/vect-live-2.c: New test. >> * gcc.dg/vect/vect-live-5.c: New test. >> * gcc.dg/vect/vect-live-slp-1.c: New test. >> * gcc.dg/vect/vect-live-slp-2.c: New test. >> * gcc.dg/vect/vect-live-slp-3.c: New test. > >These tests all fail for me on i686-linux. The problem is >in the use of dg-options in gcc.dg/vect/, where it override all the >various >needed vectorization options that need to be enabled on various arches >(e.g. -msse2 on i686). > >Fixed thusly, tested on x86_64-linux and i686-linux, ok for trunk? Thanks for fixing this. However, I think you had a copy/paste error. For vect-live-slp-1.c and vect-live-slp-3.c you used dg-options instead of dg-additional-options, which causes the tests to fail for me as UNRESOLVED. Ok to commit? Alan. testsuite/ * gcc.dg/vect/vect-live-1.c: Use additional-options. * gcc.dg/vect/vect-live-3.c: Likewise. diff --git a/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c b/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c index 82cd8dcabed8f18408e0e5cf295c2f15e0621ae6..7fefbff17d8c34bdfb1e8d697d2b9a70a 048ae16 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c +++ b/gcc/testsuite/gcc.dg/vect/vect-live-slp-1.c @@ -1,5 +1,5 @@ /* { dg-require-effective-target vect_int } */ -/* { dg-options "-fno-tree-scev-cprop" } */ +/* { dg-additional-options "-fno-tree-scev-cprop" } */ #include "tree-vect.h" diff --git a/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c b/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c index 9e4a59eca593289050086b4b0a1347be5d748723..aacf5cb98071f6fec1f4b522eeefeb669 6787334 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c +++ b/gcc/testsuite/gcc.dg/vect/vect-live-slp-3.c @@ -1,5 +1,5 @@ /* { dg-require-effective-target vect_int } */ -/* { dg-options "-fno-tree-scev-cprop" } */ +/* { dg-additional-options "-fno-tree-scev-cprop" } */ #include "tree-vect.h"
[PATCH] Fix check_GNU_style.sh for BSD / Mac OS X
check_GNU_style.sh fails to detect lines >80 chars on BSD / Mac OS X systems. This is becuase paste is being called with an empty delimiter list. Instead \0 should be used. Tested on Ubuntu 14.04 and OS X 10.9.5 contrib/ * check_GNU_style.sh: Fix paste args for BSD Alan diff --git a/contrib/check_GNU_style.sh b/contrib/check_GNU_style.sh index a7478f8f573132aef5ed1010f0cf5b13f08350d4..87a276c9cf47b5e07c4407f740ce05dce 1928c30 100755 --- a/contrib/check_GNU_style.sh +++ b/contrib/check_GNU_style.sh @@ -191,7 +191,7 @@ col (){ # Combine prefix back with long lines. # Filter out empty lines. local found=false - paste -d '' "$tmp2" "$tmp3" \ + paste -d '\0' "$tmp2" "$tmp3" \ | grep -v '^[0-9][0-9]*:+$' \ > "$tmp" && found=true
Re: [PATCH][3/3] No need to vectorize simple only-live stmts
>Statements which are live but not relevant need marking to ensure they are >vectorized. > >Live statements which are simple and all uses of them are invariant do not >need >to be vectorized. > >This patch adds a check to make sure those stmts which pass both the above >checks are not vectorized and then discarded. > >Tested on x86 and aarch64. > > >gcc/ >*tree-vect-stmts.c (vect_stmt_relevant_p): Do not vectorize non >live >relevant stmts which are simple and invariant. > >testsuite/ >* gcc.dg/vect/vect-live-slp-5.c: Remove dg check. This version adds an addition relevance check in vectorizable_live_operation. It requires the "Remove duplicated GOMP SIMD LANE code” to work. gcc/ \* tree-vect-stmts.c (vect_stmt_relevant_p): Do not vectorize non live relevant stmts which are simple and invariant. \* tree-vect-loop.c (vectorizable_live_operation): Check relevance instead of simple and invariant testsuite/ \* gcc.dg/vect/vect-live-slp-5.c: Remove dg check. Alan. 3of3relevant.patch Description: Binary data
Re: [PATCH][2/3] Vectorize inductions that are live after the loop
On 01/06/2016 10:51, "Richard Biener" wrote: >On Wed, Jun 1, 2016 at 10:46 AM, Alan Hayward >wrote: >> >> >> On 30/05/2016 14:22, "Richard Biener" >>wrote: >> >>>On Fri, May 27, 2016 at 5:12 PM, Alan Hayward >>>wrote: >>>> >>>> On 27/05/2016 12:41, "Richard Biener" >>>>wrote: >>>> >>>>>On Fri, May 27, 2016 at 11:09 AM, Alan Hayward >>>>>wrote: >> >>>> >>>>> >>>>>The rest of the changes look ok to me. >>>> >>>> Does that include PATCH 1/3 ? >>> >>>I don't like how 1/3 ends up looking :/ So what was the alternative >>>again? >>>I looked into 1/3 and what it takes to remove the 'stmt' argument and >>>instead pass in a vect_def_type. It's a bit twisted and just adding >>>another >>>argument (the loop_vinfo) doesn't help things here. >>> >>>So - instead of 1/3 you might want to split out a function >>> >>>tree >>>vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type >>>dt, tree vectype) >>>{ >>> switch (dt) >>>{ >>>... >>>} >>>} >>> >>>and for constant/external force vectype != NULL. >> >> I’m still a little confused as to exactly what you want here. >> >> From your two comments I think you wanted me to completely remove the >> boolean type check and the vect_init_vector call. But if I remove that >> then other code paths will break. >> >> However, I’ve just realised that in vectorized_live_operation I already >> have the def stmt and I can easily get hold of dt from >>STMT_VINFO_DEF_TYPE. >> Which means I can call vect_get_vec_def_for_operand_1 from >> vectorized_live_operation. >> >> I’ve put together a version where I have: >> >> tree >> vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type dt) >> { >> >> switch (dt) >> { >>case vect_internal_def || vect_external_def: >> gcc_unreachable () >> >>.. code for for all other cases.. >> } >> } >> >> /* Used by existing code */ >> tree >> vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype) >> { >> vect_is_simple_use(op, loop_vinfo, &def_stmt, &dt); ..and the dump >>code >> >> >> If dt == internal_def || vect_external_def: >> .. Check for BOOLEAN_TYPE .. >> return vect_init_vector (stmt, op, vector_type, NULL); >> >> else >> vect_get_vec_def_for_operand_1 (def_stmt, dt) >> } >> >> >> Does that look better? > >Yes! > >Thanks, >Richard. > This version of the patch addresses the simple+invariant issues (and patch [3/3] optimizes it). Also includes a small change to handle when the vect pattern has introduced new pattern match statements (in vectorizable_live_operation if STMT_VINFO_RELATED_STMT is not null then use it instead of stmt). gcc/ * tree-vect-loop.c (vect_analyze_loop_operations): Allow live stmts. (vectorizable_reduction): Check for new relevant state. (vectorizable_live_operation): vectorize live stmts using BIT_FIELD_REF. Remove special case for gimple assigns stmts. * tree-vect-stmts.c (is_simple_and_all_uses_invariant): New function. (vect_stmt_relevant_p): Check for stmts which are only used live. (process_use): Use of a stmt does not inherit it's live value. (vect_mark_stmts_to_be_vectorized): Simplify relevance inheritance. (vect_analyze_stmt): Check for new relevant state. *tree-vectorizer.h (vect_relevant): New entry for a stmt which is used outside the loop, but not inside it. testsuite/ * gcc.dg/tree-ssa/pr64183.c: Ensure test does not vectorize. * testsuite/gcc.dg/vect/no-scevccp-vect-iv-2.c: Remove xfail. * gcc.dg/vect/vect-live-1.c: New test. * gcc.dg/vect/vect-live-2.c: New test. * gcc.dg/vect/vect-live-3.c: New test. * gcc.dg/vect/vect-live-4.c: New test. * gcc.dg/vect/vect-live-5.c: New test. * gcc.dg/vect/vect-live-slp-1.c: New test. * gcc.dg/vect/vect-live-slp-2.c: New test. * gcc.dg/vect/vect-live-slp-3.c: New test. Alan. 2of3live.patch Description: Binary data
Re: [PATCH][1/3] Add loop_vinfo to vect_get_vec_def_for_operand
> This patch simply adds loop_vinfo as an extra argument to > vect_get_vec_def_for_operand and only generates a stmt_vinfo if required. > This is a required cleanup for patch [2/3]. > Tested on x86 and aarch64. > > gcc/ > * tree-vectorizer.h (vect_get_vec_def_for_operand): Pass loop_vinfo in. > * tree-vect-stmts.c (vect_get_vec_def_for_operand): Pass loop_vinfo in. > (vect_get_vec_defs): Pass down loop_vinfo. > (vectorizable_mask_load_store): Likewise. > (vectorizable_call): Likewise. > (vectorizable_simd_clone_call): Likewise. > (vect_get_loop_based_defs): Likewise. > (vectorizable_conversion): Likewise. > (vectorizable_operation): Likewise. > (vectorizable_store): Likewise. > (vectorizable_load): Likewise. > (vectorizable_condition): Likewise. > (vectorizable_comparison): Likewise. > * tree-vect-loop.c (get_initial_def_for_induction): Likewise. > (get_initial_def_for_reduction): Likewise. > (vectorizable_reduction): Likewise. New version. I've removed the additional loop_vinfo arg. Instead, I've split part of vect_get_vec_def_for_operand into a new function vect_get_vec_def_for_operand_1. My [2/3] patch will call vect_get_vec_def_for_operand_1 direct from vectorizeable_live_operation gcc/ * tree-vectorizer.h (vect_get_vec_def_for_operand_1): New * tree-vect-stmts.c (vect_get_vec_def_for_operand_1): New (vect_get_vec_def_for_operand): Split out code. Alan. 1of3get_vec_def.patch Description: Binary data
Remove duplicated GOMP SIMD LANE code
The IFN_GOMP_SIMD_LANE code in vectorizable_call is essentially a duplicate of the code in vectorizable_live_operation. They both replace all uses outside the loop with the constant VF - 1. Note that my patch to vectorize inductions that are live after the loop will also remove the IFN_GOMP_SIMD_LANE code in vectorizable_live_operation. This patch is required in order for the follow on optimisation (No need to Vectorise simple only-live stmts) to work. Tested with libgomp on x86 and aarch64 gcc/ * tree-vect-stmts.c (vectorizable_call) Remove GOMP_SIMD_LANE code. Alan. duplicatedGOMP.patch Description: Binary data
Re: [PATCH][2/3] Vectorize inductions that are live after the loop
On 30/05/2016 14:22, "Richard Biener" wrote: >On Fri, May 27, 2016 at 5:12 PM, Alan Hayward >wrote: >> >> On 27/05/2016 12:41, "Richard Biener" >>wrote: >> >>>On Fri, May 27, 2016 at 11:09 AM, Alan Hayward >>>wrote: >> >>> >>>The rest of the changes look ok to me. >> >> Does that include PATCH 1/3 ? > >I don't like how 1/3 ends up looking :/ So what was the alternative >again? >I looked into 1/3 and what it takes to remove the 'stmt' argument and >instead pass in a vect_def_type. It's a bit twisted and just adding >another >argument (the loop_vinfo) doesn't help things here. > >So - instead of 1/3 you might want to split out a function > >tree >vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type >dt, tree vectype) >{ > switch (dt) >{ >... >} >} > >and for constant/external force vectype != NULL. I’m still a little confused as to exactly what you want here. From your two comments I think you wanted me to completely remove the boolean type check and the vect_init_vector call. But if I remove that then other code paths will break. However, I’ve just realised that in vectorized_live_operation I already have the def stmt and I can easily get hold of dt from STMT_VINFO_DEF_TYPE. Which means I can call vect_get_vec_def_for_operand_1 from vectorized_live_operation. I’ve put together a version where I have: tree vect_get_vec_def_for_operand_1 (gimple *def_stmt, enum vect_def_type dt) { switch (dt) { case vect_internal_def || vect_external_def: gcc_unreachable () .. code for for all other cases.. } } /* Used by existing code */ tree vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype) { vect_is_simple_use(op, loop_vinfo, &def_stmt, &dt); ..and the dump code If dt == internal_def || vect_external_def: .. Check for BOOLEAN_TYPE .. return vect_init_vector (stmt, op, vector_type, NULL); else vect_get_vec_def_for_operand_1 (def_stmt, dt) } Does that look better? Alan.
Re: [PATCH][2/3] Vectorize inductions that are live after the loop
On 27/05/2016 12:41, "Richard Biener" wrote: >On Fri, May 27, 2016 at 11:09 AM, Alan Hayward >wrote: >> This patch is a reworking of the previous vectorize inductions that are >> live >> after the loop patch. >> It now supports SLP and an optimisation has been moved to patch [3/3]. >> >> >> Stmts which are live (ie: defined inside a loop and then used after the >> loop) >> are not currently supported by the vectorizer. In many cases >> vectorization can >> still occur because the SCEV cprop pass will hoist the definition of the >> stmt >> outside of the loop before the vectorizor pass. However, there are >>various >> cases SCEV cprop cannot hoist, for example: >> for (i = 0; i < n; ++i) >> { >> ret = x[i]; >> x[i] = i; >> } >>return i; >> >> Currently stmts are marked live using a bool, and the relevant state >>using >> an >> enum. Both these states are propagated to the definition of all uses of >>the >> stmt. Also, a stmt can be live but not relevant. >> >> This patch vectorizes a live stmt definition normally within the loop >>and >> then >> after the loop uses BIT_FIELD_REF to extract the final scalar value from >> the >> vector. >> >> This patch adds a new relevant state (vect_used_only_live) for when a >>stmt >> is >> used only outside the loop. The relevant state is still propagated to >>all >> it's >> uses, but the live bool is not (this ensures that >> vectorizable_live_operation >> is only called with stmts that really are live). >> >> Tested on x86 and aarch64. > >+ /* If STMT is a simple assignment and its inputs are invariant, then >it can >+ remain in place, unvectorized. The original last scalar value that >it >+ computes will be used. */ >+ if (is_simple_and_all_uses_invariant (stmt, loop_vinfo)) > { > >so we can't use STMT_VINFO_RELEVANT or so? I thought we somehow >mark stmts we don't vectorize in the end. It’s probably worth making clear that this check exists in the current GCC head - today vectorize_live_operation only supports the simple+invariant case and the SSE2 case. My patch simply moved the simple+invariant code into the new function is_simple_and_all_uses_invariant. Looking at this again, I think what we really care about is…. *If the stmt is live but not relevant, we need to mark it to ensure it is vectorised. *If a stmt is simple+invariant then a live usage of it can either use the scalar or vectorized version. So for a live stmt: *If it is simple+invariant and not relevant, then it is more optimal to use the scalar version. *If it is simple+invariant and relevant then it is more optimal to use the vectorized version. *If it is not simple+invariant then we must always use the vectorized version. Therefore, the patch as it stands is correct but not optimal. In patch 3/3 for the code above (vectorize_live_operation) I can change the check to if not relevant then assert that it is not simple+invariant and return true. > >+ lhs = (is_a (stmt)) ? gimple_phi_result (stmt) >+ : gimple_get_lhs (stmt); >+ lhs_type = TREE_TYPE (lhs); >+ >+ /* Find all uses of STMT outside the loop. */ >+ auto_vec worklist; >+ FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs) >+{ >+ basic_block bb = gimple_bb (use_stmt); >+ >+ if (!flow_bb_inside_loop_p (loop, bb)) >+ worklist.safe_push (use_stmt); > } >+ gcc_assert (!worklist.is_empty ()); > >as we are in loop-closed SSA there should be exactly one such use...? Yes, I should change this to assert that worklist is of length 1. > >+ /* Get the correct slp vectorized stmt. */ >+ vec_oprnds.create (num_vec); >+ vect_get_slp_vect_defs (slp_node, &vec_oprnds); > >As you look at the vectorized stmt you can directly use the >SLP_TREE_VEC_STMTS >array (the stmts lhs, of course), no need to export this function. Ok, I can change this. > >The rest of the changes look ok to me. Does that include PATCH 1/3 ? > >Thanks, >Richard. > > >> gcc/ >> * tree-vect-loop.c (vect_analyze_loop_operations): Allow live >> stmts. >> (vectorizable_reduction): Check for new relevant state. >> (vectorizable_live_operation): vectorize live stmts using >> BIT_FIELD_REF. Remove special case for gimple assigns stmts. >> * tree-vect-stmts.c (is_simple_and_all_uses_invariant): New >> function. >> (vect_stmt_relevant_p): Check for stmts which are only used >>live. >> (process_use): Use of a stmt does not inherit
[PATCH][3/3] No need to vectorize simple only-live stmts
Statements which are live but not relevant need marking to ensure they are vectorized. Live statements which are simple and all uses of them are invariant do not need to be vectorized. This patch adds a check to make sure those stmts which pass both the above checks are not vectorized and then discarded. Tested on x86 and aarch64. gcc/ *tree-vect-stmts.c (vect_stmt_relevant_p): Do not vectorize non live relevant stmts which are simple and invariant. testsuite/ * gcc.dg/vect/vect-live-slp-5.c: Remove dg check. Alan. live3.patch Description: Binary data
[PATCH][2/3] Vectorize inductions that are live after the loop
This patch is a reworking of the previous vectorize inductions that are live after the loop patch. It now supports SLP and an optimisation has been moved to patch [3/3]. Stmts which are live (ie: defined inside a loop and then used after the loop) are not currently supported by the vectorizer. In many cases vectorization can still occur because the SCEV cprop pass will hoist the definition of the stmt outside of the loop before the vectorizor pass. However, there are various cases SCEV cprop cannot hoist, for example: for (i = 0; i < n; ++i) { ret = x[i]; x[i] = i; } return i; Currently stmts are marked live using a bool, and the relevant state using an enum. Both these states are propagated to the definition of all uses of the stmt. Also, a stmt can be live but not relevant. This patch vectorizes a live stmt definition normally within the loop and then after the loop uses BIT_FIELD_REF to extract the final scalar value from the vector. This patch adds a new relevant state (vect_used_only_live) for when a stmt is used only outside the loop. The relevant state is still propagated to all it's uses, but the live bool is not (this ensures that vectorizable_live_operation is only called with stmts that really are live). Tested on x86 and aarch64. gcc/ * tree-vect-loop.c (vect_analyze_loop_operations): Allow live stmts. (vectorizable_reduction): Check for new relevant state. (vectorizable_live_operation): vectorize live stmts using BIT_FIELD_REF. Remove special case for gimple assigns stmts. * tree-vect-stmts.c (is_simple_and_all_uses_invariant): New function. (vect_stmt_relevant_p): Check for stmts which are only used live. (process_use): Use of a stmt does not inherit it's live value. (vect_mark_stmts_to_be_vectorized): Simplify relevance inheritance. (vect_analyze_stmt): Check for new relevant state. *tree-vect-slp.c (vect_get_slp_vect_defs): Make global *tree-vectorizer.h (vect_relevant): New entry for a stmt which is used outside the loop, but not inside it. testsuite/ * gcc.dg/tree-ssa/pr64183.c: Ensure test does not vectorize. * testsuite/gcc.dg/vect/no-scevccp-vect-iv-2.c: Remove xfail. * gcc.dg/vect/vect-live-1.c: New test. * gcc.dg/vect/vect-live-2.c: New test. * gcc.dg/vect/vect-live-3.c: New test. * gcc.dg/vect/vect-live-4.c: New test. * gcc.dg/vect/vect-live-5.c: New test. * gcc.dg/vect/vect-live-slp-1.c: New test. * gcc.dg/vect/vect-live-slp-2.c: New test. * gcc.dg/vect/vect-live-slp-3.c: New test. * gcc.dg/vect/vect-live-slp-4.c: New test. Alan. live2.patch Description: Binary data
[PATCH][1/3] Add loop_vinfo to vect_get_vec_def_for_operand
This patch simply adds loop_vinfo as an extra argument to vect_get_vec_def_for_operand and only generates a stmt_vinfo if required. This is a required cleanup for patch [2/3]. Tested on x86 and aarch64. gcc/ * tree-vectorizer.h (vect_get_vec_def_for_operand): Pass loop_vinfo in. * tree-vect-stmts.c (vect_get_vec_def_for_operand): Pass loop_vinfo in. (vect_get_vec_defs): Pass down loop_vinfo. (vectorizable_mask_load_store): Likewise. (vectorizable_call): Likewise. (vectorizable_simd_clone_call): Likewise. (vect_get_loop_based_defs): Likewise. (vectorizable_conversion): Likewise. (vectorizable_operation): Likewise. (vectorizable_store): Likewise. (vectorizable_load): Likewise. (vectorizable_condition): Likewise. (vectorizable_comparison): Likewise. * tree-vect-loop.c (get_initial_def_for_induction): Likewise. (get_initial_def_for_reduction): Likewise. (vectorizable_reduction): Likewise. Alan. live1.patch Description: Binary data
Re: [PATCH] Vectorize inductions that are live after the loop.
Thanks for the review. On 23/05/2016 11:35, "Richard Biener" wrote: > >@@ -6332,79 +6324,81 @@ vectorizable_live_operation (gimple *stmt, > stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info); > struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); >- tree op; >- gimple *def_stmt; >- ssa_op_iter iter; >+ imm_use_iterator imm_iter; >+ tree lhs, lhs_type, vec_lhs; >+ tree vectype = STMT_VINFO_VECTYPE (stmt_info); >+ int nunits = TYPE_VECTOR_SUBPARTS (vectype); >+ int ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits; >+ gimple *use_stmt; > > gcc_assert (STMT_VINFO_LIVE_P (stmt_info)); > >+ if (STMT_VINFO_TYPE (stmt_info) == reduc_vec_info_type) >+return true; >+ > >This is an odd check - it says the stmt is handled by >vectorizable_reduction. And your >return claims it is handled by vectorizable_live_operation ... Previously this check was made to decide whether to call vectorizable_live_operation, So it made sense to put this check inside the function. But, yes, I agree that the return value of the function no longer makes sense. I can revert this. > >You removed the SIMD lane handling? The SIMD lane handling effectively checked for a special case, then added code which would extract the final value of the vector. The new code I’ve added does the exact same thing for more generic cases, so the SIMD check can be removed and it’ll still be vectorized correctly. > >@@ -303,6 +335,16 @@ vect_stmt_relevant_p (gimple *stmt, loop_vec_info >loop_vinfo, >} > } > >+ if (*live_p && *relevant == vect_unused_in_scope >+ && !is_simple_and_all_uses_invariant (stmt, loop_vinfo)) >+{ >+ if (dump_enabled_p ()) >+ dump_printf_loc (MSG_NOTE, vect_location, >+"vec_stmt_relevant_p: live and not all uses " >+"invariant.\n"); >+ *relevant = vect_used_only_live; >+} > >But that's a missed invariant motion / code sinking opportunity then. >Did you have a >testcase for this? I don’t have a test case :( It made sense that this was the correct action to do on the failure (rather than assert). > >@@ -618,57 +660,31 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info >loop_vinfo) >} > > /* Examine the USEs of STMT. For each USE, mark the stmt that >defines it >-(DEF_STMT) as relevant/irrelevant and live/dead according to the >-liveness and relevance properties of STMT. */ >+(DEF_STMT) as relevant/irrelevant according to the relevance >property >+of STMT. */ > stmt_vinfo = vinfo_for_stmt (stmt); > relevant = STMT_VINFO_RELEVANT (stmt_vinfo); >- live_p = STMT_VINFO_LIVE_P (stmt_vinfo); >- >- /* Generally, the liveness and relevance properties of STMT are >-propagated as is to the DEF_STMTs of its USEs: >- live_p <-- STMT_VINFO_LIVE_P (STMT_VINFO) >- relevant <-- STMT_VINFO_RELEVANT (STMT_VINFO) >- >-One exception is when STMT has been identified as defining a >reduction >-variable; in this case we set the liveness/relevance as follows: >- live_p = false >- relevant = vect_used_by_reduction >-This is because we distinguish between two kinds of relevant >stmts - >-those that are used by a reduction computation, and those that >are >-(also) used by a regular computation. This allows us later on to >-identify stmts that are used solely by a reduction, and >therefore the >-order of the results that they produce does not have to be kept. > */ >- >- def_type = STMT_VINFO_DEF_TYPE (stmt_vinfo); >- tmp_relevant = relevant; >- switch (def_type) >+ >+ switch (STMT_VINFO_DEF_TYPE (stmt_vinfo)) > { > >you removed this comment. Is it no longer valid? Can you please >instead update it? >This is a tricky area. I’ll replace with a new comment. > > >@@ -1310,17 +1325,14 @@ vect_init_vector (gimple *stmt, tree val, tree >type, gimple_stmt_iterator *gsi) >In case OP is an invariant or constant, a new stmt that creates a >vector def >needs to be introduced. VECTYPE may be used to specify a required >type for >vector invariant. */ >- >-tree >-vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype) >+static tree >+vect_get_vec_def_for_operand_internal (tree op, gimple *stmt, >+ loop_vec_info loop_vinfo, tree >vectype) > { > tree vec_oprnd; >... > >+tree >+vect_get_vec_def_for_operand (tree op, gimple *stmt, tree vectype) >+{ >+ stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt); >+ loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo); >+ return vect_get_vec_def_for_operand_internal (op, stmt, loop_vinfo, >vectype); >+} >+ >+tree >+vect_get_vec_def_for_operand_outside (tree op, loop_vec_info loop_vinfo) >+{ >+ return vect_get_vec_def_for_operand_internal (op, NULL, loop_vinfo, >NULL); >+} > >I was trying to redu
[PATCH] Vectorize inductions that are live after the loop.
Vectorize inductions that are live after the loop. Stmts which are live (ie: defined inside a loop and then used after the loop) are not currently supported by the vectorizer. In many cases vectorization can still occur because the SCEV cprop pass will hoist the definition of the stmt outside of the loop before the vectorizor pass. However, there are various cases SCEV cprop cannot hoist, for example: for (i = 0; i < n; ++i) { ret = x[i]; x[i] = i; } return i; Currently stmts are marked live using a bool, and the relevant state using an enum. Both these states are propagated to the definition of all uses of the stmt. Also, a stmt can be live but not relevant. This patch vectorizes a live stmt definition normally within the loop and then after the loop uses BIT_FIELD_REF to extract the final scalar value from the vector. This patch adds a new relevant state (vect_used_only_live) for when a stmt is used only outside the loop. The relevant state is still propagated to all it's uses, but the live bool is not (this ensures that vectorizable_live_operation is only called with stmts that really are live). Tested on x86 and aarch64. gcc/ * tree-vect-loop.c (vect_analyze_loop_operations): Allow live stmts. (vectorizable_reduction): Check for new relevant state. (vectorizable_live_operation): vectorize live stmts using BIT_FIELD_REF. Remove special case for gimple assigns stmts. * tree-vect-stmts.c (is_simple_and_all_uses_invariant): New function. (vect_stmt_relevant_p): Check for stmts which are only used outside the loop. (process_use): Use of a stmt does not inherit it's live value. (vect_mark_stmts_to_be_vectorized): Simplify relevance inheritance. (vect_get_vec_def_for_operand): Split body into... (vect_get_vec_def_for_operand_internal): ...new function (vect_get_vec_def_for_operand_outside): New. Same as above but for stmts outside a loop. (vect_analyze_stmt): Check for new relevant state. *tree-vectorizer.h (vect_relevant): New entry for a stmt which is used outside the loop, but not inside it. testsuite/ * gcc.dg/tree-ssa/pr64183.c: Ensure test does not vectorize. * testsuite/gcc.dg/vect/no-scevccp-vect-iv-2.c: Remove xfail. * gcc.dg/vect/vect-live-1.c: New test. * gcc.dg/vect/vect-live-2.c: New test. * gcc.dg/vect/vect-live-3.c: New test. * gcc.dg/vect/vect-live-4.c: New test. Cheers, Alan. liveInductions.patch Description: liveInductions.patch
Re: [PATCH] PR tree-optimization/68413 : Only check for integer cond reduction on analysis stage
On 20/11/2015 13:47, "Richard Biener" wrote: >On Fri, Nov 20, 2015 at 1:33 PM, Alan Hayward >wrote: >> >> >>On 20/11/2015 11:00, "Richard Biener" wrote: >> >>>On Fri, Nov 20, 2015 at 10:24 AM, Alan Hayward >>>wrote: >>>>When vectorising a integer induction condition reduction, >>>>is_nonwrapping_integer_induction ends up with different values for base >>>>during the analysis and build phases. In the first it is an >>>>INTEGER_CST, >>>>in the second the loop has been vectorised out and the base is now a >>>>variable. >>>> >>>>This results in the analysis and build stage detecting the >>>>STMT_VINFO_VEC_REDUCTION_TYPE as different types. >>>> >>>>The easiest way to fix this is to only check for integer induction >>>>conditions on the analysis stage. >>> >>>I don't like this. For the evolution part we have added >>>STMT_VINFO_LOOP_PHI_EVOLUTION_PART. If you now need >>>the original initial value as well then just save it. >>> >>>Or if you really want to go with the hack then please do not call >>>is_nonwrapping_integer_induction with vec_stmt != NULL but >>>initialize cond_expr_is_nonwrapping_integer_induction from >>>STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) >>> >>>The hack also lacks a comment. >>> >> >>Ok. I've gone for a combination of both: >> >>I now cache the base in STMT_VINFO_LOOP_PHI_EVOLUTION_BASE. >> >>I've removed the vec_stmt != NULL checks. >> >>I've moved the call to is_nonwrapping_integer_induction until after the >>vect_is_simple_reduction check. I never liked that I had >>is_nonwrapping_integer_induction early in the function, and think this >>looks better. > >It looks better but the comment for loop_phi_evolution_base is wrong. >The value is _not_ "correct" after prologue peeling (unless you >update it there to a non-constant expr). It is conservatively "correct" >for is_nonwrapping_integer_induction though. Which is why I'd >probably rename it to STMT_VINFO_LOOP_PHI_EVOLUTION_BASE_UNCHANGED >to reflect this (and similar to STMT_VINFO_NITERS_UNCHANGED). > >Ok with that change. Updated as requested and submitted. Alan. analysisonlycondcheck3.patch Description: Binary data
Re: [PATCH] PR tree-optimization/68413 : Only check for integer cond reduction on analysis stage
On 20/11/2015 11:00, "Richard Biener" wrote: >On Fri, Nov 20, 2015 at 10:24 AM, Alan Hayward >wrote: >> When vectorising a integer induction condition reduction, >> is_nonwrapping_integer_induction ends up with different values for base >> during the analysis and build phases. In the first it is an INTEGER_CST, >> in the second the loop has been vectorised out and the base is now a >> variable. >> >> This results in the analysis and build stage detecting the >> STMT_VINFO_VEC_REDUCTION_TYPE as different types. >> >> The easiest way to fix this is to only check for integer induction >> conditions on the analysis stage. > >I don't like this. For the evolution part we have added >STMT_VINFO_LOOP_PHI_EVOLUTION_PART. If you now need >the original initial value as well then just save it. > >Or if you really want to go with the hack then please do not call >is_nonwrapping_integer_induction with vec_stmt != NULL but >initialize cond_expr_is_nonwrapping_integer_induction from >STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) > >The hack also lacks a comment. > Ok. I've gone for a combination of both: I now cache the base in STMT_VINFO_LOOP_PHI_EVOLUTION_BASE. I've removed the vec_stmt != NULL checks. I've moved the call to is_nonwrapping_integer_induction until after the vect_is_simple_reduction check. I never liked that I had is_nonwrapping_integer_induction early in the function, and think this looks better. Alan. analysisonlycondcheck2.patch Description: Binary data
[PATCH] PR tree-optimization/68413 : Only check for integer cond reduction on analysis stage
When vectorising a integer induction condition reduction, is_nonwrapping_integer_induction ends up with different values for base during the analysis and build phases. In the first it is an INTEGER_CST, in the second the loop has been vectorised out and the base is now a variable. This results in the analysis and build stage detecting the STMT_VINFO_VEC_REDUCTION_TYPE as different types. The easiest way to fix this is to only check for integer induction conditions on the analysis stage. gcc/ PR tree-optimization/68413 * tree-vect-loop.c (vectorizable_reduction): Only check for integer cond reduction on analysis stage. Thanks, Alan. analysisonlycondcheck.patch Description: Binary data
Re: [Patch] Optimize condition reductions where the result is an integer induction variable
On 12/11/2015 10:53, "Richard Biener" wrote: >On Wed, Nov 11, 2015 at 7:54 PM, Alan Hayward >wrote: >> >> >> On 11/11/2015 13:25, "Richard Biener" >>wrote: >> >>>On Wed, Nov 11, 2015 at 1:22 PM, Alan Hayward >>>wrote: >>>> Hi, >>>> I hoped to post this in time for Monday’s cut off date, but >>>>circumstances >>>> delayed me until today. Hoping if possible this patch will still be >>>>able >>>> to go in. >>>> >>>> >>>> This patch builds upon the change for PR65947, and reduces the amount >>>>of >>>> code produced in a vectorized condition reduction where operand 2 of >>>>the >>>> COND_EXPR is an assignment of a increasing integer induction variable >>>>that >>>> won't wrap. >>>> >>>> >>>> For example (assuming all types are ints), this is a match: >>>> >>>> last = 5; >>>> for (i = 0; i < N; i++) >>>> if (a[i] < min_v) >>>> last = i; >>>> >>>> Whereas, this is not because the result is based off a memory access: >>>> last = 5; >>>> for (i = 0; i < N; i++) >>>> if (a[i] < min_v) >>>> last = a[i]; >>>> >>>> In the integer induction variable case we can just use a MAX reduction >>>>and >>>> skip all the code I added in my vectorized condition reduction patch - >>>>the >>>> additional induction variables in vectorizable_reduction () and the >>>> additional checks in vect_create_epilog_for_reduction (). From the >>>>patch >>>> diff only, it's not immediately obvious that those parts will be >>>>skipped >>>> as there is no code changes in those areas. >>>> >>>> The initial value of the induction variable is force set to zero, as >>>>any >>>> other value could effect the result of the induction. At the end of >>>>the >>>> loop, if the result is zero, then we restore the original initial >>>>value. >>> >>>+static bool >>>+is_integer_induction (gimple *stmt, struct loop *loop) >>> >>>is_nonwrapping_integer_induction? >>> >>>+ tree lhs_max = TYPE_MAX_VALUE (TREE_TYPE (gimple_phi_result (stmt))); >>> >>>don't use TYPE_MAX_VALUE. >>> >>>+ /* Check that the induction increments. */ >>>+ if (tree_int_cst_compare (step, size_zero_node) <= 0) >>>+return false; >>> >>>tree_int_cst_sgn (step) == -1 >>> >>>+ /* Check that the max size of the loop will not wrap. */ >>>+ >>>+ if (! max_loop_iterations (loop, &ni)) >>>+return false; >>>+ /* Convert backedges to iterations. */ >>>+ ni += 1; >>> >>>just use max_stmt_executions (loop, &ni) which properly checks for >>>overflow >>>of the +1. >>> >>>+ max_loop_value = wi::add (wi::to_widest (base), >>>+ wi::mul (wi::to_widest (step), ni)); >>>+ >>>+ if (wi::gtu_p (max_loop_value, wi::to_widest (lhs_max))) >>>+return false; >>> >>>you miss a check for the wi::add / wi::mul to overflow. You can use >>>extra args to determine this. >>> >>>Instead of TYPE_MAX_VALUE use wi::max_value (precision, sign). >>> >>>I wonder if you want to skip all the overflow checks for >>>TYPE_OVERFLOW_UNDEFINED >>>IV types? >>> >> >> Ok with all the above. >> >> Tried using max_value () but this gave me a wide_int instead of a >> widest_int. >> Instead I've replaced with min_precision and GET_MODE_BITSIZE. >> >> Added an extra check for when the type is TYPE_OVERFLOW_UNDEFINED. > >+ /* Set the loop-entry arg of the reduction-phi. */ >+ >+ if (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) >+ == INTEGER_INDUC_COND_REDUCTION) > >extra vertical space > >+ tree zero = build_int_cst ( TREE_TYPE (vec_init_def_type), >0); >+ tree zero_vec = build_vector_from_val (vec_init_def_type, >zero); >+ > >build_zero_cst (vec_init_def_type); > >+ else >+ { >+ add_phi_arg (as_a (phi), vec_init_def, > loop_preheader_edge (loop), UNKNOWN_LOCATION); >+ } > >no {}s around single stmts > >+
Re: [Patch] Optimize condition reductions where the result is an integer induction variable
On 11/11/2015 13:25, "Richard Biener" wrote: >On Wed, Nov 11, 2015 at 1:22 PM, Alan Hayward >wrote: >> Hi, >> I hoped to post this in time for Monday’s cut off date, but >>circumstances >> delayed me until today. Hoping if possible this patch will still be able >> to go in. >> >> >> This patch builds upon the change for PR65947, and reduces the amount of >> code produced in a vectorized condition reduction where operand 2 of the >> COND_EXPR is an assignment of a increasing integer induction variable >>that >> won't wrap. >> >> >> For example (assuming all types are ints), this is a match: >> >> last = 5; >> for (i = 0; i < N; i++) >> if (a[i] < min_v) >> last = i; >> >> Whereas, this is not because the result is based off a memory access: >> last = 5; >> for (i = 0; i < N; i++) >> if (a[i] < min_v) >> last = a[i]; >> >> In the integer induction variable case we can just use a MAX reduction >>and >> skip all the code I added in my vectorized condition reduction patch - >>the >> additional induction variables in vectorizable_reduction () and the >> additional checks in vect_create_epilog_for_reduction (). From the patch >> diff only, it's not immediately obvious that those parts will be skipped >> as there is no code changes in those areas. >> >> The initial value of the induction variable is force set to zero, as any >> other value could effect the result of the induction. At the end of the >> loop, if the result is zero, then we restore the original initial value. > >+static bool >+is_integer_induction (gimple *stmt, struct loop *loop) > >is_nonwrapping_integer_induction? > >+ tree lhs_max = TYPE_MAX_VALUE (TREE_TYPE (gimple_phi_result (stmt))); > >don't use TYPE_MAX_VALUE. > >+ /* Check that the induction increments. */ >+ if (tree_int_cst_compare (step, size_zero_node) <= 0) >+return false; > >tree_int_cst_sgn (step) == -1 > >+ /* Check that the max size of the loop will not wrap. */ >+ >+ if (! max_loop_iterations (loop, &ni)) >+return false; >+ /* Convert backedges to iterations. */ >+ ni += 1; > >just use max_stmt_executions (loop, &ni) which properly checks for >overflow >of the +1. > >+ max_loop_value = wi::add (wi::to_widest (base), >+ wi::mul (wi::to_widest (step), ni)); >+ >+ if (wi::gtu_p (max_loop_value, wi::to_widest (lhs_max))) >+return false; > >you miss a check for the wi::add / wi::mul to overflow. You can use >extra args to determine this. > >Instead of TYPE_MAX_VALUE use wi::max_value (precision, sign). > >I wonder if you want to skip all the overflow checks for >TYPE_OVERFLOW_UNDEFINED >IV types? > Ok with all the above. Tried using max_value () but this gave me a wide_int instead of a widest_int. Instead I've replaced with min_precision and GET_MODE_BITSIZE. Added an extra check for when the type is TYPE_OVERFLOW_UNDEFINED. Alan. optimizeConditionReductions2.patch Description: Binary data
[Patch] Optimize condition reductions where the result is an integer induction variable
Hi, I hoped to post this in time for Monday’s cut off date, but circumstances delayed me until today. Hoping if possible this patch will still be able to go in. This patch builds upon the change for PR65947, and reduces the amount of code produced in a vectorized condition reduction where operand 2 of the COND_EXPR is an assignment of a increasing integer induction variable that won't wrap. For example (assuming all types are ints), this is a match: last = 5; for (i = 0; i < N; i++) if (a[i] < min_v) last = i; Whereas, this is not because the result is based off a memory access: last = 5; for (i = 0; i < N; i++) if (a[i] < min_v) last = a[i]; In the integer induction variable case we can just use a MAX reduction and skip all the code I added in my vectorized condition reduction patch - the additional induction variables in vectorizable_reduction () and the additional checks in vect_create_epilog_for_reduction (). From the patch diff only, it's not immediately obvious that those parts will be skipped as there is no code changes in those areas. The initial value of the induction variable is force set to zero, as any other value could effect the result of the induction. At the end of the loop, if the result is zero, then we restore the original initial value. Cheers, Alan. optimizeConditionReductions.patch Description: Binary data
Re: [Patch] Avoid is_simple_use bug in vectorizable_live_operation
On 27/10/2015 11:36, "Richard Biener" wrote: >On Mon, Oct 26, 2015 at 6:15 PM, Alan Hayward >wrote: >> >> >> On 26/10/2015 13:35, "Richard Biener" >>wrote: >> >>>On Mon, Oct 26, 2015 at 1:33 PM, Alan Hayward >>>wrote: >>>> There is a potential bug in vectorizable_live_operation. >>>> >>>> Consider the case where the first op for stmt is valid, but the second >>>>is >>>> null. >>>> The first time through the for () loop, it will call out to >>>> vect_is_simple_use () which will set dt. >>>> The second time, because op is null, vect_is_simple_use () will not be >>>> called. >>>> However, dt is still set to a valid value, therefore the loop will >>>> continue. >>>> >>>> Note this is different from the case where the first op is null, which >>>> will cause the loop not call vect_is_simple_use () and then return >>>>false. >>>> >>>> It is possible that this was intentional, however that is not clear >>>>from >>>> the code. >>>> >>>> The fix is to simply ensure dt is initialized to a default value on >>>>each >>>> iteration. >>> >>>I think the patch is a strict improvement, thus OK. Still a NULL >>>operand >>>is not possible in GIMPLE so the op && check is not necessary. The way >>>it iterates over all stmt uses is a bit scary anyway. As it is ok with >>>all invariants it should probably simply do sth like >>> >>> FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE) >>> if (!vect_is_simple_use (op, )) >>> >>>and be done with that. Unvisited uses can only be constants (ok). >>> >>>Care to rework the funtion like that if you are here? >>> >> >> Ok, I’ve updated as requested. > >Ok. Please remove > > code = gimple_assign_rhs_code (stmt); > op_type = TREE_CODE_LENGTH (code); > rhs_class = get_gimple_rhs_class (code); > gcc_assert (rhs_class != GIMPLE_UNARY_RHS || op_type == unary_op); > gcc_assert (rhs_class != GIMPLE_BINARY_RHS || op_type == binary_op); > >and associated variables as I belive otherwise a --disable-checking build >will fail with set-but-not-used warnings. And the asserts are quite >pointless >given the now sanitized loop. > I did consider removing those asserts, but didn’t want to remove any important checks. Didn’t think about the disable-checking build. Anyway, new patch attached. Cheers, Alan. avoid_issimple3.patch Description: Binary data
Re: [Patch] Avoid is_simple_use bug in vectorizable_live_operation
On 26/10/2015 13:35, "Richard Biener" wrote: >On Mon, Oct 26, 2015 at 1:33 PM, Alan Hayward >wrote: >> There is a potential bug in vectorizable_live_operation. >> >> Consider the case where the first op for stmt is valid, but the second >>is >> null. >> The first time through the for () loop, it will call out to >> vect_is_simple_use () which will set dt. >> The second time, because op is null, vect_is_simple_use () will not be >> called. >> However, dt is still set to a valid value, therefore the loop will >> continue. >> >> Note this is different from the case where the first op is null, which >> will cause the loop not call vect_is_simple_use () and then return >>false. >> >> It is possible that this was intentional, however that is not clear from >> the code. >> >> The fix is to simply ensure dt is initialized to a default value on each >> iteration. > >I think the patch is a strict improvement, thus OK. Still a NULL operand >is not possible in GIMPLE so the op && check is not necessary. The way >it iterates over all stmt uses is a bit scary anyway. As it is ok with >all invariants it should probably simply do sth like > > FOR_EACH_SSA_TREE_OPERAND (op, stmt, iter, SSA_OP_USE) > if (!vect_is_simple_use (op, )) > >and be done with that. Unvisited uses can only be constants (ok). > >Care to rework the funtion like that if you are here? > Ok, I’ve updated as requested. Cheers, Alan. avoid_issimple2.patch Description: Binary data
[Patch] Avoid is_simple_use bug in vectorizable_live_operation
There is a potential bug in vectorizable_live_operation. Consider the case where the first op for stmt is valid, but the second is null. The first time through the for () loop, it will call out to vect_is_simple_use () which will set dt. The second time, because op is null, vect_is_simple_use () will not be called. However, dt is still set to a valid value, therefore the loop will continue. Note this is different from the case where the first op is null, which will cause the loop not call vect_is_simple_use () and then return false. It is possible that this was intentional, however that is not clear from the code. The fix is to simply ensure dt is initialized to a default value on each iteration. 2015-09-07 Alan Hayward alan.hayw...@arm.com * tree-vect-looop.c (vectorizable_live_operation): localize variable. Cheers, Alan avoid_issimple.patch Description: Binary data
[PATCH] Fix VEC_COND_EXPR types when vectorizing conditional expressions
VEC_COND_EXPRs have been changed to use boolean vectors for the first argument. This changes the vectorizing conditional expressions to use this new format, fixing the compiler failures for the 65947-*.c tests. 2015-10-26 Alan Hayward gcc/ PR tree-optimization/65947 * tree-vect-loop.c (vect_create_epilog_for_reduction): Fix VEC_COND_EXPR types. Cheers, Alan. fix_veccondexprtypes.patch Description: Binary data
[COMMITTED] Add myself to MAINTAINERS (Write After Approval)
Hi All, I've just added myself to Write After Approval maintainers. Committed r229216 Cheers, Alan. Index: ChangeLog === --- ChangeLog (revision 229215) +++ ChangeLog (revision 229216) @@ -1,3 +1,7 @@ +2015-10-23 Alan Hayward + + * MAINTAINERS (Write After Approval): Add myself. + 2015-10-20 H.J. Lu Sync with binutils-gdb: Index: MAINTAINERS === --- MAINTAINERS (revision 229215) +++ MAINTAINERS (revision 229216) @@ -419,6 +419,7 @@ Stuart Hastings Michael Haubenwallner Pat Haugen +Alan Hayward Mark Heffernan George Helffrich Daniel Hellstrom
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
On 22/10/2015 15:15, "Alan Lawrence" wrote: >Just one very small point... > >On 19/10/15 09:17, Alan Hayward wrote: > > > - if (check_reduction > > - && (!commutative_tree_code (code) || !associative_tree_code >(code))) > > + if (check_reduction) > > { > > - if (dump_enabled_p ()) > > -report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, > > - "reduction: not commutative/associative: "); > > - return NULL; > > + if (code != COND_EXPR > > + && (!commutative_tree_code (code) || !associative_tree_code >(code))) > > + { > > + if (dump_enabled_p ()) > > + report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, > > + "reduction: not commutative/associative: "); > > + return NULL; > > + } > > + > > + if (code == COND_EXPR) > > + *v_reduc_type = COND_REDUCTION; > >Wouldn't this be easier written as > >if (code == COND_EXPR) > *v_reduc_type = COND_REDUCTION; >else if (!commutative_tree_code (code) || !associative_tree_code (code)) > {...} > >? Your call! > >Cheers, Alan Good spot! I suspect that’s slipped through when I’ve rewritten bits. I’ll add this in with Richard’s suggestion of the change around the call to vectorizable_condition. Alan.
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
On 30/09/2015 13:45, "Richard Biener" wrote: >On Wed, Sep 23, 2015 at 5:51 PM, Alan Hayward >wrote: >> >> >> On 18/09/2015 14:53, "Alan Hayward" wrote: >> >>> >>> >>>On 18/09/2015 14:26, "Alan Lawrence" wrote: >>> >>>>On 18/09/15 13:17, Richard Biener wrote: >>>>> >>>>> Ok, I see. >>>>> >>>>> That this case is already vectorized is because it implements >>>>>MAX_EXPR, >>>>> modifying it slightly to >>>>> >>>>> int foo (int *a) >>>>> { >>>>>int val = 0; >>>>>for (int i = 0; i < 1024; ++i) >>>>> if (a[i] > val) >>>>>val = a[i] + 1; >>>>>return val; >>>>> } >>>>> >>>>> makes it no longer handled by current code. >>>>> >>>> >>>>Yes. I believe the idea for the patch is to handle arbitrary >>>>expressions >>>>like >>>> >>>>int foo (int *a) >>>>{ >>>>int val = 0; >>>>for (int i = 0; i < 1024; ++i) >>>> if (some_expression (i)) >>>>val = another_expression (i); >>>>return val; >>>>} >>> >>>Yes, that’s correct. Hopefully my new test cases should cover >>>everything. >>> >> >> Attached is a new version of the patch containing all the changes >> requested by Richard. > >+ /* Compare the max index vector to the vector of found indexes to >find >+the postion of the max value. This will result in either a >single >+match or all of the values. */ >+ tree vec_compare = make_ssa_name (index_vec_type_signed); >+ gimple vec_compare_stmt = gimple_build_assign (vec_compare, >EQ_EXPR, >+induction_index, >+max_index_vec); > >I'm not sure all targets can handle this. If I deciper the code >correctly then we do > > mask = induction_index == max_index_vec; > vec_and = mask & vec_data; > >plus some casts. So this is basically > > vec_and = induction_index == max_index_vec ? vec_data : {0, 0, ... }; > >without the need to relate the induction index vector type to the data >vector type. >I believe this is also the form all targets support. Ok, I’ll replace this. > >I am missing a comment before all this code-generation that shows the >transform >result with the variable names used in the code-gen. I have a hard >time connecting >things here. Ok, I’ll add some comments. > >+ tree matched_data_reduc_cast = build1 (VIEW_CONVERT_EXPR, >scalar_type, >+matched_data_reduc); >+ epilog_stmt = gimple_build_assign (new_scalar_dest, >+matched_data_reduc_cast); >+ new_temp = make_ssa_name (new_scalar_dest, epilog_stmt); >+ gimple_assign_set_lhs (epilog_stmt, new_temp); > >this will leave the stmt unsimplified. scalar sign-changes should use >NOP_EXPR, >not VIEW_CONVERT_EXPR. The easiest fix is to use fold_convert instead. >Also just do like before - first make_ssa_name and then directly use it >in the >gimple_build_assign. We need the VIEW_CONVERT_EXPR for the cases where we have float data values. The index is always integer. > >The patch is somewhat hard to parse with all the indentation changes. A >context >diff would be much easier to read in those contexts. Ok, I’ll make the next patch like that > >+ if (v_reduc_type == COND_REDUCTION) >+{ >+ widest_int ni; >+ >+ if (! max_loop_iterations (loop, &ni)) >+ { >+ if (dump_enabled_p ()) >+ dump_printf_loc (MSG_NOTE, vect_location, >+"loop count not known, cannot create cond " >+"reduction.\n"); > >ugh. That's bad. > >+ /* The additional index will be the same type as the condition. >Check >+that the loop can fit into this less one (because we'll use up >the >+zero slot for when there are no matches). */ >+ tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type); >+ if (wi::geu_p (ni, wi::to_widest (max_index))) >+ { >+ if (dump_enabled_p ()) >+ dump_printf_loc (MSG_NOTE, vect_location, >+"loop size is greater than data size.\n"); >+
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
On 18/09/2015 14:53, "Alan Hayward" wrote: > > >On 18/09/2015 14:26, "Alan Lawrence" wrote: > >>On 18/09/15 13:17, Richard Biener wrote: >>> >>> Ok, I see. >>> >>> That this case is already vectorized is because it implements MAX_EXPR, >>> modifying it slightly to >>> >>> int foo (int *a) >>> { >>>int val = 0; >>>for (int i = 0; i < 1024; ++i) >>> if (a[i] > val) >>>val = a[i] + 1; >>>return val; >>> } >>> >>> makes it no longer handled by current code. >>> >> >>Yes. I believe the idea for the patch is to handle arbitrary expressions >>like >> >>int foo (int *a) >>{ >>int val = 0; >>for (int i = 0; i < 1024; ++i) >> if (some_expression (i)) >>val = another_expression (i); >>return val; >>} > >Yes, that’s correct. Hopefully my new test cases should cover everything. > Attached is a new version of the patch containing all the changes requested by Richard. Thanks, Alan. 0001-Support-for-vectorizing-conditional-expressions.patch Description: Binary data
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
On 18/09/2015 14:26, "Alan Lawrence" wrote: >On 18/09/15 13:17, Richard Biener wrote: >> >> Ok, I see. >> >> That this case is already vectorized is because it implements MAX_EXPR, >> modifying it slightly to >> >> int foo (int *a) >> { >>int val = 0; >>for (int i = 0; i < 1024; ++i) >> if (a[i] > val) >>val = a[i] + 1; >>return val; >> } >> >> makes it no longer handled by current code. >> > >Yes. I believe the idea for the patch is to handle arbitrary expressions >like > >int foo (int *a) >{ >int val = 0; >for (int i = 0; i < 1024; ++i) > if (some_expression (i)) >val = another_expression (i); >return val; >} Yes, that’s correct. Hopefully my new test cases should cover everything. Alan.
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
On 15/09/2015 13:09, "Richard Biener" wrote: > >Now comments on the patch itself. > >+ if (code == COND_EXPR) >+ *v_reduc_type = COND_REDUCTION; > >so why not simply use COND_EXPR instead of the new v_reduc_type? v_reduc_type is also dependant on check_reduction (which comes from !nested_cycle in vectorizable_reduction). It seemed messy to keep on checking for both those things throughout. In my patch to catch simpler condition reductions, I’ll be adding another value to this enum too. v_reduc_type will be set to this new value based on the same properties for COND_REDUCTION plus some additional constraints. > >+ if (check_reduction && code != COND_EXPR && >+ vect_is_slp_reduction (loop_info, phi, def_stmt)) > >&&s go to the next line ok > >+ /* Reduction of the max index and a reduction of the found >+values. */ >+ epilogue_cost += add_stmt_cost (target_cost_data, 1, >+ vec_to_scalar, stmt_info, 0, >+ vect_epilogue); > >vec_to_scalar once isn't what the comment suggests. Instead the >comment suggests twice what a regular reduction would do >but I guess we can "hide" the vec_to_scalar cost and "merge" it >with the broadcast. Thus make the above two vector_stmt costs? > >+ /* A broadcast of the max value. */ >+ epilogue_cost += add_stmt_cost (target_cost_data, 2, >+ scalar_to_vec, stmt_info, 0, >+ vect_epilogue); > >comment suggests a single broadcast. I’ve made a copy/paste error here. Just need to swap the 1 and the 2. > >@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi) > the final vector of induction results: */ > exit_phi = NULL; > FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg) >-{ >+ { > gimple use_stmt = USE_STMT (use_p); > if (is_gimple_debug (use_stmt)) >continue; > >please avoid unrelated whitespace changes. Ok. I was changing “8 spaces” to a tab, but happy to revert. > >+ case COND_EXPR: >+ if (v_reduc_type == COND_REDUCTION) >+ { >... >+ /* Fall through. */ >+ > case MIN_EXPR: > case MAX_EXPR: >- case COND_EXPR: > >aww, so we could already handle COND_EXPR reductions? How do they >differ from what you add? Did you see if that path is even exercised >today? Today, COND_EXPRs are only supported when they are nested inside a loop. See the vect-cond-*.c tests. For example: for (j = 0; j < M; j++) { x = x_in[j]; curr_a = a[0]; for (i = 0; i < N; i++) { next_a = a[i+1]; curr_a = x > c[i] ? curr_a : next_a; } x_out[j] = curr_a; } In that case, only the outer loop is vectorised. > >+ /* Create a vector of {init_value, 0, 0, 0...}. */ >+ vec *v; >+ vec_alloc (v, nunits); >+ CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val); >+ if (SCALAR_FLOAT_TYPE_P (scalar_type)) >+ for (i = 1; i < nunits; ++i) >+ CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, >+ build_real (scalar_type, >dconst0)); >+ else >+ for (i = 1; i < nunits; ++i) >+ CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, >+ build_int_cst (scalar_type, 0)); >+ init_def = build_constructor (vectype, v); > >you can unify the float/int case by using build_zero_cst (scalar_type). >Note that you should build a vector constant instead of a constructor >if init_val is a constant. The convenient way is to build the vector >elements into a tree[] array and use build_vector_stat in that case. Ok, will switch to build_zero_cst. Also, I will switch my vector to {init_value, init_value, init_value…}. I had {init_value, 0, 0, 0…} because I was going have the option of ADD_REDUC_EXPR, But that got removed along the way. > >+ /* Find maximum value from the vector of found indexes. */ >+ tree max_index = make_temp_ssa_name (index_scalar_type, NULL, ""); > >just use make_ssa_name (index_scalar_type); Ok > >+ /* Convert the vector of data to the same type as the EQ. */ >+ tree vec_data_cast; >+ if ( TYPE_UNSIGNED (index_vec_type)) >+ { > >How come it never happens the element >sizes do not match? (int index type and double data type?) This was a little unclear. The induction index is originally created as an unsigned version of the type as the data vector. (see the definition of cr_index_vector_type in vectorizable_reduction(), which is then used to create cond_name) I will remove the if and replace with a gcc_checking_assert(TYPE_UNSIGNED (index_vec_type)) > >+ /* Where the max index occured, use the value from the data >vector. */ >+ tree vec_and = make_temp_ssa_name (index_vec_type_signed, NUL
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
Hi Bill, I’d be a bit worried about asking the backend for the cost of a COND_REDUCTION, as that will rely on the backend understanding the implementation the vectorizer is using - every time the vectorizer changed, the backends would need to be updated too. I’m hoping soon to get together a patch to reduce the stmts produced on the simpler cases, which would require a different set of costings. I can also imagine further improvements being added for other special cases over time. Having the backends understand every variation would be a little cumbersome. As it stands today, we correctly exit the optimisation if max reduction isn’t supported in hardware, which is what the cost model is expecting. If power wanted to use this implementation, then I think it’d probably need some code in tree-vect-generic.c to implement on emulated max reduction, which would then require updates to the costs modelling of anything that uses max reduction (not just cond reduction). All of that is outside the scope of this patch. Thanks, Alan. On 10/09/2015 23:14, "Bill Schmidt" wrote: >Hi Alan, > >The cost modeling of the epilogue code seems pretty target-specific ("An >EQ stmt and an AND stmt, reduction of the max index and a reduction of >the found values, a broadcast of the max value," resulting in two >vector_stmts, one vec_to_scalar, and two scalar_to_vecs). On powerpc, >this will not represent the cost accurately, and the cost will indeed be >quite different depending on the mode (logarithmic in the number of >elements). I think that you need to create a new entry in >vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow >each target to calculate the cost appropriately. > >(Powerpc doesn't have a max-reduction hardware instruction, but because >the reduction will be only in the epilogue code, it may still be >profitable for us to generate the somewhat expensive reduction sequence >in order to vectorize the loop. But we definitely want to model it as >costly in and of itself. Also, the sequence will produce the maximum >value in all positions without a separate broadcast.) > >Thanks, >Bill > >On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote: >> Hi, >> This patch (attached) adds support for vectorizing conditional >>expressions >> (PR 65947), for example: >> >> int condition_reduction (int *a, int min_v) >> { >> int last = 0; >> for (int i = 0; i < N; i++) >> if (a[i] < min_v) >> last = a[i]; >> return last; >> } >> >> To do this the loop is vectorised to create a vector of data results (ie >> of matching a[i] values). Using an induction variable, an additional >> vector is added containing the indexes where the matches occured. In the >> function epilogue this is reduced to a single max value and then used to >> index into the vector of data results. >> When no values are matched in the loop, the indexes vector will contain >> all zeroes, eventually matching the first entry in the data results >>vector. >> >> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This >>is >> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that >> REDUC_MAX_EXPR is not supported for the required modes, failing the >> vectorization. On mips it complains that the required vcond expression >>is >> not supported. It is suggested the relevant backend experts add the >> required backend support. >> >> Using a simple testcase based around a large number of N and run on an >> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of >> it's original time. >> >> This patch caused binary differences in three spec2006 binaries on >>aarch64 >> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board >> showed no improvement or degregation in runtime. >> >> >> In the near future I hope to submit a further patch (as PR 66558) which >> optimises the case where the result is simply the index of the loop, for >> example: >> int condition_reduction (int *a, int min_v) >> { >> int last = 0; >> for (int i = 0; i < N; i++) >> if (a[i] < min_v) >> last = i; >> return last; >> } >> In this case a lot of the new code can be optimized away. >> >> I have run check for aarch64, arm and x86 and have seen no regressions. >> >> >> Changelog: >> >> 2015-08-28 Alan Hayward >> >> PR tree-optimization/65947 >> * tree-vect-loop.c >> (vect_is_simple_reduction_1): Find condition reductions. >> (vec
[PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
Hi, This patch (attached) adds support for vectorizing conditional expressions (PR 65947), for example: int condition_reduction (int *a, int min_v) { int last = 0; for (int i = 0; i < N; i++) if (a[i] < min_v) last = a[i]; return last; } To do this the loop is vectorised to create a vector of data results (ie of matching a[i] values). Using an induction variable, an additional vector is added containing the indexes where the matches occured. In the function epilogue this is reduced to a single max value and then used to index into the vector of data results. When no values are matched in the loop, the indexes vector will contain all zeroes, eventually matching the first entry in the data results vector. To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This is supported by aarch64 and arm. On X86 and powerpc, gcc will complain that REDUC_MAX_EXPR is not supported for the required modes, failing the vectorization. On mips it complains that the required vcond expression is not supported. It is suggested the relevant backend experts add the required backend support. Using a simple testcase based around a large number of N and run on an aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of it's original time. This patch caused binary differences in three spec2006 binaries on aarch64 - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board showed no improvement or degregation in runtime. In the near future I hope to submit a further patch (as PR 66558) which optimises the case where the result is simply the index of the loop, for example: int condition_reduction (int *a, int min_v) { int last = 0; for (int i = 0; i < N; i++) if (a[i] < min_v) last = i; return last; } In this case a lot of the new code can be optimized away. I have run check for aarch64, arm and x86 and have seen no regressions. Changelog: 2015-08-28 Alan Hayward PR tree-optimization/65947 * tree-vect-loop.c (vect_is_simple_reduction_1): Find condition reductions. (vect_model_reduction_cost): Add condition reduction costs. (get_initial_def_for_reduction): Add condition reduction initial var. (vect_create_epilog_for_reduction): Add condition reduction epilog. (vectorizable_reduction): Condition reduction support. * tree-vect-stmts.c (vectorizable_condition): Add vect reduction arg * doc/sourcebuild.texi (Vector-specific attributes): Document vect_max_reduc testsuite/Changelog: PR tree-optimization/65947 * lib/target-supports.exp (check_effective_target_vect_max_reduc): Add. * gcc.dg/vect/pr65947-1.c: New test. * gcc.dg/vect/pr65947-2.c: New test. * gcc.dg/vect/pr65947-3.c: New test. * gcc.dg/vect/pr65947-4.c: New test. * gcc.dg/vect/pr65947-5.c: New test. * gcc.dg/vect/pr65947-6.c: New test. * gcc.dg/vect/pr65947-7.c: New test. * gcc.dg/vect/pr65947-8.c: New test. * gcc.dg/vect/pr65947-9.c: New test. * gcc.dg/vect/pr65947-10.c: New test. * gcc.dg/vect/pr65947-11.c: New test. Thanks, Alan 0001-Support-for-vectorizing-conditional-expressions.patch Description: Binary data
[PATCH][rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1
[Cleaning this thread up to submit patch again, with better explanation] This patch causes subreg_get_info() to exit early in the simple cases where we are extracting a whole register from a multi register. In aarch64 for Big Endian we were producing a subreg of a OImode (256bits) from a CImode (384bits) This would hit the following assert in subreg_get_info: gcc_assert ((GET_MODE_SIZE (xmode) % GET_MODE_SIZE (ymode)) == 0); This is a rule we should be able to relax a little - if the subreg we want fits into a whole register then this is a valid result and can be easily detected earlier in the function. This has the bonus that we should be slightly reducing the execution time for more common cases, for example a subreg of 64bits from 256bits. This patch is required for the second part of the patch, which is aarch64 specific, and fixes up aarch64 Big Endian movoi/ci/xi. This second part has already been approved. This patch will apply cleanly by itself and no regressions were seen when testing aarch64 and x86_64 on make check. Cheers, Alan Changelog: 2014-11-14 Alan Hayward * rtlanal.c (subreg_get_info): Exit early for simple and common cases --- gcc/rtlanal.c | 13 + 1 file changed, 13 insertions(+) diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c index c9bf69c..a3f7b78 100644 --- a/gcc/rtlanal.c +++ b/gcc/rtlanal.c @@ -3561,6 +3561,19 @@ subreg_get_info (unsigned int xregno, machine_mode xmode, info->offset = offset / regsize_xmode; return; } + /* Quick exit for the simple and common case of extracting whole + subregisters from a multiregister value. */ + if (!rknown + && WORDS_BIG_ENDIAN == REG_WORDS_BIG_ENDIAN + && regsize_xmode == regsize_ymode + && (offset % regsize_ymode) == 0) + { + info->representable_p = true; + info->nregs = nregs_ymode; + info->offset = offset / regsize_ymode; + gcc_assert (info->offset + info->nregs <= nregs_xmode); + return; + } } /* Lowpart subregs are otherwise valid. */ -- 1.9.1 0001-BE-fix-load-stores.-Common-code.patch Description: Binary data
Re: [PATCH][rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1
On 02/12/2014 12:36, "Alan Hayward" wrote: > >On 21/11/2014 14:08, "Alan Hayward" wrote: > >> >>On 14/11/2014 16:48, "Alan Hayward" wrote: >> >>>This is a new version of my BE patch from a few weeks ago. >>>This is part 1 and covers rtlanal.c. The second part will be aarch64 >>>specific. >>> >>>When combined with the second patch, It fixes up movoi/ci/xi for Big >>>Endian, so that we end up with the lab of a big-endian integer to be in >>>the low byte of the highest-numbered register. >>> >>>This will apply cleanly by itself and no regressions were seen when >>>testing aarch64 and x86_64 on make check. >>> >>> >>>Changelog: >>> >>>2014-11-14 Alan Hayward >>> >>>* rtlanal.c >>>(subreg_get_info): Exit early for simple and common cases >>> >>> >>>Alan. >> >>Hi, >> >>The second part to this patch (aarch64 specific) has been approved. >> >> >>Could someone review this one please. >> >> >>Thanks, >>Alan. > > >Ping. > > >Thanks, >Alan. Ping ping. Thanks, Alan. 0001-BE-fix-load-stores.-Common-code.patch Description: Binary data
Re: [rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1
On 21/11/2014 14:08, "Alan Hayward" wrote: > >On 14/11/2014 16:48, "Alan Hayward" wrote: > >>This is a new version of my BE patch from a few weeks ago. >>This is part 1 and covers rtlanal.c. The second part will be aarch64 >>specific. >> >>When combined with the second patch, It fixes up movoi/ci/xi for Big >>Endian, so that we end up with the lab of a big-endian integer to be in >>the low byte of the highest-numbered register. >> >>This will apply cleanly by itself and no regressions were seen when >>testing aarch64 and x86_64 on make check. >> >> >>Changelog: >> >>2014-11-14 Alan Hayward >> >>* rtlanal.c >>(subreg_get_info): Exit early for simple and common cases >> >> >>Alan. > >Hi, > >The second part to this patch (aarch64 specific) has been approved. > > >Could someone review this one please. > > >Thanks, >Alan. Ping. Thanks, Alan. 0001-BE-fix-load-stores.-Common-code.patch Description: Binary data
Re: [rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1
On 14/11/2014 16:48, "Alan Hayward" wrote: >This is a new version of my BE patch from a few weeks ago. >This is part 1 and covers rtlanal.c. The second part will be aarch64 >specific. > >When combined with the second patch, It fixes up movoi/ci/xi for Big >Endian, so that we end up with the lab of a big-endian integer to be in >the low byte of the highest-numbered register. > >This will apply cleanly by itself and no regressions were seen when >testing aarch64 and x86_64 on make check. > > >Changelog: > >2014-11-14 Alan Hayward > >* rtlanal.c >(subreg_get_info): Exit early for simple and common cases > > >Alan. Hi, The second part to this patch (aarch64 specific) has been approved. Could someone review this one please. Thanks, Alan.
FW: [Aarch64][BE][2/2] Fix vector load/stores to not use ld1/st1
On 20/11/2014 18:13, "Marcus Shawcroft" wrote: >On 14 November 2014 16:48, Alan Hayward wrote: >>This is a new version of my BE patch from a few weeks ago. >>This is part 2 and covers all the aarch64 changes. >> >>When combined with the first patch, It fixes up movoi/ci/xi for Big >>Endian, so that we end up with the lab of a big-endian integer to be in >>the low byte of the highest-numbered register. >> >>This patch requires part 1 and David Sherwood’s patch: >> [AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe. >> >>When tested with David’s patch and [1/2] of this patch, no regressions >>were seen when testing aarch64 and x86_64 on make check. >> >> >>Changelog: >>2014-11-14 Alan Hayward >> >> * config/aarch64/aarch64.c >> (aarch64_classify_address): Allow extra addressing modes for BE. >> (aarch64_print_operand): new operand for printing a q >>register+1. > > >Just a bunch of ChangeLog nits. > >+void aarch64_simd_emit_reg_reg_move (rtx *operands, enum machine_mode >mode, >+ unsigned int count); > >Drop the formal argument names. > > >Can you respin with these changes please. > >/Marcus New version. Identical to previous version of the patch except for: * removal of parameter names in aarch64-protos.h * new changelog 2014-11-21 Alan Hayward PR 57233 PR 59810 * config/aarch64/aarch64.c (aarch64_classify_address): Allow extra addressing modes for BE. (aarch64_print_operand): New operand for printing a q register+1. (aarch64_simd_emit_reg_reg_move): Define. (aarch64_simd_disambiguate_copy): Remove. * config/aarch64/aarch64-protos.h (aarch64_simd_emit_reg_reg_move): Define. (aarch64_simd_disambiguate_copy): Remove. * config/aarch64/aarch64-simd.md (define_split): Use aarch64_simd_emit_reg_reg_move. (define_expand "mov"): Less restrictive predicates. (define_insn "*aarch64_mov"): Simplify and only allow for LE. (define_insn "*aarch64_be_movoi"): Define. (define_insn "*aarch64_be_movci"): Define. (define_insn "*aarch64_be_movxi"): Define. (define_split): OI mov. Use aarch64_simd_emit_reg_reg_move. (define_split): CI mov. Use aarch64_simd_emit_reg_reg_move. (define_split): XI mov. Use aarch64_simd_emit_reg_reg_move. Alan. 0001-BE-fix-load-stores.-Aarch64-code.-v2.patch Description: Binary data
[Aarch64][BE][2/2] Fix vector load/stores to not use ld1/st1
This is a new version of my BE patch from a few weeks ago. This is part 2 and covers all the aarch64 changes. When combined with the first patch, It fixes up movoi/ci/xi for Big Endian, so that we end up with the lab of a big-endian integer to be in the low byte of the highest-numbered register. This patch requires part 1 and David Sherwood’s patch: [AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe. When tested with David’s patch and [1/2] of this patch, no regressions were seen when testing aarch64 and x86_64 on make check. Changelog: 2014-11-14 Alan Hayward * config/aarch64/aarch64.c (aarch64_classify_address): Allow extra addressing modes for BE. (aarch64_print_operand): new operand for printing a q register+1. (aarch64_simd_emit_reg_reg_move): replacement for aarch64_simd_disambiguate_copy that plants the required mov. * config/aarch64/aarch64-protos.h (aarch64_simd_emit_reg_reg_move): replacement for aarch64_simd_disambiguate_copy. * config/aarch64/aarch64-simd.md (define_split): Use new aarch64_simd_emit_reg_reg_move. (define_expand "mov"): less restrictive predicates. (define_insn "*aarch64_mov"): Simplify and only allow for LE. (define_insn "*aarch64_be_movoi"): New. BE only. Plant ldp or stp. (define_insn "*aarch64_be_movci"): New. BE only. No instructions. (define_insn "*aarch64_be_movxi"): New. BE only. No instructions. (define_split): OI mov. Use new aarch64_simd_emit_reg_reg_move. (define_split): CI mov. Use new aarch64_simd_emit_reg_reg_move. On BE plant movs for reg to/from mem case. (define_split): XI mov. Use new aarch64_simd_emit_reg_reg_move. On BE plant movs for reg to/from mem case. Alan. 0001-BE-fix-load-stores.-Aarch64-code.patch Description: Binary data
[rtlanal.c][BE][1/2] Fix vector load/stores to not use ld1/st1
This is a new version of my BE patch from a few weeks ago. This is part 1 and covers rtlanal.c. The second part will be aarch64 specific. When combined with the second patch, It fixes up movoi/ci/xi for Big Endian, so that we end up with the lab of a big-endian integer to be in the low byte of the highest-numbered register. This will apply cleanly by itself and no regressions were seen when testing aarch64 and x86_64 on make check. Changelog: 2014-11-14 Alan Hayward * rtlanal.c (subreg_get_info): Exit early for simple and common cases Alan. 0001-BE-fix-load-stores.-Common-code.patch Description: Binary data
Re: [AArch64] [BE] Fix vector load/stores to not use ld1/st1
On 06/11/2014 09:09, "Yangfei (Felix)" wrote: >> On 10 October 2014 16:19, Alan Hayward wrote: >> > This patch is dependant on "[AArch64] [BE] [1/2] Make large opaque >> > integer modes endianness-safe.” >> > >> > It fixes up movoi/ci/xi for Big Endian, so that we end up with the lsb >> > of a big-endian integer to be in the low byte of the highest-numbered >> > register. >> > >> > movoi uses stp/ldp >> > movxi needs a split version (which is shared with register->register >> > movxi), which just splits to two new movs movci can then split to >> > three movs. A future patch will instead split to an movoi and a movti. >> > >> > >> > There are no changes for LE. >> > >> > Ran whole of check with both parts of "Make large opaque integer modes >> > endianness-safe”. No regressions. >> > >> > >> > ChangeLog: >> > >> > gcc/: >> > 2014-10-10 Alan Hayward >> > * config/aarch64/aarch64.c >> > (aarch64_classify_address): Allow extra addressing modes for >>BE. >> > (aarch64_print_operand): new operand for printing a q >>register+1. >> > (aarch64_simd_emit_reg_reg_move): replacement for >> > aarch64_simd_disambiguate_copy that plants the required mov. >> > * config/aarch64/aarch64-protos.h >> > (aarch64_simd_emit_reg_reg_move): replacement for >> > aarch64_simd_disambiguate_copy. >> > * config/aarch64/aarch64-simd.md >> > (define_split) Use new aarch64_simd_emit_reg_reg_move. >> > (define_expand "mov") less restrictive predicates. >> > (define_insn "*aarch64_mov") Simplify and only allow >>for LE. >> > (define_insn "*aarch64_be_movoi") New. BE only. Plant ldp or >> stp. >> > (define_insn "*aarch64_be_movci") New. BE only. No >> instructions. >> > (define_insn "*aarch64_be_movxi") New. BE only. No >> instructions. >> > (define_split) OI mov. Use new >> aarch64_simd_emit_reg_reg_move. >> > (define_split) CI mov. Use new >> aarch64_simd_emit_reg_reg_move. >> > On BE >> > plant movs for reg to/from mem case. >> > (define_split) XI mov. Use new >> aarch64_simd_emit_reg_reg_move. >> > On BE >> > plant movs for reg to/from mem case. >> > >> >> OK /Marcus > > >Hello, > > It seems that this patch breaks the compile of some testcases under >big-endian. > On example: testsuite/gcc.target/aarch64/advsimd-intrinsics/ vldX.c > Any thing I missed? Please confirm. Thanks. > >$ aarch64_be-linux-gnu-gcc vldX.c > >vldX.c: In function 'exec_vldX': >vldX.c:686:1: internal compiler error: in change_address_1, at >emit-rtl.c:2096 > } > ^ >0x73b18c change_address_1 >/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/emit-rtl.c:2096 >0xdbd278 gen_split_2991(rtx_insn*, rtx_def**) > >/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/config/aarch64/aarch64-simd >.md:4436 >0x743aa1 try_split(rtx_def*, rtx_def*, int) >/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/emit-rtl.c:3639 >0x9b8c93 split_insn >/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/recog.c:2901 >0x9b8ee7 split_all_insns_noflow() >/home/jjj/p660/p660_build_dir_be/src/trunk/gcc/recog.c:3042 >Please submit a full bug report, >with preprocessed source if appropriate. >Please include the complete backtrace with any bug report. >See <http://gcc.gnu.org/bugs.html> for instructions. > Did you try my patch with or without it’s dependency ? "[AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe.” (David Sherwood) Without that above patch then I would expect some tests to fail. In addition, it looks like David has had some problems merging that patch to the latest head - I’d suggest not using this patch until David has sorted his. Alan.
[AArch64] [BE] Fix vector load/stores to not use ld1/st1
This patch is dependant on "[AArch64] [BE] [1/2] Make large opaque integer modes endianness-safe.” It fixes up movoi/ci/xi for Big Endian, so that we end up with the lsb of a big-endian integer to be in the low byte of the highest-numbered register. movoi uses stp/ldp movxi needs a split version (which is shared with register->register movxi), which just splits to two new movs movci can then split to three movs. A future patch will instead split to an movoi and a movti. There are no changes for LE. Ran whole of check with both parts of "Make large opaque integer modes endianness-safe”. No regressions. ChangeLog: gcc/: 2014-10-10 Alan Hayward * config/aarch64/aarch64.c (aarch64_classify_address): Allow extra addressing modes for BE. (aarch64_print_operand): new operand for printing a q register+1. (aarch64_simd_emit_reg_reg_move): replacement for aarch64_simd_disambiguate_copy that plants the required mov. * config/aarch64/aarch64-protos.h (aarch64_simd_emit_reg_reg_move): replacement for aarch64_simd_disambiguate_copy. * config/aarch64/aarch64-simd.md (define_split) Use new aarch64_simd_emit_reg_reg_move. (define_expand "mov") less restrictive predicates. (define_insn "*aarch64_mov") Simplify and only allow for LE. (define_insn "*aarch64_be_movoi") New. BE only. Plant ldp or stp. (define_insn "*aarch64_be_movci") New. BE only. No instructions. (define_insn "*aarch64_be_movxi") New. BE only. No instructions. (define_split) OI mov. Use new aarch64_simd_emit_reg_reg_move. (define_split) CI mov. Use new aarch64_simd_emit_reg_reg_move. On BE plant movs for reg to/from mem case. (define_split) XI mov. Use new aarch64_simd_emit_reg_reg_move. On BE plant movs for reg to/from mem case. 0001-be-mov.patch Description: Binary data