[Bug c++/52315] New: [C++11] constexpr object of nested class
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52315 Bug #: 52315 Summary: [C++11] constexpr object of nested class Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de I'm not sure whether this is valid/invalid C++11, but it seems a little weird. The following works as expected: struct B { constexpr B (unsigned v) noexcept : val (v) { } constexpr unsigned value (void) const noexcept { return val; } unsigned val; }; struct A { int stuff[B (16).value ()]; }; But moving struct B inside of struct A ... struct A { struct B { constexpr B (unsigned v) noexcept : val (v) { } constexpr unsigned value (void) const noexcept { return val; } unsigned val; }; int stuff[ B (16).value () ]; }; ... results in error: size of array 'stuff' is not an integral constant-expression Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20120220 (experimental) (GCC)
[Bug target/52049] New: SH Target: Inefficient constant address access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52049 Bug #: 52049 Summary: SH Target: Inefficient constant address access Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* static volatile int* const g_0 = (volatile int*)0x1240; static volatile int* const g_1 = (volatile int*)0x1244; static volatile int* const g_2 = (volatile int*)0x1248; static volatile int* const g_3 = (volatile int*)0x124C; int test_24 (void) { return *g_0 + *g_1 + *g_2 + *g_3; } Compiled with -O1: mov.w.L31,r1 mov.l@r1+,r0 mov.l@r1+,r3 mov.l@r1+,r2 mov.l@r1,r1 addr3,r0 addr2,r0 rts addr1,r0 .align 1 .L31: .short4672 Compiled with -O2, -Os, -O3: mov.w.L31,r1 mov.l@r1,r0 mov.l@(4,r1),r3 mov.l@(8,r1),r2 add#12,r1 ! why not mov.l @(12,r1),r1 ?? mov.l@r1,r1 addr3,r0 addr2,r0 rts addr1,r0 .align 1 .L31: .short4672 This happens always for the last memory access, if the number of contiguous accesses is > 2. When using non-volatile variables: static int* const h_0 = (int*)0x1240; static int* const h_1 = (int*)0x1244; static int* const h_2 = (int*)0x1248; static int* const h_3 = (int*)0x124C; int test_25 (void) { return *h_0 + *h_1 + *h_2 + *h_3; } Compiled with -O1,-O2,-Os,-O3: mov.w.L33,r1 mov.l@r1+,r0 mov.l@r1,r1 addr1,r0 mov.w.L34,r1 mov.l@r1,r1 addr1,r0 mov.w.L35,r1 mov.l@r1,r1 rts addr1,r0 .align 1 .L33: .short4672 .L34: .short4680 .L35: .short4684 Better: mov.w.L31,r1 mov.l@r1+,r0 mov.l@r1+,r3 mov.l@r1+,r2 addr3,r0 mov.l@r1,r1 addr2,r0 rts addr1,r0 .align 1 .L31: .short4672 I'm not sure whether this is actually a problem of the SH back-end or of some middle-end passes. It happens for all sub-targets and regardless of the endianess. Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20120129 (experimental) (GCC)
[Bug target/31640] cache block alignment is too aggressive on sh-elf
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31640 --- Comment #3 from Oleg Endo 2011-12-31 17:24:47 UTC --- Created attachment 26208 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26208 Proposed patch (In reply to comment #0) > The sh4 port aligns blocks that have no fallthrus and that are either > frequently executed (JUMP_ALIGN) or preceeded a barrier > (LABEL_ALIGN_AFTER_BARRIER) on a cache line. > > While in theory this help to avoid cache misses if the block slits over 2 > cache > lines, in practise this reduces cache locality and lenghten distance between > blocks. > The number of issued instructions are also impacted. For example the relative > indirect address in jump tables needs a byte zero extend instruction if the > distance occupies 8 bits instead of 7 bits. > > I ran some experiments and benchmarked (eembc) with 2 strategies > 1) -falign-jumps=1 > 2) Align the block if the size is bigger than a given threshold. (empirically > set to 16 bytes, half of the cache line size). See illustrating attached > patch. > > My conclusion is that in -O3 the performance never degrades (option 2 is a > little bit better, even improving dhrystone by 3%) when removing this padding. > And the text size improves by ~15%. Because of this I would like to propose the following alignment strategies (unless they are changed by the user with -falign-??? options). -Os: Align everything to 2 byte to get compact code -O2,-O3: Align functions to 4 bytes. Align labels and jumps to 2 bytes (to avoid potential code bloat). Align loops to 4 bytes. The attached patch should do that, although not fully tested yet.
[Bug target/51244] SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 --- Comment #8 from Oleg Endo 2011-12-30 21:21:14 UTC --- (In reply to comment #7) > (In reply to comment #3) > > I haven't ran all tests on it yet, but CSiBE shows average code size > > reduction > > of approx. -0.1% for -m4* with some code size increases in some files. > > Would something like that be OK for stage 3? > > Looks good, though not appropriate for stage 3, I think. The patch passed the testsuite without new failures. I'll queue it up for stage 1.
[Bug target/50749] SH Target: Post-increment addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 --- Comment #10 from Oleg Endo 2011-12-30 02:14:00 UTC --- (In reply to comment #9) > (In reply to comment #8) > > Specifying -fno-tree-forwprop doesn't seem to have any effect on these > > cases. > > For that function, -fdump-tree-all shows that the tree loop ivopts > optimization does it. Try -fno-tree-forwprop -fno-ivopts. This makes even worse code. Anyway, I think this issue should be addressed by the auto_inc_dec pass. If OK, I'd like to change it from target PR to middle-end PR.
[Bug target/51708] New: SH Target: SHAD / SHLD constant not CSE-ed
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51708 Bug #: 51708 Summary: SH Target: SHAD / SHLD constant not CSE-ed Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de Target: sh*-*-* If the shift amount is a constant for SHAD / SHLD insns it should be CSE-ed. This could be done in a similar way as in the movnegt insn. Example function: int foo (int* a, int x, int n) { int i; int count; for (i = 0; i < n; i ++) count += (*(a + i) << 12); return count; } compiled with -m4 -O3: cmp/pl r6 bf .L11 shll2 r6 add #-4,r6 shlr2 r6 add #1,r6 .L9: mov.l @r4+,r1 mov #12,r2! load constant in loop dt r6 shldr2,r1 ! use constant in loop bf/s.L9 add r1,r0 .L11: rts nop better: cmp/pl r6 bf .L11 shll2 r6 add #-4,r6 shlr2 r6 add #1,r6 mov #12,r2! load constant before loop .L9: mov.l @r4+,r1 dt r6 shldr2,r1 ! use constant in loop bf/s.L9 add r1,r0 .L11: rts nop
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #15 from Oleg Endo 2011-12-29 00:34:53 UTC --- (In reply to comment #14) > With trunk rev 181517 I have observed the following problem, which happens > when > compiling for -m2*, -m3*, -m4* and -Os: > This is still present as of rev 182713 and seems to be a different issue. I've created PR51697 for it.
[Bug target/51697] New: SH Target: Inefficient DImode comparisons for -Os
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51697 Bug #: 51697 Summary: SH Target: Inefficient DImode comparisons for -Os Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* For -Os and everything but -m1 DImode comparisons are not optimized properly which results in redundant SImode comparisons, producing code worse than for -O1. A reduced example: int test_0 (long long* x) { return *x & 0x ? -20 : -40; } -Os -m2/-m3/-m4: mov#0,r2 ! 55movsi_ie/3[length = 2] tstr2,r2 ! 57cmpeqsi_t/1[length = 2] bf/s.L12! 58branch_false[length = 2] mov.l@(4,r4),r3! 12movsi_ie/7[length = 2] tstr3,r3 ! 59cmpeqsi_t/1[length = 2] .L12: bt/s.L11! 14branch_true[length = 2] mov#-40,r0 ! 5movsi_ie/3[length = 2] mov#-20,r0 ! 4movsi_ie/3[length = 2] .L11: rts nop ! 65*return_i[length = 4] -Os -m1: -O2 -m4: mov.l @(4,r4),r1 ! 10movsi_i/5[length = 2] mov #-40,r0 ! 5movsi_i/3[length = 2] tst r1,r1 ! 15cmpeqsi_t/1[length = 2] bt .L7 ! 16branch_true[length = 2] mov #-20,r0 ! 4movsi_i/3[length = 2] .L7: rts nop ! 61*return_i[length = 4] -O1 -m4: mov.l @(4,r4),r1 ! 10movsi_ie/7[length = 2] tst r1,r1 ! 17cmpeqsi_t/1[length = 2] bt/s.L6 ! 18branch_true[length = 2] mov #-40,r0 ! 5movsi_ie/3[length = 2] mov #-20,r0 ! 4movsi_ie/3[length = 2] .L6: rts nop ! 62*return_i[length = 4] Another example would be: int test_2 (unsigned long long x) { return x >= 0x1LL ? -20 : -40; } -Os -m2/-m3/-m4: mov #0,r2 ! 48movsi_ie/3[length = 2] mov #-1,r3 ! 49movsi_ie/3[length = 2] cmp/eq r2,r4 ! 9cmpgtudi_t[length = 8] bf/s.Ldi67 cmp/hi r2,r4 cmp/hi r3,r5 .Ldi67: bf/s.L16! 10branch_false[length = 2] mov #-40,r0 ! 5movsi_ie/3[length = 2] mov #-20,r0 ! 4movsi_ie/3[length = 2] .L16: rts nop ! 52*return_i[length = 4] -Os -m1: tst r4,r4 ! 9cmpeqsi_t/1[length = 2] mov #-20,r0 ! 4movsi_i/3[length = 2] bf .L12! 10branch_false[length = 2] mov #-40,r0 ! 5movsi_i/3[length = 2] .L12: rts nop ! 56*return_i[length = 4] The problem does not appear for -m1, only for -Os and -m2*, -m3*, -m4*.
[Bug target/51340] SH Target: Make -mfused-madd enabled by default
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51340 --- Comment #4 from Oleg Endo 2011-12-29 00:02:40 UTC --- (In reply to comment #3) > (In reply to comment #2) > > Uhm, yes... > > The title should have been "Enable -mfused-madd by -ffast-math" > > Do you mean something like this? > > --- ORIG/trunk/gcc/config/sh/sh.c2011-12-03 10:03:41.0 +0900 > +++ trunk/gcc/config/sh/sh.c2011-12-27 08:33:23.0 +0900 > @@ -838,6 +838,11 @@ sh_option_override (void) > align_functions = min_align; > } > > + /* Default to use fmac insn when -ffast-math. See PR target/29100. */ > + if (global_options_set.x_TARGET_FMAC == 0 > + && fast_math_flags_set_p (&global_options) > +TARGET_FMAC = 1; > + >if (sh_fixed_range_str) > sh_fix_range (sh_fixed_range_str); > Yes, something like that. Or maybe check flag_unsafe_math_optimizations, as it is done for FSCA and FSRRA insns in sh.md. > > I don't know the exact semantics for the new patterns. All I know is that > > rounding is supposed to be done only once after the two operations. This is > > the case for the SH fmac insn. Not sure whether this is enough though. > > It seems that we can use the fma pattern, though it would be an another issue. Maybe when trunk is back to stage 1.
[Bug target/51244] SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 --- Comment #6 from Oleg Endo 2011-12-28 15:59:35 UTC --- (In reply to comment #3) > Created attachment 26191 [details] > Proposed patch to improve some of the issues. > > The attached patch removes the useless sequence and still allows the -1 > constant to be CSE-ed for such cases as the example function above. > > I haven't ran all tests on it yet, but CSiBE shows average code size reduction > of approx. -0.1% for -m4* with some code size increases in some files. Some of the code size increases are caused by the ifcvt.c pass which tries to transform sequences like: int test_func_6 (int a, int b, int c) { if (a == 16) c = 0; return b + c; } into branch-free code like: mov r4,r0 ! 45movsi_ie/2[length = 2] cmp/eq #16,r0 ! 9 cmpeqsi_t/2[length = 2] mov #-1,r0 ! 34movsi_ie/3[length = 2] negcr0,r0 ! 38*negc[length = 2] neg r0,r0 ! 36negsi2[length = 2] and r6,r0 ! 37*andsi3_compact/2[length = 2] rts ! 48*return_i[length = 2] add r5,r0 ! 14*addsi3_compact[length = 2] instead of the more compact (and on SH4 most likely better): movr4,r0 ! 41movsi_ie/2[length = 2] cmp/eq#16,r0 ! 9cmpeqsi_t/2[length = 2] bf0f ! 34*movsicc_t_true/2[length = 4] mov#0,r6 0: addr5,r6 ! 14*addsi3_compact[length = 2] rts ! 44*return_i[length = 2] movr6,r0 ! 19movsi_ie/2[length = 2] This particular case is handled in noce_try_store_flag_mask, which does the transformation if BRANCH_COST >= 2, which is true for -m4. I guess before the patch ifcvt didn't realize that this transformation can be applied. I've tried setting BRANCH_COST to 1, which avoids this transformation but increases overall code size a bit.
[Bug target/51244] SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 --- Comment #5 from Oleg Endo 2011-12-28 02:44:05 UTC --- (In reply to comment #2) > (In reply to comment #1) > > > > BTW, OT, (a != b || a != c) ? b : c could be reduced to b, I think. > > > > Yes, very much so. > It is reduced to "return b" for -m2, -m2e, -m2a, -m3, -m3e > but not for -m1 and -m4*. This seems to be due to the following in sh.h: #define BRANCH_COST(speed_p, predictable_p) \ (TARGET_SH5 ? 1 : ! TARGET_SH2 || TARGET_HARD_SH4 ? 2 : 1)
[Bug target/51244] SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 --- Comment #4 from Oleg Endo 2011-12-27 23:17:03 UTC --- (In reply to comment #1) > > > return a >= 0 && b >= 0 ? c : d; > > x >= 0 is expanded to the sequence like > > ra = not x > rb = -31 > rc = ra >> (neg rb) > T = (rc == 0) > conditional jump > > and combine tries to simplify it. combine simplifies b >= 0 > successfully into shll and bt but fails to simplify a >= 0. > It seems that combine doesn't do constant propagation well and > misses the constant -31. Another simpler fail: int test_func_22_NG (int a, int b, int c, int d) { return a >= 0; } becomes: not r4,r0 ! 9one_cmplsi2[length = 2] mov #-31,r1 ! 12movsi_ie/3[length = 2] rts ! 31*return_i[length = 2] shldr1,r0 ! 13lshrsi3_d[length = 2] which could be: cmp/pzr4 rts movtr0 >From what I could observe, this is caused by the various shift insns which leads combine to this result. For example, the shll, branch sequence that is used instead of cmp/pz, branch is caused by the ashlsi_c insn, which defines a lt:SI comparison. Although that is correct, using cmp/pz could be better, since it does not modify the reg, and on SH4 it is an MT group insn. The ashlsi_c insn / lt:SI picking can be avoided by adjusting the rtx costs, for instance (just tried it out briefly). I think a peephole in this case could fix some of the symptoms but not the actual cause. I'll see if I can come up with something that works without a peephole, even though all the shift stuff looks a bit suspicious ;)
[Bug target/51244] SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 --- Comment #3 from Oleg Endo 2011-12-27 22:43:11 UTC --- Created attachment 26191 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26191 Proposed patch to improve some of the issues. (In reply to comment #1) > > [...] > > mov #-1,rn > negc rn,rm > tst #255,rm > > which is essentially T_reg = T_reg. Usually combine catches such > situation, but negc might be too complex for combine. > For this case, replacing current movnegt expander by insn, splitter > and peephole something like > > [...] > > the above useless sequence could be removed, though we will miss > the chance that the -1 can be CSE-ed when the cstore value is > used. This will cause a bit worse code for the loop like > > int > foo (int *a, int x, int n) > { > int i; > int count; > > for (i = 0; i < n; i++) > count += (*(a + i) != x); > > return count; > } > Thanks for your ideas and comments. It was really useful. The attached patch removes the useless sequence and still allows the -1 constant to be CSE-ed for such cases as the example function above. I haven't ran all tests on it yet, but CSiBE shows average code size reduction of approx. -0.1% for -m4* with some code size increases in some files. Would something like that be OK for stage 3?
[Bug target/51244] SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 --- Comment #2 from Oleg Endo 2011-12-27 21:26:33 UTC --- (In reply to comment #1) > > BTW, OT, (a != b || a != c) ? b : c could be reduced to b, I think. > Yes, very much so. It is reduced to "return b" for -m2, -m2e, -m2a, -m3, -m3e but not for -m1 and -m4*. The correct test function should be rather: int test_func_0_NG (int a, int b, int c, int d) { return (a != b || a != d) ? b : c; } which is actually OK for all variants except -m1 and -m4*: cmp/eqr5,r4! 11cmpeqsi_t/3[length = 2] bf.s.L6! 12branch_false[length = 2] cmp/eqr7,r5! 14cmpeqsi_t/3[length = 2] bf.L6! 15branch_false[length = 2] movr6,r5! 8movsi_i/2[length = 2] .L6: rts! 42*return_i[length = 2] movr5,r0! 23movsi_i/2[length = 2]
[Bug target/51340] SH Target: Make -mfused-madd enabled by default
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51340 --- Comment #2 from Oleg Endo 2011-12-19 01:29:12 UTC --- (In reply to comment #1) > (In reply to comment #0) > > Is there any particular reason why this should not be enabled by > > default for SH targets that support the FMAC insn? > > PR29100? Uhm, yes... The title should have been "Enable -mfused-madd by -ffast-math" > > BTW, if SH fmac satisfies the semantics for fused multiplication and > add operation, the fmaf4 instruction pattern would be better now. I don't know the exact semantics for the new patterns. All I know is that rounding is supposed to be done only once after the two operations. This is the case for the SH fmac insn. Not sure whether this is enough though.
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #20 from Oleg Endo 2011-12-12 02:11:12 UTC --- (In reply to comment #19) > The results look way better now. I've tested your latest patch for > sh4-unknown-linux-gnu and found no new regressions for gcc testsuite. > CSiBE with "-O2 -fpic" on that target shows that 144 improvements and > 28 dis-improvements for size on 896 files. The worst case is > -4.34783 net/ipv4/ip_forward 704 736 > which looks the case of the high r0 register pressure. The best one is > 25.7426 arch/testplatform/kernel/traps 10160 8080 > which looks to be very impressive. That looks nice! Thanks for checking it out! I haven't ran CSiBE with "-O2 -fpic", only with "-Os -mpretend-cmove -mfused-madd -freg-struct-return". I will use your params and try to see what is happening in those cases where it gets worse. Maybe it can be "fixed" with a peephole. > > > /* We want to enable the use of SUBREGs as a means to > > VEC_SELECT a single element of a vector. */ > >+ > >+ /* This effectively disallows using GENERAL_REGS for SFmode vector > >subregs. > >+ This can be problematic when SFmode vector subregs need to be accessed > >+ on the stack with displacement addressing, as it happens with -O0. > >+ Thus we allow the mode change for -O0. */ > > if (to == SFmode && VECTOR_MODE_P (from) && GET_MODE_INNER (from) == > > SFmode) > >-return (reg_classes_intersect_p (GENERAL_REGS, rclass)); > >+return optimize ? (reg_classes_intersect_p (GENERAL_REGS, rclass)) : > >false; > >+ Thus we allow the mode change for -O0. */ .. should be _disallow_ of course... Another note that should have went into the comment: As far as I could observe it, this is mainly triggered by the following in sh_legitimate_index_p: + if (mode == QImode && (unsigned) INTVAL (op) < 16) +return true; .. probably because this makes the generic "m" constraint match QImode displacement addressing and then it tries using it. > Rather than that, I guess that the QI/HImode disp addressing would > be an optimization unneeded for -O0 in the first place. Perhaps > something like -mpreferdisp option and TARGET_PREFER_DISP macro > which are enable by default but disable at -O0 might be help. Yeah, could also be an option. > It'll also help some unfortunate anormallies for which those optimizations > will generate worse codes. You mean, by giving the user the option to turn off displacement addressing for e.g. some specific files / modules by specifying -mno-preferdisp or something like that? By anomalies do you mean code that gets worse because of too much pressure on R0 and all the reloads around it, or do you have any other bad use cases? BTW, the vector mode handling seems a bit unfinished (see also PR13423). I was planning to address that at a later point... > Maybe. Implementing it with predicates and constraints would be > smarter if possible but may be difficult because the register > allocator handles the "m" constraint specially. Yes, the "m" constraint is an obstacle in this case. What I've tried out is splitting it into a memory constraint that allows displacement addressing and another memory constraint that disallows it ("Snd", "Sdd" - they are added by the patch) and use those in the move / sign extend insns instead of the generic "m" constraint. For example something like that: (define_insn "*extendqisi2" [(set (match_operand:SI 0 "arith_reg_dest" "=z,r,r") (sign_extend:SI (match_operand:QI 1 "general_movsrc_operand" Sdd,Snd,r")))] "TARGET_SH1" "@ mov.b%1,%0 mov.b%1,%0 exts.b%1,%0" [(set_attr "type" "load,load,arith")]) This basically seems to work. But when there are consecutive loads, reload would use displacement addressing for the first load, but not for the following loads because R0 will be already allocated at that point. Ideally, reload should take into account that "reloading around R0" is in most cases more efficient than other strategies, especially on SH4. However, I'm not sure whether changing reload for this issue is a good idea ;) Another thing I could try out is to have load/store insns that allow arbitrary operands in displacement addressing like on SH2A, and split them into two insns of one load/store and one reg-reg move after reload. But that would probably require the R0 clobber in the expander which could make worse code in cases where displacement addressing is not used, I guess. Do you think this approach could make sense? > I think so, though we are in stage 3 and have to wait the trunk returns > to stage 1 or 2 for committing such changes. I was afraid you might say something like this :T > You have the time for implementing HImode support. Yep, sure. I've noticed that the latest version of the patch seems to fix some more testsuite failures. I will investigate which hunk is responsible for the fixes so that could be pulled out from the patch. OK? > BTW, th
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 Oleg Endo changed: What|Removed |Added Attachment #25932|0 |1 is obsolete|| --- Comment #18 from Oleg Endo 2011-12-11 00:24:17 UTC --- Created attachment 26046 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26046 Proposed patch to add QImode displacement addressing This version of the patch was tested against rev 182090 and did not introduce new failures. It allows the generation of QImode displacement addressing move insns, but only if the addresses can be fixed before reload. If reload needs to load/store e.g. pseudos on the stack where displacement addressing would be required the generated code is not as good unfortunately. One good example for this kind of scenario is the gcc.dg/compat/struct-by-value-11 test from the testsuite, which shows that there is still some room for improvement. Still, on the CSiBE I could observe a code size decrease of -1.2% avg, where mpeg2dec goes down to -5.65% and libmpeg2/motion_comp down to -20% (-m4 -ml taken as example). Due to SH2A's 4 byte QImode displacement insn the benefit on SH2A code is not that big (-0.2% avg). There are probably smarter ways of doing what the patch does. I have also tried out implementing it with predicates and constraints, few load/store insns and lots of alternatives in the insns. However, reload would refuse to select the displacement addressing due to pressure on R0 in many cases. Would something like the attached patch be acceptable (after some cleanups)? If so, I'd also start adding HImode displacement addressing support.
[Bug target/50749] SH Target: Post-increment addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 --- Comment #8 from Oleg Endo 2011-11-28 22:31:44 UTC --- (In reply to comment #7) > The problem is that SH target can't do those simple array accesses > well at QI/HImode because of the lack of displacement addressing > for those modes. In these particular cases, this is true. Even if there was a working QI/HImode displacement addressing, it would most likely result in worse code compared to post-inc / pre-dec addressing opportunities, because of register pressure on R0. Apart from that there is no displacement addressing for FPU loads/stores, which also misses some opportunities: float test_func_5 (float* p, int c) { float r = 0; do { r += *p++; r += *p++; r += *p++; } while (--c); return r; } Compiled with -Os -m4-single: fldi0fr0 .L11: movr4,r1 fmov.s@r1+,fr1 dtr5 faddfr1,fr0 fmov.s@r1,fr1 movr4,r1 add#8,r1 add#12,r4 faddfr1,fr0 fmov.s@r1,fr1 bf/s.L11 faddfr1,fr0 rts nop ..could be: fldi0fr0 .L11: fmov.s@r4+,fr1 dtr5 faddfr1,fr0 fmov.s@r4+,fr1 faddfr1,fr0 fmov.s@r4+,fr1 bf/s.L11 faddfr1,fr0 rts nop Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases.
[Bug target/51340] New: SH Target: Make -mfused-madd enabled by default
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51340 Bug #: 51340 Summary: SH Target: Make -mfused-madd enabled by default Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* Currently the FMAC insn has to be enabled by specifying the -mfused-madd option. Is there any particular reason why this should not be enabled by default for SH targets that support the FMAC insn?
[Bug target/51337] New: SH Target: Various testsuite ICEs for -m2a -O0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51337 Bug #: 51337 Summary: SH Target: Various testsuite ICEs for -m2a -O0 Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh2a-*-* There are various testsuite ICEs for -m2a -mb -O0, which look similar to the following: gcc.c-torture/compile/2923-1.c: ...: error: insn does not satisfy its constraints: (insn 142 34 35 (set (mem/c:SI (plus:SI (reg/f:SI 14 r14) (const_int 36 [0x24])) [0 %sfp+-16 S4 A32]) (reg:SI 150 fpul)) ... {movsi_ie} (nil)) ...: internal compiler error: in extract_constrain_insn_cached, at recog.c:2052 The problem seems to be the movsi_ie insn, for which reload thinks it is possible to load the FPUL register from a stack allocated reg using SImode displacement addressing, which is actually not possible. The following patch seems to help the issues... --- gcc/config/sh/sh.c.orig2011-11-28 10:03:04.0 +0900 +++ gcc/config/sh/sh.c2011-11-28 15:09:01.0 +0900 @@ -12432,6 +12432,10 @@ sh_secondary_reload (bool in_p, rtx x, r if (rclass != GENERAL_REGS && REG_P (x) && TARGET_REGISTER_P (REGNO (x))) return GENERAL_REGS; + /* If here fall back to loading FPUL register through general regs. + Happens when FPUL has to be loaded from a reg allocated on the stack. */ + if (rclass == FPUL_REGS && !REG_P (x)) +return GENERAL_REGS; return NO_REGS; } However, I think the following should be a better way to check the condition in question... + if (rclass == FPUL_REGS && true_regnum (x) == -1) +return GENERAL_REGS; I'll submit a patch once testing is completed.
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 Oleg Endo changed: What|Removed |Added Attachment #25848|0 |1 is obsolete|| --- Comment #17 from Oleg Endo 2011-11-28 12:26:42 UTC --- Created attachment 25932 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25932 Proposed patch to add QImode displacement addressing An updated version of the patch. Tested against rev 181404, resolves some previous failures for SH2A but also introduces the following libstdc++ related failures: FAIL: 22_locale/ctype/is/wchar_t/11740.cc (test for excess errors) FAIL: 22_locale/money_get/get/char/5.cc execution test FAIL: 22_locale/money_put/put/char/12971.cc execution test FAIL: 22_locale/money_put/put/char/39168.cc execution test FAIL: 22_locale/money_put/put/char/4.cc execution test FAIL: 22_locale/money_put/put/char/5.cc execution test FAIL: 22_locale/money_put/put/char/6.cc execution test FAIL: 22_locale/money_put/put/wchar_t/12971.cc execution test FAIL: 22_locale/money_put/put/wchar_t/39168.cc execution test FAIL: 22_locale/money_put/put/wchar_t/4.cc execution test FAIL: 22_locale/money_put/put/wchar_t/5.cc execution test FAIL: 22_locale/money_put/put/wchar_t/6.cc execution test Still investigating...
[Bug target/50814] SH Target: SHAD / SHLD instructions not used on SH2A
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50814 --- Comment #2 from Oleg Endo 2011-11-27 22:38:34 UTC --- (In reply to comment #1) > Will be slightly different because sh2a's shad&shld are 4-byte > insns. Perhaps something like below will work, though I don't > test it at all. > According to the SW manual document rej09b0051_sh2a.pdf the SHAD and SHLD insns have the same 2-byte format as on SH3: SHAD Rm, Rn: 01001100 SHLD Rm, Rn: 01001101 Am I missing something there?
[Bug middle-end/50325] [4.7 Regression] 76 new fails with rev. 177691
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50325 Oleg Endo changed: What|Removed |Added CC||oleg.e...@t-online.de --- Comment #19 from Oleg Endo 2011-11-21 00:48:16 UTC --- Hello, the patch from 181405 also breaks things for the SH target when compiling for big endian. For instance when compiling the CSiBE set it bails out with: /home/code/CSiBE/./src/./teem-1.6.0-src/src/air/754.c: In function 'airFPGen_d': /home/code/CSiBE/./src/./teem-1.6.0-src/src/air/754.c:283:45: internal compiler error: in int_mode_for_mode, at stor-layout.c:424 I haven't checked the testsuite results, but I guess they are similar to the other reports.
[Bug target/51244] New: SH Target: Inefficient conditional branch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51244 Bug #: 51244 Summary: SH Target: Inefficient conditional branch Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* Created attachment 25869 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25869 Examples It seems that the condition inversion sometimes gets confused, resulting in unnecessary code bloat. The attached examples were compiled with -Os but the same problem happens also at -O2 and -O3. sh-elf-gcc -v Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 2019 (experimental) (GCC)
[Bug target/51241] New: SH Target: Unnecessary sign/zero extensions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51241 Bug #: 51241 Summary: SH Target: Unnecessary sign/zero extensions Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* Created attachment 25868 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25868 Examples Under some circumstances unnecessary sign/zero extension insns are generated as shown in the attached example. sh-elf-gcc -v Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 2019 (experimental) (GCC)
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #14 from Oleg Endo 2011-11-20 13:04:39 UTC --- With trunk rev 181517 I have observed the following problem, which happens when compiling for -m2*, -m3*, -m4* and -Os: The function compiled with -m2 -Os int test_int64_lowword (long long* x) { return *x & 0x ? -20 : -40; } results in the following code: mov#0,r2! 42movsi_i/3[length = 2] tstr2,r2! 44cmpeqsi_t/1[length = 2] bf.s.L4! 45branch_false[length = 2] mov.l@(4,r4),r3! 12movsi_i/5[length = 2] tstr3,r3! 46cmpeqsi_t/1[length = 2] .L4: bt.L3! 14branch_true[length = 2] mov#-20,r0! 4movsi_i/3[length = 2] rts nop! 52*return_i[length = 4] .L3: rts! 54*return_i[length = 2] mov#-40,r0! 5movsi_i/3[length = 2] When compiled with -O1 or -O2 the high word test is completely eliminated correctly: mov.l@(4,r4),r1! 10movsi_i/5[length = 2] tstr1,r1! 17cmpeqsi_t/1[length = 2] bt.L4! 18branch_true[length = 2] mov#-20,r0! 4movsi_i/3[length = 2] rts nop! 50*return_i[length = 4] .L4: rts! 52*return_i[length = 2] mov#-40,r0! 5movsi_i/3[length = 2] I'm not sure whether this is actually related to this PR, but have noticed it with the test cases of this PR. It seems only to happen for comparison-like insns. If I remember correctly, this problem did not exist when I started working on this PR.
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 Oleg Endo changed: What|Removed |Added Attachment #25684|0 |1 is obsolete|| --- Comment #16 from Oleg Endo 2011-11-17 23:58:11 UTC --- Created attachment 25848 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25848 Proposed patch to add QImode displacement addressing Just wanted to share some progress on this issue... The FPUL case from before seems to be solved, but another issue is left which is related to vector modes. For example the 20050113-1.c target test case compiled with -O0 causes: 20050113-1.c:16:1: internal compiler error: in change_address_1, at emit-rtl.c:2001 Please submit a full bug report, with preprocessed source if appropriate. This is because of the following in function sh_cannot_change_mode_class: if (to == SFmode && VECTOR_MODE_P (from) && GET_MODE_INNER (from) == SFmode) return (reg_classes_intersect_p (GENERAL_REGS, rclass)); IRA's last words on that are: Reloads for insn # 52 Reload 0: FPUL_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1), can't combine, secondary_reload_p reload_reg_rtx: (reg:SI 150 fpul) Reload 1: reload_in (SI) = (reg:SI 65 fr1 [orig:171 ] [171]) GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg:SI 65 fr1 [orig:171 ] [171]) reload_reg_rtx: (reg/i:SI 0 r0) secondary_in_reload = 0 deleting insn with uid = 26.
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #15 from Oleg Endo 2011-11-02 01:38:26 UTC --- (In reply to comment #14) > .ira dump would be your friend, though I suspect that your patch > triggered off some other reload problem like PR48596. Could you > try the change in #5 of that PR? I just tried it out, but it seems it has no effect in this case.
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #13 from Oleg Endo 2011-11-02 00:15:44 UTC --- Created attachment 25684 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25684 Proposed patch to add QImode displacement addressing This should be better now. I have also put back support for the SH2A displacement addressing insns. The CSiBE set compiles fine for -m2a, -m4-single, -m4a-single, both big and little endian. However, it introduces new failures in the testsuite. I have examined a few of them and one common problem seems to be of this nature: udivmod4.c:59:1: error: insn does not satisfy its constraints: (insn 313 312 194 (set (reg:SI 150 fpul) (mem/c:SI (plus:SI (reg/f:SI 15 r15) (const_int 8 [0x8])) [0 %sfp+-72 S4 A32])) udivmod4.c:54 192 {movsi_ie} (nil)) udivmod4.c:59:1: internal compiler error: in extract_constrain_insn_cached, at recog.c:2052 .. when compiled with -O0 -m4-single -mb. This is mainly caused by the following added lines in sh_legitimate_index_p: + if ((GET_MODE_SIZE (mode) == 1) && (unsigned) INTVAL (op) < 16) +return true; Apparently this makes something believe that loading the FPUL register from a displacement address is possible, which is of course not the case. However, I can't see any connection there...
[Bug target/50749] SH Target: Post-increment addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 --- Comment #6 from Oleg Endo 2011-10-30 13:53:51 UTC --- (In reply to comment #1) > GCC makes usual mem accesses into those with post_inc/pre_dec at > auto_inc_dec pass. I guess that auto_inc_dec pass can't find > post_inc insns well in that case because other tree/rtl optimizers > tweak the code already. If this is the case, the problem would be > not target specific. Looks like so. For the simple test case... int test (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } ...the load insns are initially expanded as: (insn 2 5 3 2 (set (reg/v/f:SI 169 [ p ]) (reg:SI 4 r4 [ p ])) (nil)) (insn 7 6 8 3 (set (reg:SI 171) (sign_extend:SI (mem:QI (reg/v/f:SI 169 [ p ]) [0 *p_2(D)+0 S1 A8]))) (nil)) (insn 8 7 9 3 (set (reg:SI 172) (reg/v/f:SI 169 [ p ])) (nil)) (insn 9 8 10 3 (set (reg/f:SI 173) (plus:SI (reg/v/f:SI 169 [ p ]) (const_int 1 [0x1]))) (nil)) (insn 10 9 11 3 (set (reg:SI 174) (sign_extend:SI (mem:QI (reg/f:SI 173) [0 MEM[(char *)p_2(D) + 1B]+0 S1 A8]))) (nil)) (insn 13 12 14 3 (set (reg/f:SI 177) (plus:SI (reg/v/f:SI 169 [ p ]) (const_int 2 [0x2]))) (nil)) (insn 14 13 15 3 (set (reg:SI 178) (sign_extend:SI (mem:QI (reg/f:SI 177) [0 MEM[(char *)p_2(D) + 2B]+0 S1 A8]))) (nil)) The auto_inc_dec pass then fails to realize that (reg:SI 177) = (reg:SI 169) + 2 = (reg:SI 173) + 1. I wonder whether there might be something in the target code that suggests the early optimizers to do that? I've tried playing with the TARGET_ADDRESS_COST hook but it didn't have any effect in this case.
[Bug target/50694] SH Target: SH2A little endian does not actually work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50694 --- Comment #9 from Oleg Endo 2011-10-30 12:45:30 UTC --- (In reply to comment #8) > (In reply to comment #7) > > This problem doesn't require the theoretical/mathematical > completeness. There are many inappropriate combinations > of options which don't get any warning when running compiler > and configurations. The important thing is to warn very > confusing ones from the user's point of view. So your patch > in #6 or even one liner in #2 would be OK and enough for > this PR, I think. OK. I think the SH1 check should be left out completely in this case. If it is included, the default configuration (-m1 -mb) will fail to build e.g. newlib. I'd propose catching the SH2A case only, considering the user's default -mb / -ml configuration. If somebody configures the default to be -m2a -ml then it will fail to build, but I guess only very few people will try to do that anyway.
[Bug target/50749] SH Target: Post-increment addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 --- Comment #5 from Oleg Endo 2011-10-30 12:36:51 UTC --- (In reply to comment #4) > > I'm a bit curious to see what happens if they are changed > to non-zero for SI/DImode. ..not much actually. I've tried defining both as TARGET_SH1 and compared before/after CSiBE results. No code size changes at all. Also the micro test cases do not seem to be affected by that. Of course I might have missed some border case...
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #11 from Oleg Endo 2011-10-27 21:54:17 UTC --- Created attachment 25639 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25639 Stripped reload test case from jpeg-6b
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #10 from Oleg Endo 2011-10-27 21:10:47 UTC --- @@ -12430,6 +12453,10 @@ sh_secondary_reload (bool in_p, rtx x, r if (rclass != GENERAL_REGS && REG_P (x) && TARGET_REGISTER_P (REGNO (x))) return GENERAL_REGS; + if (rclass == GENERAL_REGS && mode == QImode + && MEM_P (x) && GET_CODE (XEXP (x, 0)) == PLUS + && CONST_INT_P (XEXP (XEXP (x, 0), 1)) + && INTVAL (XEXP (XEXP (x, 0), 1)) < 16) +return R0_REGS; return NO_REGS; } Makes the stripped down bszip2 test pass, but the same problem pops up in other CSiBE sources.
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #9 from Oleg Endo 2011-10-27 09:23:47 UTC --- Created attachment 25625 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25625 Stripped reload test case from bzip2 (In reply to comment #8) > > and RA chooses r1 and r0 as the registers to where memories will > be loaded. The problem is we have no direct way to load buf1[4] > to r1. In such situation, a secondary reload is needed. See > the description of TARGET_SECONDARY_RELOAD in the gcc manual. > Here is a trial: Thanks a lot! > > The ICE for your testcase went away with it, though I've got > [...] > spill_failure, at reload1.c:2118 > > when bootstrapping. So this passes the ball to the mov.b (R0,Rn),Rm insn... The same happens with bzip2 in CSiBE. I have attached a stripped down snippet that reproduces the problem. IRA's last words on that one are: Using reg 2 for reload 0 Spilling for insn 365. Using reg 2 for reload 0 Spilling for insn 371. Using reg 3 for reload 0 Spilling for insn 374. Using reg 2 for reload 0 Spilling for insn 377. Using reg 0 for reload 1 reload failure for reload 2 Reloads for insn # 377 Reload 0: reload_in (SI) = (plus:SI (reg/f:SI 15 r15) (const_int 16 [0x10])) GENERAL_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 2), can't combine reload_in_reg: (plus:SI (reg/f:SI 15 r15) (const_int 16 [0x10])) Reload 1: reload_in (SI) = (reg:SI 3 r3 [693]) R0_REGS, RELOAD_FOR_INPUT (opnum = 0) reload_in_reg: (reg:SI 3 r3 [693]) Reload 2: R0_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 2), can't combine, secondary_reload_p Reload 3: reload_in (QI) = (mem/c:QI (plus:SI (reg/f:SI 15 r15) (const_int 16 [0x10])) [6 %sfp+-60 S1 A32]) GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 2), can't combine reload_in_reg: (subreg:QI (reg/v:SI 177 [ curr ]) 0) secondary_in_reload = 2
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #7 from Oleg Endo 2011-10-26 23:07:08 UTC --- Created attachment 25622 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25622 asmcons and ira pass log for the reload failure of "z" insn constraint (In reply to comment #5) > It seems that clobbering R0 in that expander is simply papering > over the real problem. Yes, it is very much. > Although the reload issue beyonds me, > .ira dump file about that impossible insn which doesn't satisfy > the "z" constraint would be a starting point. I've been trying to make some sense out of it since, but I'm a bit clueless at the moment here. The following function is a reduced failure example: int fail (char* buf0, char* buf1) { char a,b,c; a = buf0[0] + buf1[1]; b = buf0[5] + buf1[4]; c = buf0[15] + buf1[14]; return a+b-c; } In the attached log the problematic pseudo reg is reg:QI 191, where it is reloaded into r1... Reloads for insn # 13 Reload 0: reload_in (QI) = (reg:QI 191 [ MEM[(char *)buf1_4(D) + 4B] ]) reload_out (SI) = (reg:SI 1 r1 [193]) GENERAL_REGS, RELOAD_OTHER (opnum = 0) Strange thing is that the following variant does not cause the error: int fail (char* buf0, char* buf1, char* out) { char a,b,c; a = buf0[0] + buf1[1]; b = buf0[5] + buf1[4]; c = buf0[15] + buf1[14]; *out = a+b-c; return *out; }
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #6 from Oleg Endo 2011-10-26 22:36:31 UTC --- Created attachment 25621 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25621 Experimental not working patch for mov.b with displacement addressing
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #4 from Oleg Endo 2011-10-23 21:56:56 UTC --- Created attachment 25582 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25582 Experimental patch for mov.b with displacement addressing > (In reply to comment #2) > > Welcome to the spill-failure-for-class-'R0_REGS' club :-) The attached patch is an experimental example only. Please ignore wrong formatting, comments, and the fact that it cripples some of the SH2A support :} It adds support for 'mov.b @(disp, Rm), R0' and 'mov.b R0, @(disp, Rn)' instructions. I haven't tested it fully, only building the CSiBE set for -m4-single -ml. Although it results in suboptimal code like ... ... mov.l@(4,r15),r2 movr1,r0 mov.br0,@(5,r2) rts movr1,r0! redundant .. or like ... mov.b@(1,r4),r0 movr0,r4 mov.b@(2,r5),r0 addr0,r4! better: add r4, r0 movr4,r0! not needed if add operands are swapped mov.br0,@(5,r6) rts movr4,r0! redundant .. it already shows some code size improvements: avg: -563.22 / -0.942376 % max: compiler22804 -> 22928 +124 / +0.543764 % min: OpenTCP-1.0.4 27069 -> 25989-1080 / -3.989804 % top 5 files mpeg2dec-0.3.1 libmpeg2/motion_comp 6044 -> 4796 -1248 / -20.648577 % libpng-1.2.5 pngrtran 19668 -> 18904 -764 / -3.884482 % linux-2.4.23-pre3-testplatform arch/testplatform/kernel/traps 6192 -> 5532 -660 / -10.658915 % lwip-0.5.3.preproc src/core/tcp_input 5424 -> 5040 -384 / -7.079646 % libmspack test/cabextract_md5 21780 -> 21424 -356 / -1.634527 % The R0 clobber in the movqi expander and the explicit usage of R0 in the splits effectively disable some optimizations, but this is the only thing I could get to work so far. I've left the straight forward but non-working patterns as comments in the patch as a reference. Basically, without the R0 clobber in the movqi expander it eventually ends up like that... error: insn does not satisfy its constraints: (insn 737 40 42 4 (set (reg:QI 10 r10) (mem/c:QI (plus:SI (reg:SI 1 r1 [386]) (const_int 1 [0x1])) [0 *D.4946_20+0 S1 A8])) {*movqi_m_reg_disp_load} (nil)) internal compiler error: in reload_cse_simplify_operands, at postreload.c:403 I'm puzzled why the register allocator ignores the constraint "z" when it starts to run out of registers. In the error case above it tries to produce something like 'mov.b @(1,r1),r10' which of course is impossible. Any hints are highly appreciated.
[Bug target/50814] New: SH Target: SHAD / SHLD instructions not used on SH2A
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50814 Bug #: 50814 Summary: SH Target: SHAD / SHLD instructions not used on SH2A Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh2a-*-* Although there are some insns (e.g. ashlsi3_sh2a) that are supposed to handle dynamic shifts on SH2A, somehow the dynamic shift instructions SHAD and SHLD are never generated, no matter what the shift amount is. int x_shad_right (int y) { return y >> 15; } mov.l.L6,r1 sts.lpr,@-r15 jsr@r1 nop movr4,r0 lds.l @r15+,pr rts/n .align 2 .L6: .long___ashiftrt_r4_15 int x_shad_left (int y) { return y << 15; } movr4,r0 shll8r0 shlrr0 rts shll8r0 Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20111020 (experimental) (GCC) It is also not clear to me why SH2A seems to require different handling for dynamic shifts than SH3 or SH4...
[Bug target/50694] SH Target: SH2A little endian does not actually work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50694 --- Comment #7 from Oleg Endo 2011-10-20 18:44:32 UTC --- (In reply to comment #6) > (In reply to comment #5) > > I'll send in a patch with a couple of other cosmetic changes later, OK? > > Please go for it. ..or maybe just leave it as it is :T The change I was suggesting opens up another problem with multilib and endian config / selection. I think instead of adding / implementing the endian restrictions it would be more useful to expand little endian support in binutils and drop the endian restrictions in gcc altogether once binutils fully support it. What do you think? Would that make more sense in the end?
[Bug target/50749] SH Target: Post-increment addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 --- Comment #3 from Oleg Endo 2011-10-19 00:00:01 UTC --- Kaz, do you happen to know why the following is defined in sh.h? #define USE_LOAD_POST_INCREMENT(mode)((mode == SImode || mode == DImode) \ ? 0 : TARGET_SH1) #define USE_STORE_PRE_DECREMENT(mode)((mode == SImode || mode == DImode) \ ? 0 : TARGET_SH1) Is there any (historical) reason to disable post-inc / pre-dec addressing for SImode / DImode? Thanks, Oleg
[Bug target/50694] SH Target: SH2A little endian does not actually work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50694 --- Comment #5 from Oleg Endo 2011-10-18 22:37:13 UTC --- (In reply to comment #4) > There are no real uses of SH1/SH2/SH2E/SH3E cores anymore, I think. True. But the SH7020 (SH1) is still listed on digikey for an amazing collector's price ;) > I agree that taking care of -m2e is not worth. Perhaps same for > -m1. Anyway, your change looks plausible to me. I'll send in a patch with a couple of other cosmetic changes later, OK?
[Bug target/50694] SH Target: SH2A little endian does not actually work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50694 --- Comment #3 from Oleg Endo 2011-10-18 21:33:06 UTC --- (In reply to comment #2) > Ah. One liner > > -#define DRIVER_SELF_SPECS "%{m2a:%{ml:%eSH2a does not support > little-endian}}" > +#define DRIVER_SELF_SPECS "%{m2a*:%{ml:%eSH2a does not support > little-endian}}" > > should work. Yep, does work. However, I've noticed that it will stop working when the compiler's default SH CPU and endian configuration is not -m1 -mb. The code below instead gets the job done but I'm not sure whether it could be written in a better way. I've also noticed, that SH1 also supports big endian only, and -m1 -ml is explicitly excluded in the multilib config in t-sh. #if TARGET_ENDIAN_DEFAULT == MASK_BIG_ENDIAN #define IS_LITTLE_ENDIAN_OPTION "%{ml:" #else #define IS_LITTLE_ENDIAN_OPTION "%{!mb:" #endif #if TARGET_CPU_DEFAULT == SELECT_SH1 #define UNSUPPORTED_SH1 IS_LITTLE_ENDIAN_OPTION \ "%{m1|!m2*:%{!m3*:%{!m4*:%{!m5*:%eSH1 does not support little-endian}" #else #define UNSUPPORTED_SH1 IS_LITTLE_ENDIAN_OPTION \ "%{m1:%eSH1 does not support little-endian}}" #endif #if TARGET_CPU_DEFAULT & MASK_HARD_SH2A #define UNSUPPORTED_SH2A IS_LITTLE_ENDIAN_OPTION \ "%{m2a*|!m1:%{!m2*:%{!m3*:%{!m4*:{!m5*:%eSH2a does not support little-endian}}" #else #define UNSUPPORTED_SH2A IS_LITTLE_ENDIAN_OPTION \ "%{m2a*:%eSH2a does not support little-endian}}" #endif #define DRIVER_SELF_SPECS UNSUPPORTED_SH1, UNSUPPORTED_SH2A Actually, SH2E also supports big endian only as per public specification. But I don't think that it would cause any trouble for anyone. Of course it could be excluded as well. :)
[Bug target/50751] SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 --- Comment #2 from Oleg Endo 2011-10-17 00:37:42 UTC --- (In reply to comment #1) > This is a known issue. See the comment just before > sh.c:sh_legitimate_index_p. > Unfortunately, I guess this PR might be marked as WONTFIX. Yeah, I know this has been around for a while. I'd like to take my chances anyway :) Cheers, Oleg
[Bug target/50751] New: SH Target: Displacement addressing does not work for QImode and HImode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50751 Bug #: 50751 Summary: SH Target: Displacement addressing does not work for QImode and HImode Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* Displacement addressing is only used for SImode but not for QImode nor HImode. The following example summarizes the problem: struct X { chara, b, c, d; short e, f; int g, h; }; int test_func_4 (X* x) { return x->b + x->e + x->g; } compiled with: -Os -m4-single -ml -S movr4,r1 add#1,r1 mov.b@r1,r0 add#3,r1 mov.w@r1,r1 addr1,r0 mov.l@(8,r4),r1 rts addr1,r0 would be better as: mov.b@(1,r4),r0 movr0,r1 mov.w@(4,r4),r0 addr0,r1 mov.l@(8,r4),r0 addr1,r0 rts nop Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20111016 (experimental) (GCC)
[Bug target/50750] New: SH Target: Pre-decrement addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50750 Bug #: 50750 Summary: SH Target: Pre-decrement addressing used only for first memory access Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* Pre-decrement addressing is generated only for the first memory access. Any subsequent memory access does not use pre-decrement addressing. The following is a reduced example. The problem exists with any number of memory accesses > 1 and at any optimization level. int test_func_2_1 (int* p, int c) { int r = 0; do { r += *--p; r += *--p; r += *--p; } while (--c); return r; } compiled with -fno-tree-loop-optimize -Os -m4-single -ml: mov#0,r0 .L11: movr4,r1 add#-64,r1 mov.l@(60,r1),r2 dtr5 add#-12,r4 addr2,r0 mov.l@(56,r1),r2 mov.l@(52,r1),r1 addr2,r0 bf/s.L11 addr1,r0 rts nop would be better as: mov#0,r0 .L11: mov.l@-r4,r1 dtr5 mov.l@-r4,r2 addr1,r0 mov.l@-r4,r3 addr2,r0 bf/s.L11 addr3,r0 rts nop Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20111016 (experimental) (GCC)
[Bug target/50749] New: SH Target: Post-increment addressing used only for first memory access
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 Bug #: 50749 Summary: SH Target: Post-increment addressing used only for first memory access Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de CC: kkoj...@gcc.gnu.org Target: sh*-*-* Post-increment addressing is generated only for the first memory access. Any subsequent memory access does not use post-increment addressing. The following two functions are reduced examples and result in the same code being generated. The problem exists with any number of memory accesses > 1 and at any optimization level. int test_0 (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } int test_1 (char* p, int c) { int r = 0; r += p[0]; r += p[1]; r += p[2]; return r; } compiled with -fno-ivopts -Os -m4-single -ml ... movr4,r1 mov.b@r1+,r0 add#2,r4 mov.b@r1,r1 addr1,r0 mov.b@r4,r1 rts addr1,r0 This could be done better ... mov.b@r4+,r0 mov.b@r4+,r1 addr1,r0 mov.b@r4+,r1 rts addr1,r0 Another example with a loop: int test_func_1 (char* p, int c) { int r = 0; do { r += *p++; r += *p++; } while (--c); return r; } compiled with -fno-ivopts -Os -m4-single -ml ... mov#0,r0 .L5: movr4,r1 mov.b@r1+,r2 dtr5 mov.b@r1,r1 addr2,r0 add#2,r4 bf/s.L5 addr1,r0 rts nop would be better as: mov#0, r0 .L5: mov.b@r4+, r1 dtr5 mov.b@r4+, r2 addr1, r0 bf/s.L5 addr2, r0 rts nop Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20111016 (experimental) (GCC)
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 Oleg Endo changed: What|Removed |Added Attachment #24412|0 |1 is obsolete|| --- Comment #11 from Oleg Endo 2011-10-13 22:54:54 UTC --- Created attachment 25491 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25491 Proposed patch including test case > > 3) only zero_extract special cases > > looks to be dominant. Yes. I've briefly looked through the test sources. A popular use case for bit test instructions seem to be single bit tests, which the patch is basically adding. > I see. I also expect that the experts have some idea for > this issue. Hm .. http://gcc.gnu.org/ml/gcc/2011-10/msg00189.html Eric pointed me to the i386 back-end. Unfortunately, what I found there is where I originally started... ;; Combine likes to form bit extractions for some tests. Humor it. I.e. it is also coded against the behavior of the combine pass with a bunch of pattern variations. I guess that's the way it's supposed to be done then :T > I don't think that it's too much. Those numbers can be easily > collected for CSiBE. If your patterns are named, you could > simply add "-dap -save-temps" to the compiler option which is > specified when ruining CSiBE's create-config and then get > the occurrences of testsi_6, for example, with something like > grep "testsi_6" `find . -name "*.s" -print` | wc -l > after running the CSiBE size test. Ah, right! With the attached latest patch applied to trunk rev 179778 the numbers for "-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return" look something like that: tstsi_t: 1391 tsthi_t: 4 tstqi_t: 23 tstqi_t_zero: 667 tstsi_t_and_not: 598 tstsi_t_zero_extract_eq: 70 tstsi_t_zero_extract_xor: 923 Notice that the split contributes to the tstsi_t number. Also, the 3 patterns tstsi_t_zero_extract_xor tstsi_t_zero_extract_subreg_xor_little tstsi_t_zero_extract_subreg_xor_big are basically one and the same. On SH4A the subreg variants are required, because tstsi_t_zero_extract_xor will never match. I've also added a special case to sh_rtx_costs to detect at least the tstsi_t pattern. However, the other patterns are not really covered by that and the combine pass calculates the cost as a sum of all the operations of the pattern. I guess the selection of the test instruction could be stimulated a bit more by a more accurate costs calculation, but my feeling is that it won't do a lot. Cheers, Oleg
[Bug target/50694] SH Target: SH2A little endian does not actually work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50694 --- Comment #1 from Oleg Endo 2011-10-13 19:54:31 UTC --- As it turns out, this is already handled by the line #define DRIVER_SELF_SPECS "%{m2a:%{ml:%eSH2a does not support little-endian}}" in sh.h. However, it doesn't catch -m2a-nofpu, -m2a-single, -m2a-single-only options.
[Bug target/50694] New: SH Target: SH2A little endian does not actually work
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50694 Bug #: 50694 Summary: SH Target: SH2A little endian does not actually work Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de While testing a patch I have noticed some failures for SH2A and -ml. The problem is that GCC happily generates SH2A code in little endian, but when the output is assembled with GAS and it contains a "movi20" instruction, GAS will abort at some "is big endian?" check. Maybe for SH2A the -ml option should either output a warning and do big endian, or just abort with an error message? As far as I know, the SH2A does big endian only anyway... Cheers, Oleg
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #9 from Oleg Endo 2011-10-10 23:48:17 UTC --- Created attachment 25461 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25461 CSiBE comparisons (In reply to comment #8) > > Another combine pass to reduce size less than 0.3% on one target > would be not acceptable, I guess. I'm sorry, I forgot to mention that it was just a proof of concept hack of mine, just to see whether it has any chance to work at all. I think it would be better to change/fix the behavior of the combine pass in this regard, so that it tries matching combined patterns without sophisticated transformations. I will try asking on the gcc list about that. > ~10 new patterns would be > overkill for that result, though I'm still expecting that a few > patterns of them were dominant. Yep, even if it turned out to be actually only 8 patterns in total, but still.. I haven't looked at the issue with SH4A but most likely it would add another one or two patterns... so basically ~10 :) > Could you get numbers which pattern > was used in the former option? I think it would be a bit too much checking out each individual pattern. Instead I grouped them into what they are effectively doing. While I was at it, I also added your peephole idea, and a top 10 listing of the individual files.
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #7 from Oleg Endo 2011-10-09 23:34:45 UTC --- (In reply to comment #6) > Yep, maintenance burden but I don't mean ack/nak for anything. > If it's enough fruitful, we should take that route. When it > gives 5% improvement in the usual working set like as CSiBE, > hundreds lines would be OK, but if it's ~0.5% or less, it doesn't > look worth to add many patterns for that. > > > Isn't there a way to tell the combine pass not to do so, but instead first > > look > > deeper at what is in the MD? > > I don't know how to do it cleanly. > I've tried out a couple of things and got some CSiBE numbers based on trunk rev 179430. Unfortunately only code size comparisons, no run time performance numbers. The tests were compiled with -ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return Option 1) Use many (~10) patterns in the MD and some cost calculation tuning. The last patch required some adaptation, because the combine pass started trying to match things slightly differently. I've also noticed that it requires a special case for one pattern on SH4A... size of all modules: 2916390 -> 2909026-7364 / -0.252504 % avg difference over all modules: -409.11 / -0.273887 % max: compiler 22808 -> 22812 +4 / +0.017538 % min: libpng-1.2.5 99120 -> 97804-1316 / -1.327684 % Option 2) I've added another combine pass which has the make_compound_operation function turned off. The make_compound_operation function is used to produce zero_extract patterns. If the resulting "simplified" pattern does not match anything in the MD, combine reverts the transformation and proceeds with the next insn. That way, it never tries to match the tst #imm pattern in the MD. With this option only ~5 patterns seem to be required and a small extension of the costs calculation. size of all modules: 2916390 -> 2909170-7220 / -0.247566 % avg difference over all modules: -401.11 / -0.254423 % max: compiler 22808 -> 22812 +4 / +0.017538 % min: libpng-1.2.5 99120 -> 97836-1284 / -1.295400 % Not so spectacular on average. It highly depends on the type of SW being compiled, but it hits quite a lot of files in CSiBE. Option 2 seems more robust even if it seems less effective, what do you think?
[Bug target/49468] SH Target: inefficient integer abs code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49468 Oleg Endo changed: What|Removed |Added Attachment #24625|0 |1 is obsolete|| --- Comment #7 from Oleg Endo 2011-09-25 12:48:24 UTC --- Created attachment 25360 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25360 Proposed patch The last version of the patch fails the test gcc.c-torture/execute/arith-rand-ll.c for -m2a-single -mb and multiple optimization levels with the following error: internal compiler error: in change_address_1, at emit-rtl.c:1994 The attached version fixes some of the failures but still fails the test above with -m2a-single -mb -O2. Other optimization levels work fine. The problem is caused by the define_insn_and_split "*abssi2". It even fails if the "*abssi2" splits into nothing but a simple register copy (movdi) or comparison insn. I'm now testing the patch without the DI abs parts and will submit it if passes without new failures. Cheers, Oleg
[Bug target/49468] SH Target: inefficient integer abs code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49468 Oleg Endo changed: What|Removed |Added Attachment #24603|0 |1 is obsolete|| --- Comment #6 from Oleg Endo 2011-06-29 01:36:46 UTC --- Created attachment 24625 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24625 Proposed patch Added missing constraints to define_insn_and_split. Fixed the formatting (hopefully ... there seems to be a problem with the tabs. I'm using 4 spaces wide tabs)
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #5 from Oleg Endo 2011-06-26 22:30:05 UTC --- > Yes, that peephole doesn't catch all the patterns which could > make tst #imm8,r0 use. Perhaps it would be a good idea to get > numbers for the test like CSiBE test with the vanilla and new > insns/peepholes patched compilers. Something covers 80% of > the possible cases in the usual working set, it would be enough > successful for such a micro-optimization, I guess. I'd also like to see some numbers on those micro-improvements. I'll have a look at CSiBE. Anyway, why not just add all the currently known-to-work cases? What are your concerns regarding that? I can imagine that it is a maintenance burden to keep all those definitions and special cases in the MD up-to-date (bit rot etc). Do you have anything other than that in mind? It seems that others have been trying to solve the same problem in a very similar way: http://patchwork.ozlabs.org/patch/58832/ ;) I've figured that the following pattern works quite well for this particular case. It seems to catch all the bit patterns. (sh_tst_const_ok and sh_zero_extract_val are added functions in sh.c) (define_insn_and_split "*tstsi3" [(set (reg:SI T_REG) (zero_extract:SI (match_operand:SI 0 "arith_reg_operand" "") (match_operand:SI 1 "const_int_operand") (match_operand:SI 2 "const_int_operand")))] "TARGET_SH1 && sh_tst_const_ok (sh_zero_extract_val (operands[1], operands[2]))" "#" "&& 1" [(const_int 0)] "{ int tstval = sh_zero_extract_val (operands[1], operands[2]); emit_insn (gen_tstsi (operands[0], GEN_INT (tstval))); DONE; }" ) ...however, the problem with that one is that the T bit is inverted afterwards, and thus the following branch logic (or whatever) gets inverted as well. One option could be to emit some T bit inversion insn after gen_tstsi and then optimize it away later on with e.g. a peephole (?), but I haven't tried that yet. The actual "problem" is that the combine pass first sees the andsi, eq, ... insns and correctly matches them to those in the MD. But instead of continuing to match patterns from the MD it expands the and insn into something like zero_extract, which in turn will never be combined back into the tst insn. Isn't there a way to tell the combine pass not to do so, but instead first look deeper at what is in the MD? Regarding the peephole: > (satisfies_constraint_I08 (operands[1]) > || satisfies_constraint_K08 (operands[1]) I guess this might generate wrong code for e.g. "if (x & -2)". When x has any bits[31:1] set this must return true. The code after the peephole optimization will look only at the lower 8 bits and would possibly return false for x = 0xFF00, which is wrong. So it should be satisfies_constraint_K08 only, shouldn't it? > > Cost patch looks fine to me. Could you propose it as a separate > patch on gcc-patches list with an appropriate ChangeLog entry? > When proposing it, please refer how you've tested it. Also > the numbers got with the patch are highly welcome. > > BTW, do you have FSF copyright assignment for your GCC work? > Although the cost patch itself is essentially several lines which > doesn't require copyright assignment, the other changes you've > proposed clearly require the paper work, I think. I'll contact you directly regarding that.
[Bug target/49468] SH Target: inefficient integer abs code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49468 Oleg Endo changed: What|Removed |Added Attachment #24560|0 |1 is obsolete|| --- Comment #4 from Oleg Endo 2011-06-26 20:48:19 UTC --- Created attachment 24603 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24603 Proposed patch Thanks for checking. I missed to tell the expanders that the T bit is clobbered by insns such as negdi2 and abssi2 / absdi2. The negdi2 expander that I have changed caused the libstdc++ tests to fail when formatting a -1LL. Another (reduced) example: long long x (long long i, int j, long long k) { if (j & 5) return -i; return -k; } ..ended up as (the "j & 5" got lost)... mov.l@(4,r15),r1 clrt mov.l@r15,r2 negcr1,r1 negcr2,r0 bt/s.L8 clrt negcr5,r1 negcr4,r0 .L8: rts nop
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #3 from Oleg Endo 2011-06-19 16:42:01 UTC --- Thanks for having a look at it Kaz. Yes, the fiddling with the combine pass is fragile. I've tried out your peephole idea. It works in most cases but not all the time. E.g. it doesn't catch the following example: int test_imm (int i) { return (i & 3) ? 20 : 40; } mov#3,r1 andr1,r4 tstr4,r4 bt/s.L5 mov#40,r0 mov#20,r0 .L5: rts nop Also the following... int test_imm (int* i) { return (*i & 255) ? 20 : 40; } becomes on little endian (OK): mov.b@r4,r1 tstr1,r1 bt/s.L10 mov#40,r0 mov#20,r0 .L10: rts nop but on big endian (NG): mov.l@r4,r1 extu.br1,r1 tstr1,r1 bt/s.L10 mov#40,r0 mov#20,r0 .L10: rts nop I'll give the thing another try. Regarding the change to the andcosts function, the following should be better: --- sh.c(revision 175188) +++ sh.c(working copy) @@ -243,7 +243,7 @@ static int flow_dependent_p (rtx, rtx); static void flow_dependent_p_1 (rtx, const_rtx, void *); static int shiftcosts (rtx); -static int andcosts (rtx); +static int and_xor_ior_costs (rtx, int code); static int addsubcosts (rtx); static int multcosts (rtx); static bool unspec_caller_rtx_p (rtx); @@ -2841,14 +2841,14 @@ return shift_insns[value]; } -/* Return the cost of an AND operation. */ +/* Return the cost of an AND/XOR/IOR operation. */ static inline int -andcosts (rtx x) +and_xor_ior_costs (rtx x, int code) { int i; - /* Anding with a register is a single cycle and instruction. */ + /* register x register is a single cycle instruction. */ if (!CONST_INT_P (XEXP (x, 1))) return 1; @@ -2864,17 +2864,17 @@ } /* These constants are single cycle extu.[bw] instructions. */ - if (i == 0xff || i == 0x) + if ((i == 0xff || i == 0x) && code == AND) return 1; - /* Constants that can be used in an and immediate instruction in a single + /* Constants that can be used in an instruction with an immediate are a single cycle, but this requires r0, so make it a little more expensive. */ if (CONST_OK_FOR_K08 (i)) return 2; - /* Constants that can be loaded with a mov immediate and an and. + /* Constants that can be loaded with a mov immediate need one more cycle. This case is probably unnecessary. */ if (CONST_OK_FOR_I08 (i)) return 2; - /* Any other constants requires a 2 cycle pc-relative load plus an and. + /* Any other constant requires an additional 2 cycle pc-relative load. This case is probably unnecessary. */ return 3; } @@ -3043,7 +3043,9 @@ return true; case AND: - *total = COSTS_N_INSNS (andcosts (x)); +case XOR: +case IOR: + *total = COSTS_N_INSNS (and_xor_ior_costs (x, code)); return true; case MULT:
[Bug target/49468] SH Target: inefficient integer abs code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49468 --- Comment #2 from Oleg Endo 2011-06-19 15:48:41 UTC --- Created attachment 24561 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24561 Before/After Examples
[Bug target/49468] SH Target: inefficient integer abs code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49468 --- Comment #1 from Oleg Endo 2011-06-19 15:29:45 UTC --- Created attachment 24560 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24560 Proposed patch The patch adds explicit handling of abs:SI and abs:DI to the machine description instead of relying on the default abs handling. On SH4 the zero-offset branch way is definitely faster. It reduces pressure on EX group instructions and improves parallel instruction execution. On other than SH4 zero-offset branches are not as fast, but the resulting code should still be faster than the default branch-free abs code, at least it is more compact. The patch also handles the case of neg (abs (...)) by simply inverting the branch condition. Feedback appreciated.
[Bug target/49468] New: SH Target: inefficient integer abs code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49468 Summary: SH Target: inefficient integer abs code Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de The generated code for abs:SI and abs:DI is a bit inefficient: int abs (int i) { return (i < 0) ? -i : i; } movr4,r1 shllr1 subcr1,r1 movr1,r0 xorr4,r0 rts subr1,r0 long long abs (long long i) { return (i < 0) ? -i : i; } movr4,r3 shllr3 subcr3,r3 movr5,r1 xorr3,r1 movr3,r0 clrt xorr4,r0 subcr3,r1 rts subcr3,r0 There is a define_split in sh.md which is supposed to handle the special case for SH4 but it is not doing anything. The problem has been around since a couple of GCC 4.x versions. sh-elf-gcc -v Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib : (reconfigured) ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib : (reconfigured) ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib : (reconfigured) ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib target_alias=sh-elf CFLAGS=-Os CXXFLAGS=-Os --enable-languages=c,c++,lto --no-create --no-recursion Thread model: single gcc version 4.7.0 20110619 (experimental) (GCC)
[Bug target/49305] New: SH Target: internal compiler error: in reload_cse_simplify_operands, at postreload.c:403
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49305 Summary: SH Target: internal compiler error: in reload_cse_simplify_operands, at postreload.c:403 Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de When building newlib 1.19.0 it fails to build the SH2A lib: sh-elf-gcc -B/home/y/code/gcc/newlib-1.19.0-build-sh-elf/sh-elf/m2a/newlib/ -isystem /home/y/code/gcc/newlib-1.19.0-build-sh-elf/sh-elf/m2a/newlib/targ-include -isystem /home/y/code/gcc/newlib-1.19.0/newlib/libc/include -B/home/y/code/gcc/newlib-1.19.0-build-sh-elf/sh-elf/m2a/libgloss/sh -L/home/y/code/gcc/newlib-1.19.0-build-sh-elf/sh-elf/m2a/libgloss/libnosys -L/home/y/code/gcc/newlib-1.19.0/libgloss/sh -m2a -DPACKAGE_NAME=\"newlib\" -DPACKAGE_TARNAME=\"newlib\" -DPACKAGE_VERSION=\"1.19.0\" -DPACKAGE_STRING=\"newlib\ 1.19.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -I. -I../../../../../../newlib-1.19.0/newlib/libc/stdio -fno-builtin -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return -m2a -c ../../../../../../newlib-1.19.0/newlib/libc/stdio/vfwscanf.c -o lib_a-vfwscanf.o ../../../../../../newlib-1.19.0/newlib/libc/stdio/vfwscanf.c: In function '__svfwscanf_r': ../../../../../../newlib-1.19.0/newlib/libc/stdio/vfwscanf.c:1454:1: error: insn does not satisfy its constraints: (insn 2358 4549 2359 471 (set (reg:SI 0 r0 [1023]) (sign_extend:SI (mem/s:HI (plus:SI (reg:SI 2 r2) (const_int 12 [0xc])) [5 fp_167(D)->_flags+0 S2 A32]))) ../../../../../../newlib-1.19.0/newlib/libc/stdio/vfwscanf.c:1447 164 {*extendqisi2_compact} (nil)) ../../../../../../newlib-1.19.0/newlib/libc/stdio/vfwscanf.c:1454:1: internal compiler error: in reload_cse_simplify_operands, at postreload.c:403 sh-elf-gcc -v Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20110606 (experimental) (GCC)
[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 --- Comment #1 from Oleg Endo 2011-06-01 20:42:00 UTC --- Created attachment 24412 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24412 Proposed Patch Although the patch gets the job done, programmer's sense tells me it is fishy, or at least pretty much brute forced cure of the symptoms, not the cause. It's my first GCC patch, so any feedback is highly appreciated. What I did was looking at the RTL, in particular the combine pass, identifying patterns it failed to find a "shortcut" (tst insn) for and adding them to the insn descriptions. I also had to expand the costs calculation of the AND instruction to cover AND, OR and XOR (on SH they are the same anyways), or else the cost of a matched replacement insn would result in a rejection in the combine pass.
[Bug target/49263] New: SH Target: underutilized "TST #imm, R0" instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263 Summary: SH Target: underutilized "TST #imm, R0" instruction Product: gcc Version: 4.6.1 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.e...@t-online.de Created attachment 24411 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24411 test for various integer types and constant values 0...255 The "TST #imm, R0" instruction is a bit underutilized on SH targets. For some bit patterns of the immediate constant it tries to extract the bits in question by various means and test the result against zero/non-zero and misses the straight forward instruction. In particular: * one single bit * n contiguous bits starting at bit 0 When testing a byte against 0x80 it uses "CMP/PZ", which is as good as a "TST" instruction (in terms of costs), so this is OK. I've spotted this in version around 4.4. It still happens with 4.6 and the following config: Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.6.1/lto-wrapper Target: sh-elf Configured with: ../gcc-4.6-20110527/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.6.1 20110527 (prerelease) (GCC)