Hi Maciej, I'm glad to hear from you again!
> I think that should be a GAS warning really (similarly to macros that > expand to multiple instructions in a delay slot) as people ought to be > allowed to do what they wish, and then `-Werror' can be used for code > quality enforcement (and possibly disabled on a case-by-case basis). How can one reasonably prevent the bug when compiling a whole Linux distribution with thousands of packages if GAS merely warns and proceeds in many cases? > > [ In theory, GAS could actually insert NOPs prior to the noreorder > > directive, > > to make the loop longer that six instructions, but GAS does not have that > > kind of capability. Another option that GCC prevents is to place a NOP in > > the delay slot. ] > > Well, GAS does have that capability, although of course it is not enabled > for `noreorder' code. Hmm? I'm unable to make sense of that, since such NOPs are unaffected by noreorder. This is what I meant: loop: addiu $5,$5,1 addiu $4,$4,1 lb $2,-1($5) nop /* These two NOPs would prevent the */ nop /* bug while preserving noreorder. */ .set noreorder .set nomacro bne $2,$3,loop sb $2,-1($4) .set macro .set reorder > For generated code I think however that usually it > will be cheaper performance-wise if a non-trivial delay-slot instruction > is never produced in the affected cases (i.e. a dummy NOP is always used). I suppose so too. One observation is that R5900 BogoMIPS went down from 584 to 195 when switching from the generic kernel variant .set noreorder 1: bnez a0,1b addiu a0,a0,-1 .set reorder that is undefined for R5900 in arch/mips/lib/delay.c, to a special variant beqz a0,2f 1: addiu a0,a0,-1 bnez a0,1b 2: for the R5900 where GAS will place a NOP in the branch delay slot. With loop unrolling BogoMIPS could be inflated again, of course. In general, unrolling is a bit better for the R5900 than general MIPS. Perhaps GCC could be informed about that, possibly via profile-guided optimisation. > > A reasonable fix for GCC is perhaps to update gcc/config/mips/mips.md to not > > make explicit use of the branch delay slot, as suggested by the patch below? > > Then GCC will emit > > > > loop: addiu $5,$5,1 > > addiu $4,$4,1 > > lb $2,-1($5) > > sb $2,-1($4) > > bne $2,$3,loop > > > > that GAS will adjust in the ELF object to > > > > 4: 24a50001 addiu a1,a1,1 > > 8: 24840001 addiu a0,a0,1 > > c: 80a2ffff lb v0,-1(a1) > > 10: a082ffff sb v0,-1(a0) > > 14: 1443fffb bne v0,v1,4 > > 18: 00000000 nop > > > > where a NOP is placed in the delay slot to avoid the bug. For longer loops, > > this relies on GAS making appropriate use of the delay slot. I'm not sure if > > GAS is good at that, though? > > I'm sort-of surprised that GCC has produced `reorder' code here, making > it rely on GAS for delay slot scheduling. Have you used an unusual set of > options that prevents GCC from making `noreorder' code, which as I recall > is the usual default (not for the MIPS16 mode IIRC, as well as some > obscure corner cases)? I used the suggested patch, and recompiled the kernel with it, and verified with the machine code tool that there were no undefined loops. Wouldn't that be enough? > > diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md > > index e17b1d522f0..acd31a8960c 100644 > > --- a/gcc/config/mips/mips.md > > +++ b/gcc/config/mips/mips.md > > @@ -1091,6 +1091,7 @@ > > > > (define_delay (and (eq_attr "type" "branch") > > (not (match_test "TARGET_MIPS16")) > > + (not (match_test "TARGET_FIX_R5900")) > > (eq_attr "branch_likely" "yes")) > > [(eq_attr "can_delay" "yes") > > (nil) > > @@ -1100,6 +1101,7 @@ > > ;; not annul on false. > > (define_delay (and (eq_attr "type" "branch") > > (not (match_test "TARGET_MIPS16")) > > + (not (match_test "TARGET_FIX_R5900")) > > (ior (match_test "TARGET_CB_NEVER") > > (and (eq_attr "compact_form" "maybe") > > (not (match_test "TARGET_CB_ALWAYS"))) > > > > I think you need to modify the default `can_delay' attribute definition > instead (in the same way). I tried to limit the case to branch delays only, which is a rough approximation. Call and jump delay slots are acceptable. Are you referring to this piece? ;; Can the instruction be put into a delay slot? (define_attr "can_delay" "no,yes" (if_then_else (and (eq_attr "type" "!branch,call,jump") (eq_attr "hazard" "none") (match_test "get_attr_insn_count (insn) == 1")) (const_string "yes") (const_string "no"))) > An improved future version might determine the > exact conditions as noted in your proposed commit description, however I'd > suggest making this simple change first. Learning the exact conditions, in a hardware sense, would be interesting indeed, as some short loops actually do appear to work despite being labeled as undefined. A fairly deep understanding of the R5900 implementation is essential for such an undertaking. :) Fredrik