Re: MIPS RDHWR instruction reordering
On Mon, 24 Jul 2006 19:08:41 +0100, Richard Sandiford [EMAIL PROTECTED] wrote: You can set up DejaGNU's unix.exp to use rsh and rcp. It's usually a case of creating a board file foo.exp like this: ... and then running with make -k check RUNTESTFLAGS='--target_board foo'. FWIW, I can run a gcc test run for you on mips64-linux-gnu (and perhaps glibc too, depending on the time). I should have the results by the weekend. Good advice, thanks! I'll setup the environment. It might take some time since I'm not familiar with dejagnu, so your results will be appreciated. +; Since rdhwr always generate a trap for now, it should not be be put +; on delay slot. It it was on delay slot, the emulation will be +; slower. Andreas has already commented on this, but the tense sounds wrong too. Maybe it would be better to have something like: ; Putting rdhwr in a delay slot would make the kernel's emulation ; of it much slower. right above the can_delay line. Whatever you feel is best though. Thank you. Here is a updated patch with your suggestions. 2006-07-25 Atsushi Nemoto [EMAIL PROTECTED] partial PR target/28126 * config/mips/mips.md (tls_get_tp_mode): Set can_delay to no. Index: gcc/config/mips/mips.md === --- gcc/config/mips/mips.md (revision 115735) +++ gcc/config/mips/mips.md (working copy) @@ -5458,6 +5458,9 @@ HAVE_AS_TLS !TARGET_MIPS16 .set\tpush\;.set\tmips32r2\t\;rdhwr\t%0,$29\;.set\tpop [(set_attr type unknown) +; Since rdhwr always generates a trap for now, putting it in a delay +; slot would make the kernel's emulation of it much slower. + (set_attr can_delay no) (set_attr mode MODE)]) ; The MIPS Paired-Single Floating Point and MIPS-3D Instructions.
Re: MIPS RDHWR instruction reordering
On 22 Jul 2006 20:58:16 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: OK, patch is approved, with a ChangeLog entry, and assuming it passes the testsuite. Thanks, here is a patch with a ChangeLog entry. I can cross-build gcc and glibc successfully, but unfortunately I can not build and run the testsuite natively (in reasonable time) due to limited CPU/memory resources on my target platform. Is there good way to run testsuite on cross environment? 2006-07-24 Atsushi Nemoto [EMAIL PROTECTED] partial PR target/28126 * config/mips/mips.md (tls_get_tp_mode): Set can_delay to no. Index: gcc/config/mips/mips.md === --- gcc/config/mips/mips.md (revision 115370) +++ gcc/config/mips/mips.md (working copy) @@ -5450,6 +5450,9 @@ ; MIPS 32r2 specification, but we use it on any architecture because ; we expect it to be emulated. Use .set to force the assembler to ; accept it. +; Since rdhwr always generate a trap for now, it should not be be put +; on delay slot. It it was on delay slot, the emulation will be +; slower. (define_insn tls_get_tp_mode [(set (match_operand:P 0 register_operand =v) @@ -5458,6 +5461,7 @@ HAVE_AS_TLS !TARGET_MIPS16 .set\tpush\;.set\tmips32r2\t\;rdhwr\t%0,$29\;.set\tpop [(set_attr type unknown) + (set_attr can_delay no) (set_attr mode MODE)]) ; The MIPS Paired-Single Floating Point and MIPS-3D Instructions.
Re: MIPS RDHWR instruction reordering
Atsushi Nemoto [EMAIL PROTECTED] writes: Index: gcc/config/mips/mips.md === --- gcc/config/mips/mips.md (revision 115370) +++ gcc/config/mips/mips.md (working copy) @@ -5450,6 +5450,9 @@ ; MIPS 32r2 specification, but we use it on any architecture because ; we expect it to be emulated. Use .set to force the assembler to ; accept it. +; Since rdhwr always generate a trap for now, it should not be be put s/generate/s/ +; on delay slot. It it was on delay slot, the emulation will be on a (or in a?), s/It/If/ Andreas. -- Andreas Schwab, SuSE Labs, [EMAIL PROTECTED] SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 And now for something completely different.
Re: MIPS RDHWR instruction reordering
Thanks for tackling this. Atsushi Nemoto [EMAIL PROTECTED] writes: On 22 Jul 2006 20:58:16 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: OK, patch is approved, with a ChangeLog entry, and assuming it passes the testsuite. Thanks, here is a patch with a ChangeLog entry. I can cross-build gcc and glibc successfully, but unfortunately I can not build and run the testsuite natively (in reasonable time) due to limited CPU/memory resources on my target platform. Is there good way to run testsuite on cross environment? You can set up DejaGNU's unix.exp to use rsh and rcp. It's usually a case of creating a board file foo.exp like this: load_generic_config unix process_multilib_options set_board_info hostname You can use ssh and scp instead by adding: set_board_info rsh_prog ssh set_board_info rcp_prog rcp and then running with make -k check RUNTESTFLAGS='--target_board foo'. FWIW, I can run a gcc test run for you on mips64-linux-gnu (and perhaps glibc too, depending on the time). I should have the results by the weekend. +; Since rdhwr always generate a trap for now, it should not be be put +; on delay slot. It it was on delay slot, the emulation will be +; slower. Andreas has already commented on this, but the tense sounds wrong too. Maybe it would be better to have something like: ; Putting rdhwr in a delay slot would make the kernel's emulation ; of it much slower. right above the can_delay line. Whatever you feel is best though. Richard
Re: MIPS RDHWR instruction reordering
On 21 Jul 2006 10:06:34 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: I also don't see why revision 108713 would affect this. But I do note that this version is still bad. The rdhwr instruction is in the branch delay slot, and is therefore always executed. Yes, and I think rdhwr should not be in delay slot anyway. Just avoiding delay slot is quite easy. Here is a proposal patch. With this patch, it seems gcc 4.2 generate desired code for now, though there might be somewhere to fix. Index: gcc/config/mips/mips.md === --- gcc/config/mips/mips.md (revision 115370) +++ gcc/config/mips/mips.md (working copy) @@ -5450,6 +5450,9 @@ ; MIPS 32r2 specification, but we use it on any architecture because ; we expect it to be emulated. Use .set to force the assembler to ; accept it. +; Since rdhwr always generate a trap for now, it should not be be put +; on delay slot. It it was on delay slot, the emulation will be +; slower. (define_insn tls_get_tp_mode [(set (match_operand:P 0 register_operand =v) @@ -5458,6 +5461,7 @@ HAVE_AS_TLS !TARGET_MIPS16 .set\tpush\;.set\tmips32r2\t\;rdhwr\t%0,$29\;.set\tpop [(set_attr type unknown) + (set_attr can_delay no) (set_attr mode MODE)]) ; The MIPS Paired-Single Floating Point and MIPS-3D Instructions.
Re: MIPS RDHWR instruction reordering
Atsushi Nemoto [EMAIL PROTECTED] writes: On 21 Jul 2006 10:06:34 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: I also don't see why revision 108713 would affect this. But I do note that this version is still bad. The rdhwr instruction is in the branch delay slot, and is therefore always executed. Yes, and I think rdhwr should not be in delay slot anyway. Just avoiding delay slot is quite easy. Here is a proposal patch. With this patch, it seems gcc 4.2 generate desired code for now, though there might be somewhere to fix. This may be right, but I'm not sure that it is. If it is OK to unconditionally execute rdhwr, then it should be OK to put it in a delay slot. Unless that will break something when rdhwr is emulated. Or will the emulation code run slower when rdhwr is in a delay slot? That is, slower than it would if rdhwr were emulated without being in a delay slot? Ian
Re: MIPS RDHWR instruction reordering
On Sat, Jul 22, 2006 at 09:52:44AM -0700, Ian Lance Taylor wrote: This may be right, but I'm not sure that it is. If it is OK to unconditionally execute rdhwr, then it should be OK to put it in a delay slot. Unless that will break something when rdhwr is emulated. Or will the emulation code run slower when rdhwr is in a delay slot? That is, slower than it would if rdhwr were emulated without being in a delay slot? Yes. Much, much slower. There's a fast path support for rdhwr (I'm not sure if it is committed yet but it's definitely floating around) which only handles the non-branch-delay case. It will still work in a delay slot, but it's a much heavier-weight operation. So, until and unless there is a revision of the MIPS architecture on which this instruction is not guaranteed to trap, I think we should not put it in a delay slot. -- Daniel Jacobowitz CodeSourcery
Re: MIPS RDHWR instruction reordering
Daniel Jacobowitz [EMAIL PROTECTED] writes: On Sat, Jul 22, 2006 at 09:52:44AM -0700, Ian Lance Taylor wrote: This may be right, but I'm not sure that it is. If it is OK to unconditionally execute rdhwr, then it should be OK to put it in a delay slot. Unless that will break something when rdhwr is emulated. Or will the emulation code run slower when rdhwr is in a delay slot? That is, slower than it would if rdhwr were emulated without being in a delay slot? Yes. Much, much slower. There's a fast path support for rdhwr (I'm not sure if it is committed yet but it's definitely floating around) which only handles the non-branch-delay case. It will still work in a delay slot, but it's a much heavier-weight operation. So, until and unless there is a revision of the MIPS architecture on which this instruction is not guaranteed to trap, I think we should not put it in a delay slot. OK, patch is approved, with a ChangeLog entry, and assuming it passes the testsuite. Thanks. Ian
Re: MIPS RDHWR instruction reordering
On 19 Jun 2006 16:45:43 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: I'm not sure, because I'm not sure what is hoisting the instruction. I tried recreating this, but I couldn't. I get this: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setreorder .setnoreorder .setnomacro beq $4,$0,$L7 .setpush .setmips32r2 rdhwr $3,$29 .setpop .setmacro .setreorder FYI, I found that the difference between your result (gcc 4.2) and mine (gcc 4.1.1) is come from r108713 commit. With r108712 I got: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setnomacro lw $2,%gottprel(x)($28) .setpush .setmips32r2 rdhwr $3,$29 .setpop addu$2,$2,$3 beq $4,$0,$L4 move$3,$0 lw $3,0($2) $L4: j $31 move$2,$3 And with r108713 I got: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setnomacro beq $4,$0,$L7 .setpush .setmips32r2 rdhwr $3,$29 .setpop lw $2,%gottprel(x)($28) nop addu$2,$2,$3 lw $2,0($2) j $31 nop $L7: j $31 move$2,$0 And I can not see why the commit make such a difference... --- Atsushi Nemoto
Re: MIPS RDHWR instruction reordering
Atsushi Nemoto [EMAIL PROTECTED] writes: And with r108713 I got: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setnomacro beq $4,$0,$L7 .setpush .setmips32r2 rdhwr $3,$29 .setpop lw $2,%gottprel(x)($28) nop addu$2,$2,$3 lw $2,0($2) j $31 nop $L7: j $31 move$2,$0 And I can not see why the commit make such a difference... I also don't see why revision 108713 would affect this. But I do note that this version is still bad. The rdhwr instruction is in the branch delay slot, and is therefore always executed. Ian
Re: MIPS RDHWR instruction reordering
On 19 Jun 2006 16:45:43 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: I tried recreating this, but I couldn't. I get this: ... This of course is not ideal, since it unconditionally executes the rdhwr instruction. But it is not the same as what the OP reported. I used stock gcc 4.1.1 to get the result: mips-linux-gcc -O2 -S foo.c If I used -O instead of -O2, I got similer (but not same) result: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setnomacro bne $4,$0,$L2 .setpush .setmips32r2 rdhwr $3,$29 .setpop j $31 move$2,$0 $L2: lw $2,%gottprel(x)($28) nop addu$2,$2,$3 lw $2,0($2) j $31 nop JFYI. Should I file a bug report? --- Atsushi Nemoto
Re: MIPS RDHWR instruction reordering
Atsushi Nemoto [EMAIL PROTECTED] writes: On 19 Jun 2006 16:45:43 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: I tried recreating this, but I couldn't. I get this: ... This of course is not ideal, since it unconditionally executes the rdhwr instruction. But it is not the same as what the OP reported. I used stock gcc 4.1.1 to get the result: mips-linux-gcc -O2 -S foo.c I was using current mainline. Should I file a bug report? Yes, please. Thanks. Ian
Re: MIPS RDHWR instruction reordering
On Fri, Jun 16, 2006 at 02:12:29PM -0700, Ian Lance Taylor wrote: The computation of the address of x was moved outside the conditional--that is, both the rdhwr and the addu moved. You'll have to figure out why. gcc shouldn't move instructions outside of a conditional unless they are cheap and don't trap. This instruction doesn't trap, but it's not cheap. What metric gets used for this - rtx_cost? -- Daniel Jacobowitz CodeSourcery
Re: MIPS RDHWR instruction reordering
Daniel Jacobowitz [EMAIL PROTECTED] writes: On Fri, Jun 16, 2006 at 02:12:29PM -0700, Ian Lance Taylor wrote: The computation of the address of x was moved outside the conditional--that is, both the rdhwr and the addu moved. You'll have to figure out why. gcc shouldn't move instructions outside of a conditional unless they are cheap and don't trap. This instruction doesn't trap, but it's not cheap. What metric gets used for this - rtx_cost? I'm not sure, because I'm not sure what is hoisting the instruction. I tried recreating this, but I couldn't. I get this: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setreorder .setnoreorder .setnomacro beq $4,$0,$L7 .setpush .setmips32r2 rdhwr $3,$29 .setpop .setmacro .setreorder lw $2,%gottprel(x)($28) addu$2,$2,$3 lw $2,0($2) j $31 $L7: .setnoreorder .setnomacro j $31 move$2,$0 .setmacro .setreorder This of course is not ideal, since it unconditionally executes the rdhwr instruction. But it is not the same as what the OP reported. This case happens because reorg.c ignores the cost of the instruction in fill_slots_from_thread. I believe that reorg.c should not move an expensive instruction which is only conditionally executed into a delay slot. That is probably a bug. We can see a similar case with this: int foo(int arg, int x) { if (arg) return x * x; return 0; } which yields this: foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setreorder .setnoreorder .setnomacro beq $4,$0,$L7 mult$5,$5 .setmacro .setreorder mflo$2 j $31 $L7: .setnoreorder .setnomacro j $31 move$2,$0 .setmacro .setreorder which executes the mult instruction unconditionally which is probably not desirable since it will tie up the multiplication pipeline. Ian
Re: MIPS RDHWR instruction reordering
On 16 Jun 2006 14:12:29 -0700, Ian Lance Taylor [EMAIL PROTECTED] wrote: The RDHWR is executed _before_ evaluating the arg value. For arg == 0 case, the RDHWR has no point but just a overhead. Without -O2, the RDHWR is executed _after_ the evaluation, so gcc's optimizer reorder the RDHWR instruction. The computation of the address of x was moved outside the conditional--that is, both the rdhwr and the addu moved. You'll have to figure out why. gcc shouldn't move instructions outside of a conditional unless they are cheap and don't trap. This instruction doesn't trap, but it's not cheap. AFAIK on all MIPS CPU the rdhwr generates an exception (trap). I tried changing unspec to unspec_volatile in this definition: (define_insn tls_get_tp_mode [(set (match_operand:P 0 register_operand =v) (unspec:P [(const_int 0)] UNSPEC_TLS_GET_TP))] HAVE_AS_TLS !TARGET_MIPS16 .set\tpush\;.set\tmips32r2\t\;rdhwr\t%0,$29\;.set\tpop [(set_attr type unknown) (set_attr mode MODE)]) With unspec_volatile, gcc do not move the rdhwr before the branch. But this change has bad side effects. For example, if I incremented a thread local variable, rdhwr is used twice (for load and store). So I suppose we should tell gcc that rdhwr is not cheap. But I do not know how to describe such information in .md file... --- Atsushi Nemoto
MIPS RDHWR instruction reordering
The RDHWR instruction is used to support TLS on Linux/MIPS. For now it is always emulated by kernel (on Reserved Instruction exception handler), the instruction will be quite expensive. If I compile this code with gcc 4.1.1 (-O2), extern __thread int x; int foo(int arg) { if (arg) return x; return 0; } I got this output. foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setnomacro lw $2,%gottprel(x)($28) .setpush .setmips32r2 rdhwr $3,$29 .setpop addu$2,$2,$3 beq $4,$0,$L4 move$3,$0 lw $3,0($2) $L4: j $31 move$2,$3 The RDHWR is executed _before_ evaluating the arg value. For arg == 0 case, the RDHWR has no point but just a overhead. Without -O2, the RDHWR is executed _after_ the evaluation, so gcc's optimizer reorder the RDHWR instruction. I want to make the RDHWR called only if really required. Is it possible to prevent such a reording for RDHWR ? This is very common case in libc. On returning from system call, libc checks the result and write to the errno which is thread-local. On successful syscall, we do not need to access errno at all, but gcc sometimes put RDHWR _before_ checking the result of the syscall. This add an additional exception for the syscall for usual case. I do not know much about gcc's internal. Any help are welcome. Thank you. --- Atsushi Nemoto
Re: MIPS RDHWR instruction reordering
Atsushi Nemoto [EMAIL PROTECTED] writes: If I compile this code with gcc 4.1.1 (-O2), extern __thread int x; int foo(int arg) { if (arg) return x; return 0; } I got this output. foo: .frame $sp,0,$31 # vars= 0, regs= 0/0, args= 0, gp= 0 .mask 0x,0 .fmask 0x,0 .setnoreorder .cpload $25 .setnomacro lw $2,%gottprel(x)($28) .setpush .setmips32r2 rdhwr $3,$29 .setpop addu$2,$2,$3 beq $4,$0,$L4 move$3,$0 lw $3,0($2) $L4: j $31 move$2,$3 The RDHWR is executed _before_ evaluating the arg value. For arg == 0 case, the RDHWR has no point but just a overhead. Without -O2, the RDHWR is executed _after_ the evaluation, so gcc's optimizer reorder the RDHWR instruction. The computation of the address of x was moved outside the conditional--that is, both the rdhwr and the addu moved. You'll have to figure out why. gcc shouldn't move instructions outside of a conditional unless they are cheap and don't trap. This instruction doesn't trap, but it's not cheap. Ian