Re: [MIPS] MADD issue
Nigel Stephens [EMAIL PROTECTED] writes: OK, so maybe as the person who removed adddi3 from the MIPS backend, and the main proponent of the new fused opcodes, you get to volunteer to implement this? :) Hey, I was pretty happy with the status quo ;) Richard
Re: [MIPS] MADD issue
Richard Sandiford wrote: Nigel Stephens [EMAIL PROTECTED] writes: I notice that at least the 32-bit rs6000, i386, sparc, frv, sh, cris, mcore, score, arm pa backends still implement adddi3 as either a define_insn which outputs two instructions or an explicit define_expand followed define_split and associated sub patterns. Are we setting the bar too high for MIPS? :) I don't think that follows. The main reason that ports like rs6000, i386, arm, sparc and pa define adddi3 is because those architectures provide special add-with-carry type instructions, or similar architecture-specific optimisations. Right, good point. MIPS has nothing like that. Actually the MIPS DSP ASE does have addsc and addwc, which could be used for this purpose. Sadly not subsc and subwc, though. The old MIPS patterns just re-implemented the standard optabs version (and often did so less efficiently, as I said before). Whilst I'm sure that your proposal is the right one going forward, it still feels like it could be significant amount of work to implement. And the simplified optab/expand support would only work for multiply-add sequences expressed within a single expression, and wouldn't be able to optimise disjoint multiply, then add expressions. Or have I missed something. I don't think that follows either. Out-of-ssa should coalesce them if they are in the same basic block. And if they aren't in the same basic block, that's presumably because the tree optimisers think they shouldn't be. Combine wouldn't look across basic block boundaries either AFAIK. E.g. compiling: typedef unsigned long long ull; typedef unsigned int ui; ull foo (ui x, ui y, ull z) { ull tmp = (ull) x * y; return tmp + z; } with a recent snapshot and -O2 -fdump-tree-all shows that the final_cleanup dump does indeed have the combined form: ;; Function foo (foo) foo (x, y, z) { bb 2: return (long long unsigned int) y * (long long unsigned int) x + z; } Ah I see. Fair enough. I realise no-one else has spoken out in support of me, so perhaps I'm in a minority of one here. But it does seem to me that in the Tree-SSA world, it makes less sense to duplicate standard optabs in the backend purely for the reason of keeping DImode arithmetic around as DImode arithmetic for longer. in the short term we really do need to reenable madd/msub for MIPS32 targets in GCC 4. We could do that with a local patch to put back adddi3, but it would be better if we could coordinate this with you. If removing the patterns had been purely a clean-up, I would be more open to the idea of putting the patterns back. But given that removing the patterns had an optimisation benefit of its own, I'm less open to the idea of adding them back, especially when (as far as I'm concerned) there's a clean way of getting the best of both worlds. OK, so maybe as the person who removed adddi3 from the MIPS backend, and the main proponent of the new fused opcodes, you get to volunteer to implement this? :) In the meantime the performance gain from being able to use a widening madd is more important to us than the benefit of improved optimisation of 64-bit addition, so we'll probably have to put adddi3 back in as a local patch. Nigel
Re: [MIPS] MADD issue
Richard Sandiford wrote: Nigel Stephens [EMAIL PROTECTED] writes: While I agree with you philosophically, it feels like (b) might be quite a major task. A number of optimisation passes which currently recognise and MUL and PLUS separately (e.g. loop strength reduction) would now need to be extended to handle the fused MULPLUS and MULSUB operators. And although the reduction in instruction count due to your previous change is good, what is it as a percentage of the total? After all it only helps code which uses 64-bit integer types with a 32-bit ABI, which is probably quite a small proportion of most real-life applications -- whereas for some algorithms the ability to use MADD is absolutely critical to performance, and for them losing the ability to generate MADD is a significant backward step for the compiler. How about, as a workaround until (b) sees the light of day, we reimplement adddi3 and subdi3 only (not the other di mode patterns), qualified by ISA_HAS_MADD_MSUB. Perhaps they could also be implemented more cleanly nowadays, using define_insn_and_split and/or a # template, to avoid generating multi-instruction assembler sequences. The old patterns had a define_split too. That wasn't really the problem. If you don't want to add a tree code yet, it would still be possible to add the optab and expand support, recognising mult-add sequences in a similar way to how we recognise widening multiplies now. I feel at least that's a step in the right direction. Richard I notice that at least the 32-bit rs6000, i386, sparc, frv, sh, cris, mcore, score, arm pa backends still implement adddi3 as either a define_insn which outputs two instructions or an explicit define_expand followed define_split and associated sub patterns. Are we setting the bar too high for MIPS? :) Whilst I'm sure that your proposal is the right one going forward, it still feels like it could be significant amount of work to implement. And the simplified optab/expand support would only work for multiply-add sequences expressed within a single expression, and wouldn't be able to optimise disjoint multiply, then add expressions. Or have I missed something. in the short term we really do need to reenable madd/msub for MIPS32 targets in GCC 4. We could do that with a local patch to put back adddi3, but it would be better if we could coordinate this with you. Nigel
Re: [MIPS] MADD issue
Nigel Stephens [EMAIL PROTECTED] writes: I notice that at least the 32-bit rs6000, i386, sparc, frv, sh, cris, mcore, score, arm pa backends still implement adddi3 as either a define_insn which outputs two instructions or an explicit define_expand followed define_split and associated sub patterns. Are we setting the bar too high for MIPS? :) I don't think that follows. The main reason that ports like rs6000, i386, arm, sparc and pa define adddi3 is because those architectures provide special add-with-carry type instructions, or similar architecture-specific optimisations. MIPS has nothing like that. The old MIPS patterns just re-implemented the standard optabs version (and often did so less efficiently, as I said before). Whilst I'm sure that your proposal is the right one going forward, it still feels like it could be significant amount of work to implement. And the simplified optab/expand support would only work for multiply-add sequences expressed within a single expression, and wouldn't be able to optimise disjoint multiply, then add expressions. Or have I missed something. I don't think that follows either. Out-of-ssa should coalesce them if they are in the same basic block. And if they aren't in the same basic block, that's presumably because the tree optimisers think they shouldn't be. Combine wouldn't look across basic block boundaries either AFAIK. E.g. compiling: typedef unsigned long long ull; typedef unsigned int ui; ull foo (ui x, ui y, ull z) { ull tmp = (ull) x * y; return tmp + z; } with a recent snapshot and -O2 -fdump-tree-all shows that the final_cleanup dump does indeed have the combined form: ;; Function foo (foo) foo (x, y, z) { bb 2: return (long long unsigned int) y * (long long unsigned int) x + z; } I realise no-one else has spoken out in support of me, so perhaps I'm in a minority of one here. But it does seem to me that in the Tree-SSA world, it makes less sense to duplicate standard optabs in the backend purely for the reason of keeping DImode arithmetic around as DImode arithmetic for longer. in the short term we really do need to reenable madd/msub for MIPS32 targets in GCC 4. We could do that with a local patch to put back adddi3, but it would be better if we could coordinate this with you. If removing the patterns had been purely a clean-up, I would be more open to the idea of putting the patterns back. But given that removing the patterns had an optimisation benefit of its own, I'm less open to the idea of adding them back, especially when (as far as I'm concerned) there's a clean way of getting the best of both worlds. Richard
Re: [MIPS] MADD issue
Richard Sandiford [EMAIL PROTECTED] writes: I realise no-one else has spoken out in support of me, so perhaps I'm in a minority of one here. But it does seem to me that in the Tree-SSA world, it makes less sense to duplicate standard optabs in the backend purely for the reason of keeping DImode arithmetic around as DImode arithmetic for longer. The main issue I know of is the RTL level loop optimizers (combine and CSE can mostly work off of REG_EQUAL notes). If you define_expand adddi3, they won't be able to handle loops using long long types. Whether this matters in practice for real code, I don't know. Certainly adddi3 and friends should not be straight define_insns, as they used to be for MIPS. With the lower-subreg pass, they should be either define_expand to individual insns or define_insn_and_split with an unconditional split before reload. Ian
Re: [MIPS] MADD issue
Ian Lance Taylor [EMAIL PROTECTED] writes: Richard Sandiford [EMAIL PROTECTED] writes: I realise no-one else has spoken out in support of me, so perhaps I'm in a minority of one here. But it does seem to me that in the Tree-SSA world, it makes less sense to duplicate standard optabs in the backend purely for the reason of keeping DImode arithmetic around as DImode arithmetic for longer. The main issue I know of is the RTL level loop optimizers (combine and CSE can mostly work off of REG_EQUAL notes). If you define_expand adddi3, they won't be able to handle loops using long long types. Whether this matters in practice for real code, I don't know. My point was that I thought the interesting multi-word optimisations (including loop optimisations) ought to be done at the tree level instead, and that the main focus of the RTL optimisers ought to be optimising things after machine-specific information has been exposed. In contrast, the MIPS define_insn define_splits existed specifically to avoid exposing machine-specific information to those optimisations. I'm not sure from your reply whether you disagree (although it sounds like you might). Certainly adddi3 and friends should not be straight define_insns, as they used to be for MIPS. With the lower-subreg pass, they should be either define_expand to individual insns or define_insn_and_split with an unconditional split before reload. Well, the old patterns had define_splits too. I don't think that was really the problem. Richard
Re: [MIPS] MADD issue
Richard Sandiford [EMAIL PROTECTED] writes: Ian Lance Taylor [EMAIL PROTECTED] writes: Richard Sandiford [EMAIL PROTECTED] writes: I realise no-one else has spoken out in support of me, so perhaps I'm in a minority of one here. But it does seem to me that in the Tree-SSA world, it makes less sense to duplicate standard optabs in the backend purely for the reason of keeping DImode arithmetic around as DImode arithmetic for longer. The main issue I know of is the RTL level loop optimizers (combine and CSE can mostly work off of REG_EQUAL notes). If you define_expand adddi3, they won't be able to handle loops using long long types. Whether this matters in practice for real code, I don't know. My point was that I thought the interesting multi-word optimisations (including loop optimisations) ought to be done at the tree level instead, and that the main focus of the RTL optimisers ought to be optimising things after machine-specific information has been exposed. In contrast, the MIPS define_insn define_splits existed specifically to avoid exposing machine-specific information to those optimisations. I'm not sure from your reply whether you disagree (although it sounds like you might). I suppose I neither agree nor disagree. It's a matter for testing. It's clear that with our present scheme there are loop optimization opportunities at the RTL level in the form of hoisting new loop invariants created by expanding the addressing modes. And, of course, some machines have specific loop instructions which currently can only be handled at the RTL level. However, those should be more or less independent of adddi3. Ian
Re: [MIPS] MADD issue
Ian Lance Taylor [EMAIL PROTECTED] writes: Richard Sandiford [EMAIL PROTECTED] writes: Ian Lance Taylor [EMAIL PROTECTED] writes: Richard Sandiford [EMAIL PROTECTED] writes: I realise no-one else has spoken out in support of me, so perhaps I'm in a minority of one here. But it does seem to me that in the Tree-SSA world, it makes less sense to duplicate standard optabs in the backend purely for the reason of keeping DImode arithmetic around as DImode arithmetic for longer. The main issue I know of is the RTL level loop optimizers (combine and CSE can mostly work off of REG_EQUAL notes). If you define_expand adddi3, they won't be able to handle loops using long long types. Whether this matters in practice for real code, I don't know. My point was that I thought the interesting multi-word optimisations (including loop optimisations) ought to be done at the tree level instead, and that the main focus of the RTL optimisers ought to be optimising things after machine-specific information has been exposed. In contrast, the MIPS define_insn define_splits existed specifically to avoid exposing machine-specific information to those optimisations. I'm not sure from your reply whether you disagree (although it sounds like you might). I suppose I neither agree nor disagree. It's a matter for testing. It's clear that with our present scheme there are loop optimization opportunities at the RTL level in the form of hoisting new loop invariants created by expanding the addressing modes. And, of course, some machines have specific loop instructions which currently can only be handled at the RTL level. Right. In case my message might be interpreted as saying that we shouldn't have RTL loop optimisers, that wasn't the intention at all. MIPS definitely benefits from them with %hi accesses, etc. It was more the opposite: splitting the instructions early ought to give the RTL optimisers the opportunity to do more things that the tree optimisers simply couldn't. OTOH, trying to give the RTL optimisers the same sort of arithmetic operations that the tree level had seems like going out of our way to make the RTL optimisers repeat work. However, those should be more or less independent of adddi3. Yeah, I hope so. Richard
Re: [MIPS] MADD issue
(define_insn adddi3_internal_1 [(set (match_operand:DI 0 register_operand =d,d) (plus:DI (match_operand:DI 1 register_operand 0,d) (match_operand:DI 2 register_operand d,d))) (clobber (match_operand:SI 3 register_operand =d,d))] !TARGET_64BIT !TARGET_DEBUG_G_MODE !TARGET_MIPS16 { return (REGNO (operands[0]) == REGNO (operands[1]) REGNO (operands[0]) == REGNO (operands[2])) ? srl\t%3,%L0,31\;sll\t%M0,%M0,1\;sll\t%L0,%L1,1\;addu\t%M0,%M0,%3 : addu\t%L0,%L1,%L2\;sltu\t%3,%L0,%L2\;addu\t%M0,%M1,%M2\;addu\t%M0,%M0,%3; } This should be a post-reload (i.e. predicated on reload_completed) split, I think. Paolo
Re: [MIPS] MADD issue
Fu, Chao-Ying [EMAIL PROTECTED] writes: After tracing GCC 4.x to see why MADD is not generated for MIPS32, I found out the main issue is that the pattern adddi3 is not available for MIPS32. Because the missing of adddi3, GCC 4.x needs to split 64-bit addition to 4 separate RTL insns. This leads to that the combining phase fails to combine RTL insns to a single madd pattern. Could we enable adddi3 for MIPS32 in GCC 4.x? Or is there a better way to generate MADD? Thanks a lot! The problem with: Ex: (mips.md in GCC 3.4) (define_expand adddi3 [(parallel [(set (match_operand:DI 0 register_operand ) (plus:DI (match_operand:DI 1 register_operand ) (match_operand:DI 2 arith_operand ))) (clobber (match_dup 3))])] TARGET_64BIT || (!TARGET_DEBUG_G_MODE !TARGET_MIPS16) { (define_insn adddi3_internal_1 [(set (match_operand:DI 0 register_operand =d,d) (plus:DI (match_operand:DI 1 register_operand 0,d) (match_operand:DI 2 register_operand d,d))) (clobber (match_operand:SI 3 register_operand =d,d))] !TARGET_64BIT !TARGET_DEBUG_G_MODE !TARGET_MIPS16 { return (REGNO (operands[0]) == REGNO (operands[1]) REGNO (operands[0]) == REGNO (operands[2])) ? srl\t%3,%L0,31\;sll\t%M0,%M0,1\;sll\t%L0,%L1,1\;addu\t%M0,%M0,%3 : addu\t%L0,%L1,%L2\;sltu\t%3,%L0,%L2\;addu\t%M0,%M1,%M2\;addu\t%M0,%M0,%3; } [(set_attr type darith) (set_attr mode DI) (set_attr length 16)]) ...this was that it tended to be very poor for the additions themselves. When optabs.c implements the additions instead, the early RTL optimisers get to see the individual instructions, and are often able to handle constant or part-constant operands better. This led to a noticable size improvement when I tested it originally. (I imagine the effects are even better now, thanks to the subreg lowering pass.) See: http://gcc.gnu.org/ml/gcc-patches/2004-05/msg00947.html for the patch that made this change, and some rationale. As far as madd goes, I think it would be better to either (a) get combine to handle this situation or (b) get expand to generate a fused multiply-add from the outset. (b) sounds like it might be useful in its own right. At the moment we treat the generation of floating-point multiply-adds as an optimisation, but in some applications it's critical not to round the intermediate result. (I don't know if there's a bugzilla entry about this.) If we treated fused multiply-add as a primitive operation, we could extend it to integer types too. In this case we'd also need to handle widening multiplications, but we already need to do that for stand-alone multiplications. Just random musings, and probably not the answer you wanted to hear, sorry. Richard
Re: [MIPS] MADD issue
Richard Sandiford wrote: As far as madd goes, I think it would be better to either (a) get combine to handle this situation or (b) get expand to generate a fused multiply-add from the outset. (b) sounds like it might be useful in its own right. At the moment we treat the generation of floating-point multiply-adds as an optimisation, but in some applications it's critical not to round the intermediate result. (I don't know if there's a bugzilla entry about this.) If we treated fused multiply-add as a primitive operation, we could extend it to integer types too. In this case we'd also need to handle widening multiplications, but we already need to do that for stand-alone multiplications. Richard While I agree with you philosophically, it feels like (b) might be quite a major task. A number of optimisation passes which currently recognise and MUL and PLUS separately (e.g. loop strength reduction) would now need to be extended to handle the fused MULPLUS and MULSUB operators. And although the reduction in instruction count due to your previous change is good, what is it as a percentage of the total? After all it only helps code which uses 64-bit integer types with a 32-bit ABI, which is probably quite a small proportion of most real-life applications -- whereas for some algorithms the ability to use MADD is absolutely critical to performance, and for them losing the ability to generate MADD is a significant backward step for the compiler. How about, as a workaround until (b) sees the light of day, we reimplement adddi3 and subdi3 only (not the other di mode patterns), qualified by ISA_HAS_MADD_MSUB. Perhaps they could also be implemented more cleanly nowadays, using define_insn_and_split and/or a # template, to avoid generating multi-instruction assembler sequences. Nigel
Re: [MIPS] MADD issue
Nigel Stephens [EMAIL PROTECTED] writes: While I agree with you philosophically, it feels like (b) might be quite a major task. A number of optimisation passes which currently recognise and MUL and PLUS separately (e.g. loop strength reduction) would now need to be extended to handle the fused MULPLUS and MULSUB operators. And although the reduction in instruction count due to your previous change is good, what is it as a percentage of the total? After all it only helps code which uses 64-bit integer types with a 32-bit ABI, which is probably quite a small proportion of most real-life applications -- whereas for some algorithms the ability to use MADD is absolutely critical to performance, and for them losing the ability to generate MADD is a significant backward step for the compiler. How about, as a workaround until (b) sees the light of day, we reimplement adddi3 and subdi3 only (not the other di mode patterns), qualified by ISA_HAS_MADD_MSUB. Perhaps they could also be implemented more cleanly nowadays, using define_insn_and_split and/or a # template, to avoid generating multi-instruction assembler sequences. The old patterns had a define_split too. That wasn't really the problem. If you don't want to add a tree code yet, it would still be possible to add the optab and expand support, recognising mult-add sequences in a similar way to how we recognise widening multiplies now. I feel at least that's a step in the right direction. Richard
Re: [MIPS] MADD issue
Richard Sandiford wrote: Nigel Stephens [EMAIL PROTECTED] writes: While I agree with you philosophically, it feels like (b) might be quite a major task. A number of optimisation passes which currently recognise and MUL and PLUS separately (e.g. loop strength reduction) would now need to be extended to handle the fused MULPLUS and MULSUB operators. And although the reduction in instruction count due to your previous change is good, what is it as a percentage of the total? After all it only helps code which uses 64-bit integer types with a 32-bit ABI, which is probably quite a small proportion of most real-life applications -- whereas for some algorithms the ability to use MADD is absolutely critical to performance, and for them losing the ability to generate MADD is a significant backward step for the compiler. How about, as a workaround until (b) sees the light of day, we reimplement adddi3 and subdi3 only (not the other di mode patterns), qualified by ISA_HAS_MADD_MSUB. Perhaps they could also be implemented more cleanly nowadays, using define_insn_and_split and/or a # template, to avoid generating multi-instruction assembler sequences. The old patterns had a define_split too. That wasn't really the problem. If you don't want to add a tree code yet, it would still be possible to add the optab and expand support, recognising mult-add sequences in a similar way to how we recognise widening multiplies now. I feel at least that's a step in the right direction. OK, we'll have a think about that. Thanks Nigel
Re: [MIPS] MADD issue
Paolo Bonzini [EMAIL PROTECTED] writes: (define_insn adddi3_internal_1 [(set (match_operand:DI 0 register_operand =d,d) (plus:DI (match_operand:DI 1 register_operand 0,d) (match_operand:DI 2 register_operand d,d))) (clobber (match_operand:SI 3 register_operand =d,d))] !TARGET_64BIT !TARGET_DEBUG_G_MODE !TARGET_MIPS16 { return (REGNO (operands[0]) == REGNO (operands[1]) REGNO (operands[0]) == REGNO (operands[2])) ? srl\t%3,%L0,31\;sll\t%M0,%M0,1\;sll\t%L0,%L1,1\;addu\t%M0,%M0,%3 : addu\t%L0,%L1,%L2\;sltu\t%3,%L0,%L2\;addu\t%M0,%M1,%M2\;addu\t%M0,%M0,%3; } This should be a post-reload (i.e. predicated on reload_completed) split, I think. Actually, with the relatively recent lower-subreg work, it is desirable to split this sort of instruction before reload. That is, do an unconditional split. Ian
Re: [MIPS] MADD issue
This should be a post-reload (i.e. predicated on reload_completed) split, I think. Actually, with the relatively recent lower-subreg work, it is desirable to split this sort of instruction before reload. That is, do an unconditional split. Right. Combine cannot cope with the resulting 4-insn sequence and merge it back with the multiplication, but split is ran after combine and before the second lower-subreg pass. So, making this an unconditional split should be the best of both worlds. Paolo
Re: [MIPS] MADD issue
Paolo Bonzini [EMAIL PROTECTED] writes: This should be a post-reload (i.e. predicated on reload_completed) split, I think. Actually, with the relatively recent lower-subreg work, it is desirable to split this sort of instruction before reload. That is, do an unconditional split. Right. Combine cannot cope with the resulting 4-insn sequence and merge it back with the multiplication, but split is ran after combine and before the second lower-subreg pass. So, making this an unconditional split should be the best of both worlds. The problem is, combine is also one of the passes that was able to optimise the split form so effectively. It's not the best of both worlds from that point of view. Richard
[MIPS] MADD issue
Hi Richard, After tracing GCC 4.x to see why MADD is not generated for MIPS32, I found out the main issue is that the pattern adddi3 is not available for MIPS32. Because the missing of adddi3, GCC 4.x needs to split 64-bit addition to 4 separate RTL insns. This leads to that the combining phase fails to combine RTL insns to a single madd pattern. Could we enable adddi3 for MIPS32 in GCC 4.x? Or is there a better way to generate MADD? Thanks a lot! Ex: (b67.c) long long test (long long a, int b, int c) { return a + (long long) b * (long long) c; } # gcc -S b67.c -O3 -mips32 (b67.s) test: .frame $sp,0,$31 .mask 0x,0 .fmask 0x,0 .setnoreorder .setnomacro mtlo$5 mthi$4 madd$6,$7 mflo$3 j $31 mfhi$2 Regards, Chao-ying - Ex: (mips.md in GCC 3.4) (define_expand adddi3 [(parallel [(set (match_operand:DI 0 register_operand ) (plus:DI (match_operand:DI 1 register_operand ) (match_operand:DI 2 arith_operand ))) (clobber (match_dup 3))])] TARGET_64BIT || (!TARGET_DEBUG_G_MODE !TARGET_MIPS16) { (define_insn adddi3_internal_1 [(set (match_operand:DI 0 register_operand =d,d) (plus:DI (match_operand:DI 1 register_operand 0,d) (match_operand:DI 2 register_operand d,d))) (clobber (match_operand:SI 3 register_operand =d,d))] !TARGET_64BIT !TARGET_DEBUG_G_MODE !TARGET_MIPS16 { return (REGNO (operands[0]) == REGNO (operands[1]) REGNO (operands[0]) == REGNO (operands[2])) ? srl\t%3,%L0,31\;sll\t%M0,%M0,1\;sll\t%L0,%L1,1\;addu\t%M0,%M0,%3 : addu\t%L0,%L1,%L2\;sltu\t%3,%L0,%L2\;addu\t%M0,%M1,%M2\;addu\t%M0,%M0,%3; } [(set_attr type darith) (set_attr mode DI) (set_attr length 16)]) (define_insn *smul_acc_di [(set (match_operand:DI 0 register_operand =x) (plus:DI (mult:DI (sign_extend:DI (match_operand:SI 1 register_operand d)) (sign_extend:DI (match_operand:SI 2 register_operand d))) (match_operand:DI 3 register_operand 0)))] (TARGET_MAD || ISA_HAS_MACC) !TARGET_64BIT { if (TARGET_MAD) return mad\t%1,%2; else if (TARGET_MIPS5500) return madd\t%1,%2; else return macc\t%.,%1,%2; } [(set_attr type imadd) (set_attr mode SI)]) -