Re: [PATCH] Do not handle SUBREG in apply_distributive_law (Re: RFC: allowing fwprop to propagate subregs)
On Wed, Mar 7, 2012 at 6:40 PM, Ulrich Weigand wrote: > Richard Kenner wrote: > >> > Given the current set of results, since I do not have any way to verify >> > whether my simplify_set changes would actually trigger correctly, I'd >> > rather propose to just remove the SUBREG case in apply_distributive_law >> > (i.e. only apply the first patch below). >> > >> > Thoughts? >> >> I think that's reasonable. But I'd replace it with a comment saying >> what used to be there and why it was removed. > > Now that we're back in stage 1, I'd like to try and move forward with > this again. Here's a patch that implements your suggestion. > > Tested on arm-linux-gnueabi, i386-linux-gnu and s390-linux-gnu. > > OK for mainline? Ok. Thanks, Richard. > Bye, > Ulrich > > 2012-03-07 Ulrich Weigand > > gcc/ > * combine.c (apply_distributive_law): Do not distribute SUBREG. > > === modified file 'gcc/combine.c' > --- gcc/combine.c 2012-02-07 15:48:52 + > +++ gcc/combine.c 2012-02-22 11:57:19 + > @@ -9286,36 +9286,22 @@ > /* This is also a multiply, so it distributes over everything. */ > break; > > - case SUBREG: > - /* Non-paradoxical SUBREGs distributes over all operations, > - provided the inner modes and byte offsets are the same, this > - is an extraction of a low-order part, we don't convert an fp > - operation to int or vice versa, this is not a vector mode, > - and we would not be converting a single-word operation into a > - multi-word operation. The latter test is not required, but > - it prevents generating unneeded multi-word operations. Some > - of the previous tests are redundant given the latter test, > - but are retained because they are required for correctness. > - > - We produce the result slightly differently in this case. */ > - > - if (GET_MODE (SUBREG_REG (lhs)) != GET_MODE (SUBREG_REG (rhs)) > - || SUBREG_BYTE (lhs) != SUBREG_BYTE (rhs) > - || ! subreg_lowpart_p (lhs) > - || (GET_MODE_CLASS (GET_MODE (lhs)) > - != GET_MODE_CLASS (GET_MODE (SUBREG_REG (lhs > - || paradoxical_subreg_p (lhs) > - || VECTOR_MODE_P (GET_MODE (lhs)) > - || GET_MODE_SIZE (GET_MODE (SUBREG_REG (lhs))) > UNITS_PER_WORD > - /* Result might need to be truncated. Don't change mode if > - explicit truncation is needed. */ > - || !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (x), > - GET_MODE (SUBREG_REG (lhs > - return x; > - > - tem = simplify_gen_binary (code, GET_MODE (SUBREG_REG (lhs)), > - SUBREG_REG (lhs), SUBREG_REG (rhs)); > - return gen_lowpart (GET_MODE (x), tem); > + /* This used to handle SUBREG, but this turned out to be counter- > + productive, since (subreg (op ...)) usually is not handled by > + insn patterns, and this "optimization" therefore transformed > + recognizable patterns into unrecognizable ones. Therefore the > + SUBREG case was removed from here. > + > + It is possible that distributing SUBREG over arithmetic operations > + leads to an intermediate result than can then be optimized further, > + e.g. by moving the outer SUBREG to the other side of a SET as done > + in simplify_set. This seems to have been the original intent of > + handling SUBREGs here. > + > + However, with current GCC this does not appear to actually happen, > + at least on major platforms. If some case is found where removing > + the SUBREG case here prevents follow-on optimizations, distributing > + SUBREGs ought to be re-added at that place, e.g. in simplify_set. */ > > default: > return x; > > > > -- > Dr. Ulrich Weigand > GNU Toolchain for Linux on System z and Cell BE > ulrich.weig...@de.ibm.com >
[PATCH] Do not handle SUBREG in apply_distributive_law (Re: RFC: allowing fwprop to propagate subregs)
Richard Kenner wrote: > > Given the current set of results, since I do not have any way to verify > > whether my simplify_set changes would actually trigger correctly, I'd > > rather propose to just remove the SUBREG case in apply_distributive_law > > (i.e. only apply the first patch below). > > > > Thoughts? > > I think that's reasonable. But I'd replace it with a comment saying > what used to be there and why it was removed. Now that we're back in stage 1, I'd like to try and move forward with this again. Here's a patch that implements your suggestion. Tested on arm-linux-gnueabi, i386-linux-gnu and s390-linux-gnu. OK for mainline? Bye, Ulrich 2012-03-07 Ulrich Weigand gcc/ * combine.c (apply_distributive_law): Do not distribute SUBREG. === modified file 'gcc/combine.c' --- gcc/combine.c 2012-02-07 15:48:52 + +++ gcc/combine.c 2012-02-22 11:57:19 + @@ -9286,36 +9286,22 @@ /* This is also a multiply, so it distributes over everything. */ break; -case SUBREG: - /* Non-paradoxical SUBREGs distributes over all operations, -provided the inner modes and byte offsets are the same, this -is an extraction of a low-order part, we don't convert an fp -operation to int or vice versa, this is not a vector mode, -and we would not be converting a single-word operation into a -multi-word operation. The latter test is not required, but -it prevents generating unneeded multi-word operations. Some -of the previous tests are redundant given the latter test, -but are retained because they are required for correctness. - -We produce the result slightly differently in this case. */ - - if (GET_MODE (SUBREG_REG (lhs)) != GET_MODE (SUBREG_REG (rhs)) - || SUBREG_BYTE (lhs) != SUBREG_BYTE (rhs) - || ! subreg_lowpart_p (lhs) - || (GET_MODE_CLASS (GET_MODE (lhs)) - != GET_MODE_CLASS (GET_MODE (SUBREG_REG (lhs - || paradoxical_subreg_p (lhs) - || VECTOR_MODE_P (GET_MODE (lhs)) - || GET_MODE_SIZE (GET_MODE (SUBREG_REG (lhs))) > UNITS_PER_WORD - /* Result might need to be truncated. Don't change mode if -explicit truncation is needed. */ - || !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (x), -GET_MODE (SUBREG_REG (lhs - return x; - - tem = simplify_gen_binary (code, GET_MODE (SUBREG_REG (lhs)), -SUBREG_REG (lhs), SUBREG_REG (rhs)); - return gen_lowpart (GET_MODE (x), tem); +/* This used to handle SUBREG, but this turned out to be counter- + productive, since (subreg (op ...)) usually is not handled by + insn patterns, and this "optimization" therefore transformed + recognizable patterns into unrecognizable ones. Therefore the + SUBREG case was removed from here. + + It is possible that distributing SUBREG over arithmetic operations + leads to an intermediate result than can then be optimized further, + e.g. by moving the outer SUBREG to the other side of a SET as done + in simplify_set. This seems to have been the original intent of + handling SUBREGs here. + + However, with current GCC this does not appear to actually happen, + at least on major platforms. If some case is found where removing + the SUBREG case here prevents follow-on optimizations, distributing + SUBREGs ought to be re-added at that place, e.g. in simplify_set. */ default: return x; -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: RFC: allowing fwprop to propagate subregs
> I tried to implement that suggestion, but interestingly enough I cannot > really test it since I was unable to find any single case where that > SUBREG case in apply_distributive_law actually causes any difference > whatsoever in generated code. > > Do you have any further suggestion of how to find a testcase (some > particular source code and/or architecture)? No. It may well be irrelevant now and was only relevant in cases where we just had SImode and not the narrower modes on some of the older architectures such as a29k. It may also be that the optimizations that this caught are now done at tree level, though I think the former is the more likely. > Given the current set of results, since I do not have any way to verify > whether my simplify_set changes would actually trigger correctly, I'd > rather propose to just remove the SUBREG case in apply_distributive_law > (i.e. only apply the first patch below). > > Thoughts? I think that's reasonable. But I'd replace it with a comment saying what used to be there and why it was removed.
Re: RFC: allowing fwprop to propagate subregs
Richard Kenner wrote: > > Maybe the best solution would be to remove the SUBREG case from the generic > > apply_distributive_law subroutine, and instead add a special check for the > > distributed subreg case right at the above place in simplify_set; i.e. to > > perform the inverse distribution only if it is already guaranteed that we > > will also be able to move the subreg to the LHS ... > > That could indeed work. I tried to implement that suggestion, but interestingly enough I cannot really test it since I was unable to find any single case where that SUBREG case in apply_distributive_law actually causes any difference whatsoever in generated code. As test case I used the whole of libstdc++.so on the following set of platforms: - i686-pc-linux - s390x-ibm-linux - powerpc-ibm-linux - arm-linux-gnueabi and built the compiler and libstdc++.so for each of: - current mainline - current mainline plus the first patch below - current mainline plus both patches below All three resulting object files were identical for every platform. Do you have any further suggestion of how to find a testcase (some particular source code and/or architecture)? Given the current set of results, since I do not have any way to verify whether my simplify_set changes would actually trigger correctly, I'd rather propose to just remove the SUBREG case in apply_distributive_law (i.e. only apply the first patch below). Thoughts? Thanks, Ulrich Patch A: Remove SUBREG case in apply_distributive_law Index: gcc/combine.c === --- gcc/combine.c (revision 183240) +++ gcc/combine.c (working copy) @@ -9238,37 +9269,6 @@ /* This is also a multiply, so it distributes over everything. */ break; -case SUBREG: - /* Non-paradoxical SUBREGs distributes over all operations, -provided the inner modes and byte offsets are the same, this -is an extraction of a low-order part, we don't convert an fp -operation to int or vice versa, this is not a vector mode, -and we would not be converting a single-word operation into a -multi-word operation. The latter test is not required, but -it prevents generating unneeded multi-word operations. Some -of the previous tests are redundant given the latter test, -but are retained because they are required for correctness. - -We produce the result slightly differently in this case. */ - - if (GET_MODE (SUBREG_REG (lhs)) != GET_MODE (SUBREG_REG (rhs)) - || SUBREG_BYTE (lhs) != SUBREG_BYTE (rhs) - || ! subreg_lowpart_p (lhs) - || (GET_MODE_CLASS (GET_MODE (lhs)) - != GET_MODE_CLASS (GET_MODE (SUBREG_REG (lhs - || paradoxical_subreg_p (lhs) - || VECTOR_MODE_P (GET_MODE (lhs)) - || GET_MODE_SIZE (GET_MODE (SUBREG_REG (lhs))) > UNITS_PER_WORD - /* Result might need to be truncated. Don't change mode if -explicit truncation is needed. */ - || !TRULY_NOOP_TRUNCATION_MODES_P (GET_MODE (x), -GET_MODE (SUBREG_REG (lhs - return x; - - tem = simplify_gen_binary (code, GET_MODE (SUBREG_REG (lhs)), -SUBREG_REG (lhs), SUBREG_REG (rhs)); - return gen_lowpart (GET_MODE (x), tem); - default: return x; } Patch B: Re-implement SUBREG case specifically in simplify_set Index: gcc/combine.c === --- gcc/combine.c (revision 183240) +++ gcc/combine.c (working copy) @@ -6299,6 +6299,7 @@ rtx dest = SET_DEST (x); enum machine_mode mode = GET_MODE (src) != VOIDmode ? GET_MODE (src) : GET_MODE (dest); + rtx src_subreg; rtx other_insn; rtx *cc_use; @@ -6496,6 +6497,10 @@ and X being a REG or (subreg (reg)), we may be able to convert this to (set (subreg:m2 x) (op)). + Similarly, if we have (set x (op:m1 (subreg:m2 ...) (subreg:m2 ...))), + we may be able to first distribute the subreg over op, and then apply + the above transformation. + We can always do this if M1 is narrower than M2 because that means that we only care about the low bits of the result. @@ -6504,30 +6509,56 @@ be undefined. On machine where it is defined, this transformation is safe as long as M1 and M2 have the same number of words. */ + src_subreg = NULL_RTX; if (GET_CODE (src) == SUBREG && subreg_lowpart_p (src) - && !OBJECT_P (SUBREG_REG (src)) + && !OBJECT_P (SUBREG_REG (src))) +src_subreg = SUBREG_REG (src); + else if (GET_CODE (src) == IOR || GET_CODE (src) == XOR + || GET_CODE (src) == AND + || GET_CODE (src) == PLUS || GET_CODE (src) == MINUS) +{ + rtx lhs = XEXP (x, 0); + rtx rhs = XEXP (x, 1); + + /* We can distribute non-paradoxical lowpart SUBREGs if the +
Re: RFC: allowing fwprop to propagate subregs
> Maybe the best solution would be to remove the SUBREG case from the generic > apply_distributive_law subroutine, and instead add a special check for the > distributed subreg case right at the above place in simplify_set; i.e. to > perform the inverse distribution only if it is already guaranteed that we > will also be able to move the subreg to the LHS ... That could indeed work.
Re: RFC: allowing fwprop to propagate subregs
Richard Kenner wrote: > > > The problem appears to be that an RTX expression like > > > > > > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) > > > (subreg:QI (reg:SI 66 [ b ]) 0)) > > > > > > seems to be considered non-canonical by combine, and is instead > > > transformed into > > > > > > (subreg:QI (minus:SI (reg:SI 64 [ a ]) > > > (reg:SI 66 [ b ])) 0) > > > On the one hand, this seems odd, as SUBREGs with a non-trivial > > > expression inside will usually not match any insn pattern. On > > > the other hand, I guess this might still be useful behaviour > > > for combine on platforms that support only word arithmetic, > > > when the SUBREG might be considered a "split" point. > > No, I think the idea was that the outer SUBREG would be moved to the LHS > of the assignment to make a "paradoxical SUBREG". Except, as you point > out, there doesn't seem to be an advantage of doing this for arithmetic > unless you only have it in SImode (which it could, of course, check). Right, I was mistaken about the "split" point case; this is done only for SUBREG (MEM). But you're correct that there is a special optimization in simplify_set that will move the outer SUBREG to the LHS: /* If we have (set x (subreg:m1 (op:m2 ...) 0)) with OP being some operation, and X being a REG or (subreg (reg)), we may be able to convert this to (set (subreg:m2 x) (op)). However, it would appear that this particular location is in fact the only place where (subreg (op ...)) is handled; I don't see any other place where this type of pattern would provide any benefit. Maybe the best solution would be to remove the SUBREG case from the generic apply_distributive_law subroutine, and instead add a special check for the distributed subreg case right at the above place in simplify_set; i.e. to perform the inverse distribution only if it is already guaranteed that we will also be able to move the subreg to the LHS ... Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: RFC: allowing fwprop to propagate subregs
> [Richard, combine question at the bottom for you. I've quoted Ulrich's > whole message in order to provide a bit of context.] I don't remember ALL of this, but can perhaps make a few hints. > > The problem appears to be that an RTX expression like > > > > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) > > (subreg:QI (reg:SI 66 [ b ]) 0)) > > > > seems to be considered non-canonical by combine, and is instead > > transformed into > > > > (subreg:QI (minus:SI (reg:SI 64 [ a ]) > > (reg:SI 66 [ b ])) 0) > > On the one hand, this seems odd, as SUBREGs with a non-trivial > > expression inside will usually not match any insn pattern. On > > the other hand, I guess this might still be useful behaviour > > for combine on platforms that support only word arithmetic, > > when the SUBREG might be considered a "split" point. No, I think the idea was that the outer SUBREG would be moved to the LHS of the assignment to make a "paradoxical SUBREG". Except, as you point out, there doesn't seem to be an advantage of doing this for arithmetic unless you only have it in SImode (which it could, of course, check). > (Of course, this doesn't work for the compute-result-and-cc type patterns.) Right. > No immediate suggestions, sorry. It looks like the combine case was > added by this pre-egcs patch: > > Wed Mar 18 05:54:25 1998 Richard Kenner > > * combine.c (gen_binary): Don't make AND that does nothing. > (simplify_comparison, case AND): Commute AND and SUBREG. > > so maybe Richard remembers (or has a record of) what this was designed > to handle? I suspect the two halves related to each other: we commute AND and SUBREG in the hope that this will allow us to decide that the AND does nothing and isn't needed. But I don't see how this is related: this is the AND case in a comparison and we have a MINUS.
Re: RFC: allowing fwprop to propagate subregs
[Richard, combine question at the bottom for you. I've quoted Ulrich's whole message in order to provide a bit of context.] "Ulrich Weigand" writes: > Richard Sandiford wrote: >> At the moment, fwprop will propagate constants and registers >> even if no further rtl simplifications are possible: >> >> if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) >> flags |= PR_CAN_APPEAR; >> >> What do you think about extending this to subregs? > > I've been testing your patch (in its latest version including > the SUBREG test pointed out by H.J.), and ran into issues where > this additional propagation suppresses other optimizations. > > In particular, I'm seeing this testsuite regression on x86_64: > FAIL: gcc.target/i386/pr30315.c scan-assembler-times cmp 4 > > The problem is that the combined "compute result and set > condition code" insn patterns sometimes no longer match. > > The source code in question looks like: > > int c; > > unsigned char > de (unsigned char a, unsigned char b) > { > unsigned char difference = a - b; > if (difference > a) > c--; > return difference; > } > > > Before your patch, we generate insns like: > > (insn 3 4 5 2 (set (reg/v:QI 63 [ a ]) > (subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal} > (expr_list:REG_DEAD (reg:SI 64 [ a ]) > (nil))) > > (insn 5 3 6 2 (set (reg/v:QI 65 [ b ]) > (subreg:QI (reg:SI 66 [ b ]) 0)) pr30315.i:5 66 {*movqi_internal} > (expr_list:REG_DEAD (reg:SI 66 [ b ]) > (nil))) > > (insn 9 6 10 2 (parallel [ > (set (reg/v:QI 59 [ difference ]) > (minus:QI (reg/v:QI 63 [ a ]) > (reg/v:QI 65 [ b ]))) > (clobber (reg:CC 17 flags)) > ]) pr30315.i:6 287 {*subqi_1} > (expr_list:REG_DEAD (reg/v:QI 65 [ b ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil > > (insn 10 9 11 2 (set (reg:CC 17 flags) > (compare:CC (reg/v:QI 63 [ a ]) > (reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1} > (expr_list:REG_DEAD (reg/v:QI 63 [ a ]) > (nil))) > > which get combined into: > > (insn 4 2 3 2 (set (reg:SI 66 [ b ]) > (reg:SI 4 si [ b ])) pr30315.i:5 64 {*movsi_internal} > (expr_list:REG_DEAD (reg:SI 4 si [ b ]) > (nil))) > > (insn 3 4 5 2 (set (reg/v:QI 63 [ a ]) > (subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal} > (expr_list:REG_DEAD (reg:SI 64 [ a ]) > (nil))) > > (insn 10 9 11 2 (parallel [ > (set (reg:CCC 17 flags) > (compare:CCC (minus:QI (reg/v:QI 63 [ a ]) > (subreg:QI (reg:SI 66 [ b ]) 0)) > (reg/v:QI 63 [ a ]))) > (set (reg/v:QI 59 [ difference ]) > (minus:QI (reg/v:QI 63 [ a ]) > (subreg:QI (reg:SI 66 [ b ]) 0))) > ]) pr30315.i:7 322 {*subqi3_cc_overflow} > (expr_list:REG_DEAD (reg:SI 66 [ b ]) > (expr_list:REG_DEAD (reg/v:QI 63 [ a ]) > (nil > > > Now, with your patch we have instead before combine: > > (insn 9 6 10 2 (parallel [ > (set (reg/v:QI 59 [ difference ]) > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) > (subreg:QI (reg:SI 66 [ b ]) 0))) > (clobber (reg:CC 17 flags)) > ]) pr30315.i:6 287 {*subqi_1} > (expr_list:REG_DEAD (reg:SI 66 [ b ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil > > (insn 10 9 11 2 (set (reg:CC 17 flags) > (compare:CC (subreg:QI (reg:SI 64 [ a ]) 0) > (reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1} > (expr_list:REG_DEAD (reg:SI 64 [ a ]) > (nil))) > > which combine is unfortunately unable to process: > > Trying 9 -> 10: > Failed to match this instruction: > (parallel [ > (set (reg:CC 17 flags) > (compare:CC (subreg:QI (minus:SI (reg:SI 64 [ a ]) > (reg:SI 66 [ b ])) 0) > (subreg:QI (reg:SI 64 [ a ]) 0))) > (set (reg/v:QI 59 [ difference ]) > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) > (subreg:QI (reg:SI 66 [ b ]) 0))) > ]) > > > The problem appears to be that an RTX expression like > > (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) > (subreg:QI (reg:SI 66 [ b ]) 0)) > > seems to be considered non-canonical by combine, and is instead > transformed into > > (subreg:QI (minus:SI (reg:SI 64 [ a ]) > (reg:SI 66 [ b ])) 0) > > This happens via combine.c:apply_distributive_law. > > > On the one hand, this seems odd, as SUBREGs with a non-trivial > expression inside will usually not match any insn pattern. On > the other hand, I guess this might still be useful behaviour > for combine on platforms that support only word arithmetic, > when the SUBREG might be considered a "split" point. (Of course, > this doesn't work for the compute-result-and-cc type patte
Re: RFC: allowing fwprop to propagate subregs
On 01/11/2012 05:55 PM, Ulrich Weigand wrote: In either case, it is always odd to have RTX in the insn stream that isn't "stable" under common simplication ... Do you have any suggestions on how to fix this? If we add the fwprop patch, should we then disable apply_distributive_law for SUBREGs? This strange canonicalization is also the cause of PR39726, where I also wanted to move subregs inside arithmetic operations. See the attached patch (which is not enough to fix the PR; this is just the combine part). Unfortunately, doing this also caused x86_64 regressions, I think in zero_extract patterns. Paolo change canonicalization Index: gcc/Makefile.in === --- gcc/Makefile.in (branch pr39726) +++ gcc/Makefile.in (working copy) @@ -2844,7 +2844,7 @@ jump.o : jump.c $(CONFIG_H) $(SYSTEM_H) simplify-rtx.o : simplify-rtx.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \ $(RTL_H) $(REGS_H) hard-reg-set.h $(FLAGS_H) $(REAL_H) insn-config.h \ $(RECOG_H) $(EXPR_H) $(TOPLEV_H) output.h $(FUNCTION_H) $(GGC_H) $(TM_P_H) \ - $(TREE_H) $(TARGET_H) + $(TREE_H) $(TARGET_H) $(OPTABS_H) cgraph.o : cgraph.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(TREE_H) \ langhooks.h $(TOPLEV_H) $(FLAGS_H) $(GGC_H) $(TARGET_H) $(CGRAPH_H) \ gt-cgraph.h output.h intl.h $(BASIC_BLOCK_H) debug.h $(HASHTAB_H) \ Index: gcc/simplify-rtx.c === --- gcc/simplify-rtx.c (branch pr39726) +++ gcc/simplify-rtx.c (working copy) @@ -39,6 +39,7 @@ along with GCC; see the file COPYING3. #include "output.h" #include "ggc.h" #include "target.h" +#include "optabs.h" /* Simplification and canonicalization of RTL. */ @@ -5191,6 +5192,8 @@ rtx simplify_subreg (enum machine_mode outermode, rtx op, enum machine_mode innermode, unsigned int byte) { + int is_lowpart; + /* Little bit of sanity checking. */ gcc_assert (innermode != VOIDmode); gcc_assert (outermode != VOIDmode); @@ -5300,11 +5303,13 @@ simplify_subreg (enum machine_mode outer return NULL_RTX; } + is_lowpart = subreg_lowpart_offset (outermode, innermode) == byte; + /* Merge implicit and explicit truncations. */ if (GET_CODE (op) == TRUNCATE && GET_MODE_SIZE (outermode) < GET_MODE_SIZE (innermode) - && subreg_lowpart_offset (outermode, innermode) == byte) + && is_lowpart) return simplify_gen_unary (TRUNCATE, outermode, XEXP (op, 0), GET_MODE (XEXP (op, 0))); @@ -5343,7 +5348,7 @@ simplify_subreg (enum machine_mode outer The information is used only by alias analysis that can not grog partial register anyway. */ - if (subreg_lowpart_offset (outermode, innermode) == byte) + if (is_lowpart) ORIGINAL_REGNO (x) = ORIGINAL_REGNO (op); return x; } @@ -5393,6 +5398,51 @@ simplify_subreg (enum machine_mode outer return NULL_RTX; } + /* Try to move a subreg inside an arithmetic operation. */ + if (is_lowpart && ARITHMETIC_P (op) + && GET_MODE_CLASS (outermode) == MODE_INT + && GET_MODE_CLASS (innermode) == MODE_INT) +{ + enum insn_code ic; + enum machine_mode cnt_mode; + switch (GET_CODE (op)) + { + case ABS: + case NEG: + return simplify_gen_unary (GET_CODE (op), outermode, + rtl_hooks.gen_lowpart_no_emit + (outermode, XEXP (op, 0)), + outermode); + + case ASHIFT: + /* i386 always uses QImode for the shift count. Get the + appropriate mode from the optab. */ + ic = ashl_optab->handlers[outermode].insn_code; + if (ic != CODE_FOR_nothing) + cnt_mode = insn_data[ic].operand[2].mode; + else + cnt_mode = outermode; + return simplify_gen_binary (GET_CODE (op), outermode, + rtl_hooks.gen_lowpart_no_emit + (outermode, XEXP (op, 0)), + rtl_hooks.gen_lowpart_no_emit + (cnt_mode, XEXP (op, 1))); + + case PLUS: + case MINUS: + case AND: + case IOR: + case XOR: + return simplify_gen_binary (GET_CODE (op), outermode, + rtl_hooks.gen_lowpart_no_emit + (outermode, XEXP (op, 0)), + rtl_hooks.gen_lowpart_no_emit + (outermode, XEXP (op, 1))); + default: + break; + } +} + /* Optimize SUBREG truncations of zero and sign extended values. */ if ((GET_CODE (op) == ZERO_EXTEND || GET_CODE (op) == SIGN_EXTEND) Index: gcc/combine.c === --- gcc/combine.c (branch pr39726) +++ gcc/combine.c (working copy) @@ -11155,47 +11155,6 @@ simplify_comparison (enum rtx_code code, continue; } - /* If this is (and:M1 (subreg:M2 X 0) (const_int C1)) where C1 - fits in both M1 and M2 and the SUBREG is either paradoxical - or represents the low part, permute the SUBREG and the AND - and try again. */ - if (GET_CODE (XEXP (op0, 0)) == SUBREG) - { - unsigned HOST_WIDE_INT c1; -
Re: RFC: allowing fwprop to propagate subregs
Richard Sandiford wrote: > At the moment, fwprop will propagate constants and registers > even if no further rtl simplifications are possible: > > if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) > flags |= PR_CAN_APPEAR; > > What do you think about extending this to subregs? I've been testing your patch (in its latest version including the SUBREG test pointed out by H.J.), and ran into issues where this additional propagation suppresses other optimizations. In particular, I'm seeing this testsuite regression on x86_64: FAIL: gcc.target/i386/pr30315.c scan-assembler-times cmp 4 The problem is that the combined "compute result and set condition code" insn patterns sometimes no longer match. The source code in question looks like: int c; unsigned char de (unsigned char a, unsigned char b) { unsigned char difference = a - b; if (difference > a) c--; return difference; } Before your patch, we generate insns like: (insn 3 4 5 2 (set (reg/v:QI 63 [ a ]) (subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal} (expr_list:REG_DEAD (reg:SI 64 [ a ]) (nil))) (insn 5 3 6 2 (set (reg/v:QI 65 [ b ]) (subreg:QI (reg:SI 66 [ b ]) 0)) pr30315.i:5 66 {*movqi_internal} (expr_list:REG_DEAD (reg:SI 66 [ b ]) (nil))) (insn 9 6 10 2 (parallel [ (set (reg/v:QI 59 [ difference ]) (minus:QI (reg/v:QI 63 [ a ]) (reg/v:QI 65 [ b ]))) (clobber (reg:CC 17 flags)) ]) pr30315.i:6 287 {*subqi_1} (expr_list:REG_DEAD (reg/v:QI 65 [ b ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 10 9 11 2 (set (reg:CC 17 flags) (compare:CC (reg/v:QI 63 [ a ]) (reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1} (expr_list:REG_DEAD (reg/v:QI 63 [ a ]) (nil))) which get combined into: (insn 4 2 3 2 (set (reg:SI 66 [ b ]) (reg:SI 4 si [ b ])) pr30315.i:5 64 {*movsi_internal} (expr_list:REG_DEAD (reg:SI 4 si [ b ]) (nil))) (insn 3 4 5 2 (set (reg/v:QI 63 [ a ]) (subreg:QI (reg:SI 64 [ a ]) 0)) pr30315.i:5 66 {*movqi_internal} (expr_list:REG_DEAD (reg:SI 64 [ a ]) (nil))) (insn 10 9 11 2 (parallel [ (set (reg:CCC 17 flags) (compare:CCC (minus:QI (reg/v:QI 63 [ a ]) (subreg:QI (reg:SI 66 [ b ]) 0)) (reg/v:QI 63 [ a ]))) (set (reg/v:QI 59 [ difference ]) (minus:QI (reg/v:QI 63 [ a ]) (subreg:QI (reg:SI 66 [ b ]) 0))) ]) pr30315.i:7 322 {*subqi3_cc_overflow} (expr_list:REG_DEAD (reg:SI 66 [ b ]) (expr_list:REG_DEAD (reg/v:QI 63 [ a ]) (nil Now, with your patch we have instead before combine: (insn 9 6 10 2 (parallel [ (set (reg/v:QI 59 [ difference ]) (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) (subreg:QI (reg:SI 66 [ b ]) 0))) (clobber (reg:CC 17 flags)) ]) pr30315.i:6 287 {*subqi_1} (expr_list:REG_DEAD (reg:SI 66 [ b ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 10 9 11 2 (set (reg:CC 17 flags) (compare:CC (subreg:QI (reg:SI 64 [ a ]) 0) (reg/v:QI 59 [ difference ]))) pr30315.i:7 4 {*cmpqi_1} (expr_list:REG_DEAD (reg:SI 64 [ a ]) (nil))) which combine is unfortunately unable to process: Trying 9 -> 10: Failed to match this instruction: (parallel [ (set (reg:CC 17 flags) (compare:CC (subreg:QI (minus:SI (reg:SI 64 [ a ]) (reg:SI 66 [ b ])) 0) (subreg:QI (reg:SI 64 [ a ]) 0))) (set (reg/v:QI 59 [ difference ]) (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) (subreg:QI (reg:SI 66 [ b ]) 0))) ]) The problem appears to be that an RTX expression like (minus:QI (subreg:QI (reg:SI 64 [ a ]) 0) (subreg:QI (reg:SI 66 [ b ]) 0)) seems to be considered non-canonical by combine, and is instead transformed into (subreg:QI (minus:SI (reg:SI 64 [ a ]) (reg:SI 66 [ b ])) 0) This happens via combine.c:apply_distributive_law. On the one hand, this seems odd, as SUBREGs with a non-trivial expression inside will usually not match any insn pattern. On the other hand, I guess this might still be useful behaviour for combine on platforms that support only word arithmetic, when the SUBREG might be considered a "split" point. (Of course, this doesn't work for the compute-result-and-cc type patterns.) In either case, it is always odd to have RTX in the insn stream that isn't "stable" under common simplication ... Do you have any suggestions on how to fix this? If we add the fwprop patch, should we then disable apply_distributive_law for SUBREGs? Thanks, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: RFC: allowing fwprop to propagate subregs
On 09/14/2011 05:44 PM, Richard Sandiford wrote: > A SUBREG may not be REG nor CONSTANT. Don't you need > to check REG_P/CONSTANT_P on SUBREG? Yeah, good point. There should be a "&& REG_P (SUBREG_REG (new_rtx))" in there. Probably also worth checking for non-paradoxical subregs. I was going to suggest the former. Paradoxical subregs are already handled elsewhere in fwprop, but it also shouldn't hurt to include them. Paolo
Re: RFC: allowing fwprop to propagate subregs
"H.J. Lu" writes: > On Wed, Sep 14, 2011 at 8:24 AM, Richard Sandiford > wrote: >> At the moment, fwprop will propagate constants and registers >> even if no further rtl simplifications are possible: >> >> if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) >> flags |= PR_CAN_APPEAR; >> >> What do you think about extending this to subregs? The reason for >> asking is that on NEON, vector loads like vld4 are represented as a load >> of a single monolithic register followed by subreg extractions of each >> vector: >> >> (set (reg:OI FULL) (...)) >> (set (reg:V2SI V0) (subreg:V2SI (reg:OI FULL) 0)) >> (set (reg:V2SI V1) (subreg:V2SI (reg:OI FULL) 16)) >> (set (reg:V2SI V2) (subreg:V2SI (reg:OI FULL) 32)) >> (set (reg:V2SI V3) (subreg:V2SI (reg:OI FULL) 48)) >> >> Nothing ever propagates these subregs, so the separate moves >> survive until IRA. This has three problems: >> >> - We generally want the registers allocated to V0...V3 to be the same >> as FULL, so that the four subreg moves become nops. And this often >> happens in simple examples. But if register pressure is relatively >> high, these moves can sometimes cause IRA to spill in cases where >> it doesn't if the subregs are used instead of each Vi. >> >> - Perhaps related, register pressure becomes harder to estimate. >> >> - These moves can interfere with pre-reload scheduling. >> >> In combination with the MODES_TIEABLE_P patch that I posted here: >> >> http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00626.html >> >> this patch significantly improves the code generated for several libav >> loops. Unfortunately, I don't have a setup that can do meaningful >> x86_64 performance measurements, but a diff of the before and after >> output for libav showed many cases where the patch removed moves. >> >> What do you think? Alternatives include propagating in lower-subreg, >> or maybe only in the second fwprop pass. >> >> Richard >> >> >> gcc/ >> * fwprop.c (propagate_rtx): Also set PR_CAN_APPEAR for subregs. >> >> Index: gcc/fwprop.c >> === >> --- gcc/fwprop.c 2011-08-26 09:58:28.829540497 +0100 >> +++ gcc/fwprop.c 2011-08-26 10:14:03.767707504 +0100 >> @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode >> return NULL_RTX; >> >> flags = 0; >> - if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) >> + if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == >> SUBREG) >> flags |= PR_CAN_APPEAR; >> if (!for_each_rtx (&new_rtx, varying_mem_p, NULL)) >> flags |= PR_HANDLE_MEM; >> > > A SUBREG may not be REG nor CONSTANT. Don't you need > to check REG_P/CONSTANT_P on SUBREG? Yeah, good point. There should be a "&& REG_P (SUBREG_REG (new_rtx))" in there. Probably also worth checking for non-paradoxical subregs. Richard
Re: RFC: allowing fwprop to propagate subregs
On Wed, Sep 14, 2011 at 8:24 AM, Richard Sandiford wrote: > At the moment, fwprop will propagate constants and registers > even if no further rtl simplifications are possible: > > if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) > flags |= PR_CAN_APPEAR; > > What do you think about extending this to subregs? The reason for > asking is that on NEON, vector loads like vld4 are represented as a load > of a single monolithic register followed by subreg extractions of each > vector: > > (set (reg:OI FULL) (...)) > (set (reg:V2SI V0) (subreg:V2SI (reg:OI FULL) 0)) > (set (reg:V2SI V1) (subreg:V2SI (reg:OI FULL) 16)) > (set (reg:V2SI V2) (subreg:V2SI (reg:OI FULL) 32)) > (set (reg:V2SI V3) (subreg:V2SI (reg:OI FULL) 48)) > > Nothing ever propagates these subregs, so the separate moves > survive until IRA. This has three problems: > > - We generally want the registers allocated to V0...V3 to be the same > as FULL, so that the four subreg moves become nops. And this often > happens in simple examples. But if register pressure is relatively > high, these moves can sometimes cause IRA to spill in cases where > it doesn't if the subregs are used instead of each Vi. > > - Perhaps related, register pressure becomes harder to estimate. > > - These moves can interfere with pre-reload scheduling. > > In combination with the MODES_TIEABLE_P patch that I posted here: > > http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00626.html > > this patch significantly improves the code generated for several libav > loops. Unfortunately, I don't have a setup that can do meaningful > x86_64 performance measurements, but a diff of the before and after > output for libav showed many cases where the patch removed moves. > > What do you think? Alternatives include propagating in lower-subreg, > or maybe only in the second fwprop pass. > > Richard > > > gcc/ > * fwprop.c (propagate_rtx): Also set PR_CAN_APPEAR for subregs. > > Index: gcc/fwprop.c > === > --- gcc/fwprop.c 2011-08-26 09:58:28.829540497 +0100 > +++ gcc/fwprop.c 2011-08-26 10:14:03.767707504 +0100 > @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode > return NULL_RTX; > > flags = 0; > - if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) > + if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == > SUBREG) > flags |= PR_CAN_APPEAR; > if (!for_each_rtx (&new_rtx, varying_mem_p, NULL)) > flags |= PR_HANDLE_MEM; > A SUBREG may not be REG nor CONSTANT. Don't you need to check REG_P/CONSTANT_P on SUBREG? -- H.J.
RFC: allowing fwprop to propagate subregs
At the moment, fwprop will propagate constants and registers even if no further rtl simplifications are possible: if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) flags |= PR_CAN_APPEAR; What do you think about extending this to subregs? The reason for asking is that on NEON, vector loads like vld4 are represented as a load of a single monolithic register followed by subreg extractions of each vector: (set (reg:OI FULL) (...)) (set (reg:V2SI V0) (subreg:V2SI (reg:OI FULL) 0)) (set (reg:V2SI V1) (subreg:V2SI (reg:OI FULL) 16)) (set (reg:V2SI V2) (subreg:V2SI (reg:OI FULL) 32)) (set (reg:V2SI V3) (subreg:V2SI (reg:OI FULL) 48)) Nothing ever propagates these subregs, so the separate moves survive until IRA. This has three problems: - We generally want the registers allocated to V0...V3 to be the same as FULL, so that the four subreg moves become nops. And this often happens in simple examples. But if register pressure is relatively high, these moves can sometimes cause IRA to spill in cases where it doesn't if the subregs are used instead of each Vi. - Perhaps related, register pressure becomes harder to estimate. - These moves can interfere with pre-reload scheduling. In combination with the MODES_TIEABLE_P patch that I posted here: http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00626.html this patch significantly improves the code generated for several libav loops. Unfortunately, I don't have a setup that can do meaningful x86_64 performance measurements, but a diff of the before and after output for libav showed many cases where the patch removed moves. What do you think? Alternatives include propagating in lower-subreg, or maybe only in the second fwprop pass. Richard gcc/ * fwprop.c (propagate_rtx): Also set PR_CAN_APPEAR for subregs. Index: gcc/fwprop.c === --- gcc/fwprop.c2011-08-26 09:58:28.829540497 +0100 +++ gcc/fwprop.c2011-08-26 10:14:03.767707504 +0100 @@ -664,7 +664,7 @@ propagate_rtx (rtx x, enum machine_mode return NULL_RTX; flags = 0; - if (REG_P (new_rtx) || CONSTANT_P (new_rtx)) + if (REG_P (new_rtx) || CONSTANT_P (new_rtx) || GET_CODE (new_rtx) == SUBREG) flags |= PR_CAN_APPEAR; if (!for_each_rtx (&new_rtx, varying_mem_p, NULL)) flags |= PR_HANDLE_MEM;