On 11/29/16 16:06, Wilco Dijkstra wrote: > Bernd Edlinger wrote: > > - "TARGET_32BIT && reload_completed > + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) > && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" > > This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're > already excluding NEON. >
Aehm, no. This would split the addi_neon insn before it is clear if the reload pass will assign a VFP register. With this change the stack usage with -mfpu=neon increases from 2300 to around 2600 bytes. > This patch expands ADD and SUB earlier, so shouldn't we do the same obvious > change for the similar instructions CMP and NEG? > Good question. I think the cmp and neg pattern are more complicated and do typically have a more complicated data flow than the other patterns. I tried to create a test case which expands cmpdi and negdi patterns as follows: --- pr77308-1.c 2016-11-25 17:53:20.379141465 +0100 +++ pr77308-2.c 2016-11-29 20:46:51.266948631 +0100 @@ -68,10 +68,10 @@ #define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) #define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) #define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) -#define Sigma0(x) ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39)) -#define Sigma1(x) ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41)) -#define sigma0(x) ~(ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7)) -#define sigma1(x) ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39) == (x) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ROTR((x),41) < (x) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ ((x)>>7) <= (x) ? ~(x) : (x)) +#define sigma1(x) ((long long)(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6)) < (long long)(x) ? -(x) : (x)) #define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) #define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) This expands *arm_negdi2, *arm_cmpdi_unsigned, *arm_cmpdi_insn. The stack usage is around 1900 bytes with previous patch, and 2300 bytes without. I tried to split *arm_negdi2 and *arm_cmpdi_unsined early, and it gives indeed smaller stack sizes in the test case above (~400 bytes). But when I make *arm_cmpdi_insn split early, it ICEs: --- arm.md.orig 2016-11-27 09:22:41.794790123 +0100 +++ arm.md 2016-11-29 21:51:51.438163078 +0100 @@ -7432,7 +7432,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) ontop of the latest patch, I got: gcc -S -Os pr77308-2.c -fdump-rtl-all-verbose pr77308-2.c: In function 'sha512_block_data_order': pr77308-2.c:169:1: error: unrecognizable insn: } ^ (insn 4870 4869 1636 87 (set (scratch:SI) (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4) (subreg:SI (reg:DI 473 [ X$14 ]) 4)) (ltu:SI (reg:CC_C 100 cc) (const_int 0 [0])))) "pr77308-2.c":140 -1 (nil)) pr77308-2.c:169:1: internal compiler error: in extract_insn, at recog.c:2311 0xaf4cd8 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) ../../gcc-trunk/gcc/rtl-error.c:108 0xaf4d09 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) ../../gcc-trunk/gcc/rtl-error.c:116 0xac74ef extract_insn(rtx_insn*) ../../gcc-trunk/gcc/recog.c:2311 0x122427a decompose_multiword_subregs ../../gcc-trunk/gcc/lower-subreg.c:1467 0x122550d execute ../../gcc-trunk/gcc/lower-subreg.c:1734 So it is certainly possible, but not really simple to improve the stack size even further. But I would prefer to do that in a separate patch. BTW: there are also negd2_compare, *negdi_extendsidi, *negdi_zero_extendsidi, *thumb2_negdi2. I think it would be a precondition to have test cases that exercise each of these patterns before we try to split these instructions. Bernd.