Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

Bernd Edlinger Tue, 29 Nov 2016 13:37:57 -0800

On 11/29/16 16:06, Wilco Dijkstra wrote:
> Bernd Edlinger wrote:
>
> -  "TARGET_32BIT && reload_completed
> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
>     && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
>
> This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're
> already excluding NEON.
>


Aehm, no.  This would split the addi_neon insn before it is clear
if the reload pass will assign a VFP register.

With this change the stack usage with -mfpu=neon increases
from 2300 to around 2600 bytes.

> This patch expands ADD and SUB earlier, so shouldn't we do the same obvious
> change for the similar instructions CMP and NEG?
>

Good question.  I think the cmp and neg pattern are more complicated
and do typically have a more complicated data flow than the other
patterns.

I tried to create a test case which expands cmpdi and negdi patterns
as follows:

--- pr77308-1.c 2016-11-25 17:53:20.379141465 +0100
+++ pr77308-2.c 2016-11-29 20:46:51.266948631 +0100
@@ -68,10 +68,10 @@
  #define B(x,j)    (((SHA_LONG64)(*(((const unsigned char 
*)(&x))+j)))<<((7-j)*8))
  #define PULL64(x) 
(B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
  #define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
-#define Sigma0(x)       ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39))
-#define Sigma1(x)       ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41))
-#define sigma0(x)       ~(ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7))
-#define sigma1(x)       ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39) == 
(x) ? -(x) : (x))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ROTR((x),41) < 
(x) ? -(x) : (x))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7) <= (x) 
? ~(x) : (x))
+#define sigma1(x)       ((long long)(ROTR((x),19) ^ ROTR((x),61) ^ 
((x)>>6)) < (long long)(x) ? -(x) : (x))
  #define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
  #define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))


This expands *arm_negdi2, *arm_cmpdi_unsigned, *arm_cmpdi_insn.
The stack usage is around 1900 bytes with previous patch,
and 2300 bytes without.

I tried to split *arm_negdi2 and *arm_cmpdi_unsined early, and it
gives indeed smaller stack sizes in the test case above (~400 bytes).
But when I make *arm_cmpdi_insn split early, it ICEs:

--- arm.md.orig 2016-11-27 09:22:41.794790123 +0100
+++ arm.md      2016-11-29 21:51:51.438163078 +0100
@@ -7432,7 +7432,7 @@
     (clobber (match_scratch:SI 2 "=r"))]
    "TARGET_32BIT"
    "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
    [(set (reg:CC CC_REGNUM)
          (compare:CC (match_dup 0) (match_dup 1)))
     (parallel [(set (reg:CC CC_REGNUM)

ontop of the latest patch, I got:

gcc -S -Os pr77308-2.c -fdump-rtl-all-verbose
pr77308-2.c: In function 'sha512_block_data_order':
pr77308-2.c:169:1: error: unrecognizable insn:
  }
  ^
(insn 4870 4869 1636 87 (set (scratch:SI)
         (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4)
                 (subreg:SI (reg:DI 473 [ X$14 ]) 4))
             (ltu:SI (reg:CC_C 100 cc)
                 (const_int 0 [0])))) "pr77308-2.c":140 -1
      (nil))
pr77308-2.c:169:1: internal compiler error: in extract_insn, at recog.c:2311
0xaf4cd8 _fatal_insn(char const*, rtx_def const*, char const*, int, char 
const*)
        ../../gcc-trunk/gcc/rtl-error.c:108
0xaf4d09 _fatal_insn_not_found(rtx_def const*, char const*, int, char 
const*)
        ../../gcc-trunk/gcc/rtl-error.c:116
0xac74ef extract_insn(rtx_insn*)
        ../../gcc-trunk/gcc/recog.c:2311
0x122427a decompose_multiword_subregs
        ../../gcc-trunk/gcc/lower-subreg.c:1467
0x122550d execute
        ../../gcc-trunk/gcc/lower-subreg.c:1734


So it is certainly possible, but not really simple to improve the
stack size even further.  But I would prefer to do that in a
separate patch.

BTW: there are also negd2_compare, *negdi_extendsidi,
*negdi_zero_extendsidi, *thumb2_negdi2.

I think it would be a precondition to have test cases that exercise
each of these patterns before we try to split these instructions.


Bernd.

Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

Reply via email to