[Bug rtl-optimization/65932] [5 Regression] Linux-3.10.75 on arm926ej-s does not boot due to wrong code generation

ramana at gcc dot gnu.org Wed, 09 Sep 2015 02:06:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65932


Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ramana at gcc dot gnu.org

--- Comment #21 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to Jim Wilson from comment #18)
> Ultimately, I believe that this is an ARM backend bug.  PROMOTE_MODE and
> TARGET_PROMOTE_FUNCTION_MODE should not behave differently.  It would help
> if the PROMOTE_MODE macro was merged with the TARGET_PROMOTE_FUNCTION_MODE
> hook, to avoid accidents like this.
> 
> I've tested my patch to modify PROMOTE_MODE so that it no longer sets
> UNSIGNEDP for char and short.
> 
> For the SPEC CPU2000 benchmarks, individual benchmarks are within 1% which
> is within the noise range, and the full benchmark results have almost
> identical performance.

This is for armv7-a ? Thumb2 / ARM state ? 

> 
> I get 3 additional failures in the gcc testsuite, for
> gcc.target/arm/wmul-[123].c.  These are testcases to verify generation of
> the smulbb and smlabb instruction.  However, they only work currently
> because of the extra sign-extends emitted by PROMOTE_MODE.  We currently
> emit an unsigned short load, a sign-extend, and a multiply.  The sign-extend
> gets merged into the multiply.  But with the patch, we emit a signed short
> load and a multiply, and hence can't form smulbb.  Unpatched, for wmul-1.c
> we get
>       ldrh    r1, [r4, #2]!
>       ldrh    r6, [r0, #2]!
>       smlabb  r5, r1, r6, r5
>       smlabb  r2, r1, r1, r2
> and patched we get
>       ldrsh   r1, [r4, #2]!
>       ldrsh   r6, [r0, #2]!
>       mla     r5, r1, r6, r5
>       mla     r2, r1, r1, r2

There are 2 considerations here. smlabb, smulbb on some of the older cores are
more efficient than the equivalent mla instructions there can be a 1 cycle
issue delta and a 1 cycle result latency overhead. Further more in Thumb2 -
ldrsh has a smaller immediate range than ldrh and thus might cause more
inefficient register usage compared to the original case.

I think investigating performance on an older core across a range of benchmarks
(not relying on just spec but looking at something like the popular standard
embedded benchmark would also be interesting) as I don't think this exposes
enough 16 bit multiplication operations would be interesting. 



> wmul-2.c is similar.  There is a bigger difference with wmul-3.c.  Unpatched
> is
>       ldrh    r1, [r5, #2]!
>       ldrh    r4, [r0, #2]!
>       smulbb  r4, r1, r4
>       subs    r6, r6, r4
>       smulbb  r1, r1, r1
>       subs    r2, r2, r1
> whereas patched is
>       ldrsh   r1, [r4, #2]!
>       ldrsh   r6, [r0, #2]!
>       mls     r5, r1, r6, r5
>       mls     r2, r1, r1, r2
> The patched code is better or equivalent to the unpatched code in all cases,
> but these testcases no longer serve their purpose.  I can fix wmul-1.c by
> changing types to int and casting to signed short.  This doesn't work for
> wmul-2.c because the scalar sign-extend is moved out of the loop, and no
> longer available to merge with the multiple.  This also doesn't work for
> wmul-3.c, but only because it is cse'd differently.  I get unpatched
>       smulbb  r4, r1, r4
>       subs    r6, r6, r4
> and patched
>       sxth    r4, r4
>       mls     r6, r1, r4, r6

For e.g. on the Cortex-A9, this is now turning a 4 cycle sequence into a 6
cycle sequence.

On older cores, this is worse again as there is no sxth instruction, we've just
introduced a couple of additional shifts and that would cost in terms of code
size.

I do think that before this is submitted back into mainline we need to check
across a range of architecture levels the effects of this change.

1 It would be good to look at this for something like libavcodec or fft for
some scalar arithmetic code especially to do with short integers.

2. Comparing code generation in terms of code size and / or examples to satisfy
that the PROMOTE_MODE change is not sub-optimal would be good.

> 
> At the moment the only option I have to make wmul-2.c and wmul-3.c work is
> to remove them

[Bug rtl-optimization/65932] [5 Regression] Linux-3.10.75 on arm926ej-s does not boot due to wrong code generation

Reply via email to