https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65932
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ramana at gcc dot gnu.org --- Comment #21 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> --- (In reply to Jim Wilson from comment #18) > Ultimately, I believe that this is an ARM backend bug. PROMOTE_MODE and > TARGET_PROMOTE_FUNCTION_MODE should not behave differently. It would help > if the PROMOTE_MODE macro was merged with the TARGET_PROMOTE_FUNCTION_MODE > hook, to avoid accidents like this. > > I've tested my patch to modify PROMOTE_MODE so that it no longer sets > UNSIGNEDP for char and short. > > For the SPEC CPU2000 benchmarks, individual benchmarks are within 1% which > is within the noise range, and the full benchmark results have almost > identical performance. This is for armv7-a ? Thumb2 / ARM state ? > > I get 3 additional failures in the gcc testsuite, for > gcc.target/arm/wmul-[123].c. These are testcases to verify generation of > the smulbb and smlabb instruction. However, they only work currently > because of the extra sign-extends emitted by PROMOTE_MODE. We currently > emit an unsigned short load, a sign-extend, and a multiply. The sign-extend > gets merged into the multiply. But with the patch, we emit a signed short > load and a multiply, and hence can't form smulbb. Unpatched, for wmul-1.c > we get > ldrh r1, [r4, #2]! > ldrh r6, [r0, #2]! > smlabb r5, r1, r6, r5 > smlabb r2, r1, r1, r2 > and patched we get > ldrsh r1, [r4, #2]! > ldrsh r6, [r0, #2]! > mla r5, r1, r6, r5 > mla r2, r1, r1, r2 There are 2 considerations here. smlabb, smulbb on some of the older cores are more efficient than the equivalent mla instructions there can be a 1 cycle issue delta and a 1 cycle result latency overhead. Further more in Thumb2 - ldrsh has a smaller immediate range than ldrh and thus might cause more inefficient register usage compared to the original case. I think investigating performance on an older core across a range of benchmarks (not relying on just spec but looking at something like the popular standard embedded benchmark would also be interesting) as I don't think this exposes enough 16 bit multiplication operations would be interesting. > wmul-2.c is similar. There is a bigger difference with wmul-3.c. Unpatched > is > ldrh r1, [r5, #2]! > ldrh r4, [r0, #2]! > smulbb r4, r1, r4 > subs r6, r6, r4 > smulbb r1, r1, r1 > subs r2, r2, r1 > whereas patched is > ldrsh r1, [r4, #2]! > ldrsh r6, [r0, #2]! > mls r5, r1, r6, r5 > mls r2, r1, r1, r2 > The patched code is better or equivalent to the unpatched code in all cases, > but these testcases no longer serve their purpose. I can fix wmul-1.c by > changing types to int and casting to signed short. This doesn't work for > wmul-2.c because the scalar sign-extend is moved out of the loop, and no > longer available to merge with the multiple. This also doesn't work for > wmul-3.c, but only because it is cse'd differently. I get unpatched > smulbb r4, r1, r4 > subs r6, r6, r4 > and patched > sxth r4, r4 > mls r6, r1, r4, r6 For e.g. on the Cortex-A9, this is now turning a 4 cycle sequence into a 6 cycle sequence. On older cores, this is worse again as there is no sxth instruction, we've just introduced a couple of additional shifts and that would cost in terms of code size. I do think that before this is submitted back into mainline we need to check across a range of architecture levels the effects of this change. 1 It would be good to look at this for something like libavcodec or fft for some scalar arithmetic code especially to do with short integers. 2. Comparing code generation in terms of code size and / or examples to satisfy that the PROMOTE_MODE change is not sub-optimal would be good. > > At the moment the only option I have to make wmul-2.c and wmul-3.c work is > to remove them