https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70089

            Bug ID: 70089
           Summary: ARM/THUMB unnecessarily typecasts some rvalues on
                    memory store
           Product: gcc
           Version: 5.2.1
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: aik at aol dot com.au
  Target Milestone: ---

Created attachment 37874
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37874&action=edit
Test cases demonstrating sign-extending behaviour

When a multiplication is part of an rvalue for memory storage, gcc is
casting/wrapping variables smaller than the native width (32bit) to the
destination pointer's dereference type and expanding back to 32bit (via
LSL+LSR, LSL+ASR, or AND), even if the [r]value is not used for anything more
than being stored to memory. Since memory stores smaller than the native
register width ignore the higher bits, this behaviour is unnecessary and
results in bloat inside of hotspots.

Additionally, there is a strange behaviour where if one logical-ORs a variable
with a constant which is negative (if the type were signed) and smaller than
the native register width, gcc will emit code to sign extend (even in unsigned
cases), making the code inefficient.

For example:

  // typeof(m) = unsigned short *
  // typeof(x) = unsigned short
  *m++ = x|0x4000U;
  *m++ = x|0x8000U;

This is trivially translated to ARM assembly as:

  ; r0: &m
  ; r1: x
  ORR  r2, r1, #0x4000 ; t1 = x|0x4000U
  ORR  r1, r1, #0x8000 ; t2 = x|0x8000U
  STRH r2, [r0], #2    ; *m++ = t1
  STRH r1, [r0], #2    ; *m++ = t2

However, gcc is generating the following instead:

  ; r0: &m
  ; r1: x
  MVN  r3, r1, lsl #17 ; t2 = x|0xFFFF8000
  MVN  r3, r3, lsr #17
  ORR  r1, r1, #0x4000 ; t1 = x|0x4000U
  STRH r1, [r0, #0]    ; m[0] = t1
  STRH r3, [r0, #2]    ; m[1] = t2

In that instance, it's not too awful (just one extra instruction). However,
when these sign-extended values become impossible to generate in two
instructions, gcc will resort to using a literal pool to fetch the OR constant.
The C code:

  *m++ = x|0x4100U;
  *m++ = x|0x8100U;

The trivial interpretation:

  ORR  r2, r1, #0x4100 ; t1 = x|0x4100U
  ORR  r1, r1, #0x8100 ; t2 = x|0x8100U
  STRH r2, [r0], #2    ; *m++ = t1
  STRH r1, [r0], #2    ; *m++ = t2

The generated assembly (instructions sorted for readability):

  ORR  ip, r1, #0x4100 ; t1 = x|0x4100U
  LDR  r3, =0xFFFF8100 ; t2 = x|0xFFFF8100
  ORR  r3, r1, r3
  STRH ip, [r0, #4]    ; m[2] = t1
  STRH r3, [r0, #6]    ; m[3] = t2

Not only is this slower (due to the extra instruction and the memory access),
but it also takes up more memory (and the more constants you have that require
a memory load for sign-extension, the worse it gets).

Of particular interest, however, is that using LTO actually removes all
instances of the typecast/mask-before-store behaviour, even with 'negative'
values.

Reply via email to