[Bug target/95962] Inefficient code for simple arm_neon.h iota operation

tnfchris at gcc dot gnu.org via Gcc-bugs Thu, 12 Aug 2021 01:01:55 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962


Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-08-12
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #1 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
We generate the correct code at -O3 but not -O2.

At -O3 we generate

foo:
        adrp    x0, .LC0
        sub     sp, sp, #16
        ldr     q0, [x0, #:lo12:.LC0]
        add     sp, sp, 16
        ret

where the problem seems to be at at -O2 store merging has broken up the
construction of `array` into two separate memory accesses:

  MEM <unsigned long> [(int *)&array] = 4294967296;
  MEM <unsigned long> [(int *)&array + 8B] = 12884901890;

whereas at -O3 we still have a single assignment:

  MEM <vector(4) int> [(int *)&array] = { 0, 1, 2, 3 };

I'm not sure even if we made these loads gimple level if that would help. we'd
still have the explicit MEMs created by store merging.

Perhaps we should just make store-merging allow TImode merges and split them in
the backend if needed.

[Bug target/95962] Inefficient code for simple arm_neon.h iota operation

Reply via email to