https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438

Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mkuvyrkov at gcc dot gnu.org

--- Comment #9 from Maxim Kuvyrkov <mkuvyrkov at gcc dot gnu.org> ---
I've looked into another case where inability to handle stores with gaps
generates sub-optimal code.  I'm interested in spending some time on fixing
this, provided some guidance in the vectorizer.

Is it substantially more difficult to handle stores with gaps compared to loads
with gaps?

The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(),
which is #2 function in 462.libquantum profile.  This cycle accounts for about
25% of total 462.libquantum time.

===struct node_struct
{
  float _Complex gap;
  unsigned long long state;
};

struct reg_struct
{
  int size;
  struct node_struct *node;
};

void
func(int target, struct reg_struct *reg)
{
  int i;

  for(i=0; i<reg->size; i++)
    reg->node[i].state ^= ((unsigned long long) 1 << target);
}
===

This loop vectorizes into
  <bb 5>:
  # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)>
  vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]);
  vect__5.11_41 = vect_array.10[0];
  vect__5.12_42 = vect_array.10[1];
  vect__7.13_44 = vect__5.11_41 ^ vect_cst__43;
  _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>;
  MEM[(long long unsigned int *)ivtmp_45] = _48;
  ivtmp_50 = ivtmp_45 + 16;
  _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>;
  MEM[(long long unsigned int *)ivtmp_50] = _51;

which then becomes for aarch64:
.L4:
        ld2     {v0.2d - v1.2d}, [x1]
        add     w2, w2, 1
        cmp     w2, w7
        eor     v0.16b, v2.16b, v0.16b
        umov    x4, v0.d[1]
        st1     {v0.d}[0], [x1]
        add     x1, x1, 32
        str     x4, [x1, -16]
        bcc     .L4

Reply via email to