https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> Note the SLP discovery opportunity is from the "reduction" PHI to the
> return which merges control flow to a zero/one flag.

Right, so I get what you mean here, so in

  <bb 5> [local count: 308696474]:
  _52 = t2x_61 < 0.0;
  _53 = t2y_63 < 0.0;
  _54 = _52 | _53;
  _66 = t2z_65 < 0.0;
  _67 = _54 | _66;
  if (_67 != 0)
    goto <bb 15>; [51.40%]
  else
    goto <bb 6>; [48.60%]

  <bb 15> [local count: 158662579]:
  goto <bb 8>; [100.00%]

  <bb 6> [local count: 150033894]:
  _55 = isec_58(D)->dist;
  _68 = _55 < t1y_62;
  _69 = _55 < t1x_60;
  _70 = _68 | _69;
  _71 = _55 < t1z_64;
  _72 = _70 | _71;
  _73 = ~_72;
  _74 = (int) _73;

  <bb 7> [local count: 1073741824]:
  # _56 = PHI <0(8), _74(6)>
  return _56;

we start at _56 and follow the preds up.  The interesting bit here though is
that the values being compared aren't sequential in memory.

So:

  if (t1x > isec->dist || t1y > isec->dist || t1z > isec->dist) return 0;

  float t1x = (bb[isec->bv_index[0]] - isec->start[0]) * isec->idot_axis[0];
  float t1y = (bb[isec->bv_index[2]] - isec->start[1]) * isec->idot_axis[1];
  float t1z = (bb[isec->bv_index[4]] - isec->start[2]) * isec->idot_axis[2];

but then in:

  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y <
t1z) return 0;

we need a replicated t1x and {t2x, t2x, t2y}.

It looks like the ICX code does indeed rebuild/shuffle the vector at every
exit.
ICX does a better job than OACC here, it does a nice trick, the key is that it
also re-ordered the exits based on the complexity of the shuffle.

        movsxd  rax, dword ptr [rdi + 56]
        vmovsd  xmm1, qword ptr [rdi]           # xmm1 = mem[0],zero
        vmovsd  xmm2, qword ptr [rdi + 76]      # xmm2 = mem[0],zero
        movsxd  rcx, dword ptr [rdi + 64]
        vmovss  xmm0, dword ptr [rsi + 4*rax]   # xmm0 = mem[0],zero,zero,zero
        vinsertps       xmm0, xmm0, dword ptr [rsi + 4*rcx], 16 # xmm0 =
xmm0[0],mem[0],xmm0[2,3]
        vsubps  xmm0, xmm0, xmm1
        vmulps  xmm0, xmm0, xmm2
        vxorps  xmm3, xmm3, xmm3
        vcmpltps        xmm3, xmm0, xmm3

i.e. the exit:

  if (t2x < 0.0f || t2y < 0.0f || t2z < 0.0f) return 0;

was made the first exit so it doesn't perform the complicated shuffles if it
doesn't need to.

So it looks like schedule SLP should take in complexity in mind?  This will
become interesting with costing as well.

Reply via email to