https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #3) > Note the SLP discovery opportunity is from the "reduction" PHI to the > return which merges control flow to a zero/one flag. Right, so I get what you mean here, so in <bb 5> [local count: 308696474]: _52 = t2x_61 < 0.0; _53 = t2y_63 < 0.0; _54 = _52 | _53; _66 = t2z_65 < 0.0; _67 = _54 | _66; if (_67 != 0) goto <bb 15>; [51.40%] else goto <bb 6>; [48.60%] <bb 15> [local count: 158662579]: goto <bb 8>; [100.00%] <bb 6> [local count: 150033894]: _55 = isec_58(D)->dist; _68 = _55 < t1y_62; _69 = _55 < t1x_60; _70 = _68 | _69; _71 = _55 < t1z_64; _72 = _70 | _71; _73 = ~_72; _74 = (int) _73; <bb 7> [local count: 1073741824]: # _56 = PHI <0(8), _74(6)> return _56; we start at _56 and follow the preds up. The interesting bit here though is that the values being compared aren't sequential in memory. So: if (t1x > isec->dist || t1y > isec->dist || t1z > isec->dist) return 0; float t1x = (bb[isec->bv_index[0]] - isec->start[0]) * isec->idot_axis[0]; float t1y = (bb[isec->bv_index[2]] - isec->start[1]) * isec->idot_axis[1]; float t1z = (bb[isec->bv_index[4]] - isec->start[2]) * isec->idot_axis[2]; but then in: if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z) return 0; we need a replicated t1x and {t2x, t2x, t2y}. It looks like the ICX code does indeed rebuild/shuffle the vector at every exit. ICX does a better job than OACC here, it does a nice trick, the key is that it also re-ordered the exits based on the complexity of the shuffle. movsxd rax, dword ptr [rdi + 56] vmovsd xmm1, qword ptr [rdi] # xmm1 = mem[0],zero vmovsd xmm2, qword ptr [rdi + 76] # xmm2 = mem[0],zero movsxd rcx, dword ptr [rdi + 64] vmovss xmm0, dword ptr [rsi + 4*rax] # xmm0 = mem[0],zero,zero,zero vinsertps xmm0, xmm0, dword ptr [rsi + 4*rcx], 16 # xmm0 = xmm0[0],mem[0],xmm0[2,3] vsubps xmm0, xmm0, xmm1 vmulps xmm0, xmm0, xmm2 vxorps xmm3, xmm3, xmm3 vcmpltps xmm3, xmm0, xmm3 i.e. the exit: if (t2x < 0.0f || t2y < 0.0f || t2z < 0.0f) return 0; was made the first exit so it doesn't perform the complicated shuffles if it doesn't need to. So it looks like schedule SLP should take in complexity in mind? This will become interesting with costing as well.