https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ams at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to rsand...@gcc.gnu.org from comment #2) > See the comment above rgroup_controls in tree-vectorizer.h for the > current assumptions around loop predication. If AVX512 wants something > different then some extensions will be needed :-) Coming back to this now. Crucially If only the first three lanes are active, the masks we need are: f rgroup: 1 1 | 1 1 | 1 1 | 0 0 d rgroup: 1 | 1 | 1 | 0 Here we can use a mask calculated for f's rgroup for d's, but not vice versa. that seems to assume that the space in the mask vector for the "bools" in the d rgroup is twice as large as in that for the f rgroup. For AVX512 there's a single bit for each lane, independently on the width of the actual data. So instead of Thus for each value of nV, it is enough to provide nV masks, with the mask being calculated based on the highest nL (or, equivalently, based on the highest nS) required by any rgroup with that nV. We therefore represent the entire collection of masks as a two-level table, with the first level being indexed by nV - 1 (since nV == 0 doesn't exist) and the second being indexed by the mask index 0 <= i < nV. we need a set of nV masks for each value of nS, and we can pick the smallest nV for each nS and generate the corresponding larger nV masks by a series of shifts. In fact we can re-use the first vector (excess bits are OK). So for the example in tree-vectorizer.h float *f; double *d; for (int i = 0; i < n; ++i) { f[i * 2 + 0] += 1.0f; f[i * 2 + 1] += 2.0f; d[i] += 3.0; } we'd need to perform two WHILE_ULT. For float *f; double *d; for (int i = 0; i < n; ++i) { f[i] += 1.0f; d[i] += 3.0; } we'd compute the mask for the f rgroup with a WHILE_ULT and we'll have nV_d = 2 * nV_f and the first mask vector from f can be reused for d (but not the other way around). The second mask vector for d can be obtained by kshiftr. There's no other way to do N bit to two N/2 bit hi/lo (un)packing (there's a 2x N/2 bit -> N bit operation, for whatever reason). There's also no way to transform the d rgroup mask into the f rgroup mask for the first example aka duplicate bits in place, { b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN }, nor the reverse. So in reality it seems we need a set of mask vectors for the full set of nS * nV combinations with AVX512. Doing fully masking with AVX2 style vectors would work with the existing rgroup control scheme. Currently the "key" to the AVX512 behavior is the use of scalar modes for the mask vectors but then that's also what GCN uses. Do GCN mask bits really map to bytes to allow the currently used rgroup scheme?