On 15/06/2023 12:06, Richard Biener wrote:
On Thu, 15 Jun 2023, Andrew Stubbs wrote:
On 15/06/2023 10:58, Richard Biener wrote:
On Thu, 15 Jun 2023, Andrew Stubbs wrote:
On 14/06/2023 15:29, Richard Biener wrote:
Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <a...@codesourcery.com>:
On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).
This is also sounds like GCN. We currently use WHILE_ULT in the middle
end
which expands to a vector compare against a vector of stepped values.
This
requires an additional instruction to prepare the comparison vector
(compared to SVE), but the "while_ultv64sidi" pattern (for example)
returns
the DImode bitmask, so it works reasonably well.
Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops. Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size scalar 128 256 512 512e 512f
1 9.42 11.32 9.35 11.17 15.13 16.89
2 5.72 6.53 6.66 6.66 7.62 8.56
3 4.49 5.10 5.10 5.74 5.08 5.73
4 4.10 4.33 4.29 5.21 3.79 4.25
6 3.78 3.85 3.86 4.76 2.54 2.85
8 3.64 1.89 3.76 4.50 1.92 2.16
12 3.56 2.21 3.75 4.26 1.26 1.42
16 3.36 0.83 1.06 4.16 0.95 1.07
20 3.39 1.42 1.33 4.07 0.75 0.85
24 3.23 0.66 1.72 4.22 0.62 0.70
28 3.18 1.09 2.04 4.20 0.54 0.61
32 3.16 0.47 0.41 0.41 0.47 0.53
34 3.16 0.67 0.61 0.56 0.44 0.50
38 3.19 0.95 0.95 0.82 0.40 0.45
42 3.09 0.58 1.21 1.13 0.36 0.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop. From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.
Let me check I understand correctly. In the fully masked case, there is a
single loop in which a new mask is generated at the start of each
iteration. In the masked epilogue case, the main loop uses no masking
whatsoever, thus avoiding the need for generating a mask, carrying the
mask, inserting vec_merge operations, etc, and then the epilogue looks
much
like the fully masked case, but unlike smaller mode epilogues there is no
loop because the eplogue vector size is the same. Is that right?
Yes.
This scheme seems like it might also benefit GCN, in so much as it
simplifies the hot code path.
GCN does not actually have smaller vector sizes, so there's no analogue
to
AVX2 (we pretend we have some smaller sizes, but that's because the
middle
end can't do masking everywhere yet, and it helps make some vector
constants smaller, perhaps).
This patch does not enable using fully masked loops or
masked epilogues by default. More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in
their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls. Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info. vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.
I have decided against interweaving
vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs
vect_verify_full_masking_avx512.
I was split between making 'vec_loop_masks' a class with methods,
possibly merging in the _len stuff into a single registry. It
seemed to be too many changes for the purpose of getting AVX512
working. I'm going to play wait and see what happens with RISC-V
here since they are going to get both masks and lengths registered
I think.
The vect_prepare_for_masked_peels hunk might run into issues with
SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
looked odd.
Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run
the testsuite with --param vect-partial-vector-usage=2 with and
without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
and one latent wrong-code (PR110237).
There's followup work to be done to try enabling masked epilogues
for x86-64 by default (when AVX512 is enabled, possibly only when
-mprefer-vector-width=512). Getting cost modeling and decision
right is going to be challenging.
Any comments?
OK?
Btw, testing on GCN would be welcome - the _avx512 paths could
work for it so in case the while_ult path fails (not sure if
it ever does) it could get _avx512 style masking. Likewise
testing on ARM just to see I didn't break anything here.
I don't have SVE hardware so testing is probably meaningless.
I can set some tests going. Is vect.exp enough?
Well, only you know (from experience), but sure that?s a nice start.
I tested vect.exp for both gcc and gfortran and there were no regressions.
I
have another run going with the other param settings.
(Side note: vect.exp used to be a nice quick test for use during
development,
but the tsvc tests are now really slow, at least when run on a single GPU
thread.)
I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
this prevents masked loops, but not masked epilogues, right?)
Yes. That should also work with the while_ult style btw.
and the results
look good. I plan to do some benchmarking shortly. One comment: building a
vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
constant pool and an unnecessary memory load (it literally has to use this
sequence to generate the addresses to load the constant!) Generating the
sequence via VEC_SERIES would be a no-op, for GCN, because we have an
ABI-mandated register that already holds that value. (Perhaps I have
another
piece missing here, IDK?)
I failed to special-case the {0, 1, 2, 3, ... } constant because I
couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
It might be that the target needs to pattern match these constants
at RTL expansion time?
Btw, did you disable your while_ult pattern for the experiment?
I tried it both ways; both appear to work, and the while_ult case does avoid
the constant vector. I also don't seem to need while_ult for the fully masked
case any more (is that new?).
Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
conveniently available but it has to multiply the IV with the
number of scalars per iter which has overflow issues it has
to compensate for by choosing a wider IV. I'm avoiding that
issue (besides for alignment peeling) by instead altering
the constant vector to compare against. On x86 the constant
vector is always a load but the multiplication would add to
the latency of mask production which already isn't too great.
Is the multiplication not usually a shift?
And yes, the alternate scheme doesn't rely on while_ult but instead
on vec_cmpu to produce the masks.
You might be able to produce the {0, 0, 1, 1, ... } constant
by interleaving v1 with itself? Any non-power-of-two duplication
looks more difficult though.
I think that would need to use a full permutation, which is probably
faster than a cold load, but in all these cases the vector that defines
the permutation looks exactly like the result, so ......
I've been playing with this stuff some more and I find that even though
GCN supports fully masked loops and uses them when I test without
offload, it's actually been running in
param_vect_partial_vector_usage==0 mode for offload because i386.cc has
that hardcoded and the offload compiler inherits param settings from the
host.
I tried running the Babelstream benchmark with the various settings and
it's a wash for most of the measurements (memory limited, most likely),
but the "Dot" benchmark is considerably slower when fully masked (about
50%). This probably explains why adding the additional "fake" smaller
vector sizes was so good for our numbers, but confirms that the partial
epilogue is a good option.
Andrew