> Am 15.06.2023 um 16:04 schrieb Andrew Stubbs <a...@codesourcery.com>: > > On 15/06/2023 15:00, Richard Biener wrote: >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>> On 15/06/2023 14:34, Richard Biener wrote: >>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>>> >>>>> On 15/06/2023 12:06, Richard Biener wrote: >>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>>>>> >>>>>>> On 15/06/2023 10:58, Richard Biener wrote: >>>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>>>>>>> >>>>>>>>> On 14/06/2023 15:29, Richard Biener wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs >>>>>>>>>>> <a...@codesourcery.com>: >>>>>>>>>>> >>>>>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: >>>>>>>>>>>> This implemens fully masked vectorization or a masked epilog for >>>>>>>>>>>> AVX512 style masks which single themselves out by representing >>>>>>>>>>>> each lane with a single bit and by using integer modes for the mask >>>>>>>>>>>> (both is much like GCN). >>>>>>>>>>>> AVX512 is also special in that it doesn't have any instruction >>>>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult. >>>>>>>>>>>> Instead the masks are produced by vector compares and the loop >>>>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on >>>>>>>>>>>> mask generation, a suitable mask test instruction is available). >>>>>>>>>>> >>>>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the >>>>>>>>>>> middle >>>>>>>>>>> end >>>>>>>>>>> which expands to a vector compare against a vector of stepped >>>>>>>>>>> values. >>>>>>>>>>> This >>>>>>>>>>> requires an additional instruction to prepare the comparison vector >>>>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example) >>>>>>>>>>> returns >>>>>>>>>>> the DImode bitmask, so it works reasonably well. >>>>>>>>>>> >>>>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs >>>>>>>>>>>> messes things up in some cases removing that IV to eliminate >>>>>>>>>>>> it with an incrementing one used for address generation. >>>>>>>>>>>> One of the motivating testcases is from PR108410 which in turn >>>>>>>>>>>> is extracted from x264 where large size vectorization shows >>>>>>>>>>>> issues with small trip loops. Execution time there improves >>>>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases >>>>>>>>>>>> of less than 32 iterations. >>>>>>>>>>>> size scalar 128 256 512 512e 512f >>>>>>>>>>>> 1 9.42 11.32 9.35 11.17 15.13 16.89 >>>>>>>>>>>> 2 5.72 6.53 6.66 6.66 7.62 8.56 >>>>>>>>>>>> 3 4.49 5.10 5.10 5.74 5.08 5.73 >>>>>>>>>>>> 4 4.10 4.33 4.29 5.21 3.79 4.25 >>>>>>>>>>>> 6 3.78 3.85 3.86 4.76 2.54 2.85 >>>>>>>>>>>> 8 3.64 1.89 3.76 4.50 1.92 2.16 >>>>>>>>>>>> 12 3.56 2.21 3.75 4.26 1.26 1.42 >>>>>>>>>>>> 16 3.36 0.83 1.06 4.16 0.95 1.07 >>>>>>>>>>>> 20 3.39 1.42 1.33 4.07 0.75 0.85 >>>>>>>>>>>> 24 3.23 0.66 1.72 4.22 0.62 0.70 >>>>>>>>>>>> 28 3.18 1.09 2.04 4.20 0.54 0.61 >>>>>>>>>>>> 32 3.16 0.47 0.41 0.41 0.47 0.53 >>>>>>>>>>>> 34 3.16 0.67 0.61 0.56 0.44 0.50 >>>>>>>>>>>> 38 3.19 0.95 0.95 0.82 0.40 0.45 >>>>>>>>>>>> 42 3.09 0.58 1.21 1.13 0.36 0.40 >>>>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for >>>>>>>>>>>> a masked epilog and 512f for the fully masked loop. From >>>>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly >>>>>>>>>>>> the winner, the fully masked variant is clearly worse and >>>>>>>>>>>> it's size benefit is also tiny. >>>>>>>>>>> >>>>>>>>>>> Let me check I understand correctly. In the fully masked case, there >>>>>>>>>>> is >>>>>>>>>>> a >>>>>>>>>>> single loop in which a new mask is generated at the start of each >>>>>>>>>>> iteration. In the masked epilogue case, the main loop uses no >>>>>>>>>>> masking >>>>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying >>>>>>>>>>> the >>>>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue >>>>>>>>>>> looks >>>>>>>>>>> much >>>>>>>>>>> like the fully masked case, but unlike smaller mode epilogues there >>>>>>>>>>> is >>>>>>>>>>> no >>>>>>>>>>> loop because the eplogue vector size is the same. Is that right? >>>>>>>>>> >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>>> This scheme seems like it might also benefit GCN, in so much as it >>>>>>>>>>> simplifies the hot code path. >>>>>>>>>>> >>>>>>>>>>> GCN does not actually have smaller vector sizes, so there's no >>>>>>>>>>> analogue >>>>>>>>>>> to >>>>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the >>>>>>>>>>> middle >>>>>>>>>>> end can't do masking everywhere yet, and it helps make some vector >>>>>>>>>>> constants smaller, perhaps). >>>>>>>>>>> >>>>>>>>>>>> This patch does not enable using fully masked loops or >>>>>>>>>>>> masked epilogues by default. More work on cost modeling >>>>>>>>>>>> and vectorization kind selection on x86_64 is necessary >>>>>>>>>>>> for this. >>>>>>>>>>>> Implementation wise this introduces >>>>>>>>>>>> LOOP_VINFO_PARTIAL_VECTORS_STYLE >>>>>>>>>>>> which could be exploited further to unify some of the flags >>>>>>>>>>>> we have right now but there didn't seem to be many easy things >>>>>>>>>>>> to merge, so I'm leaving this for followups. >>>>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept >>>>>>>>>>>> in >>>>>>>>>>>> their >>>>>>>>>>>> original form and recorded in a hash_set now instead of being >>>>>>>>>>>> processed to a vector of rgroup_controls. Instead that's now >>>>>>>>>>>> left to the final analysis phase which tries forming the >>>>>>>>>>>> rgroup_controls >>>>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style >>>>>>>>>>>> which needs a different organization and instead fills a hash_map >>>>>>>>>>>> with the relevant info. vect_get_loop_mask now has two >>>>>>>>>>>> implementations, >>>>>>>>>>>> one for the two mask styles we then have. >>>>>>>>>>>> I have decided against interweaving >>>>>>>>>>>> vect_set_loop_condition_partial_vectors >>>>>>>>>>>> with conditions to do AVX512 style masking and instead opted to >>>>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. >>>>>>>>>>>> Likewise for vect_verify_full_masking vs >>>>>>>>>>>> vect_verify_full_masking_avx512. >>>>>>>>>>>> I was split between making 'vec_loop_masks' a class with methods, >>>>>>>>>>>> possibly merging in the _len stuff into a single registry. It >>>>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512 >>>>>>>>>>>> working. I'm going to play wait and see what happens with RISC-V >>>>>>>>>>>> here since they are going to get both masks and lengths registered >>>>>>>>>>>> I think. >>>>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with >>>>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE >>>>>>>>>>>> looked odd. >>>>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run >>>>>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and >>>>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) >>>>>>>>>>>> and one latent wrong-code (PR110237). >>>>>>>>>>>> There's followup work to be done to try enabling masked epilogues >>>>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when >>>>>>>>>>>> -mprefer-vector-width=512). Getting cost modeling and decision >>>>>>>>>>>> right is going to be challenging. >>>>>>>>>>>> Any comments? >>>>>>>>>>>> OK? >>>>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could >>>>>>>>>>>> work for it so in case the while_ult path fails (not sure if >>>>>>>>>>>> it ever does) it could get _avx512 style masking. Likewise >>>>>>>>>>>> testing on ARM just to see I didn't break anything here. >>>>>>>>>>>> I don't have SVE hardware so testing is probably meaningless. >>>>>>>>>>> >>>>>>>>>>> I can set some tests going. Is vect.exp enough? >>>>>>>>>> >>>>>>>>>> Well, only you know (from experience), but sure that?s a nice start. >>>>>>>>> >>>>>>>>> I tested vect.exp for both gcc and gfortran and there were no >>>>>>>>> regressions. >>>>>>>>> I >>>>>>>>> have another run going with the other param settings. >>>>>>>>> >>>>>>>>> (Side note: vect.exp used to be a nice quick test for use during >>>>>>>>> development, >>>>>>>>> but the tsvc tests are now really slow, at least when run on a single >>>>>>>>> GPU >>>>>>>>> thread.) >>>>>>>>> >>>>>>>>> I tried some small examples with --param vect-partial-vector-usage=1 >>>>>>>>> (IIUC >>>>>>>>> this prevents masked loops, but not masked epilogues, right?) >>>>>>>> >>>>>>>> Yes. That should also work with the while_ult style btw. >>>>>>>> >>>>>>>>> and the results >>>>>>>>> look good. I plan to do some benchmarking shortly. One comment: >>>>>>>>> building >>>>>>>>> a >>>>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry >>>>>>>>> in >>>>>>>>> the >>>>>>>>> constant pool and an unnecessary memory load (it literally has to use >>>>>>>>> this >>>>>>>>> sequence to generate the addresses to load the constant!) Generating >>>>>>>>> the >>>>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an >>>>>>>>> ABI-mandated register that already holds that value. (Perhaps I have >>>>>>>>> another >>>>>>>>> piece missing here, IDK?) >>>>>>>> >>>>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I >>>>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }. >>>>>>>> It might be that the target needs to pattern match these constants >>>>>>>> at RTL expansion time? >>>>>>>> >>>>>>>> Btw, did you disable your while_ult pattern for the experiment? >>>>>>> >>>>>>> I tried it both ways; both appear to work, and the while_ult case does >>>>>>> avoid >>>>>>> the constant vector. I also don't seem to need while_ult for the fully >>>>>>> masked >>>>>>> case any more (is that new?). >>>>>> >>>>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems >>>>>> conveniently available but it has to multiply the IV with the >>>>>> number of scalars per iter which has overflow issues it has >>>>>> to compensate for by choosing a wider IV. I'm avoiding that >>>>>> issue (besides for alignment peeling) by instead altering >>>>>> the constant vector to compare against. On x86 the constant >>>>>> vector is always a load but the multiplication would add to >>>>>> the latency of mask production which already isn't too great. >>>>> >>>>> Is the multiplication not usually a shift? >>>>> >>>>>> And yes, the alternate scheme doesn't rely on while_ult but instead >>>>>> on vec_cmpu to produce the masks. >>>>>> >>>>>> You might be able to produce the {0, 0, 1, 1, ... } constant >>>>>> by interleaving v1 with itself? Any non-power-of-two duplication >>>>>> looks more difficult though. >>>>> >>>>> I think that would need to use a full permutation, which is probably >>>>> faster >>>>> than a cold load, but in all these cases the vector that defines the >>>>> permutation looks exactly like the result, so ...... >>>>> >>>>> I've been playing with this stuff some more and I find that even though >>>>> GCN >>>>> supports fully masked loops and uses them when I test without offload, >>>>> it's >>>>> actually been running in param_vect_partial_vector_usage==0 mode for >>>>> offload >>>>> because i386.cc has that hardcoded and the offload compiler inherits param >>>>> settings from the host. >>>> >>>> Doesn't that mean it will have a scalar epilog and a very large VF for the >>>> main loop due to the large vector size? >>>> >>>>> I tried running the Babelstream benchmark with the various settings and >>>>> it's a >>>>> wash for most of the measurements (memory limited, most likely), but the >>>>> "Dot" >>>>> benchmark is considerably slower when fully masked (about 50%). This >>>>> probably >>>>> explains why adding the additional "fake" smaller vector sizes was so good >>>>> for >>>>> our numbers, but confirms that the partial epilogue is a good option. >>>> >>>> Ah, "fake" smaller vector sizes probably then made up for this with >>>> "fixed" size epilogue vectorization? But yes, I think a vectorized >>>> epilog with partial vectors that then does not iterate would get you >>>> the best of both worlds. >>> >>> Yes, it uses V32 for the epilogue, which won't fit every case but it better >>> than nothing. >>> >>>> So param_vect_partial_vector_usage == 1. >>> >>> Unfortunately, there's doesn't seem to be a way to set this *only* for the >>> offload compiler. If you could fix it for x86_64 soon then that would be >>> awesome. :) >> So some opts tweaking in the GCN option_override hook doesn't work? > > I didn't try that, but -foffload-options=--param=vect-partial-vector-usage=1 > had no effect. I guess the flag is streamed as function specific optimization… > >>>> Whether with a while_ult optab or the vec_cmpu scheme should then >>>> depend on generated code quality. >>> >>> Which looks like it depends on these constants alone. >> Understood, compared to while_ult these now appear explicitely. As >> Richard said you likely have to tweak the backend to make use of 'r1' > > Yes, I think I understood what he meant. Should be doable. > > Andrew
Re: [PATCH 3/3] AVX512 fully masked vectorization
Richard Biener via Gcc-patches Thu, 15 Jun 2023 09:17:18 -0700
- RE: [PATCH 3/3] AVX512 full... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX512 fully ma... Andrew Stubbs
- Re: [PATCH 3/3] AVX512 full... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX512 ... Andrew Stubbs
- Re: [PATCH 3/3] AV... Richard Biener via Gcc-patches
- Re: [PATCH 3/3... Andrew Stubbs
- Re: [PATCH 3/3... Richard Biener via Gcc-patches
- Re: [PATCH 3/3... Andrew Stubbs
- Re: [PATCH 3/3... Richard Biener via Gcc-patches
- Re: [PATCH 3/3... Andrew Stubbs
- Re: [PATCH 3/3... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX512 full... Richard Sandiford via Gcc-patches
- Re: [PATCH 3/3] AVX512 fully masked vect... Richard Sandiford via Gcc-patches
- Re: [PATCH 3/3] AVX512 fully masked... Richard Biener via Gcc-patches
- Re: [PATCH 3/3] AVX512 fully ma... Richard Biener via Gcc-patches