On Wed, Oct 30, 2019 at 10:43 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > The series posted so far now shows how the hook would be used in practice. > Just wanted to follow up on some points here: > > Richard Sandiford <richard.sandif...@arm.com> writes: > > Richard Biener <richard.guent...@gmail.com> writes: > >> On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford > >> <richard.sandif...@arm.com> wrote: > >>> > >>> Richard Biener <richard.guent...@gmail.com> writes: > >>> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford > >>> > <richard.sandif...@arm.com> wrote: > >>> >> > >>> >> Richard Biener <richard.guent...@gmail.com> writes: > >>> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford > >>> >> > <richard.sandif...@arm.com> wrote: > >>> >> >> > >>> >> >> This patch is the first of a series that tries to remove two > >>> >> >> assumptions: > >>> >> >> > >>> >> >> (1) that all vectors involved in vectorisation must be the same size > >>> >> >> > >>> >> >> (2) that there is only one vector mode for a given element mode and > >>> >> >> number of elements > >>> >> >> > >>> >> >> Relaxing (1) helps with targets that support multiple vector sizes > >>> >> >> or > >>> >> >> that require the number of elements to stay the same. E.g. if we're > >>> >> >> vectorising code that operates on narrow and wide elements, and the > >>> >> >> narrow elements use 64-bit vectors, then on AArch64 it would > >>> >> >> normally > >>> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors > >>> >> >> for the wide elements. > >>> >> >> > >>> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce > >>> >> >> fixed-length code for SVE. It also allows unpacked/half-size SVE > >>> >> >> vectors to work with -msve-vector-bits=256. > >>> >> >> > >>> >> >> The patch adds a new hook that targets can use to control how we > >>> >> >> move from one vector mode to another. The hook takes a starting > >>> >> >> vector > >>> >> >> mode, a new element mode, and (optionally) a new number of elements. > >>> >> >> The flexibility needed for (1) comes in when the number of elements > >>> >> >> isn't specified. > >>> >> >> > >>> >> >> All callers in this patch specify the number of elements, but a > >>> >> >> later > >>> >> >> vectoriser patch doesn't. I won't be posting the vectoriser patch > >>> >> >> for a few days, hence the RFC/A tag. > >>> >> >> > >>> >> >> Tested individually on aarch64-linux-gnu and as a series on > >>> >> >> x86_64-linux-gnu. OK to install? Or if not yet, does the idea > >>> >> >> look OK? > >>> >> > > >>> >> > In isolation the idea looks good but maybe a bit limited? I see > >>> >> > how it works for the same-size case but if you consider x86 > >>> >> > where we have SSE, AVX256 and AVX512 what would it return > >>> >> > for related_vector_mode (V4SImode, SImode, 0)? Or is this > >>> >> > kind of query not intended (where the component modes match > >>> >> > but nunits is zero)? > >>> >> > >>> >> In that case we'd normally get V4SImode back. It's an allowed > >>> >> combination, but not very useful. > >>> >> > >>> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then? Or > >>> >> > is > >>> >> > it just used to stay in the same register set for different component > >>> >> > modes? > >>> >> > >>> >> Yeah, the idea is to use the original vector mode as essentially > >>> >> a base architecture. > >>> >> > >>> >> The follow-on patches replace vec_info::vector_size with > >>> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes > >>> >> with targetm.vectorize.autovectorize_vector_modes. These are the > >>> >> starting modes that would be passed to the hook in the nunits==0 case. > >>> >> > >>> >> E.g. for Advanced SIMD on AArch64, it would make more sense for > >>> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode. > >>> >> I think things would work in a similar way for the x86_64 vector archs. > >>> >> > >>> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the > >>> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they > >>> >> happen to be the same size for 128-bit SVE. We can then compare > >>> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring > >>> >> that we consistently use all-SVE modes or all-Advanced SIMD modes > >>> >> for each attempt. > >>> >> > >>> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes: > >>> >> > >>> >> - VNx16QImode (full vector) > >>> >> - VNx8QImode (half vector) > >>> >> - VNx4QImode (quarter vector) > >>> >> - VNx2QImode (eighth vector) > >>> >> > >>> >> and then pick the one with the lowest cost. related_mode would > >>> >> keep the number of units the same for nunits==0, within the limit > >>> >> of the vector size. E.g.: > >>> >> > >>> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector) > >>> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector) > >>> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector) > >>> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector) > >>> >> > >>> >> and: > >>> >> > >>> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector) > >>> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector) > >>> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector) > >>> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector) > >>> >> > >>> >> So when operating on multiple element sizes, the tradeoff is between > >>> >> trying to make full use of the vector size (higher base nunits) vs. > >>> >> trying to remove packs and unpacks between multiple vector copies > >>> >> (lower base nunits). The latter is useful because extending within > >>> >> a vector is an in-lane rather than cross-lane operation and truncating > >>> >> within a vector is a no-op. > >>> >> > >>> >> With a couple of tweaks, we seem to do a good job of guessing which > >>> >> version has the lowest cost, at least for the simple cases I've tried > >>> >> so far. > >>> >> > >>> >> Obviously there's going to be a bit of a compile-time cost > >>> >> for SVE targets, but I think it's worth paying for. > >>> > > >>> > I would guess that immediate benefit could be seen with > >>> > basic-block vectorization which simply fails when conversions > >>> > are involved. x86_64 should now always support V4SImode > >>> > and V2SImode so eventually a testcase can be crafted for that > >>> > as well. > >>> > >>> I'd hoped so too, but the problem is that if the cost saving is good > >>> enough, BB vectorisation simply stops at the conversion frontiers and > >>> vectorises the rest, rather than considering other vector mode > >>> combinations that might be able to do more. > >> > >> Sure, but when SLP build fails because it thinks it needs to unroll > >> it could ask for a vector type with the same number of lanes > > > > Do you mean for loop or BB vectorisation? For loop vectorisation > > the outer loop iterating over vector sizes/modes works fine: if we > > need to unroll beyond the maximum VF, we'll just retry vectorisation > > with a different vector size/mode combination, with the narrowest > > element having smaller vectors. > > > > But yeah, I guess if get_vectype_for_scalar_type returns a vector > > with too many units for BB vectorisation, we could try asking for > > a different type with the "right" number of units, rather than > > failing and falling back to the next iteration of the outer loop. > > I'll give that a go. > > This is 16/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg02063.html > It was easier than I'd expected and cleans up some nasty codegen for > mixed-size SLP on AArch64, so thanks :-) > > On iterating over VFs: > > >> (that's probably what we should do from the start - get same number > >> of lane vector types in BB vectorization). > > > > It's still useful to have different numbers of lanes for both loop > > and BB vectorisation. E.g. if we're applying SLP to an operation on > > 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit > > vector archs. > > > >> It's when you introduce multiple sizes then the outer loop over all > >> sizes comparing costs becomes somewhat obsolete... this outer > >> loop should then instead compare different VFs (which also means > >> possible extra unrolling beyond the vector size need?). > > > > This is effectively what we're doing for the SVE arrangement above. > > But replacing targetm.vectorize.autovectorize_vector_sizes with > > targetm.vectorize.autovectorize_vector_modes means that we can > > also try multiple vector archs with the same VF (e.g. 128-bit SVE > > vs. 128-bit Advanced SIMD). > > See 12/n: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01831.html > for how this iteration works in practice. The AArch64 port specifies > four ways of picking vector types, depending on which elements (if any) > use 64 bits instead of 128 bits. Each of these approaches then determines > a VF, based on the element types that are actually used. Sometimes we > can end up trying the same VF and vector types multiple times, but that's > easy to fix -- will post a patch soon as preparation for picking the > "best" size. > > I think iterating on these approaches is better than iterating directly > on candidate VFs for the reason mentioned above: full-vector 128-bit SVE > and full-vector 128-bit Advanced SIMD both provide the same VF, but we > want to try them both. > > And like you say, if base vector mode M1 gives a VF of X and base vector > mode M2 gives a VF of X/2, and if 2*M2 appears cheaper than M1, we could > in future have the option of doubling the VF for M2 artificially as a > way of getting interleave-based unrolling. So in that case we'd have > two different approaches for vectorising at VF X even if we stick to the > same vector architecture. > > So I don't think starting with the VF and calculating other things from > that is intrinsically better. We'd then have to have a subloop that > iterates over Advanced SIMD vs. SVE, and potentially a further subloop > that iterates over unroll factors.
OK, fair enough. The patch is OK. Thanks, Richard. > Thanks, > Richard