Re: [PATCH] move x86 to use gather/scatter internal functions

Richard Biener via Gcc-patches Thu, 19 Aug 2021 01:46:18 -0700

On Wed, 18 Aug 2021, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Wed, 18 Aug 2021, Hongtao Liu wrote:
> >
> >> On Wed, Aug 18, 2021 at 7:37 PM Hongtao Liu <crazy...@gmail.com> wrote:
> >> >
> >> > On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazy...@gmail.com> wrote:
> >> > >
> >> > > On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguent...@suse.de> 
> >> > > wrote:
> >> > > >
> >> > > > On Wed, 18 Aug 2021, Richard Biener wrote:
> >> > > >
> >> > > > >
> >> > > > > So in the end I seem to be able to combine AVX & AVX512 arriving
> >> > > > > at the following which passes basic testing.  I will now see to
> >> > > > > teach the vectorizer the required "promotion" to handle
> >> > > > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> >> > > > >
> >> > > > > Meanwhile, do you see any hole in the below?  If not I'll
> >> > > > > do mask_scatter_store accordingly (that will be simpler since
> >> > > > > AVX doesn't have scatter).
> >> > > >
> >> > > > There seems to be one more complication ... we have
> >> > > >
> >> > > > (define_expand "avx2_gatherdi<mode>"
> >> > > >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 
> >> > > > "register_operand")
> >> > > >                    (unspec:VEC_GATHER_MODE
> >> > > >                      [(match_operand:<VEC_GATHER_SRCDI> 1
> >> > > > "register_operand")
> >> > > >                       (mem:<ssescalarmode>
> >> > > >                         (match_par_dup 6
> >> > > >                           [(match_operand 2 "vsib_address_operand")
> >> > > >                            (match_operand:<VEC_GATHER_IDXDI>
> >> > > >                               3 "register_operand")
> >> > > >
> >> > > > but VEC_GATHER_IDXDI is
> >> > > >
> >> > > > (define_mode_attr VEC_GATHER_IDXDI
> >> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >> > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> >> > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> >> > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> >> > > >
> >> > > > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> >> > > VEX.128 version: For dword indices, the instruction will gather four
> >> > > single-precision floating-point values. For
> >> > > qword indices, the instruction will gather two values and zero the
> >> > > upper 64 bits of the destination.
> >> > > VEX.256 version: For dword indices, the instruction will gather eight
> >> > > single-precision floating-point values. For
> >> > > qword indices, the instruction will gather four values and zero the
> >> > > upper 128 bits of the destination.
> >> > >
> >> > > So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
> >> > > under avx2, and v8sfv8di under avx512.
> >> > >
> >> > > ----cut pattern----
> >> > > (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
> >> > >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> >> > > (unspec:VEC_GATHER_MODE
> >> > >   [(pc)
> >> > >    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
> >> > >      [(unspec:P
> >> > > [(match_operand:P 2 "vsib_address_operand" "Tv")
> >> > > (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
> >> > > (match_operand:SI 5 "const1248_operand" "n")]
> >> > > UNSPEC_VSIBADDR)])
> >> > >    (mem:BLK (scratch))
> >> > >    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
> >> > >   UNSPEC_GATHER))
> >> > >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >> > >   "TARGET_AVX2"
> >> > > {
> >> > >   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
> >> > >     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
> >> > > %x0|%x0, %6, %4}";
> >> > > ----cut end---------------
> >> > > We are using the trick of the operand modifier %x0 to force print xmm.
> >> > (define_mode_attr VEC_GATHER_SRCDI
> >> >       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >> >        (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
> >> >        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >> >        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >> >
> >> > (define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
> >> >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> >> > (unspec:VEC_GATHER_MODE
> >> >   [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
> >> >    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> >> >      [(unspec:P
> >> > [(match_operand:P 3 "vsib_address_operand" "Tv")
> >> > (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
> >> > (match_operand:SI 6 "const1248_operand" "n")]
> >> > UNSPEC_VSIBADDR)])
> >> >    (mem:BLK (scratch))
> >> >    (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
> >> >   UNSPEC_GATHER))
> >> >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >> >   "TARGET_AVX2"
> >> >   "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
> >> >
> >> > Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.
> >> 
> >> Typo should be operands[2] in the below one
> >> (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>"
> >>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> >> (unspec:VEC_GATHER_MODE
> >>   [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
> >>    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> >>      [(unspec:P
> >> [(match_operand:P 3 "vsib_address_operand" "Tv")
> >> (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
> >> (match_operand:SI 6 "const1248_operand" "n")]
> >> UNSPEC_VSIBADDR)])
> >>    (mem:BLK (scratch))
> >>    (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
> >>   UNSPEC_GATHER))
> >>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >>   "TARGET_AVX2"
> >>   "%M3v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"
> >
> > It's somewhat of a mess.  When we expose
> >
> >   mask_gather_loadv8sfv8di
> >
> > for example then the vectorizer will with -mavx512f generate a
> > vector mask for v8sf which will be a AVX2 style mask.  But in
> > the end this will use gen_avx512f_gatherdiv16sf and require a
> > AVX512 style mask.
> 
> I think it would be OK/sensible to use the larger of the index or
> result vectors to determine the mask, if that helps.  There just
> wasn't any need to make a distinction for SVE, since there the
> mask type is determined entirely by the number of elements.


I don't think that would help - for a V4SFmode gather with V4DImode
indices the x86 ISA expects the AVX2 mask to be V4SImode, that is,
the mask corresponds to the data mode, not to the index mode.

The real issue is that we're asking for a mask mode just using
the data mode but not the instruction.  So with V8SFmode data
mode and -mavx512f -mno-avx512vl it appears we want a AVX2
mask mode but in reality the gather instruction we can use
uses EVEX.512 encoding when the index mode is V8DImode
and thus is available w/ -mavx512f.

So it appears that we'd need to pass another mode to the
get_mask_mode target hook that determines the instruction
encoding ...

I'm also not sure if this wouldn't open up a can of worms
with the mask computation stmts which would not know which
(maybe there are two) context they are used in.  We're
fundamentally using two different vector sizes and ISA subsets
here :/  Maybe vinfo->vector_mode should fully determine whether
we're using EVEX or VEX encoding in the loop and thus we should
reject attempts to do VEX encoding vectorization in EVEX mode?
I supose this is how you don't get a mix of SVE and ADVSIMD
vectorization in one loop?

But since VEX/EVEX use the same modes (but SVE/ADVSIMD do not)
this will get interesting.

Oh, and -mavx512bw would also be prone to use (different size again)
AVX2 vectors at the same time.  Maybe the "solution" is to simply
not expose the EVEX modes to the vectorizer when there's such
a possibility, thus only when all F, VL and BW are available?
At least when thinking about relaxing the vector sizes we can
deal with.

I've opened this can of worms with

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 97745a830a2..db938f79c9c 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -3749,13 +3749,15 @@ vect_gather_scatter_fn_p (vec_info *vinfo, bool 
read_p, 
bool masked_p,
 
   for (;;)
     {
-      tree offset_vectype = get_vectype_for_scalar_type (vinfo, 
offset_type);
-      if (!offset_vectype)
-       return false;
-
+      tree offset_vectype
+       = get_related_vectype_for_scalar_type (vinfo->vector_mode, 
offset_type,
+                                              TYPE_VECTOR_SUBPARTS 
(vectype));

to even get the mixed size optabs to be considered.  If we were to
expose the real instructions instead we'd have
mask_gather_loadv16sfv8di but as said the semantics of such
optab entry would need to be defined.  That's the way the current
builtin_gather target hook code works, so it seems that would be
a working approach, side-stepping the mixed size problems.
Here we'd define that excess elements on either the data or the
index operand are ignored (on the output operand they are zeroed).

Richard.

Re: [PATCH] move x86 to use gather/scatter internal functions

Reply via email to