Re: [RFC] Masking vectorized loops with bound not aligned to VF.

Ilya Enkovich Thu, 17 Sep 2015 08:07:41 -0700

2015-09-16 15:30 GMT+03:00 Richard Biener <rguent...@suse.de>:
> On Mon, 14 Sep 2015, Kirill Yukhin wrote:
>
>> Hello,
>> I'd like to initiate discussion on vectorization of loops which
>> boundaries are not aligned to VF. Main target for this optimization
>> right now is x86's AVX-512, which features per-element embedded masking
>> for all instructions. The main goal for this mail is to agree on overall
>> design of the feature.
>>
>> This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1].
>>
>> Here's a sketch of the algorithm:
>>   1. Add check on basic stmts for masking: possibility to introduce index 
>> vector and
>>      corresponding mask
>>   2. At the check if statements are vectorizable we additionally check if 
>> stmts
>>      need and can be masked and compute masking cost. Result is stored in 
>> `stmt_vinfo`.
>>      We are going  to mask only mem. accesses, reductions and modify mask 
>> for already
>>      masked stmts (mask load, mask store and vect. condition)
>
> I think you also need to mask divisions (for integer divide by zero) and
> want to mask FP ops which may result in NaNs or denormals (because that's
> generally to slow down execution a lot in my experience).
>
> Why not simply mask all stmts?


Hi,

Statement masking may be not free. Especially if we need to transform
mask somehow to do it. It also may be unsupported on a platform (e.g.
for AVX-512 not all instructions support masking) but still not be a
problem to mask a loop. BTW for AVX-512 masking doesn't boost
performance even if we have some special cases like NaNs. We don't
consider exceptions in vector code (and it seems to be a case now?)
otherwise we would need to mask them also.

>
>>   3. Make a decision about masking: take computed costs and est. iterations 
>> count
>>      into consideration
>>   4. Modify prologue/epilogue generation according decision made at 
>> analysis. Three
>>      options available:
>>     a. Use scalar remainder
>>     b. Use masked remainder. Won't be supported in first version
>>     c. Mask main loop
>>   5.Support vectorized loop masking:
>>     - Create stmts for mask generation
>>     - Support generation of masked vector code (create generic vector code 
>> then
>>       patch it w/ masks)
>>       -  Mask loads/stores/vconds/reductions only
>>
>>  In first version (targeted v6) we're not going to support 4.b and loop
>> mask pack/unpack. No `pack/unpack` means that masking will be supported
>> only for types w/ the same size as index variable
>
> This means that if ncopies for any stmt is > 1 masking won't be supported,
> right?  (you'd need two or more different masks)

We don't think it is a very important feature to have in initial
version. It can be added later and shouldn't affect overall
implementation design much. BTW currently masked loads and stores
don't support masks of other sizes and don't do masks pack/unpack.

>
>>
>> [1] - 
>> https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf
>>
>> What do you think?
>
> There was the idea some time ago to use single-iteration vector
> variants for prologues/epilogues by simply overlapping them with
> the vector loop (and either making sure to mask out the overlap
> area or make sure the result stays the same).  This kind-of is
> similar to 4b and thus IMHO it's better to get 4b implemented
> rather than trying 4c.  So for example
>
>  int a[];
>  for (i=0; i < 13; ++i)
>    a[i] = i;
>
> would be vectorized (with v4si) as
>
>  for (i=0; i < 13 / 4; ++i)
>    ((v4si *)a)[i] = { ... };
>  *(v4si *)(&a[9]) = { ... };
>
> where the epilogue store of course would be unaligned.  The masked
> variant can avoid the data pointer adjustment and instead use a masked
> store.
>
> OTOH it might be that the unaligned scheme is as efficient as the
> masked version.  Only the masked version is more trivially correct,
> data dependences can make the above idea not work without masking
> out stores like for
>
>  for (i=0; i < 13; ++i)
>    a[i] = a[i+1];
>
> obviously the overlapping iterations in the epilogue would
> compute bogus values.  To avoid this we can merge the result
> with the previously stored values (using properly computed masks)
> before storing it.
>
> Basically both 4b and the above idea need to peel a vector
> iteration and "modify" it.  The same trick can be applied to
> prologue loops of course.
>
> Any chance you can try working on 4b instead?  It also feels
> like it would need less hacks throughout the vectorizer
> (basically post-processing the generated vector loop).
>
> If 4b is implemented I don't think 4c is worth doing.

I agree 4b is a more universal approach and should cover more cases.
But we consider 4c is the first step to have 4b. I think significant
difference of 4b from a replay you described is that in 4b masked
remainder may be used to execute short trip count loops which mean we
can execute masked remainder without actually executing main
vectorized loop. It causes various implications and make peeling more
complex due to multiple remainder entries.

We see the first goal as a 4c implementation with following re-usage
of all its parts later (GCC 7) for 4b. I don't see how 4c may require
more hacking in a vectorizer. Basically 4c consists of two parts:

1. Masks applicability analysis for all statements (require/not
require [mem ref, reduction, exceptions etc.], need/don't need [due to
perf considerations], maskable/not maskable, masking cost). This is
integrated into current statement analysis and is used by both 4b and
4c to make a decision about a remainder.
2. Loop transformation (masking). We are not sure yet what is the best
way to implement it. We see two options here:
  a) transform already vectorized loop as a separate post-processing
  b) initially generate masked statements on-the-fly in vect_transform_stmt

2b looks simpler to implement because we have all info required for
masking in place. 2a seems as a more universal approach and can be
re-used for 4b for peeled iteration. The problem for 2a is that we may
miss required vec_info for peeled statements. E.g. if we have several
ncopies of some statement then we have to have a way to identify which
part of the mask corresponds to which statement.

Do you have an idea how "masking" is better be organized to be usable
for both 4b and 4c?

Thanks,
Ilya


>
> Thanks,
> Richard.

Re: [RFC] Masking vectorized loops with bound not aligned to VF.

Reply via email to