On 05/05/2021 13:34, Richard Biener wrote:
On Wed, 5 May 2021, Andre Vieira (lists) wrote:

I tried to see what IVOPTs would make of this and it is able to analyze the
IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
could be used as the base for the other (loop 2). I don't know if this is
where you'd want such optimizations to be made, on one side I think it would
be great as it would also help with non-vectorized loops as you allured to.
Hmm, OK.  So there's the first loop that has a looparound jump and thus
we do not always enter the 2nd loop with the first loop final value of the
IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
And for a followup transform to catch this it would need to compute
the final value of the IV and then match this up with the initial
value computation.  I suppose FRE could be teached to do this, at
least for very simple cases.
I will admit I am not at all familiar with how FRE works, I know it exists as the occlusion of running it often breaks my vector patches :P But that's about all I know. I will have a look and see if it makes sense from my perspective to address it there, because ...

Anyway I diverge. Back to the main question of this patch. How do you suggest
I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
if one ends where the other begins?
I don't think we will make that work easily.  So indeed attacking this
in the vectorizer sounds most promising.

The problem with this that I found with my approach is that it only tackles the vectorized epilogues and that leads to regressions, I don't have the example at hand, but what I saw was happening was that increased register pressure lead to a spill in the hot path. I believe this was caused by the epilogue loop using the update pointers as the base for their DR's, in this case there were three DR's (2 loads one store), but the scalar epilogue still using the original base + niters, since this data_reference approach only changes the vectorized epilogues.


  I'll note there's also
the issue of epilogue vectorization and reductions where we seem
to not re-use partially reduced reduction vectors but instead
reduce to a scalar in each step.  That's a related issue - we're
not able to carry forward a (reduction) IV we generated for the
main vector loop to the epilogue loops.  Like for

double foo (double *a, int n)
{
   double sum = 0.;
   for (int i = 0; i < n; ++i)
     sum += a[i];
   return sum;
}

with AVX512 we get three reductions to scalars instead of
a partial reduction from zmm to ymm before the first vectorized
epilogue followed by a reduction from ymm to xmm before the second
(the jump around for the epilogues need to jump to the further
reduction piece obviously).

So I think we want to record IVs we generate (the reduction IVs
are already nicely associated with the stmt-infos), one might
consider to refer to them from the dr_vec_info for example.

It's just going to be "interesting" to wire everything up
correctly with all the jump-arounds we have ...
I have a downstream hack for the reductions, but it only worked for partial-vector-usage as there you have the guarantee it's the same vector-mode, so you don't need to pfaff around with half and full vectors. Obviously what you are suggesting has much wider applications and not surprisingly I think Richard Sandiford also pointed out to me that these are somewhat related and we might be able to reuse the IV-creation to manage it all. But I feel like I am currently light years away from that.

I had started to look at removing the data_reference updating we have now and dealing with this in the 'create_iv' calls from 'vect_create_data_ref_ptr' inside 'vectorizable_{load,store}' but then I thought it would be good to discuss it with you first. This will require keeping track of the 'end-value' of the IV, which for loops where we can skip the previous loop means we will need to construct a phi-node containing the updated pointer and the initial base. But I'm not entirely sure where to keep track of all this. Also I don't know if I can replace the base address of the data_reference right there at the 'create_iv' call, can a data_reference be used multiple times in the same loop?

I'll go do a bit more nosing around this idea and the ivmap you mentioned before. Let me know if you have any ideas on how this all should look like, even if its a 'in an ideal world'.

Andre

On 04/05/2021 10:56, Richard Biener wrote:
On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:

Hi,

The aim of this RFC is to explore a way of cleaning up the codegen around
data_references.  To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters.  We have found this leads to
better codegen in the vectorized epilogue loops.

The approach in this RFC creates a map if iv_updates which always contain
an
updated pointer that is caputed in vectorizable_{load,store}, an iv_update
may
also contain a skip_edge in case we decide the vectorization can be skipped
in
'vect_do_peeling'. During the epilogue update this map of iv_updates is
then
checked to see if it contains an entry for a data_reference and it is used
accordingly and if not it reverts back to the old behavior of using the
niters
to advance the data_reference.

The motivation for this work is to improve codegen for the option `--param
vect-partial-vector-usage=1` for SVE. We found that one of the main
problems
for the codegen here was coming from unnecessary conversions caused by the
way
we update the data_references in the epilogue.

This patch passes regression tests in aarch64-linux-gnu, but the codegen is
still not optimal in some cases. Specifically those where we have a scalar
epilogue, as this does not use the data_reference's and will rely on the
gimple scalar code, thus constructing again a memory access using the
niters.
This is a limitation for which I haven't quite worked out a solution yet
and
does cause some minor regressions due to unfortunate spills.

Let me know what you think and if you have ideas of how we can better
achieve
this.
Hmm, so the patch adds a kludge to improve the kludge we have in place ;)

I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer.  That way we can
see how the various components involved behave (FRE + ivopts most
specifically).

That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue as
initial value.  For each DR group we'd create a single IV (we can
even do better in case we have load + store of the "same" group).

We already kind-of track things via the ivexpr_map, but I'm not sure
if this lazly populated map can be reliably re-used to "re-populate"
the epilogue one (walk the map, create epilogue IVs with the appropriate
initial value & adjustd upate).

Richard.

Kind regards,
Andre Vieira




Reply via email to