Re: [RFC] ldist: Recognize rawmemchr loop patterns

Richard Biener via Gcc-patches Thu, 20 May 2021 02:25:18 -0700

On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
<stefa...@linux.ibm.com> wrote:
>
> On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> > <stefa...@linux.ibm.com> wrote:
> > >
> > > [snip]
> > >
> > > Please find attached a new version of the patch.  A major change compared 
> > > to
> > > the previous patch is that I created a separate pass which hopefully makes
> > > reviewing also easier since it is almost self-contained.  After realizing 
> > > that
> > > detecting loops which mimic the behavior of rawmemchr/strlen functions 
> > > does not
> > > really fit into the topic of loop distribution, I created a separate pass.
> >
> > It's true that these reduction-like patterns are more difficult than
> > the existing
> > memcpy/memset cases.
> >
> > >  Due
> > > to this I was also able to play around a bit and schedule the pass at 
> > > different
> > > times.  Currently it is scheduled right before loop distribution where 
> > > loop
> > > header copying already took place which leads to the following effect.
> >
> > In fact I'd schedule it after loop distribution so there's the chance that 
> > loop
> > distribution can expose a loop that fits the new pattern.
> >
> > >  Running
> > > this setup over
> > >
> > > char *t (char *p)
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > the new pass transforms
> > >
> > > char * t (char * p)
> > > {
> > >   char _1;
> > >   char _7;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 7>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >
> > >   <bb 3> [local count: 955630225]:
> > >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> > >   p_6 = p_8 + 1;
> > >   _1 = *p_6;
> > >   if (_1 != 0)
> > >     goto <bb 6>; [89.00%]
> > >   else
> > >     goto <bb 8>; [11.00%]
> > >
> > >   <bb 8> [local count: 105119324]:
> > >   # p_2 = PHI <p_6(3)>
> > >   goto <bb 4>; [100.00%]
> > >
> > >   <bb 6> [local count: 850510901]:
> > >   goto <bb 3>; [100.00%]
> > >
> > >   <bb 7> [local count: 12992276]:
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> > >   return p_9;
> > >
> > > }
> > >
> > > into
> > >
> > > char * t (char * p)
> > > {
> > >   char * _5;
> > >   char _7;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 4>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >   _5 = p_3(D) + 1;
> > >   p_10 = .RAWMEMCHR (_5, 0);
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> > >   return p_9;
> > >
> > > }
> > >
> > > which is fine so far.  However, I haven't made up my mind so far whether 
> > > it is
> > > worthwhile to spend more time in order to also eliminate the "first 
> > > unrolling"
> > > of the loop.
> >
> > Might be a phiopt transform ;)  Might apply to quite some set of
> > builtins.  I wonder how the strlen case looks like though.
> >
> > > I gave it a shot by scheduling the pass prior pass copy header
> > > and ended up with:
> > >
> > > char * t (char * p)
> > > {
> > >   <bb 2> [local count: 118111600]:
> > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > >   return p_5;
> > >
> > > }
> > >
> > > which seems optimal to me.  The downside of this is that I have to 
> > > initialize
> > > scalar evolution analysis which might be undesired that early.
> > >
> > > All this brings me to the question where do you see this peace of code 
> > > running?
> > > If in a separate pass when would you schedule it?  If in an existing pass,
> > > which one would you choose?
> >
> > I think it still fits loop distribution.  If you manage to detect it
> > with your pass
> > standalone then you should be able to detect it in loop distribution.
>
> If a loop is distributed only because one of the partitions matches a
> rawmemchr/strlen-like loop pattern, then we have at least two partitions
> which walk over the same memory region.  Since a rawmemchr/strlen-like
> loop has no body (neglecting expression-3 of a for-loop where just an
> increment happens) it is governed by the memory accesses in the loop
> condition.  Therefore, in such a case loop distribution would result in
> performance degradation.  This is why I think that it does not fit
> conceptually into ldist pass.  However, since I make use of a couple of
> helper functions from ldist pass, it may still fit technically.
>
> Since currently all ldist optimizations operate over loops where niters
> is known and for rawmemchr/strlen-like loops this is not the case, it is
> not possible that those optimizations expose a loop which is suitable
> for rawmemchr/strlen optimization.


True - though that seems to be an unnecessary restriction.

>  Therefore, what do you think about
> scheduling rawmemchr/strlen optimization right between those
> if-statements of function loop_distribution::execute?
>
>    if (nb_generated_loops + nb_generated_calls > 0)
>      {
>        changed = true;
>        if (dump_enabled_p ())
>          dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                           loc, "Loop%s %d distributed: split to %d loops "
>                           "and %d library calls.\n", str, loop->num,
>                           nb_generated_loops, nb_generated_calls);
>
>        break;
>      }
>
>    // rawmemchr/strlen like loops
>
>    if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

but we won't ever arrive here because of the niters condition.  But
yes, doing the pattern matching in the innermost loop processing code
looks good to me - for the specific case it would be

      /* Don't distribute loop if niters is unknown.  */
      tree niters = number_of_latch_executions (loop);
      if (niters == NULL_TREE || niters == chrec_dont_know)
---> here?
        continue;

> > Can you
> > explain what part is "easier" as standalone pass?
>
> Yea that term is rather misleading.  It was probably easier for me to
> understand the underlying problem and to play around with my code.
> There are no technical reasons for a standalone pass.

And sorry for the late response...

Richard.

> Cheers,
> Stefan
>
> >
> > > Another topic which came up is whether there exists a more elegant 
> > > solution to
> > > my current implementation in order to deal with stores (I'm speaking of 
> > > the `if
> > > (store_dr)` statement inside of function transform_loop_1).  For example,
> > >
> > > extern char *p;
> > > char *t ()
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > ends up as
> > >
> > > char * t ()
> > > {
> > >   char * _1;
> > >   char * _2;
> > >   char _3;
> > >   char * p.1_8;
> > >   char _9;
> > >   char * p.1_10;
> > >   char * p.1_11;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   p.1_8 = p;
> > >   _9 = *p.1_8;
> > >   if (_9 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 7>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >
> > >   <bb 3> [local count: 955630225]:
> > >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> > >   _1 = p.1_10 + 1;
> > >   p = _1;
> > >   _3 = *_1;
> > >   if (_3 != 0)
> > >     goto <bb 6>; [89.00%]
> > >   else
> > >     goto <bb 8>; [11.00%]
> > >
> > >   <bb 8> [local count: 105119324]:
> > >   # _2 = PHI <_1(3)>
> > >   goto <bb 4>; [100.00%]
> > >
> > >   <bb 6> [local count: 850510901]:
> > >   goto <bb 3>; [100.00%]
> > >
> > >   <bb 7> [local count: 12992276]:
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> > >   return p.1_11;
> > >
> > > }
> > >
> > > where inside the loop a load and store occurs.  For a rawmemchr like loop 
> > > I
> > > have to show that we never load from a memory location to which we write.
> > > Currently I solve this by hard coding those facts which are not generic 
> > > at all.
> > > I gave compute_data_dependences_for_loop a try which failed to determine 
> > > the
> > > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > > there are more generic solutions to express this in contrast to my 
> > > current one?
> >
> > So the example loop is not valid to be trasformed to rawmemchr since it's
> > valid to call it with p = &p; - but sure, once you pass the first *p != 0 
> > check
> > things become fishy but GCC isn't able to turn that into a non-dependence.
> >
> > Why is the case of stores inside the loop important?  In fact that you
> > think it is
> > makes a case for integrating this with loop distribution since loop 
> > distribution
> > would be able to prove (if possible) that the stores can be separated into
> > a different loop.
> >
> > And sorry for the delay in answering ...
> >
> > Thanks,
> > Richard.
> >
> > >
> > > Thanks again for your input so far.  Really appreciated.
> > >
> > > Cheers,
> > > Stefan

Re: [RFC] ldist: Recognize rawmemchr loop patterns

Reply via email to