On Tue, 23 Mar 2021, Xionghu Luo wrote:

> 
> 
> On 2020/12/23 00:53, Richard Biener wrote:
> > On December 21, 2020 10:03:43 AM GMT+01:00, Xiong Hu Luo
> > <luo...@linux.ibm.com> wrote:
> >> Here comes another case that requires run a pass once more, as this is
> >> not the common suggested direction to solve problems, not quite sure
> >> whether it is still a reasonble fix here.  Source code is something
> >> like:
> >>
> >> ref = ip + *hslot;
> >> while (ip < in_end - 2) {
> >>   unsigned int len = 2;
> >>   len++;
> >>     for ()   {
> >>       do len++;
> >>       while (len < maxlen && ref[len] == ip[len]); //sink code here.
> >>       break;
> >>     }
> >>   len -= 2;
> >>   ip++;
> >>   ip += len + 1;
> >>   if (ip >= in_end - 2)
> >>     break;
> >> }
> >>
> >> Before ivopts, the gimple for inner while loop is xxx.c.172t.slp1:
> >>
> >>   <bb 31> [local count: 75120046]:
> >>   # len_160 = PHI <len_161(30), len_189(58)>
> >>   len_189 = len_160 + 1;
> >>   _423 = (sizetype) len_189;
> >>   _424 = ip_229 + _423;
> >>   if (maxlen_186 > len_189)
> >>     goto <bb 32>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 32> [local count: 70988443]:
> >>   _84 = *_424;
> >>   _86 = ref_182 + _423;
> >>   _87 = *_86;
> >>   if (_84 == _87)
> >>     goto <bb 58>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 58> [local count: 67084079]:
> >>   goto <bb 31>; [100.00%]
> >>
> >>   <bb 33> [local count: 14847855]:
> >>   # len_263 = PHI <len_160(32), len_160(31)>
> >>   # _262 = PHI <_423(32), _423(31)>
> >>   # _264 = PHI <_424(32), _424(31)>
> >>   len_190 = len_263 + 4294967295;
> >>   if (len_190 <= 6)
> >>     goto <bb 34>; [0.00%]
> >>   else
> >>     goto <bb 36>; [100.00%]
> >>
> >> Then in ivopts, instructions are updated to xxx.c.174t.ivopts:
> >>
> >>   <bb 31> [local count: 75120046]:
> >>   # ivtmp.30_29 = PHI <ivtmp.30_32(30), ivtmp.30_31(58)>
> >>   _34 = (unsigned int) ivtmp.30_29;
> >>   len_160 = _34 + 4294967295;
> >>   _423 = ivtmp.30_29;
> >>   _35 = (unsigned long) ip_229;
> >>   _420 = ivtmp.30_29 + _35;
> >>   _419 = (uint8_t *) _420;
> >>   _424 = _419;
> >>   len_418 = (unsigned int) ivtmp.30_29;
> >>   if (maxlen_186 > len_418)
> >>     goto <bb 32>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 32> [local count: 70988443]:
> >>   _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_29 * 1];
> >>   ivtmp.30_31 = ivtmp.30_29 + 1;
> >>   _417 = ref_182 + 18446744073709551615;
> >>   _87 = MEM[(uint8_t *)_417 + ivtmp.30_31 * 1];
> >>   if (_84 == _87)
> >>     goto <bb 58>; [94.50%]
> >>   else
> >>     goto <bb 33>; [5.50%]
> >>
> >>   <bb 58> [local count: 67084079]:
> >>   goto <bb 31>; [100.00%]
> >>
> >>   <bb 33> [local count: 14847855]:
> >>   # len_263 = PHI <len_160(32), len_160(31)>
> >>   # _262 = PHI <_423(32), _423(31)>
> >>   # _264 = PHI <_424(32), _424(31)>
> >>   len_190 = len_263 + 4294967295;
> >>   if (len_190 <= 6)
> >>     goto <bb 34>; [0.00%]
> >>   else
> >>     goto <bb 36>; [100.00%]
> >>
> >> Some instructions in BB 31 are not used in the loop and could be sinked
> >> out of loop to reduce the computation, but they are not sinked
> >> throughout all passes later.  Run the sink_code pass once more at least
> >> after fre5 could improve this typical case performance 23% due to few
> >> instructions exausted in loop.
> >> xxx.c.209t.sink2:
> >>
> >> Sinking _419 = (uint8_t *) _420;
> >> from bb 31 to bb 89
> >> Sinking _420 = ivtmp.30_29 + _35;
> >> from bb 31 to bb 89
> >> Sinking _35 = (unsigned long) ip_229;
> >> from bb 31 to bb 89
> >> Sinking len_160 = _34 + 4294967295;
> >> from bb 31 to bb 33
> >>
> >> I also tested the SPEC2017 performance on P8LE, 544.nab_r is improved
> >> by 2.43%, but no big changes to other cases, GEOMEAN is improved quite
> >> small with 0.25%.
> >>
> >> The reason why it should be run after fre5 is fre would do some phi
> >> optimization to expose the optimization.  The patch put it after
> >> pass_modref is due to my guess that some gimple optimizations like
> >> thread_jumps, dse, dce etc. could provide more opportunities for
> >> sinking code.  Not sure it is the correct place to put.  I also
> >> verified this issue exists in both X86 and ARM64.
> >> Any comments?  Thanks.
> > 
> > It definitely should be before uncprop (but context stops there). And yes,
> > re-running passes isn't the very, very best thing to do without explaining
> > it cannot be done in other ways. Not for late stage 3 anyway.
> > 
> > Richard.
> > 
> 
> Thanks.  Also tried to implement this in a seperate RTL pass, which
> would be better?  I guess this would also be stage1 issues...

Yes, that's also for stage1 obviously.  Can you check how the number
of sink opportunities of your RTL pass changes if you add the
2nd GIMPLE sinking pass?

I'll note that you are not sinking later stmts first (you only
walk insns reverse but not BBs).  GIMPLE sinking performs a
domwalk over post dominators (but it has an easier job because
of PHIs).  I guess you'd want to walk starting from loop exit
blocks (from innermost loops as you do) in reverse program order.

I'll also note that you don't have to check whether stmts you
want to sink have their uses set after it - you can emit copies
to a new pseudo at the original insn location and use those
after the loop (that of course comes at some cost).

Also we already have a sinking pass on RTL which even computes
a proper PRE on the reverse graph - -fgcse-sm aka store-motion.c.
I'm not sure whether this deals with non-stores but the
LCM machinery definitely can handle arbitrary expressions.  I wonder
if it makes more sense to extend this rather than inventing a new
ad-hoc sinking pass?

Richard.

> 
> Xionghu
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Reply via email to