On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law <l...@redhat.com> wrote:
> On 12/03/2015 07:38 AM, Richard Biener wrote:
>>
>> This pass is now enabled by default with -Os but has no limits on the
>> amount of
>> stmts it copies.
>
> The more statements it copies, the more likely it is that the path spitting
> will turn out to be useful!  It's counter-intuitive.

Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer is enabled
with -fprofile-use (but it is also properly driven to only trace hot paths)
and otherwise not by default at any optimization level.

> The primary benefit AFAICT with path splitting is that it exposes additional
> CSE, DCE, etc opportunities.
>
> IIRC  Ajit posited that it could help with live/conflict analysis, I never
> saw that, and with the changes to push splitting deeper into the pipeline
> I'd further life/conflict analysis since that work also involved preserving
> the single latch property.
>
>
>
>  It also will make all loops with this shape have at least two
>>
>> exits (if the resulting loop will be disambiguated the inner loop will
>> have two exits).
>> Having more than one exit will disable almost all loop optimizations after
>> it.
>
> Hmmm, the updated code keeps the single latch property, but I'm pretty sure
> it won't keep a single exit policy.
>
> To keep a single exit policy would require keeping an additional block
> around.  Each of the split paths would unconditionally transfer to this new
> block.  The new block would then either transfer to the latch block or out
> of the loop.

Don't see how this would work for the CFG pattern it operates on unless you
duplicate the exit condition into that new block creating an even more
obfuscated
CFG.

>
>>
>> The pass itself documents the transform it does but does zero to motivate
>> it.
>>
>> What's the benefit of this pass (apart from disrupting further
>> optimizations)?
>
> It's essentially building superblocks in a special case to enable additional
> CSE, DCE and the like.
>
> Unfortunately what is is missing is heuristics and de-duplication.  The
> former to drive cases when it's not useful and the latter to reduce codesize
> for any statements that did not participate in optimizations when they were
> duplicated.
>
> The de-duplication is the "sink-statements-through-phi" problems, cross
> jumping, tail merging and the like class of problems.
>
> It was only after I approved this code after twiddling it for Ajit that I
> came across Honza's tracer implementation, which may in fact be
> retargettable to these loops and do a better job.  I haven't experimented
> with that.

Well, I originally suggested to merge this with the tracer pass...

>> I can see a _single_ case where duplicating the latch will allow threading
>> one of the paths through the loop header to eliminate the original exit.
>> Then
>> disambiguation may create a nice nested loop out of this.  Of course that
>> is only profitable again if you know the remaining single exit of the
>> inner
>> loop (exiting to the outer one) is executed infrequently (thus the inner
>> loop
>> actually loops).
>
> It wasn't ever about threading.

I see.

>>
>> But no checks other than on the CFG shape exist (oh, it checks it will
>> at _least_ copy two stmts!).
>
> Again, the more statements it copies the more likely it is to be profitable.
> Think superblocks to expose CSE, DCE and the like.

Ok, so similar to tracer (where I think the main benefit is actually increasing
scheduling opportunities for architectures where it matters).

Note that both passes are placed quite late and thus won't see much
of the GIMPLE optimizations (DOM mainly).  I wonder why they were
not placed adjacent to each other.

>>
>> Given the profitability constraints above (well, correct me if I am
>> wrong on these)
>> it looks like the whole transform should be done within the FSM threading
>> code which might be able to compute whether there will be an inner loop
>> with a single exit only.
>
> While it shares some concepts with jump threading, I don't think the
> transformation belongs in jump threading.
>
>>
>> I'm inclined to request the pass to be removed again or at least disabled
>> by
>> default.
>
> I wouldn't lose any sleep if we disabled by default or removed, particularly
> if we can repurpose Honza's code.  In fact, I might strongly support the
> former until we hear back from Ajit on performance data.

See above for what we do with -ftracer.  path-splitting should at _least_
restrict itself to operate on optimize_loop_for_speed_p () loops.

It should also (even if counter-intuitive) limit the amount of stmt copying
it does - after all there is sth like an instruction cache size which exceeeding
for loops will never be a good idea (and even smaller special loop caches on
some archs).

Note that a better heuristic than "at least more than one stmt" would be
to have at least one PHI in the merger block.  Otherwise I don't see how
CSE opportunities could exist we don't see without the duplication.
And yes, more PHIs -> more possible CSE.  I wouldn't say so for
the number of stmts.  So please limit the number of stmt copies!
(after all we do limit the number of stmts we copy during jump threading!)

Richard.

> I also keep coming back to Click's paper on code motion -- in that context,
> copying statements would be a way to break dependencies and give the global
> code motion algorithm more freedom.  The advantage of doing it in a
> framework like Click's is it's got a built-in sinking step.
>
>
>>
>> What closed source benchmark was this transform invented for?
>
> I think it was EEMBC or Coremark.  Ajit should know for sure.  I was
> actually still hoping to see benchmark results from Ajit to confirm the new
> placement in the pipeline didn't negate all the benefits he saw.
>
> jeff
>

Reply via email to