Hi Bernd,

On Fri, Aug 26, 2016 at 03:03:43PM +0200, Bernd Schmidt wrote:
> Not really, I'm afraid. There still seems to be no detailed explanation 
> of the algorithm. Instead, there is a vague outline

Did you read the description of 8/9?  If you think any of that needs to
be in the code, please just say so.

> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
>    for this edge.  */

You asked for a comment here, see
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00932.html
("what's the purpose of this", etc.)

> Worse, I'm still not entirely certain what it is intended to achieve: I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere.

"Make faster code", like all other optimisation passes do as well.

I commented repeatedly about (micro-)benchmark results.

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.

and that is the long and short of it.

> My most general question would be why the algorithm 
> for placing individual prologue components would be different from the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or 
> arguments relating to how this new code acts with regard to loops are 
> also particularly needed.

The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,
not such a hot plan.  The "separate" algo does not duplicate blocks at all
(its only code growth is for all the prologue/epilogue components inserted,
and for some forwarder blocks sometimes).

The full-prologue algorithm only ever inserts exactly one prologue, far
from optimal.  But it *cannot* duplicate it with the semantics we have.
The separate components can of course be duplicated, it's a new abstraction
and part of the requirements for it.

> So, I think improvement is necessary in these points, but I also think 
> that there's room for experimental verification by way of self-tests.

Since it is used everywhere I think that is a big waste of time (it is
tested a lot already).  Testing specific problem cases can of course help;
but we have "make check" for that already anwyay. Also, I think "self-tests"
looking at the internals are just wrong (the only sane way to update such
tests when changing the code is to delete the tests, since they should be
written independently; otherwise you just have two copies of the same).  And
the useless abstractions are just useless.

The algorithm is procedural; writing it in procedural style is much clearer
than hiding everything.

> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a long 
> way towards showing that at least we don't have to worry too much about 
> miscompilations.

If you hadn't noticed, the algorithm is constructed in such a way as to
minimise possible miscompilations: it first determines what blocks need
what components "active", and then makes it so, in a separate (much more
trivial) phase.

Wrt your patch...  making each component needed half of the time is not
very realistic, you'll end up with all components active in most blocks.

bools as parameters where they do not mean "yes/no" is confusing.

It seems you do no longer do the "insert at head" and "insert at tail"
before the "insert on edge"?  This cannot work afais.

Some utility funcs print dump info; the caller should, instead.

You make "components" global (as "m_components").


Segher

Reply via email to