[Bug rtl-optimization/71785] Computed gotos are mostly optimized away

2019-11-21 Thread rndfax at yandex dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

--- Comment #20 from Aleksey  ---
(In reply to Segher Boessenkool from comment #19)
> '-freorder-blocks'
>  Reorder basic blocks in the compiled function in order to reduce
>  number of taken branches and improve code locality.
> 
>  Enabled at levels '-O', '-O2', '-O3', '-Os'.
> 
> If you disable this option, you get more taken branches and a less linear
> control flow.  As documented.

If compgoto would have been depended on this option then that would explained
observed behavior. But compgoto does not depend on it. And it says:
/* Duplicate the blocks containing computed gotos.  This basically unfactors
   computed gotos that were factored early on in the compilation process to
   speed up edge based data flow.  We used to not unfactor them again, which
   can seriously pessimize code with many computed jumps in the source code,
   such as interpreters.  See e.g. PR15242.  */
static void
duplicate_computed_gotos (function *fun)

"This basically unfactors computed gotos that were factored early"

And it works fine until one last predecessor. Why until the last predecessor?
That's the point.

> 
> > START - the very first "goto *xxx" - was not optimized, since "bb 5" has
> > only 1 predecessor.
> > But it's optimized in later step "bbro". That's exactly why option
> > "-fno-reorder-blocks" breaks first jump optimization.
> > 
> > Can someone explain why there are such conditions:
> >   if (single_pred_p (bb))
> >   return false;
> > 
> >   if (single_pred_p (bb))
> >   continue;
> > in maybe_duplicate_computed_goto function?
> 
> It duplicates the code, one copy for every predecessor to jump to.

I showed that not for every predecessor.

> It does not move existing blocks elsewhere.

I provided RTL dump, it says that it removes edges. It moves the code. It even
says this:

/* Duplicates basic block BB and redirects edge E to it.
...
basic_block
   
duplicate_block (basic_block bb, edge e, basic_block after,
copy_bb_data *id)

Edge count in bb decreases, so code is moving.

And by answering the question I meant "what are these conditions for".
Why bb with one predecessor is not optimized?
Why not just optimize it right here right now?

[Bug rtl-optimization/71785] Computed gotos are mostly optimized away

2019-11-21 Thread rndfax at yandex dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

--- Comment #18 from Aleksey  ---
(In reply to Andrew Pinski from comment #17)
> First off internal documentation is not user documentation.  
> Second internal documentation is not always in sync with the code.  In this
> case it seems like it is not fully.

I understand that docs not always in sync, so I don't trust the documentation,
I trust source code.

> Basically BB reordering does the full unfactoring these days.

I open gcc/bb-reorder.c and see this comment:
/* Duplicate the blocks containing computed gotos.  This basically unfactors
   computed gotos that were factored early on in the compilation process to
   speed up edge based data flow.  We used to not unfactor them again, which
   can seriously pessimize code with many computed jumps in the source code,
   such as interpreters.  See e.g. PR15242.  */
static void
duplicate_computed_gotos (function *fun)

It says "This basically unfactors computed gotos that were factored early". Is
it relevant comment?
According to algorithm in this function it does what it says in the comment.

I already provided:
1) this is compgoto step, it does not depend on option "-fno-reorder-blocks"
2) it seems it does not work in the edge cases

I already provided an example from RTL and stepped through the compgoto steps.
I pointed out that the last possible optimization does not take place: BB stays
with the last predecessor and compgoto finishes because of this. This last
predecessor must be merged too.

I'm just asking what the purpose of these conditions
  if (single_pred_p (bb))
  return false;

  if (single_pred_p (bb))
  continue;
in maybe_duplicate_computed_goto function?
What if these conditions must be removed?

[Bug rtl-optimization/71785] Computed gotos are mostly optimized away

2019-11-21 Thread rndfax at yandex dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

--- Comment #16 from Aleksey  ---
> > It would be helpful if you give the explanation how these options affect
> > "un-factoring".
> 
> What options?  -fno-reorder-blocks?  Those doo the same to this code as
> they do anywhere else: the compiler does not run the reorder-blocks
> pass, so you get worse code.

This is not an explanation. Just image someone asks "why with -O0 dead code is
not eliminated?". And you just answer "It's because you are using -O0". Well
thank you, captain :)

"-fno-reorder-blocks" option affects bbro step - this is not an "un-factoring"
process. compgoto is "un-factoring" step, which takes place earlier than bbro.
compgoto promises to "un-factor" computed gotos, but this does not happen.

Anyway, I investigated my patched test case and got the following results and
found the following.

There are two labels in RTL which forms "bb 4" and "bb 5". "bb 5" is located
right after "bb 4". "bb 4" sets registers, and "bb 5" just an indirect jump.
There are 3 explicit jumps to "bb 4" from OP_A, OP_B, OP_C.
There is 1 explicit jump to "bb 5" from START (very first goto *xxx).

"bb 4" has 3 predecessors from jumps.
"bb 5" has 2 predecessors: one from jump and one from "bb 4".

What we see from rtl in compgoto:

  Duplicating computed goto bb 5 into bb 4 (bb 5 edge count: 2)
...
  After duplicating computed goto bb 5 into bb 4 (bb 5 edge count: 1)

After merging, "bb 4" has 4 predecessors. "bb 5" now has only 1 predecessor.
During the duplicate_block the edge from "bb 4" to "bb 5" eliminated. "bb 5"
has only one edge from START. Then we continue to optimize "bb 4" stepping
recursively from "bb 5":

  Duplicating computed goto bb 4 into bb 12 (bb 4 edge count: 4)

"bb 12" is OP_C.

  Duplicating computed goto bb 4 into bb 8 (bb 4 edge count: 3)

"bb 8" is OP_B.

  Duplicating computed goto bb 4 into bb 6 (bb 4 edge count: 2)

"bb 6" is OP_A.

After finishing with "bb 4" we return to "bb 5" and see this:

  Ignoring "Duplicating computed goto bb 5 into bb 2".

"bb 2" is START.

START - the very first "goto *xxx" - was not optimized, since "bb 5" has only 1
predecessor.
But it's optimized in later step "bbro". That's exactly why option
"-fno-reorder-blocks" breaks first jump optimization.

Can someone explain why there are such conditions:
  if (single_pred_p (bb))
  return false;

  if (single_pred_p (bb))
  continue;
in maybe_duplicate_computed_goto function?

Why it's necessary that bb has more than one predecessor?



> If you want to have the machine code be structured in some exact order
> or way, you will have to use assembler or something similar, not an
> optimising compiler.

Labels start blocks ("There are various reasons why control flow may transfer
from one block to another. One possibility is that some instruction, for
example a CODE_LABEL, in a linearized instruction stream just always starts a
new basic block"). One can ask compiler to not reorder them with option
"-fno-reorder-blocks" - that's enough to fulfil the requirements. This option
was used in 2004 in QEMU for similar purposes. Also it is used in gforth:

  ENGINE_FLAGS =  -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv
-fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks
-fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1
-falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -pthread

So GCC has enough options to structure the machine code in C (and C++) language
in desired way, there is no need to go into assembler.

[Bug rtl-optimization/71785] Computed gotos are mostly optimized away

2019-11-20 Thread rndfax at yandex dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

--- Comment #14 from Aleksey  ---
(In reply to Segher Boessenkool from comment #13)
> (In reply to Aleksey from comment #12)
> > But adding these two flags "-fno-reorder-blocks-and-partition
> > -fno-reorder-blocks"
> 
> If you say that basic blocks should not be reordered, then they
> are not.  All behaving perfectly as expected.
> 
> These options will cause exactly the same kind of "problems"
> everywhere else as well.
> 
> > Is it possible to fix that?
> 
> Yes, don't use compilers options that degrade performance, if
> you want good performance?

Performance is not the case here, so don't bother with it. Strict order of
labels and using everywhere "jmp reg" instead of "jmp rel + jmp reg" - this is
what is important.

Documentation (https://gcc.gnu.org/onlinedocs/gccint/Edges.html#Edges) says
"During the earlier stages of the compilation process, GCC tries to avoid such
dense flow graphs by factoring computed jumps". Then it states that "the
computed jumps are un-factored in the later passes of the compiler (in the pass
called pass_duplicate_computed_gotos)".

It would be helpful if you give the explanation how these options affect
"un-factoring".

[Bug rtl-optimization/71785] Computed gotos are mostly optimized away

2019-11-20 Thread rndfax at yandex dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

Aleksey  changed:

   What|Removed |Added

 CC||rndfax at yandex dot ru

--- Comment #12 from Aleksey  ---
Committed test case gcc/testsuite/gcc.target/powerpc/pr71785.c still fails on
the latest GCC version. It seems some flags affect result.

For example, this command:

gcc -S -O3 gcc/testsuite/gcc.target/powerpc/pr71785.c

produces good result - all computed gotos are of type "jump reg":

$ fgrep jmp pr71785.s 
jmp *%rax
jmp *%rax
jmp *%rax
jmp *%rax
$

But adding these two flags "-fno-reorder-blocks-and-partition
-fno-reorder-blocks" we get command like this:

gcc -S -O3 -fno-reorder-blocks-and-partition -fno-reorder-blocks
gcc/testsuite/gcc.target/powerpc/pr71785.c

and we see that it does not optimize one goto:

$ fgrep jmp pr71785.s 
jmp .L2
jmp *%rax
jmp *%rax
jmp *%rax
jmp *%rax
$ 

I investigated this a little in the source code of GCC. And found that in
gcc/bb-reorder.c one basic block is not optimized
(pass_duplicate_computed_gotos) because of failing this condition
"single_pred_p (bb)" in this for-loop block, in this if statement:

  for (ei = ei_start (bb->preds); (e = ei_safe_edge (ei)); )
{
  basic_block pred = e->src;

  /* Do not duplicate BB into PRED if that is the last predecessor, or if
 we cannot merge a copy of BB with PRED.  */
  if (single_pred_p (bb)
  || !single_succ_p (pred)
  || e->flags & EDGE_COMPLEX
  || pred->index < NUM_FIXED_BLOCKS
  || (JUMP_P (BB_END (pred)) && !simplejump_p (BB_END (pred)))
  || (JUMP_P (BB_END (pred)) && CROSSING_JUMP_P (BB_END (pred
{
  ei_next ();
  continue;
}

If I get it right, happens this: bb, which has "jmp reg", has several
predecessors, which all have "jmp rel" to this bb. After merging this bb into
such a predecessor, bb have one predecessor less. When bb leaves with one
predecessor it falls into this if and stop optimizing.

One may say, that it's not that big deal, that only the first jump is not
optimized. But with additional flag "-mcmodel=large" and this patch on the
committed test case file gcc/testsuite/gcc.target/powerpc/pr71785.c:
```
diff --git a/gcc/testsuite/gcc.target/powerpc/pr71785.c
b/gcc/testsuite/gcc.target/powerpc/pr71785.c
index c667ad8..6c8dde6 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr71785.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr71785.c
@@ -28,6 +28,7 @@ extern void do_stuff_b(int arg);
 extern void do_stuff_c(int arg);

 extern int someglobal;
+extern void *jump;

 void
 eval(op *op)
@@ -43,10 +44,14 @@ eval(op *op)
 CASE_OP_A:
someglobal++;
op++;
+   if (op->opcode == OP_END)
+   goto *jump;
goto *dispatch_table[op->opcode];
 CASE_OP_B:
do_stuff_b(op->arg);
op++;
+   if (op->opcode == OP_END)
+   goto *jump;
goto *dispatch_table[op->opcode];
 CASE_OP_C:
do_stuff_c(op->arg);
```

the results become bad:

gcc -S -O3 -fno-reorder-blocks-and-partition -fno-reorder-blocks -mcmodel=large
gcc/testsuite/gcc.target/powerpc/pr71785_patched.c

$ fgrep jmp pr71785.s 
jmp .L2
jmp .L14
jmp *%rax
jmp *%rax
jmp .L6
jmp *%rax
jmp *%rax
$

More real-world example of mine from project that I'm working on is using these
flags:

gcc -S -O3 -fno-reorder-blocks-and-partition -fno-reorder-blocks -mcmodel=large
-fno-crossjumping -fno-gcse -fno-PIE
gcc/testsuite/gcc.target/powerpc/pr71785_patched.c

which gives this:

$ fgrep jmp pr71785.s 
jmp .L2
jmp .L4
jmp *%rax
jmp *%rax
jmp *%rax
jmp .L6
jmp *%rax
jmp *%rax
jmp *%rax
$

If the first jump is not that crucial - it can be "jmp rel + jmp reg" - all
other jumps must be "jmp reg" only. Moreover, basic blocks are not traversed in
order they appear in source file, so in one day the first jump becomes "jmp rel
+ jmp reg", the other day (depending on the source file and basic block
traversing order) some crucial goto becomes "jmp rel + jmp reg", instead of
just "jmp reg". And this breaks everything.

Is it possible to fix that?