[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-11-04 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Jakub Jelinek  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 AssignedTo|unassigned at gcc dot   |jakub at gcc dot gnu.org
   |gnu.org |

--- Comment #24 from Jakub Jelinek  2011-11-04 
17:18:34 UTC ---
Created attachment 25721
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25721
gcc47-pr50693.patch

Here is that idea implemented.  On the gimple/expansion side it was trivial,
the only thing that complicated it is to avoid -fcompare-debug differences.
NOTE_INSN_DELETED_LABELs use the same numbers as other labels, unique for the
whole CU, so if we start to allocate from that number pool for these labels
that will be only present with -g (since they live in debug stmts and later on
in DEBUG_INSNs), we'll get label number differences, and furthermore for darwin
does some terrible hacks to workaround the mess called MachO it could even
result in different code genration.
So I decided to add NOTE_INSN_DELETED_LABEL variant which will use a different
label namespace.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-11-05 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #25 from Jakub Jelinek  2011-11-05 
19:58:41 UTC ---
Author: jakub
Date: Sat Nov  5 19:58:37 2011
New Revision: 181014

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=181014
Log:
PR tree-optimization/50693
* tree-cfg.c (gimple_can_merge_blocks_p): Allow merging with
non-forced user labels.
(gimple_merge_blocks): Turn non-forced user labels into
debug bind stmt with the label as first operand and reset value.
(gimple_duplicate_bb): Don't duplicate label debug stmts.
* dwarf2out.c (gen_label_die): Handle NOTE_INSN_DELETED_DEBUG_LABEL.
* final.c (final_scan_insn): Likewise.
(rest_of_clean_state): Don't dump NOTE_INSN_DELETED_DEBUG_LABEL.
* var-tracking.c (debug_label_num): New variable.
(delete_debug_insns): Don't delete DEBUG_INSNs for LABEL_DECLs,
instead turn them into NOTE_INSN_DELETED_DEBUG_LABEL notes.
* cfglayout.c (skip_insns_after_block, duplicate_insn_chain): Handle
NOTE_INSN_DELETED_DEBUG_LABEL.
(duplicate_insn_chain): Don't duplicate LABEL_DECL DEBUG_INSNs.
* insn-notes.def (DELETED_DEBUG_LABEL): New note kind.
* print-rtl.c (print_rtx): Handle NOTE_INSN_DELETED_DEBUG_LABEL.
* gengtype.c (adjust_field_rtx_def): Likewise.
* config/i386/i386.c (ix86_output_function_epilogue): For MachO
clear CODE_LABEL_NUMBER of NOTE_INSN_DELETED_DEBUG_LABEL
if their are at the end of function and nop hasn't been emitted.
* config/rs6000/rs6000.c (rs6000_output_function_epilogue): Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/cfglayout.c
trunk/gcc/config/i386/i386.c
trunk/gcc/config/rs6000/rs6000.c
trunk/gcc/dwarf2out.c
trunk/gcc/final.c
trunk/gcc/gengtype.c
trunk/gcc/insn-notes.def
trunk/gcc/print-rtl.c
trunk/gcc/tree-cfg.c
trunk/gcc/var-tracking.c


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-11-05 Thread alex.gaynor at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #26 from Alex Gaynor  2011-11-05 
20:08:08 UTC ---
Thank you very much for fixing this!


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2012-03-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Jakub Jelinek  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED

--- Comment #27 from Jakub Jelinek  2012-03-26 
10:23:22 UTC ---
Fixed for 4.7+, won't backport to older branches.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-10 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

David Edelsohn  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2011-10-11
 Ever Confirmed|0   |1

--- Comment #8 from David Edelsohn  2011-10-11 01:11:47 
UTC ---
Both loop1 and loop2 produce the same code on LLVM, presumably from its memset
pattern:

movq%rax, 8(%r15)
movq%rbx, (%r15)
testq   %rbx, %rbx
je  .LBB1_3
# BB#1:
movq%rbx, %rcx
movq%rax, %rdx
.align  16, 0x90
.LBB1_2:# %.lr.ph
# =>This Inner Loop Header: Depth=1
movb%r14b, (%rdx)
incq%rdx
decq%rcx
jne .LBB1_2
.LBB1_3:# %._crit_edge
movb$0, (%rax,%rbx)

Direct pointer arithmetic might not be recommended, but Intel makes do.


For loop1, GCC produces:

testq   %rbx, %rbx
movq%rax, 8(%rbp)
movq%rbx, 0(%rbp)
je  .L3
xorl%edx, %edx
.p2align 4,,10
.p2align 3
.L5:
movb%r12b, (%rax,%rdx)
addq$1, %rdx
movq8(%rbp), %rax
cmpq%rbx, %rdx
jne .L5
.L3:
movb$0, (%rax,%rbx)

For loop2, GCC produces:

xorl%edx, %edx
testq   %rbx, %rbx
movq%rax, 8(%rbp)
movq%rbx, 0(%rbp)
jne .L13
jmp .L9
.p2align 4,,10
.p2align 3
.L11:
movq8(%rbp), %rax
.L8:
.L13:
.L10:
movb%r12b, (%rax,%rdx)
addq$1, %rdx
cmpq%rbx, %rdx
jne .L11
movq8(%rbp), %rax
.L9:
movb$0, (%rax,%rbx)

In both cases GCC unnecessarily re-reads v->chars.

Is loop2 slower because jne .L13 jump into the middle of the loop confuses the
Intel loop branch predictor logic?  Or the loop2 instructions order cracks into
uops badly?  The cause of the performance difference is not obvious.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-10 Thread pinskia at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #9 from Andrew Pinski  2011-10-11 
01:24:00 UTC ---
The vectorization is not being done for the second version of the loop with the
goto.  I have not looked into the cause of it though.  Note -fno-tree-vectorize
shows that the loop is slow for both cases.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-10 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #10 from David Edelsohn  2011-10-11 
01:35:20 UTC ---
Sorry, I was looking at the loop1 and loop2 functions, not the code inlined
into the benchmark for main.

LLVM generates:

movq%r12, %rdi
movl$99, %esi
movq%rbx, %rdx
callq   memset

GCC vectorizes loop1:

.L22:
addq$1, %rdx
movdqa  %xmm0, (%rcx)
addq$16, %rcx
cmpq%rsi, %rdx
jb  .L22

but not loop2:

.L28:
.L29:
movb$99, (%rbx,%rax)
addq$1, %rax
cmpq%rbp, %rax
jne .L28


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread irar at il dot ibm.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Ira Rosen  changed:

   What|Removed |Added

 CC||irar at il dot ibm.com

--- Comment #11 from Ira Rosen  2011-10-11 07:15:15 UTC 
---
The vectorizer doesn't handle control flow in loop, and for the second loop we
have:

:
  goto  (copy_block);

loop_back:
  if (n_3(D) > i_10)
goto ;
  else
goto ;

:
  pretmp.20_6 = v_20->chars;
  goto  (end);

:
  pretmp.20_2 = v_20->chars;

  # i_29 = PHI 
  # prephitmp.21_1 = PHI 
copy_block:
  D.4443_8 = prephitmp.21_1 + i_29;
  *D.4443_8 = c_9(D);
  i_10 = i_29 + 1;
  goto  (loop_back);


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #12 from David Edelsohn  2011-10-11 
14:06:34 UTC ---
Because the vectorizer analysis occurs fairly early, I guess there is not a lot
of opportunity to clean up the control flow.

Should GCC have a memset peephole pass like LLVM?


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #13 from Richard Guenther  2011-10-11 
14:13:11 UTC ---
(In reply to comment #12)
> Because the vectorizer analysis occurs fairly early, I guess there is not a 
> lot
> of opportunity to clean up the control flow.
> 
> Should GCC have a memset peephole pass like LLVM?

It does, ftree-loop-distribute-patterns, enabled by default.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread paolo.carlini at oracle dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #14 from Paolo Carlini  2011-10-11 
14:14:24 UTC ---
A memcmp too?!? (see also the discussion part of libstdc++/50661).


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #15 from Richard Guenther  2011-10-11 
14:34:47 UTC ---
Note that it doesn't handle memset though, and the convoluted loop wouldn't be
easy to detect either.

size_t i = 0;
bool loop_cond = i < n;
while (loop_cond) {
goto copy_block;
loop_back:
loop_cond = i < n;
}
goto end;
copy_block:
v->chars[i] = c;
i++;
goto loop_back;
end:
v->chars[n] = '\0';
return v;

is simply trying to be too clever.  I can't even understand that source ;)

What probably causes this is that we don't merge the blocks

  # i_29 = PHI 
copy_block:
  D.3506_7 = v_20->chars;
  D.3507_8 = D.3506_7 + i_29;
  *D.3507_8 = c_9(D);
  i_10 = i_29 + 1;
  goto  (loop_back);

loop_back:
  loop_cond_11 = i_10 < n_3(D);
  if (loop_cond_11 != 0)
goto  (copy_block);
  else
goto  (end);

even though it's a fallthru edge.  We don't do this to preserve user labels
for debugging (and mind, no code-gen differences between -g0 vs. -g).


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #16 from Richard Guenther  2011-10-11 
14:35:17 UTC ---
(In reply to comment #14)
> A memcmp too?!? (see also the discussion part of libstdc++/50661).

No, only memset with zero.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread dje at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #17 from David Edelsohn  2011-10-11 
14:40:09 UTC ---
LLVM appears to be able to recognize memset of any value, not just zero.  And
apparently performs control flow simplification before attempting to recognize
the idiom, so it can expose the loop created by the convoluted GOTOs.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

Richard Guenther  changed:

   What|Removed |Added

 CC||aoliva at gcc dot gnu.org,
   ||jakub at gcc dot gnu.org

--- Comment #18 from Richard Guenther  2011-10-11 
14:42:46 UTC ---
(In reply to comment #15)
> What probably causes this is that we don't merge the blocks
> 
>   # i_29 = PHI 
> copy_block:
>   D.3506_7 = v_20->chars;
>   D.3507_8 = D.3506_7 + i_29;
>   *D.3507_8 = c_9(D);
>   i_10 = i_29 + 1;
>   goto  (loop_back);
> 
> loop_back:
>   loop_cond_11 = i_10 < n_3(D);
>   if (loop_cond_11 != 0)
> goto  (copy_block);
>   else
> goto  (end);
> 
> even though it's a fallthru edge.  We don't do this to preserve user labels
> for debugging (and mind, no code-gen differences between -g0 vs. -g).

Yes.  With

Index: gcc/tree-cfg.c
===
--- gcc/tree-cfg.c  (revision 179804)
+++ gcc/tree-cfg.c  (working copy)
@@ -1456,7 +1460,7 @@ gimple_can_merge_blocks_p (basic_block a

   /* Do not remove user labels.  */
   if (!DECL_ARTIFICIAL (lab))
-   return false;
+   ;
 }

   /* Protect the loop latches.  */

we merge the blocks and vectorize both loops.

The above patch is not acceptable though, which leaves making the vectorizer
deal with trivial control-flow (or a new GIMPLE_DEBUG kind that would
preserve the label?).


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #19 from Richard Guenther  2011-10-11 
14:45:22 UTC ---
(In reply to comment #17)
> LLVM appears to be able to recognize memset of any value, not just zero.  And
> apparently performs control flow simplification before attempting to recognize
> the idiom, so it can expose the loop created by the convoluted GOTOs.

I suppose you can no longer debug that though (break at the labels by
name), even when disabling the memset pattern detection?


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #20 from Jakub Jelinek  2011-10-11 
14:50:51 UTC ---
(In reply to comment #17)
> LLVM appears to be able to recognize memset of any value, not just zero.  And
> apparently performs control flow simplification before attempting to recognize
> the idiom, so it can expose the loop created by the convoluted GOTOs.

Well, GCC also performs lots of control flow simplifications, just the bb's
aren't merged here because that would mean the user label would be lost,
couldn't be used by the user debugging the code at all.

Vectorization restricts the cfg of the loop.  In successfully vectorized loops
it is unlikely user labels would be very helpful to the user, since multiple
iterations of the loop are performed together.

If we want to handle this obfuscated code, either we'd need to make debugging
experience worse for all loops (say at -O3), no matter if they will be
successfully vectorized or not, or lift up the restrictions in the vectorizer,
so that it would accept multiple basic blocks with only fallthru edges in
between and no phis or something similar, or temporarily merge the block and
split it again after vectorization, readding the user labels.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-11 Thread alex.gaynor at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #21 from Alex Gaynor  2011-10-11 
16:02:56 UTC ---
Given the concern for preserving labels for debugging, perhaps allowing the
merging of basic blocks that eliminate labels could be conditional on either a
new function attribute or command line flag?


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-12 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #22 from Richard Guenther  2011-10-12 
15:19:54 UTC ---
Yeah, maybe we can just throw them away with -O3.  Or decay them (on BB
merging) to

# DEBUG user_label:

that exposes the label to more code motion issues, so its location would be
less precise, but nothing prevents inter-block code-motion for labels at
the start of a fallthru destination either.


[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

2011-10-24 Thread aoliva at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

--- Comment #23 from Alexandre Oliva  2011-10-25 
04:48:24 UTC ---
Yup.  We don't even need a new debug stmt type, methinks.  Say, emit the debug
stmt with the LABEL_DECL, decay that to a debug stmt in cfgexpand, and turn
that into a NOTE_INSN_DELETED_LABEL during var-tracking initial scanning, to
minimize code motion impact.