Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-15 Thread Sebastian Pop
Richard Biener wrote:
 
 
 
  On the llvm test-suite, I have seen one ICE with my fsm jump-thread
 patch.
  This patch fixes the problem:
 
  diff --git a/gcc/tree-ssa-threadupdate.c
 b/gcc/tree-ssa-threadupdate.c
  index 12f83ba..f8c736e 100644
  --- a/gcc/tree-ssa-threadupdate.c
  +++ b/gcc/tree-ssa-threadupdate.c
  @@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool
 may_peel_loop_headers)
  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
{
  if (!loop-header
  +|| !loop_latch_edge (loop)
  || !bitmap_bit_p (threaded_blocks, loop-header-index))
  continue;
 
 retval |= thread_through_loop_header (loop,
 may_peel_loop_headers);
 
  Ok to commit after regstrap?
 This seems to be indicating that we have with no edge from the latch 
 block to the header block.  I'd like to know better how we got into
 that 
 state.
 
 It Also returns null for loops with multiple latches. So the patch looks OK 
 for me.

The bug I was seeing has been fixed by the patch for:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64284

Thanks,
Sebastian


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-09 Thread Richard Biener
On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote:
 On Sat, 2014-12-06 at 19:21 +, Sebastian Pop wrote:

  I think it does not make sense to duplicate paths at -Os: I disabled the 
  FSM
  jump-threading when optimizing for size like this.
 
  diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
  index 29b20c8..ce70311 100644
  --- a/gcc/tree-ssa-threadedge.c
  +++ b/gcc/tree-ssa-threadedge.c
  @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e,
return 1;
  }
 
 if (!flag_expensive_optimizations
  + || optimize_function_for_size_p (cfun)
|| TREE_CODE (cond) != SSA_NAME
|| e-dest-loop_father != e-src-loop_father
|| loop_depth (e-dest-loop_father) == 0)
  return 0;
 
  I will regstrap and commit the attached patch.

 Bootstrapped and regression tested on x86_64-linux.
 Committed r218451.

 Sebastian

 Thanks for getting all this checked in Sebastian, I have tested it on
 coremark and I am getting the speed up that I expect.  But I am a little
 confused about turning off jump threading.  I am getting the
 optimization on coremark with -O3 and that is great and if I use '-O3
 -fno-expensive-optimizations' then I don't see this part of the jump
 threading, also great.  But I was surprised that if I just used '-O3
 -fno-thread-jumps' then I still see this optimization.  Is that
 expected?  Should this test also check flag_thread_jumps?  Or should
 that be getting checked somewhere else?

-fthread-jumps is an RTL optimization flag and ignored on GIMPLE.

Richard.

 Steve Ellcey




Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-09 Thread Sebastian Pop
Richard Biener wrote:
 On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote:
  expected?  Should this test also check flag_thread_jumps?  Or should
  that be getting checked somewhere else?
 
 -fthread-jumps is an RTL optimization flag and ignored on GIMPLE.

Does it make sense to add a -f[no-]tree-thread-jumps to enable/disable the tree
jump threading?  I could also add -f[no-]tree-fsm-thread-jumps.  Opinions?

On the llvm test-suite, I have seen one ICE with my fsm jump-thread patch.
This patch fixes the problem:

diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 12f83ba..f8c736e 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool may_peel_loop_headers)
   FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
 {
   if (!loop-header
+|| !loop_latch_edge (loop)
   || !bitmap_bit_p (threaded_blocks, loop-header-index))
   continue;
 
  retval |= thread_through_loop_header (loop, may_peel_loop_headers);

Ok to commit after regstrap?

Thanks,
Sebastian


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-09 Thread Jeff Law

On 12/09/14 10:38, Sebastian Pop wrote:

Richard Biener wrote:

On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote:

expected?  Should this test also check flag_thread_jumps?  Or should
that be getting checked somewhere else?


-fthread-jumps is an RTL optimization flag and ignored on GIMPLE.


Does it make sense to add a -f[no-]tree-thread-jumps to enable/disable the tree
jump threading?  I could also add -f[no-]tree-fsm-thread-jumps.  Opinions?
Our option handling is a bit of a mess if we look at it from the user 
standpoint.  Given that the vast majority of jump threading happens on 
gimple, ISTM that -f[no-]thread-jumps ought to be controlling the gimple 
implementation.  One could easily argue that the user doesn't really 
care about where in the pipeline the optimization is implemented.


My vote would be to just make -fthread-jumps control both RTL and gimple 
jump threading.





On the llvm test-suite, I have seen one ICE with my fsm jump-thread patch.
This patch fixes the problem:

diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 12f83ba..f8c736e 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool may_peel_loop_headers)
FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
  {
if (!loop-header
+|| !loop_latch_edge (loop)
|| !bitmap_bit_p (threaded_blocks, loop-header-index))
continue;

   retval |= thread_through_loop_header (loop, may_peel_loop_headers);

Ok to commit after regstrap?
This seems to be indicating that we have with no edge from the latch 
block to the header block.  I'd like to know better how we got into that 
state.


Also, a test for the GCC testsuite would be good.  I have no idea what 
license covers the LLVM testsuite.  But given a good analysis of the 
problem we may be able to write a suitable test independent of the LLVM 
test.


jeff


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-09 Thread Richard Biener
On December 9, 2014 7:39:48 PM CET, Jeff Law l...@redhat.com wrote:
On 12/09/14 10:38, Sebastian Pop wrote:
 Richard Biener wrote:
 On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com
wrote:
 expected?  Should this test also check flag_thread_jumps?  Or
should
 that be getting checked somewhere else?

 -fthread-jumps is an RTL optimization flag and ignored on GIMPLE.

 Does it make sense to add a -f[no-]tree-thread-jumps to
enable/disable the tree
 jump threading?  I could also add -f[no-]tree-fsm-thread-jumps. 
Opinions?
Our option handling is a bit of a mess if we look at it from the user 
standpoint.  Given that the vast majority of jump threading happens on 
gimple, ISTM that -f[no-]thread-jumps ought to be controlling the
gimple 
implementation.  One could easily argue that the user doesn't really 
care about where in the pipeline the optimization is implemented.

My vote would be to just make -fthread-jumps control both RTL and
gimple 
jump threading.

Works for me.



 On the llvm test-suite, I have seen one ICE with my fsm jump-thread
patch.
 This patch fixes the problem:

 diff --git a/gcc/tree-ssa-threadupdate.c
b/gcc/tree-ssa-threadupdate.c
 index 12f83ba..f8c736e 100644
 --- a/gcc/tree-ssa-threadupdate.c
 +++ b/gcc/tree-ssa-threadupdate.c
 @@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool
may_peel_loop_headers)
 FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
   {
 if (!loop-header
 +|| !loop_latch_edge (loop)
 || !bitmap_bit_p (threaded_blocks, loop-header-index))
 continue;

retval |= thread_through_loop_header (loop,
may_peel_loop_headers);

 Ok to commit after regstrap?
This seems to be indicating that we have with no edge from the latch 
block to the header block.  I'd like to know better how we got into
that 
state.

It Also returns null for loops with multiple latches. So the patch looks OK for 
me.

Thanks,
Richard.

Also, a test for the GCC testsuite would be good.  I have no idea what 
license covers the LLVM testsuite.  But given a good analysis of the 
problem we may be able to write a suitable test independent of the LLVM

test.

jeff




Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-09 Thread Jeff Law

On 12/09/14 12:43, Richard Biener wrote:

This seems to be indicating that we have with no edge from the latch
block to the header block.  I'd like to know better how we got into
that
state.


It Also returns null for loops with multiple latches. So the patch looks OK for 
me.

Ah, OK.

Jeff



Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-09 Thread Mike Stump
On Dec 9, 2014, at 10:39 AM, Jeff Law l...@redhat.com wrote:
 Also, a test for the GCC testsuite would be good.  I have no idea what 
 license covers the LLVM testsuite.  But given a good analysis of the problem 
 we may be able to write a suitable test independent of the LLVM test.

So, the usual engineerings rules should work just fine on it.  delta reduce and 
submit.

Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-08 Thread Jeff Law

On 12/06/14 06:47, Sebastian Pop wrote:

Jeff Law wrote:

OK to commit.  Thanks for your patience.

Can you follow-up with a change which throttles this optimization
when -Os is in effect.  You can check optimize_function_for_size_p
(cfun) and simply avoid the backward traversal or you could allow it
in that case if the amount of copying is suitably small.  Your call.


I think it does not make sense to duplicate paths at -Os: I disabled the FSM
jump-threading when optimizing for size like this.

diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 29b20c8..ce70311 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e,
   return 1;
 }

if (!flag_expensive_optimizations
+ || optimize_function_for_size_p (cfun)
   || TREE_CODE (cond) != SSA_NAME
   || e-dest-loop_father != e-src-loop_father
   || loop_depth (e-dest-loop_father) == 0)
 return 0;

I will regstrap and commit the attached patch.

Looks good to me.  Thanks for taking care of it.

Jeff


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-08 Thread Steve Ellcey
On Sat, 2014-12-06 at 19:21 +, Sebastian Pop wrote:

  I think it does not make sense to duplicate paths at -Os: I disabled the FSM
  jump-threading when optimizing for size like this.
  
  diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
  index 29b20c8..ce70311 100644
  --- a/gcc/tree-ssa-threadedge.c
  +++ b/gcc/tree-ssa-threadedge.c
  @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e,
return 1;
  }
   
 if (!flag_expensive_optimizations
  + || optimize_function_for_size_p (cfun)
|| TREE_CODE (cond) != SSA_NAME
|| e-dest-loop_father != e-src-loop_father
|| loop_depth (e-dest-loop_father) == 0)
  return 0;
  
  I will regstrap and commit the attached patch.
 
 Bootstrapped and regression tested on x86_64-linux.
 Committed r218451.
 
 Sebastian

Thanks for getting all this checked in Sebastian, I have tested it on
coremark and I am getting the speed up that I expect.  But I am a little
confused about turning off jump threading.  I am getting the
optimization on coremark with -O3 and that is great and if I use '-O3
-fno-expensive-optimizations' then I don't see this part of the jump
threading, also great.  But I was surprised that if I just used '-O3
-fno-thread-jumps' then I still see this optimization.  Is that
expected?  Should this test also check flag_thread_jumps?  Or should
that be getting checked somewhere else?

Steve Ellcey




Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-06 Thread Sebastian Pop
Jeff Law wrote:
 OK to commit.  Thanks for your patience.
 
 Can you follow-up with a change which throttles this optimization
 when -Os is in effect.  You can check optimize_function_for_size_p
 (cfun) and simply avoid the backward traversal or you could allow it
 in that case if the amount of copying is suitably small.  Your call.

I think it does not make sense to duplicate paths at -Os: I disabled the FSM
jump-threading when optimizing for size like this.

diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 29b20c8..ce70311 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e,
  return 1;
}
 
   if (!flag_expensive_optimizations
+ || optimize_function_for_size_p (cfun)
  || TREE_CODE (cond) != SSA_NAME
  || e-dest-loop_father != e-src-loop_father
  || loop_depth (e-dest-loop_father) == 0)
return 0;

I will regstrap and commit the attached patch.

Sebastian
From 1e2efaa2e3121170a938cd479d979b55c37cc4a4 Mon Sep 17 00:00:00 2001
From: Sebastian Pop s@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH] extend jump thread for finite state automata

	PR tree-optimization/54742
	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): New.

	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): Documented.

	* tree-cfg.c (split_edge_bb_loc): Export.
	* tree-cfg.h (split_edge_bb_loc): Declared extern.

	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
	original value of cond when simplification fails.
	(fsm_find_thread_path): New.
	(fsm_find_control_statement_thread_paths): New.
	(thread_through_normal_block): Call find_control_statement_thread_paths.

	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
	EDGE_FSM_THREAD.
	(verify_seme): New.
	(duplicate_seme_region): New.
	(thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges
	calling duplicate_seme_region.

	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD.

	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New test.
	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New test.
---
 gcc/ChangeLog|   29 +++
 gcc/doc/invoke.texi  |   12 +
 gcc/params.def   |   15 ++
 gcc/testsuite/ChangeLog  |8 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |   43 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c |  127 ++
 gcc/tree-cfg.c   |2 +-
 gcc/tree-cfg.h   |1 +
 gcc/tree-ssa-threadedge.c|  278 +-
 gcc/tree-ssa-threadupdate.c  |  203 +++-
 gcc/tree-ssa-threadupdate.h  |1 +
 11 files changed, 716 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index b340b51..6cfd339 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,32 @@
+2014-12-06  James Greenhalgh  james.greenha...@arm.com
+	Sebastian Pop  s@samsung.com
+	Brian Rzycki  b.rzy...@samsung.com
+
+	PR tree-optimization/54742
+	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
+	max-fsm-thread-paths): New.
+
+	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
+	max-fsm-thread-paths): Documented.
+
+	* tree-cfg.c (split_edge_bb_loc): Export.
+	* tree-cfg.h (split_edge_bb_loc): Declared extern.
+
+	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
+	original value of cond when simplification fails.
+	(fsm_find_thread_path): New.
+	(fsm_find_control_statement_thread_paths): New.
+	(thread_through_normal_block): Call find_control_statement_thread_paths.
+
+	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
+	EDGE_FSM_THREAD.
+	(verify_seme): New.
+	(duplicate_seme_region): New.
+	(thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges
+	calling duplicate_seme_region.
+
+	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD.
+
 2014-12-06  H.J. Lu  hongjiu...@intel.com
 
 	PR target/64200
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 82f0794..70d1336 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10623,6 +10623,18 @@ large and significantly increase compile time at optimization level
 @option{-O1} and higher.  This parameter is a maximum nubmer of statements
 in a single generated constructor.  Default value is 5000.
 
+@item max-fsm-thread-path-insns
+Maximum number of instructions to copy when duplicating blocks on a
+finite state automaton jump thread path.  The default is 100.
+
+@item 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-06 Thread Sebastian Pop
Sebastian Pop wrote:
 Jeff Law wrote:
  OK to commit.  Thanks for your patience.
  
  Can you follow-up with a change which throttles this optimization
  when -Os is in effect.  You can check optimize_function_for_size_p
  (cfun) and simply avoid the backward traversal or you could allow it
  in that case if the amount of copying is suitably small.  Your call.
 
 I think it does not make sense to duplicate paths at -Os: I disabled the FSM
 jump-threading when optimizing for size like this.
 
 diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
 index 29b20c8..ce70311 100644
 --- a/gcc/tree-ssa-threadedge.c
 +++ b/gcc/tree-ssa-threadedge.c
 @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e,
   return 1;
 }
  
if (!flag_expensive_optimizations
 + || optimize_function_for_size_p (cfun)
   || TREE_CODE (cond) != SSA_NAME
   || e-dest-loop_father != e-src-loop_father
   || loop_depth (e-dest-loop_father) == 0)
 return 0;
 
 I will regstrap and commit the attached patch.

Bootstrapped and regression tested on x86_64-linux.
Committed r218451.

Sebastian


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-05 Thread Jeff Law

On 12/04/14 02:14, Sebastian Pop wrote:

Sebastian Pop wrote:

a fail I have not seen in the past:

FAIL: gcc.c-torture/compile/pr27571.c   -Os  (internal compiler error)

I am still investigating why this fails: as far as I can see for now this is
because in copying the FSM path we create an internal loop that is then
discovered by the loop verifier as a natural loop and is not yet in the existing
loop sturctures.  I will try to fix this in duplicate_seme by invalidating the
loop structure after we code generated all the FSM paths.  I will submit an
updated patch when it passes regtest.


We need at least this patch to fix the fail:

@@ -2518,6 +2518,7 @@ thread_through_all_blocks (bool may_peel_loop_headers)
   if (duplicate_seme_region (entry, exit, region, len - 1, NULL))
 {
   /* We do not update dominance info.  */
   free_dominance_info (CDI_DOMINATORS);
   bitmap_set_bit (threaded_blocks, entry-src-index);
+ retval = true;
 }

And this will trigger in the end of the code gen function:

  if (retval)
 loops_state_set (LOOPS_NEED_FIXUP);

That will fix the loop structures.  I'm testing this patch on top of the one I
have just sent out.

That looks correct to me.

Jeff



Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-05 Thread Jeff Law

On 12/04/14 07:29, Sebastian Pop wrote:

Sebastian Pop wrote:

Jeff Law wrote:

I'm a bit worried about compile-time impacts of the all the
recursion


I will also restrict the recursion to the loop in which we look for the FSM
thread.


The attached patch includes this change.  It passed bootstrap and regression
test on x86_64-linux.  Ok to commit?

OK to commit.  Thanks for your patience.

Can you follow-up with a change which throttles this optimization when 
-Os is in effect.  You can check optimize_function_for_size_p (cfun) and 
simply avoid the backward traversal or you could allow it in that case 
if the amount of copying is suitably small.  Your call.




jeff


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-04 Thread Sebastian Pop
Jeff Law wrote:
 +@item max-fsm-thread-path-insns
 +Maximum number of instructions to copy when duplicating blocks on a
 +finite state automaton jump thread path.  The default is 100.
 +
 +@item max-fsm-thread-length
 +Maximum number of basic blocks on a finite state automaton jump thread
 +path.  The default is 10.
 +
 +@item max-fsm-thread-paths
 +Maximum number of new jump thread paths to create for a finite state
 +automaton.  The default is 50.
 Has there been any tuning on these defaults.  I don't have any
 strong opinions about what they ought to be, this is more to get any
 such information recorded on the lists for historical purposes.

I have not tuned any of these defaults other than making sure that coremark is
still jump-threaded.  gcc.dg/tree-ssa/ssa-dom-thread-7.c is a test-case that
will check that we always optimize coremark.

 I think it's worth a note in the debug dump anytime you abort
 threading when you hit a limit.

Done.

 I'm a bit worried about compile-time impacts of the all the
 recursion, but I'm willing to wait and see if it turns out to be a
 problem in practice.

Done, as Richi suggested, checking the flag_expensive_optimizations.

 @@ -689,6 +692,12 @@ simplify_control_stmt_condition (edge e,
   pass specific callback to try and simplify it further.  */
 if (cached_lhs  ! is_gimple_min_invariant (cached_lhs))
   cached_lhs = (*simplify) (stmt, stmt);
 +
 +  /* We couldn't find an invariant.  But, callers of this
 + function may be able to do something useful with the
 + unmodified destination.  */
 +  if (!cached_lhs)
 +cached_lhs = original_lhs;
   }
 else
   cached_lhs = NULL;
 Can't you just use COND rather than stuffing its value away into
 ORIGINAL_LHS?CACHED_LHS may be better in some cases if it's an
 SSA_NAME (and it should be), but I doubt it matters in practice.
 
 Or is it the case that you have to have the original condition --
 without any context sensitive equivalences used to simplify the
 condition.

I think we need to start the search for FSM jump-threads with the original non
simplified condition.

 +/* Return true if there is at least one path from START_BB to END_BB.
 +   VISITED_BBS is used to make sure we don't fall into an infinite loop.  */
 +
 +static bool
 +fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
 +  vecbasic_block, va_gc *path,
 +  hash_setbasic_block *visited_bbs, int n_insns)
 Update comment to indicate how PATH is used to return a path from
 START_BB to END_BB.

Done.

 +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
 +   gsi_next (gsi))
 +++n_insns;
 Probably don't want to count labels and GIMPLE_NOPs.  Probably do
 want to count non-virtual PHIs since those may end up as a copies or
 constant initializations.

Done.

 
 +  if (j == 0)
 +kind = EDGE_START_FSM_THREAD;
 +  else if (single_pred_p (e-src))
 +kind = EDGE_NO_COPY_SRC_BLOCK;
 +  else {
 +kind = EDGE_COPY_SRC_JOINER_BLOCK;
 +++joiners;
 +  }
 Presumably the mis-formatting was added when you tracked the #
 joiners.  AFAICT that is a write-only variable and ought to be
 removed.  Along with the braces on the final ELSE which should
 restore proper formatting.

Done.

 
 
 @@ -2343,6 +2493,55 @@ thread_through_all_blocks (bool may_peel_loop_headers)
 threaded_blocks = BITMAP_ALLOC (NULL);
 memset (thread_stats, 0, sizeof (thread_stats));
 
 +  for (i = 0; i  paths.length ();)
 Comment before this loop.  I can see what you're doing, but I'm
 already very familiar with this code.  Basically what are you
 looking for in this loop and what do you do?

Done.

 Overall I think this is very very close and I really like the
 overall direction.  There's a few minor issues noted above and with
 those addressed, I think we should be ready to go.

Thanks for your careful review.  Please let me know if there still are things I
can improve in the attached patch.
The path passes bootstrap on x86_64-linux and powerpc64-linux, and regtest 
except 
a fail I have not seen in the past:

FAIL: gcc.c-torture/compile/pr27571.c   -Os  (internal compiler error)

I am still investigating why this fails: as far as I can see for now this is
because in copying the FSM path we create an internal loop that is then
discovered by the loop verifier as a natural loop and is not yet in the existing
loop sturctures.  I will try to fix this in duplicate_seme by invalidating the
loop structure after we code generated all the FSM paths.  I will submit an
updated patch when it passes regtest.

 Removing most of tree-ssa-threadupdate.c and using SEME duplication
 would be a huge step forward for making this code more
 understandable. I look forward to any work you do in this space in
 the future.

I will clean up the patch 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-04 Thread Sebastian Pop
Sebastian Pop wrote:
 a fail I have not seen in the past:
 
 FAIL: gcc.c-torture/compile/pr27571.c   -Os  (internal compiler error)
 
 I am still investigating why this fails: as far as I can see for now this is
 because in copying the FSM path we create an internal loop that is then
 discovered by the loop verifier as a natural loop and is not yet in the 
 existing
 loop sturctures.  I will try to fix this in duplicate_seme by invalidating the
 loop structure after we code generated all the FSM paths.  I will submit an
 updated patch when it passes regtest.

We need at least this patch to fix the fail:

@@ -2518,6 +2518,7 @@ thread_through_all_blocks (bool may_peel_loop_headers)
  if (duplicate_seme_region (entry, exit, region, len - 1, NULL))   

 
{ 
  /* We do not update dominance info.  */
  free_dominance_info (CDI_DOMINATORS);
  bitmap_set_bit (threaded_blocks, entry-src-index);
+ retval = true;
}

And this will trigger in the end of the code gen function:

 if (retval)
loops_state_set (LOOPS_NEED_FIXUP);

That will fix the loop structures.  I'm testing this patch on top of the one I
have just sent out.

Sebastian


Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-04 Thread Sebastian Pop
Sebastian Pop wrote:
 Sebastian Pop wrote:
  a fail I have not seen in the past:
  
  FAIL: gcc.c-torture/compile/pr27571.c   -Os  (internal compiler error)
  
  I am still investigating why this fails: as far as I can see for now this is
  because in copying the FSM path we create an internal loop that is then
  discovered by the loop verifier as a natural loop and is not yet in the 
  existing
  loop sturctures.  I will try to fix this in duplicate_seme by invalidating 
  the
  loop structure after we code generated all the FSM paths.  I will submit an
  updated patch when it passes regtest.
 
 We need at least this patch to fix the fail:
 
 @@ -2518,6 +2518,7 @@ thread_through_all_blocks (bool may_peel_loop_headers)
   if (duplicate_seme_region (entry, exit, region, len - 1, NULL)) 
   
  
 { 
   /* We do not update dominance info.  */
   free_dominance_info (CDI_DOMINATORS);
   bitmap_set_bit (threaded_blocks, entry-src-index);
 + retval = true;
 }
 
 And this will trigger in the end of the code gen function:
 
  if (retval)
 loops_state_set (LOOPS_NEED_FIXUP);
 
 That will fix the loop structures.  I'm testing this patch on top of the one I
 have just sent out.

This passed bootstrap and regression test on x86_64-linux.



Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-04 Thread Sebastian Pop
Jeff Law wrote:
 I'm a bit worried about compile-time impacts of the all the
 recursion

I will also restrict the recursion to the loop in which we look for the FSM
thread, like this:

diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index a6fb361..9a153bb 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -959,13 +959,17 @@ thread_around_empty_blocks (edge taken_edge,
 
 /* Return true if the CFG contains at least one path from START_BB to END_BB.
When a path is found, record in PATH the blocks from END_BB to START_BB.
-   VISITED_BBS is used to make sure we don't fall into an infinite loop.  */
+   VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
+   the recursion to basic blocks belonging to LOOP.  */
 
 static bool
 fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
  vecbasic_block, va_gc *path,
- hash_setbasic_block *visited_bbs, int n_insns)
+ hash_setbasic_block *visited_bbs, loop_p loop)
 {
+  if (loop != start_bb-loop_father)
+return false;
+
   if (start_bb == end_bb)
 {
   vec_safe_push (path, start_bb);
@@ -977,7 +981,7 @@ fsm_find_thread_path (basic_block start_bb, basic_block 
end_bb,
   edge e;
   edge_iterator ei;
   FOR_EACH_EDGE (e, ei, start_bb-succs)
-   if (fsm_find_thread_path (e-dest, end_bb, path, visited_bbs, n_insns))
+   if (fsm_find_thread_path (e-dest, end_bb, path, visited_bbs, loop))
  {
vec_safe_push (path, start_bb);
return true;
@@ -1035,7 +1039,8 @@ fsm_find_control_statement_thread_paths (tree expr,
{
  hash_setbasic_block *visited_bbs = new hash_setbasic_block;
 
- if (fsm_find_thread_path (var_bb, e-src, next_path, visited_bbs, 0))
+ if (fsm_find_thread_path (var_bb, e-src, next_path, visited_bbs,
+   e-src-loop_father))
++e_count;
 
  delete visited_bbs;



Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-04 Thread Sebastian Pop
Sebastian Pop wrote:
 Jeff Law wrote:
  I'm a bit worried about compile-time impacts of the all the
  recursion
 
 I will also restrict the recursion to the loop in which we look for the FSM
 thread.

The attached patch includes this change.  It passed bootstrap and regression
test on x86_64-linux.  Ok to commit?

Thanks,
Sebastian
From 0e3312921c07c0e0dd5c1bf5f24050b2336475ef Mon Sep 17 00:00:00 2001
From: Sebastian Pop s@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH] extend jump thread for finite state automata PR 54742

Adapted from a patch from James Greenhalgh.

	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): New.

	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): Documented.

	* tree-cfg.c (split_edge_bb_loc): Export.
	* tree-cfg.h (split_edge_bb_loc): Declared extern.

	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
	original value of cond when simplification fails.
	(fsm_find_thread_path): New.
	(fsm_find_control_statement_thread_paths): New.
	(fsm_thread_through_normal_block): Call
	find_control_statement_thread_paths.

	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
	EDGE_FSM_THREAD.
	(verify_seme): New.
	(duplicate_seme_region): New.
	(thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges
	calling duplicate_seme_region.

	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD.

	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.
	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New.
---
 gcc/doc/invoke.texi  |   12 +
 gcc/params.def   |   15 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |   43 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c |  127 ++
 gcc/tree-cfg.c   |2 +-
 gcc/tree-cfg.h   |1 +
 gcc/tree-ssa-threadedge.c|  277 +-
 gcc/tree-ssa-threadupdate.c  |  203 +++-
 gcc/tree-ssa-threadupdate.h  |1 +
 9 files changed, 678 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 89edddb..074183f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level
 @option{-O1} and higher.  This parameter is a maximum nubmer of statements
 in a single generated constructor.  Default value is 5000.
 
+@item max-fsm-thread-path-insns
+Maximum number of instructions to copy when duplicating blocks on a
+finite state automaton jump thread path.  The default is 100.
+
+@item max-fsm-thread-length
+Maximum number of basic blocks on a finite state automaton jump thread
+path.  The default is 10.
+
+@item max-fsm-thread-paths
+Maximum number of new jump thread paths to create for a finite state
+automaton.  The default is 50.
+
 @end table
 @end table
 
diff --git a/gcc/params.def b/gcc/params.def
index 9b21c07..edf3f53 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1140,6 +1140,21 @@ DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE,
 	  Maximum number of statements to be included into a single static 
 	  constructor generated by Pointer Bounds Checker,
 	  5000, 100, 0)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS,
+	  max-fsm-thread-path-insns,
+	  Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path,
+	  100, 1, 99)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_LENGTH,
+	  max-fsm-thread-length,
+	  Maximum number of basic blocks on a finite state automaton jump thread path,
+	  10, 1, 99)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_PATHS,
+	  max-fsm-thread-paths,
+	  Maximum number of new jump thread paths to create for a finite state automaton,
+	  50, 1, 99)
 /*
 
 Local variables:
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
new file mode 100644
index 000..bb34a74
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-dom1-details } */
+/* { dg-final { scan-tree-dump-times FSM 6 dom1 } } */
+/* { dg-final { cleanup-tree-dump dom1 } } */
+
+int sum0, sum1, sum2, sum3;
+int foo (char *s, char **ret)
+{
+  int state=0;
+  char c;
+
+  for (; *s  state != 4; s++)
+{
+  c = *s;
+  if (c == '*')
+	{
+	  s++;
+	  break;
+	}
+  switch (state)
+	{
+	case 0:
+	  if (c == '+')
+	state = 1;
+	  else if (c != '-')
+	sum0+=c;
+	  break;
+	case 1:
+	  if (c == '+')
+	state = 2;
+	  else if (c == '-')
+	state = 0;
+	  else
+	sum1+=c;
+	  break;
+	default:
+	  break;
+	}
+
+}
+  *ret = s;
+  

Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-02 Thread Richard Biener
On Mon, Dec 1, 2014 at 10:06 PM, Jeff Law l...@redhat.com wrote:
 On 11/25/14 14:16, Sebastian Pop wrote:

 Sebastian Pop wrote:

 I will bootstrap and regression test this patch on x86_64-linux and
 powerpc64-linux.  I will also run it on our internal benchmarks,
  coremark, and
 the llvm test-suite.
 
 I will also include a longer testcase that makes sure we do not regress
  on
 coremark.

 Done all the above.  Attached is the new patch with a new testcase.  I
 have also
 added verify_seme inspired by the recent patch adding verify_sese.

 Sebastian


 0001-extend-jump-thread-for-finite-state-automata-PR-5474.patch


  From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001
 From: Sebastian Pops@samsung.com
 Date: Fri, 26 Sep 2014 14:54:20 -0500
 Subject: [PATCH] extend jump thread for finite state automata PR 54742

 Adapted from a patch from James Greenhalgh.

 * params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
 max-fsm-thread-paths): New.

 * doc/invoke.texi (max-fsm-thread-path-insns,
 max-fsm-thread-length,
 max-fsm-thread-paths): Documented.

 * tree-cfg.c (split_edge_bb_loc): Export.
 * tree-cfg.h (split_edge_bb_loc): Declared extern.

 * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore
 the
 original value of cond when simplification fails.
 (fsm_find_thread_path): New.
 (fsm_find_control_statement_thread_paths): New.
 (fsm_thread_through_normal_block): Call
 find_control_statement_thread_paths.

 * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
 EDGE_START_FSM_THREAD.
 (verify_seme): New.
 (duplicate_seme_region): New.
 (thread_through_all_blocks): Generate code for
 EDGE_START_FSM_THREAD edges
 calling gimple_duplicate_sese_region.

 * tree-ssa-threadupdate.h (jump_thread_edge_type): Add
 EDGE_START_FSM_THREAD.

 * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.
 * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New.
 ---
   gcc/doc/invoke.texi  |   12 ++
   gcc/params.def   |   15 ++
   gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |   43 +
   gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c |  127 +
   gcc/tree-cfg.c   |2 +-
   gcc/tree-cfg.h   |1 +
   gcc/tree-ssa-threadedge.c|  215
 +-
   gcc/tree-ssa-threadupdate.c  |  201
 +++-
   gcc/tree-ssa-threadupdate.h  |1 +
   9 files changed, 614 insertions(+), 3 deletions(-)
   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c

 diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
 index 89edddb..074183f 100644
 --- a/gcc/doc/invoke.texi
 +++ b/gcc/doc/invoke.texi
 @@ -10624,6 +10624,18 @@ large and significantly increase compile time at
 optimization level
   @option{-O1} and higher.  This parameter is a maximum nubmer of
 statements
   in a single generated constructor.  Default value is 5000.

 +@item max-fsm-thread-path-insns
 +Maximum number of instructions to copy when duplicating blocks on a
 +finite state automaton jump thread path.  The default is 100.
 +
 +@item max-fsm-thread-length
 +Maximum number of basic blocks on a finite state automaton jump thread
 +path.  The default is 10.
 +
 +@item max-fsm-thread-paths
 +Maximum number of new jump thread paths to create for a finite state
 +automaton.  The default is 50.

 Has there been any tuning on these defaults.  I don't have any strong
 opinions about what they ought to be, this is more to get any such
 information recorded on the lists for historical purposes.

 I think it's worth a note in the debug dump anytime you abort threading when
 you hit a limit.

 I'm a bit worried about compile-time impacts of the all the recursion, but
 I'm willing to wait and see if it turns out to be a problem in practice.

Please consider restricting it to -fexpensive-optimizations (-O2+).

Richard.


 diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
 index 8b0b7b8..c9fe212 100644
 --- a/gcc/tree-ssa-threadedge.c
 +++ b/gcc/tree-ssa-threadedge.c
 @@ -56,6 +56,8 @@ along with GCC; see the file COPYING3.  If not see
   #include params.h
   #include tree-ssa-threadedge.h
   #include builtins.h
 +#include cfg.h
 +#include cfganal.h

   /* To avoid code explosion due to jump threading, we limit the
  number of statements we are going to copy.  This variable
 @@ -661,6 +663,7 @@ simplify_control_stmt_condition (edge e,
rather than use a relational operator.  These are simpler to
 handle.  */
 if (TREE_CODE (cond) == SSA_NAME)
   {
 +  tree original_lhs = cond;
 cached_lhs = cond;

 /* Get the 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-02 Thread Jeff Law

On 12/02/14 03:15, Richard Biener wrote:


I'm a bit worried about compile-time impacts of the all the recursion, but
I'm willing to wait and see if it turns out to be a problem in practice.


Please consider restricting it to -fexpensive-optimizations (-O2+).

Yea, let's go ahead and do that.

jeff



Re: [Patch] Improving jump-thread pass for PR 54742

2014-12-01 Thread Jeff Law

On 11/25/14 14:16, Sebastian Pop wrote:

Sebastian Pop wrote:

I will bootstrap and regression test this patch on x86_64-linux and
powerpc64-linux.  I will also run it on our internal benchmarks, coremark, and
the llvm test-suite.

I will also include a longer testcase that makes sure we do not regress on
coremark.

Done all the above.  Attached is the new patch with a new testcase.  I have also
added verify_seme inspired by the recent patch adding verify_sese.

Sebastian


0001-extend-jump-thread-for-finite-state-automata-PR-5474.patch


 From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001
From: Sebastian Pops@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH] extend jump thread for finite state automata PR 54742

Adapted from a patch from James Greenhalgh.

* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
max-fsm-thread-paths): New.

* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
max-fsm-thread-paths): Documented.

* tree-cfg.c (split_edge_bb_loc): Export.
* tree-cfg.h (split_edge_bb_loc): Declared extern.

* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
original value of cond when simplification fails.
(fsm_find_thread_path): New.
(fsm_find_control_statement_thread_paths): New.
(fsm_thread_through_normal_block): Call 
find_control_statement_thread_paths.

* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
EDGE_START_FSM_THREAD.
(verify_seme): New.
(duplicate_seme_region): New.
(thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD 
edges
calling gimple_duplicate_sese_region.

* tree-ssa-threadupdate.h (jump_thread_edge_type): Add 
EDGE_START_FSM_THREAD.

* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.
* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New.
---
  gcc/doc/invoke.texi  |   12 ++
  gcc/params.def   |   15 ++
  gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |   43 +
  gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c |  127 +
  gcc/tree-cfg.c   |2 +-
  gcc/tree-cfg.h   |1 +
  gcc/tree-ssa-threadedge.c|  215 +-
  gcc/tree-ssa-threadupdate.c  |  201 +++-
  gcc/tree-ssa-threadupdate.h  |1 +
  9 files changed, 614 insertions(+), 3 deletions(-)
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 89edddb..074183f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10624,6 +10624,18 @@ large and significantly increase compile time at 
optimization level
  @option{-O1} and higher.  This parameter is a maximum nubmer of statements
  in a single generated constructor.  Default value is 5000.

+@item max-fsm-thread-path-insns
+Maximum number of instructions to copy when duplicating blocks on a
+finite state automaton jump thread path.  The default is 100.
+
+@item max-fsm-thread-length
+Maximum number of basic blocks on a finite state automaton jump thread
+path.  The default is 10.
+
+@item max-fsm-thread-paths
+Maximum number of new jump thread paths to create for a finite state
+automaton.  The default is 50.
Has there been any tuning on these defaults.  I don't have any strong 
opinions about what they ought to be, this is more to get any such 
information recorded on the lists for historical purposes.


I think it's worth a note in the debug dump anytime you abort threading 
when you hit a limit.


I'm a bit worried about compile-time impacts of the all the recursion, 
but I'm willing to wait and see if it turns out to be a problem in practice.




diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index 8b0b7b8..c9fe212 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -56,6 +56,8 @@ along with GCC; see the file COPYING3.  If not see
  #include params.h
  #include tree-ssa-threadedge.h
  #include builtins.h
+#include cfg.h
+#include cfganal.h

  /* To avoid code explosion due to jump threading, we limit the
 number of statements we are going to copy.  This variable
@@ -661,6 +663,7 @@ simplify_control_stmt_condition (edge e,
   rather than use a relational operator.  These are simpler to handle.  */
if (TREE_CODE (cond) == SSA_NAME)
  {
+  tree original_lhs = cond;
cached_lhs = cond;

/* Get the variable's current value from the equivalence chains.
@@ -689,6 +692,12 @@ simplify_control_stmt_condition (edge e,
 pass specific callback to try and simplify it further.  */
if (cached_lhs  ! is_gimple_min_invariant 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-25 Thread Richard Biener
On Mon, Nov 24, 2014 at 11:05 PM, Sebastian Pop seb...@gmail.com wrote:
 Jeff Law wrote:
 On 11/23/14 15:22, Sebastian Pop wrote:
 The second patch attached limits the search for FSM jump threads to loops.  
 With
 that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
 (and 424 jump threads on powerpc64-linux bootstrap.)
 
 Yea, that was one of the things I was going to poke at as well as a
 quick scan of your patch gave me the impression it wasn't limited to
 loops.

 Again, I haven't looked much at the patch, but I got the impression
 you're doing a backwards walk through the predecessors to discover
 the result of the COND_EXPR.  Correct?

 Yes.


 That's something I'd been wanting to do -- basically start with a
 COND_EXPR, then walk the dataflow backwards substituting values into
 the COND_EXPR (possibly creating non-gimple).  Ultimately the goal
 is to substitute and fold, getting to a constant :-)

 The forward exhaustive stuff we do now is, crazy.   The backwards
 approach could be decoupled from DOM  VRP into an independent pass,
 which I think would be wise.

 Using a SEME region copier is also something I really wanted to do
 long term.  In fact, I believe a lot of tree-ssa-threadupdate.c
 ought to be ripped out and replaced with a SEME based copier.

 I did an experiment around these lines over the week-end, and now that you
 mention it, I feel less shy to speak about; well the patch does not yet pass
 bootstrap, and there still are about 20 failing test-cases.  I feel better
 reading the code generation part of jump-threading after this patch ;-)
 Basically I think all the tree-ssa-threadupdate.c can be replaced by
 duplicate_seme_region that generalizes the code generation.

Btw I once thought about doing on-the-fly lattice use/update and folding
during basic-block copying (or even re-generating expressions via
simplifying gimple_build ()).  Or have a substitute-and-fold like
facility that can run on SEME regions and do this.

Richard.

 It appears you've built at least parts of two pieces needed to all
 this as a Bodik style optimizer.  Which is exactly the long term
 direction I think this code ought to take.


 
 One of the reasons I think we see more branches is that in sese region 
 copying we
 do not use the knowledge of the value of the condition for the last branch 
 in a
 jump-thread path: we rely on other propagation passes to remove the branch. 
  The
 last attached patch adds:
 
/* Remove the last branch in the jump thread path.  */
remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], 
  exit-dest);
 That's certainly a possibility.  But I would expect that even with
 this limitation something would be picking up the fact that the
 branch is statically computable (even if it's an RTL optimizer).
 But it's definitely something to look for.

 
 Please let me know if the attached patches are producing better results on 
 gcc.

 For the trunk:
   instructions:1339016494968
   branches :243568982489

 First version of your patch:

   instructions:1339739533291
   branches: 243806615986

 Latest version of your patch:

   instructions:1339749122609
   branches: 243809838262

 I think I got about the same results.

 I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
 valgrind was crashing not recognizing how to decode an instruction.  Then I
 moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
 compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
 there is a bit of noise in all these numbers)

 $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 
 ~/alias.ii

 all 4 patches:

 ==153617== I   refs:  13,914,038,211
 ==153617==
 ==153617== Branches:   1,926,407,760  (1,879,827,481 cond + 46,580,279 
 ind)
 ==153617== Mispredicts:  144,890,904  (  132,094,105 cond + 12,796,799 
 ind)
 ==153617== Mispred rate: 7.5% (  7.0% +   27.4%   
 )

 ==34993== I   refs:  13,915,335,629
 ==34993==
 ==34993== Branches:   1,926,597,919  (1,880,017,558 cond + 46,580,361 ind)
 ==34993== Mispredicts:  144,974,266  (  132,177,440 cond + 12,796,826 ind)
 ==34993== Mispred rate: 7.5% (  7.0% +   27.4%   )

 ==140841== I   refs:  13,915,334,459
 ==140841==
 ==140841== Branches:   1,926,597,819  (1,880,017,458 cond + 46,580,361 
 ind)
 ==140841== Mispredicts:  144,974,296  (  132,177,470 cond + 12,796,826 
 ind)
 ==140841== Mispred rate: 7.5% (  7.0% +   27.4%   
 )

 patch 1:

 ==99902== I   refs:  13,915,069,710
 ==99902==
 ==99902== Branches:   1,926,963,813  (1,880,376,148 cond + 46,587,665 ind)
 ==99902== Mispredicts:  145,501,564  (  132,656,576 cond + 12,844,988 ind)
 ==99902== Mispred rate: 7.5% (  7.0% +   27.5%   )

 ==3907== I   refs:  13,915,082,469
 ==3907==
 ==3907== Branches:   

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-25 Thread Markus Trippelsdorf
On 2014.11.24 at 22:05 +, Sebastian Pop wrote:
 I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
 valgrind was crashing not recognizing how to decode an instruction.  Then I
 moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
 compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
 there is a bit of noise in all these numbers)
 
 $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 
 ~/alias.ii

BTW perf is also available on gcc112:

trippels@gcc2-power8 ~ % perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles   [Hardware event]
  instructions   [Hardware event]
  cache-references   [Hardware event]
  cache-misses   [Hardware event]
  branch-instructions OR branches[Hardware event]
  branch-misses  [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend[Hardware event]
  stalled-cycles-backend OR idle-cycles-backend  [Hardware event]

  cpu-clock  [Software event]
  task-clock [Software event]
  page-faults OR faults  [Software event]
  context-switches OR cs [Software event]
  cpu-migrations OR migrations   [Software event]
  minor-faults   [Software event]
  major-faults   [Software event]
  alignment-faults   [Software event]
  emulation-faults   [Software event]
  dummy  [Software event]

  L1-dcache-loads[Hardware cache event]
  L1-dcache-load-misses  [Hardware cache event]
  L1-dcache-store-misses [Hardware cache event]
  L1-dcache-prefetches   [Hardware cache event]
  L1-icache-loads[Hardware cache event]
  L1-icache-load-misses  [Hardware cache event]
  L1-icache-prefetches   [Hardware cache event]
  LLC-loads  [Hardware cache event]
  LLC-load-misses[Hardware cache event]
  LLC-stores [Hardware cache event]
  LLC-store-misses   [Hardware cache event]
  LLC-prefetches [Hardware cache event]
  dTLB-load-misses   [Hardware cache event]
  iTLB-load-misses   [Hardware cache event]
  branch-loads   [Hardware cache event]
  branch-load-misses [Hardware cache event]

  rNNN   [Raw hardware event 
descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier  [Raw hardware event 
descriptor]
   (see 'man perf-list' on how to encode it)

  mem:addr[:access][Hardware breakpoint]

-- 
Markus


Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-25 Thread Jeff Law

On 11/24/14 21:55, Jeff Law wrote:

On 11/24/14 18:09, Sebastian Pop wrote:

Sebastian Pop wrote:

I removed the return -1 and started a bootstrap on powerpc64-linux.


Bootstrap passed on top of the 4 previous patches on powerpc64-linux.


I will report the valgrind output.


The output from valgrind looks closer to the output of master with no
other
patches: still 1M more instructions executed, and 300K more branches

Just ran my suite where we get ~25k more branches, which definitely puts
us in the noise.  (that's with all 4 patches + fixing the return value
).  I'm going to look at little closer at this stuff tomorrow, but I
think we've resolved the performance issue.  I'll dig deeper into the
implementation tomorrow as well.
I was running without your followup patches (must have used the wrong 
bits from my git stash), so those results are bogus, but in a good way.


After fixing that goof, I'm seeing consistent improvements with your set 
of 4 patches and the fix for the wrong return code.  Across the suite, 
~140M fewer branches, not huge, but definitely not in the noise.


So, time to dig into the implementation :-)

Jeff

ps.  In case you're curious about the noise, it's primarily address hashing.



Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-25 Thread Sebastian Pop
Jeff Law wrote:
 On 11/24/14 21:55, Jeff Law wrote:
 On 11/24/14 18:09, Sebastian Pop wrote:
 Sebastian Pop wrote:
 I removed the return -1 and started a bootstrap on powerpc64-linux.
 
 Bootstrap passed on top of the 4 previous patches on powerpc64-linux.
 
 I will report the valgrind output.
 
 The output from valgrind looks closer to the output of master with no
 other
 patches: still 1M more instructions executed, and 300K more branches
 Just ran my suite where we get ~25k more branches, which definitely puts
 us in the noise.  (that's with all 4 patches + fixing the return value
 ).  I'm going to look at little closer at this stuff tomorrow, but I
 think we've resolved the performance issue.  I'll dig deeper into the
 implementation tomorrow as well.
 I was running without your followup patches (must have used the
 wrong bits from my git stash), so those results are bogus, but in a
 good way.
 
 After fixing that goof, I'm seeing consistent improvements with your
 set of 4 patches and the fix for the wrong return code.  Across the
 suite, ~140M fewer branches, not huge, but definitely not in the
 noise.

Thanks for your testing.

 
 So, time to dig into the implementation :-)
 

To ease the review, I squashed all the patches in a single one.

I will bootstrap and regression test this patch on x86_64-linux and
powerpc64-linux.  I will also run it on our internal benchmarks, coremark, and
the llvm test-suite.

I will also include a longer testcase that makes sure we do not regress on
coremark.

Sebastian
From db0f6817768920b497225484fab24a20e5ddf556 Mon Sep 17 00:00:00 2001
From: Sebastian Pop s@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH] extend jump thread for finite state automata PR 54742

Adapted from a patch from James Greenhalgh.

	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): New.

	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): Documented.

	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.

	* tree-cfg.c (split_edge_bb_loc): Export.
	* tree-cfg.h (split_edge_bb_loc): Declared extern.

	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
	original value of cond when simplification fails.
	(fsm_find_thread_path): New.
	(fsm_find_control_statement_thread_paths): New.
	(fsm_thread_through_normal_block): Call find_control_statement_thread_paths.

	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
	EDGE_START_FSM_THREAD.
	(duplicate_seme_region): New.
	(thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges
	calling gimple_duplicate_sese_region.

	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD.
---
 gcc/doc/invoke.texi  |  12 ++
 gcc/params.def   |  15 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |  38 
 gcc/tree-cfg.c   |   2 +-
 gcc/tree-cfg.h   |   1 +
 gcc/tree-ssa-threadedge.c| 215 ++-
 gcc/tree-ssa-threadupdate.c  | 198 -
 gcc/tree-ssa-threadupdate.h  |   1 +
 8 files changed, 479 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 89edddb..074183f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level
 @option{-O1} and higher.  This parameter is a maximum nubmer of statements
 in a single generated constructor.  Default value is 5000.
 
+@item max-fsm-thread-path-insns
+Maximum number of instructions to copy when duplicating blocks on a
+finite state automaton jump thread path.  The default is 100.
+
+@item max-fsm-thread-length
+Maximum number of basic blocks on a finite state automaton jump thread
+path.  The default is 10.
+
+@item max-fsm-thread-paths
+Maximum number of new jump thread paths to create for a finite state
+automaton.  The default is 50.
+
 @end table
 @end table
 
diff --git a/gcc/params.def b/gcc/params.def
index 9b21c07..edf3f53 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1140,6 +1140,21 @@ DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE,
 	  Maximum number of statements to be included into a single static 
 	  constructor generated by Pointer Bounds Checker,
 	  5000, 100, 0)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS,
+	  max-fsm-thread-path-insns,
+	  Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path,
+	  100, 1, 99)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_LENGTH,
+	  max-fsm-thread-length,
+	  Maximum number of basic blocks on a finite state automaton jump thread path,
+	  10, 1, 99)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_PATHS,
+	  max-fsm-thread-paths,
+	  Maximum number of new 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-25 Thread Sebastian Pop
Sebastian Pop wrote:
 I will bootstrap and regression test this patch on x86_64-linux and
 powerpc64-linux.  I will also run it on our internal benchmarks, coremark, and
 the llvm test-suite.
 
 I will also include a longer testcase that makes sure we do not regress on
 coremark.

Done all the above.  Attached is the new patch with a new testcase.  I have also
added verify_seme inspired by the recent patch adding verify_sese.

Sebastian
From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001
From: Sebastian Pop s@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH] extend jump thread for finite state automata PR 54742

Adapted from a patch from James Greenhalgh.

	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): New.

	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): Documented.

	* tree-cfg.c (split_edge_bb_loc): Export.
	* tree-cfg.h (split_edge_bb_loc): Declared extern.

	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
	original value of cond when simplification fails.
	(fsm_find_thread_path): New.
	(fsm_find_control_statement_thread_paths): New.
	(fsm_thread_through_normal_block): Call find_control_statement_thread_paths.

	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
	EDGE_START_FSM_THREAD.
	(verify_seme): New.
	(duplicate_seme_region): New.
	(thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges
	calling gimple_duplicate_sese_region.

	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD.

	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.
	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New.
---
 gcc/doc/invoke.texi  |   12 ++
 gcc/params.def   |   15 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |   43 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c |  127 +
 gcc/tree-cfg.c   |2 +-
 gcc/tree-cfg.h   |1 +
 gcc/tree-ssa-threadedge.c|  215 +-
 gcc/tree-ssa-threadupdate.c  |  201 +++-
 gcc/tree-ssa-threadupdate.h  |1 +
 9 files changed, 614 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 89edddb..074183f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level
 @option{-O1} and higher.  This parameter is a maximum nubmer of statements
 in a single generated constructor.  Default value is 5000.
 
+@item max-fsm-thread-path-insns
+Maximum number of instructions to copy when duplicating blocks on a
+finite state automaton jump thread path.  The default is 100.
+
+@item max-fsm-thread-length
+Maximum number of basic blocks on a finite state automaton jump thread
+path.  The default is 10.
+
+@item max-fsm-thread-paths
+Maximum number of new jump thread paths to create for a finite state
+automaton.  The default is 50.
+
 @end table
 @end table
 
diff --git a/gcc/params.def b/gcc/params.def
index 9b21c07..edf3f53 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1140,6 +1140,21 @@ DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE,
 	  Maximum number of statements to be included into a single static 
 	  constructor generated by Pointer Bounds Checker,
 	  5000, 100, 0)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS,
+	  max-fsm-thread-path-insns,
+	  Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path,
+	  100, 1, 99)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_LENGTH,
+	  max-fsm-thread-length,
+	  Maximum number of basic blocks on a finite state automaton jump thread path,
+	  10, 1, 99)
+
+DEFPARAM (PARAM_MAX_FSM_THREAD_PATHS,
+	  max-fsm-thread-paths,
+	  Maximum number of new jump thread paths to create for a finite state automaton,
+	  50, 1, 99)
 /*
 
 Local variables:
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
new file mode 100644
index 000..bb34a74
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-dom1-details } */
+/* { dg-final { scan-tree-dump-times FSM 6 dom1 } } */
+/* { dg-final { cleanup-tree-dump dom1 } } */
+
+int sum0, sum1, sum2, sum3;
+int foo (char *s, char **ret)
+{
+  int state=0;
+  char c;
+
+  for (; *s  state != 4; s++)
+{
+  c = *s;
+  if (c == '*')
+	{
+	  s++;
+	  break;
+	}
+  switch (state)
+	{
+	case 0:
+	  if (c == '+')
+	state = 1;
+	  else if (c != '-')
+	sum0+=c;
+	  break;
+	case 1:
+	  if (c == '+')
+	

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Jeff Law

On 11/23/14 15:22, Sebastian Pop wrote:

The second patch attached limits the search for FSM jump threads to loops.  With
that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
(and 424 jump threads on powerpc64-linux bootstrap.)

Yea, that was one of the things I was going to poke at as well as a 
quick scan of your patch gave me the impression it wasn't limited to loops.


Again, I haven't looked much at the patch, but I got the impression 
you're doing a backwards walk through the predecessors to discover the 
result of the COND_EXPR.  Correct?


That's something I'd been wanting to do -- basically start with a 
COND_EXPR, then walk the dataflow backwards substituting values into the 
COND_EXPR (possibly creating non-gimple).  Ultimately the goal is to 
substitute and fold, getting to a constant :-)


The forward exhaustive stuff we do now is, crazy.   The backwards 
approach could be decoupled from DOM  VRP into an independent pass, 
which I think would be wise.


Using a SEME region copier is also something I really wanted to do long 
term.  In fact, I believe a lot of tree-ssa-threadupdate.c ought to be 
ripped out and replaced with a SEME based copier.


It appears you've built at least parts of two pieces needed to all this 
as a Bodik style optimizer.  Which is exactly the long term direction I 
think this code ought to take.





One of the reasons I think we see more branches is that in sese region copying 
we
do not use the knowledge of the value of the condition for the last branch in a
jump-thread path: we rely on other propagation passes to remove the branch.  The
last attached patch adds:

   /* Remove the last branch in the jump thread path.  */
   remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit-dest);
That's certainly a possibility.  But I would expect that even with this 
limitation something would be picking up the fact that the branch is 
statically computable (even if it's an RTL optimizer).  But it's 
definitely something to look for.




Please let me know if the attached patches are producing better results on gcc.


For the trunk:
  instructions:1339016494968
  branches :243568982489

First version of your patch:

  instructions:1339739533291
  branches: 243806615986

Latest version of your patch:

  instructions:1339749122609
  branches: 243809838262

Which is in the noise for this test.  Which makes me wonder if I botched 
something on the latest run.  It doesn't appear so, but I'm re-running 
just to be sure.  I'm also turning on -g so that I can use cg_annotate 
to poke a bit deeper and perhaps identify one or more concrete examples 
where your patch is making this worse.


Jeff




Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Sebastian Pop
Jeff Law wrote:
 On 11/23/14 15:22, Sebastian Pop wrote:
 The second patch attached limits the search for FSM jump threads to loops.  
 With
 that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
 (and 424 jump threads on powerpc64-linux bootstrap.)
 
 Yea, that was one of the things I was going to poke at as well as a
 quick scan of your patch gave me the impression it wasn't limited to
 loops.
 
 Again, I haven't looked much at the patch, but I got the impression
 you're doing a backwards walk through the predecessors to discover
 the result of the COND_EXPR.  Correct?

Yes.

 
 That's something I'd been wanting to do -- basically start with a
 COND_EXPR, then walk the dataflow backwards substituting values into
 the COND_EXPR (possibly creating non-gimple).  Ultimately the goal
 is to substitute and fold, getting to a constant :-)
 
 The forward exhaustive stuff we do now is, crazy.   The backwards
 approach could be decoupled from DOM  VRP into an independent pass,
 which I think would be wise.
 
 Using a SEME region copier is also something I really wanted to do
 long term.  In fact, I believe a lot of tree-ssa-threadupdate.c
 ought to be ripped out and replaced with a SEME based copier.

I did an experiment around these lines over the week-end, and now that you
mention it, I feel less shy to speak about; well the patch does not yet pass
bootstrap, and there still are about 20 failing test-cases.  I feel better
reading the code generation part of jump-threading after this patch ;-)
Basically I think all the tree-ssa-threadupdate.c can be replaced by
duplicate_seme_region that generalizes the code generation.

 
 It appears you've built at least parts of two pieces needed to all
 this as a Bodik style optimizer.  Which is exactly the long term
 direction I think this code ought to take.
 
 
 
 One of the reasons I think we see more branches is that in sese region 
 copying we
 do not use the knowledge of the value of the condition for the last branch 
 in a
 jump-thread path: we rely on other propagation passes to remove the branch.  
 The
 last attached patch adds:
 
/* Remove the last branch in the jump thread path.  */
remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], 
  exit-dest);
 That's certainly a possibility.  But I would expect that even with
 this limitation something would be picking up the fact that the
 branch is statically computable (even if it's an RTL optimizer).
 But it's definitely something to look for.
 
 
 Please let me know if the attached patches are producing better results on 
 gcc.
 
 For the trunk:
   instructions:1339016494968
   branches :243568982489
 
 First version of your patch:
 
   instructions:1339739533291
   branches: 243806615986
 
 Latest version of your patch:
 
   instructions:1339749122609
   branches: 243809838262

I think I got about the same results.

I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
valgrind was crashing not recognizing how to decode an instruction.  Then I
moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
there is a bit of noise in all these numbers)

$ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 
~/alias.ii

all 4 patches:

==153617== I   refs:  13,914,038,211
==153617== 
==153617== Branches:   1,926,407,760  (1,879,827,481 cond + 46,580,279 ind)
==153617== Mispredicts:  144,890,904  (  132,094,105 cond + 12,796,799 ind)
==153617== Mispred rate: 7.5% (  7.0% +   27.4%   )

==34993== I   refs:  13,915,335,629
==34993== 
==34993== Branches:   1,926,597,919  (1,880,017,558 cond + 46,580,361 ind)
==34993== Mispredicts:  144,974,266  (  132,177,440 cond + 12,796,826 ind)
==34993== Mispred rate: 7.5% (  7.0% +   27.4%   )

==140841== I   refs:  13,915,334,459
==140841== 
==140841== Branches:   1,926,597,819  (1,880,017,458 cond + 46,580,361 ind)
==140841== Mispredicts:  144,974,296  (  132,177,470 cond + 12,796,826 ind)
==140841== Mispred rate: 7.5% (  7.0% +   27.4%   )

patch 1:

==99902== I   refs:  13,915,069,710
==99902== 
==99902== Branches:   1,926,963,813  (1,880,376,148 cond + 46,587,665 ind)
==99902== Mispredicts:  145,501,564  (  132,656,576 cond + 12,844,988 ind)
==99902== Mispred rate: 7.5% (  7.0% +   27.5%   )

==3907== I   refs:  13,915,082,469
==3907== 
==3907== Branches:   1,926,965,218  (1,880,377,471 cond + 46,587,747 ind)
==3907== Mispredicts:  145,501,569  (  132,656,554 cond + 12,845,015 ind)
==3907== Mispred rate: 7.5% (  7.0% +   27.5%   )

==44271== I   refs:  13,915,111,997
==44271== 
==44271== Branches:   1,926,968,863  (1,880,380,952 cond + 46,587,911 ind)
==44271== Mispredicts:  145,501,858 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Sebastian Pop
Sebastian Pop wrote:
  Using a SEME region copier is also something I really wanted to do
  long term.  In fact, I believe a lot of tree-ssa-threadupdate.c
  ought to be ripped out and replaced with a SEME based copier.
 
 I did an experiment around these lines over the week-end, and now that you
 mention it, I feel less shy to speak about; well the patch does not yet pass
 bootstrap, and there still are about 20 failing test-cases.  I feel better
 reading the code generation part of jump-threading after this patch ;-)
 Basically I think all the tree-ssa-threadupdate.c can be replaced by
 duplicate_seme_region that generalizes the code generation.

For reference, here is the patch I am speaking about.

Sebastian
From c7213811e2ec2443df9ffc3ca72b3b15a6c9aaf9 Mon Sep 17 00:00:00 2001
From: Sebastian Pop seb...@gmail.com
Date: Fri, 21 Nov 2014 13:17:12 -0600
Subject: [PATCH 2/3] use duplicate_seme to generate code for jump threading

---
 gcc/tree-cfg.c  |  142 +++
 gcc/tree-cfg.h  |2 +
 gcc/tree-ssa-threadedge.c   |   61 +-
 gcc/tree-ssa-threadupdate.c | 2487 +--
 gcc/tree-ssa-threadupdate.h |   23 +-
 5 files changed, 222 insertions(+), 2493 deletions(-)

diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 6d96c52..d6dc442 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -6120,6 +6120,148 @@ gimple_duplicate_sese_region (edge entry, edge exit,
   return true;
 }
 
+/* Duplicates a REGION (set of N_REGION basic blocks).  The edge ENTRY is
+   redirected to the duplicate of the region.  Dominance and loop information is
+   updated if UPDATE_DOMINANCE is true, but not the SSA web.  If
+   UPDATE_DOMINANCE is false then we assume that the caller will update the
+   dominance information after calling this function.  The new basic blocks are
+   stored to REGION_COPY in the same order as they had in REGION, provided that
+   REGION_COPY is not NULL.
+
+   Returns false if it is unable to copy the region, true otherwise.  */
+
+bool
+gimple_duplicate_seme_region (edge entry,
+			  basic_block *region, unsigned n_region,
+			  basic_block *region_copy,
+			  bool update_dominance)
+{
+  unsigned i;
+  bool free_region_copy = false, copying_header = false;
+  struct loop *loop = entry-dest-loop_father;
+  vecbasic_block doms;
+  edge redirected;
+  int memo_loop_header_no = 0, memo_loop_latch_no = 0;
+  int total_freq = 0, entry_freq = 0;
+  gcov_type total_count = 0, entry_count = 0;
+
+  if (!can_copy_bbs_p (region, n_region))
+return false;
+
+  /* Some sanity checking.  Note that we do not check for all possible
+ missuses of the functions.  I.e. if you ask to copy something weird,
+ it will work, but the state of structures probably will not be
+ correct.  */
+  for (i = 0; i  n_region; i++)
+{
+  /* We do not handle subloops, i.e. all the blocks must belong to the
+	 same loop.  */
+  if (region[i]-loop_father != loop)
+	return false;
+
+  /* If we are copying a region that starts and ends in an arbirary place in
+	 the loop: keep track of which block will become our loop header.  */
+  if (region[i] != entry-dest  region[i] == loop-header)
+	memo_loop_header_no = i;
+
+  /* And which block will become our loop latch.  */
+  if (region[i] != entry-src  region[i] == loop-latch)
+	memo_loop_latch_no = i;
+}
+
+  initialize_original_copy_tables ();
+
+  if (copying_header)
+set_loop_copy (loop, loop_outer (loop));
+  else
+set_loop_copy (loop, loop);
+
+  if (!region_copy)
+{
+  region_copy = XNEWVEC (basic_block, n_region);
+  free_region_copy = true;
+}
+
+  /* Record blocks outside the region that are dominated by something
+ inside.  */
+  if (update_dominance)
+{
+  doms.create (0);
+  doms = get_dominated_by_region (CDI_DOMINATORS, region, n_region);
+}
+
+  if (entry-dest-count)
+{
+  total_count = entry-dest-count;
+  entry_count = entry-count;
+  /* Fix up corner cases, to avoid division by zero or creation of negative
+	 frequencies.  */
+  if (entry_count  total_count)
+	entry_count = total_count;
+}
+  else
+{
+  total_freq = entry-dest-frequency;
+  entry_freq = EDGE_FREQUENCY (entry);
+  /* Fix up corner cases, to avoid division by zero or creation of negative
+	 frequencies.  */
+  if (total_freq == 0)
+	total_freq = 1;
+  else if (entry_freq  total_freq)
+	entry_freq = total_freq;
+}
+
+  copy_bbs (region, n_region, region_copy, NULL, 0, NULL, loop,
+	split_edge_bb_loc (entry), update_dominance);
+  if (total_count)
+{
+  scale_bbs_frequencies_gcov_type (region, n_region,
+   total_count - entry_count,
+   total_count);
+  scale_bbs_frequencies_gcov_type (region_copy, n_region, entry_count,
+   total_count);
+}
+  else
+{
+  scale_bbs_frequencies_int (region, n_region, total_freq - entry_freq,
+ total_freq);
+  

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Jeff Law

On 11/24/14 15:05, Sebastian Pop wrote:


I did an experiment around these lines over the week-end, and now that you
mention it, I feel less shy to speak about; well the patch does not yet pass
bootstrap, and there still are about 20 failing test-cases.  I feel better
reading the code generation part of jump-threading after this patch ;-)
Basically I think all the tree-ssa-threadupdate.c can be replaced by
duplicate_seme_region that generalizes the code generation.
Clearly next stage1 stuff, but definitely the right direction IMHO.  If 
you get the chance look at Bodik's thesis.As far as I know he's the 
only person to really look at how to structure context sensitive 
optimizations in a sane way.






I got my scripts installed on the gcc-farm.  I first used an x86_64 gcc75 and
valgrind was crashing not recognizing how to decode an instruction.  Then I
moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus
compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because
there is a bit of noise in all these numbers)
Yea, glibc  valgrind really need to update in lock-step as glibc gains 
support for new ISAs.  Certain instructions are supposed to be 
interpreted as nops, but valgrind instead raises an illegal instruction.


There's a bit of noise when using valgrind like this, but it has 
definitely proven useful in the past.


I'm looking at bitmap_ior_and_compl right now.  Not sure if cg_annotate 
is sending me on a wild goose chase yet or not.


jeff



Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Jeff Law

On 11/23/14 15:22, Sebastian Pop wrote:

Jeff Law wrote:

PS: I have run some perf analysis with the patch:
- on a bootstrap of GCC I see 3209 FSM jump threads
- libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
   (measured on simulators and reduced data sets)
- coremark gets jump threaded (as expected)
- I'm setting up the llvm test-suite and I will report perf differences

So that's *far* more jump threads than I would expect this to find
in a bootstrap of GCC -- like 3 orders of magnitude more than I'd
expect to find.


The second patch attached limits the search for FSM jump threads to loops.  With
that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
(and 424 jump threads on powerpc64-linux bootstrap.)

[ ... ]

So why are we returning -1 (block should not be duplicated and not 
suitable for a joiner) at the end of thread_through_normal_block?



  /* When COND cannot be simplified, try to find paths from a control
 statement back through the PHI nodes which would affect that 
control

 statement.  */
  vecbasic_block, va_gc *bb_path;
  vec_alloc (bb_path, n_basic_blocks_for_fn (cfun));
  vec_safe_push (bb_path, e-dest);
  hash_setgimple *visited_phis = new hash_setgimple;

  max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
  fsm_find_control_statement_thread_paths (cond, visited_phis, 
bb_path);


  delete visited_phis;
  vec_free (bb_path);

  return -1;

Returning -1 (instead of 0) says stop, there's no possibility to 
threading something on that path.   I think that's suppressing some 
profitable jump threads.  I haven't done  more than verify the bitmap 
code returns to its prior state with that change.


jeff




Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Sebastian Pop
Jeff Law wrote:
 On 11/23/14 15:22, Sebastian Pop wrote:
 Jeff Law wrote:
 PS: I have run some perf analysis with the patch:
 - on a bootstrap of GCC I see 3209 FSM jump threads
 - libpng and libjpeg contain FSM jump threads, the perf increase is in the 
 1%
(measured on simulators and reduced data sets)
 - coremark gets jump threaded (as expected)
 - I'm setting up the llvm test-suite and I will report perf differences
 So that's *far* more jump threads than I would expect this to find
 in a bootstrap of GCC -- like 3 orders of magnitude more than I'd
 expect to find.
 
 The second patch attached limits the search for FSM jump threads to loops.  
 With
 that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
 (and 424 jump threads on powerpc64-linux bootstrap.)
 [ ... ]
 
 So why are we returning -1 (block should not be duplicated and not
 suitable for a joiner) at the end of thread_through_normal_block?
 
 
   /* When COND cannot be simplified, try to find paths from a control
  statement back through the PHI nodes which would affect
 that control
  statement.  */
   vecbasic_block, va_gc *bb_path;
   vec_alloc (bb_path, n_basic_blocks_for_fn (cfun));
   vec_safe_push (bb_path, e-dest);
   hash_setgimple *visited_phis = new hash_setgimple;
 
   max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
   fsm_find_control_statement_thread_paths (cond, visited_phis,
 bb_path);
 
   delete visited_phis;
   vec_free (bb_path);
 
   return -1;
 
 Returning -1 (instead of 0) says stop, there's no possibility to
 threading something on that path.   I think that's suppressing some
 profitable jump threads.  

Thanks for spotting this.

 I haven't done  more than verify the
 bitmap code returns to its prior state with that change.

I removed the return -1 and started a bootstrap on powerpc64-linux.
I will report the valgrind output.

Thanks,
Sebastian


Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Sebastian Pop
Sebastian Pop wrote:
 I removed the return -1 and started a bootstrap on powerpc64-linux.

Bootstrap passed on top of the 4 previous patches on powerpc64-linux.

 I will report the valgrind output.

The output from valgrind looks closer to the output of master with no other
patches: still 1M more instructions executed, and 300K more branches

master no-patch:

==129233== I   refs:  13,910,221,913
==129233==
==129233== Branches:   1,925,715,095  (1,879,277,776 cond + 46,437,319 ind)
==129233== Mispredicts:  144,133,332  (  131,510,534 cond + 12,622,798 ind)
==129233== Mispred rate: 7.4% (  6.9% +   27.1%   )

4 previous patches + patch to return 0:

==149012== I   refs:  13,911,870,743
==149012== 
==149012== Branches:   1,926,092,629  (1,879,657,768 cond + 46,434,861 ind)
==149012== Mispredicts:  145,551,513  (  132,827,091 cond + 12,724,422 ind)
==149012== Mispred rate: 7.5% (  7.0% +   27.4%   )

==4492== I   refs:  13,911,899,691
==4492== 
==4492== Branches:   1,926,096,214  (1,879,661,186 cond + 46,435,028 ind)
==4492== Mispredicts:  145,551,707  (  132,827,231 cond + 12,724,476 ind)
==4492== Mispred rate: 7.5% (  7.0% +   27.4%   )

==19521== I   refs:  13,911,855,711
==19521== 
==19521== Branches:   1,926,090,982  (1,879,656,202 cond + 46,434,780 ind)
==19521== Mispredicts:  145,551,343  (  132,826,948 cond + 12,724,395 ind)
==19521== Mispred rate: 7.5% (  7.0% +   27.4%   )




Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-24 Thread Jeff Law

On 11/24/14 18:09, Sebastian Pop wrote:

Sebastian Pop wrote:

I removed the return -1 and started a bootstrap on powerpc64-linux.


Bootstrap passed on top of the 4 previous patches on powerpc64-linux.


I will report the valgrind output.


The output from valgrind looks closer to the output of master with no other
patches: still 1M more instructions executed, and 300K more branches
Just ran my suite where we get ~25k more branches, which definitely puts 
us in the noise.  (that's with all 4 patches + fixing the return value 
).  I'm going to look at little closer at this stuff tomorrow, but I 
think we've resolved the performance issue.  I'll dig deeper into the 
implementation tomorrow as well.


Cheers,
Jeff



Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-23 Thread Sebastian Pop
Jeff Law wrote:
 PS: I have run some perf analysis with the patch:
 - on a bootstrap of GCC I see 3209 FSM jump threads
 - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
(measured on simulators and reduced data sets)
 - coremark gets jump threaded (as expected)
 - I'm setting up the llvm test-suite and I will report perf differences
 So that's *far* more jump threads than I would expect this to find
 in a bootstrap of GCC -- like 3 orders of magnitude more than I'd
 expect to find.

The second patch attached limits the search for FSM jump threads to loops.  With
that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap
(and 424 jump threads on powerpc64-linux bootstrap.)

 I haven't dug deep, but the first level analysis is not encouraging.
 
 Basically I used the trunk compiler with and without your patch to
 build gcc-4.7.3's cc1 (4.7.3 simply because that's what I last used
 this testing framework).  So that gives me two cc1s that I then use
 to compile a bunch of .i files under valgrind's (cachegrind)
 control.
 
 valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ..
 
 That gives me two hunks of data for each input file I test.
 Specifically I get the dynamic number of instructions and the
 dynamic number of branches executed.
 
 For jump threading those values correspond directly to the effect
 we're looking for -- a reduction in dynamic conditional jumps and a
 reduction in dynamic instructions executed.  Typically the change in
 dynamic instructions executed is 2-3X the change in dynamic
 conditional jumps -- which makes sense as removing the conditional
 jump usually means we remove a comparison and possibly some setup
 code as well.
 
 Consistently with your patch those values get worse.   Across the
 entire set of .i files I get
 
 For the trunk:
 
 instructions:1339016494968
 branches: 243568982489
 
 With your patch:
 
 instructions:1339739533291
 branches: 243806615986
 
 
 So that's 723038323 more instructions and 237633497 more branches
 after installing your patch.  While we're looking a just under .1%
 regression in dynamic branches, that's a terrible result for this
 work.

One of the reasons I think we see more branches is that in sese region copying 
we
do not use the knowledge of the value of the condition for the last branch in a
jump-thread path: we rely on other propagation passes to remove the branch.  The
last attached patch adds:

  /* Remove the last branch in the jump thread path.  */
  remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit-dest);

Please let me know if the attached patches are producing better results on gcc.

I rebased the original patch on trunk and all patches bootstrapped together on
x86_64-linux and powerpc64-linux.

Thanks,
Sebastian
From 120d5490598b1a09a06c04796b4fda46be7fd7db Mon Sep 17 00:00:00 2001
From: Sebastian Pop s@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH 1/4] extend jump thread for finite state automata PR 54742

Adapted from a patch from James Greenhalgh.

	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): New.

	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): Documented.

	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.

	* tree-cfg.c (gimple_duplicate_sese_region): Save and restore loop
	header and latch.

	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
	original value of cond when simplification fails.
	(fsm_find_thread_path): New.
	(fsm_find_control_statement_thread_paths): New.
	(fsm_thread_through_normal_block): Call find_control_statement_thread_paths.

	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
	EDGE_START_FSM_THREAD.
	(thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges
	calling gimple_duplicate_sese_region.

	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD.
---
 gcc/doc/invoke.texi  |  12 ++
 gcc/params.def   |  15 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |  38 +
 gcc/tree-cfg.c   |  26 ++-
 gcc/tree-ssa-threadedge.c| 203 ++-
 gcc/tree-ssa-threadupdate.c  |  52 +-
 gcc/tree-ssa-threadupdate.h  |   1 +
 7 files changed, 342 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 89edddb..074183f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level
 @option{-O1} and higher.  This parameter is a maximum nubmer of statements
 in a single generated constructor.  Default value is 5000.
 
+@item max-fsm-thread-path-insns
+Maximum number of instructions 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-22 Thread Jeff Law

On 11/18/14 15:19, Sebastian Pop wrote:

The regions that we duplicate start inside a loop and stay inside the same loop,
and the jump threading path is not allowed to go in deeper nested loops.

The reason why we need to modify the sese duplication function is that the sese
region that we need to duplicate starts at an arbitrary place inside the loop,
whereas the current user of the sese duplication function tree-ssa-loop-ch.c:245
starts at the edge entering the loop and exits at the latch edge.


I'll leave the rest to Jeff but it looks good to me from an overall
structure.


Thanks for your review.

Sebastian

PS: Patch passed bootstrap and regtest on x86_64-linux.

PS: I have run some perf analysis with the patch:
- on a bootstrap of GCC I see 3209 FSM jump threads
- libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
   (measured on simulators and reduced data sets)
- coremark gets jump threaded (as expected)
- I'm setting up the llvm test-suite and I will report perf differences
So that's *far* more jump threads than I would expect this to find in a 
bootstrap of GCC -- like 3 orders of magnitude more than I'd expect to find.


I haven't dug deep, but the first level analysis is not encouraging.

Basically I used the trunk compiler with and without your patch to build 
gcc-4.7.3's cc1 (4.7.3 simply because that's what I last used this 
testing framework).  So that gives me two cc1s that I then use to 
compile a bunch of .i files under valgrind's (cachegrind) control.


valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ..

That gives me two hunks of data for each input file I test. 
Specifically I get the dynamic number of instructions and the dynamic 
number of branches executed.


For jump threading those values correspond directly to the effect we're 
looking for -- a reduction in dynamic conditional jumps and a reduction 
in dynamic instructions executed.  Typically the change in dynamic 
instructions executed is 2-3X the change in dynamic conditional jumps -- 
which makes sense as removing the conditional jump usually means we 
remove a comparison and possibly some setup code as well.


Consistently with your patch those values get worse.   Across the entire 
set of .i files I get


For the trunk:

instructions:1339016494968
branches: 243568982489

With your patch:

instructions:1339739533291
branches: 243806615986


So that's 723038323 more instructions and 237633497 more branches after 
installing your patch.  While we're looking a just under .1% regression 
in dynamic branches, that's a terrible result for this work.


I'm not sure if the threads you're optimizing are somehow hiding other 
jump threading opportunities or somehow hiding CSE-able jump conditions, 
mucking up a loop structure or something else but something very bad is 
happening here.


If I put Steve's patch through the same testing I get:

instructions:1339006760834
branches: 243565768224

Which you'll note is a *very* slight decrease of 3214265 dynamic 
branches as 9734134 total instructions executed.


So I think we need to dig deeper into why the branching behaviour of GCC 
gets noticeably worse with your patch when it should be as good as or 
better than without your patch.


I know when i was analyzing the last update to this code, I found cases 
where we're much better off taking the shorter jump threading path 
without a joiner rather than preferring the long path with a joiner. 
IIRC, the issue was that if we selected the joiner path, then the 
duplication would create another jump threading opportunity (the 
original, shorter path without a join) that wouldn't be seen until the 
next pass of jump threading.


Jeff




Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-19 Thread Jeff Law

On 10/26/14 15:34, Sebastian Pop wrote:


I have tried to understand why the code generation part ICEs on coremark: the
first problem that I have seen is that tree-ssa-threadupdate.c does not handle
more than a joiner block per path to be threaded, so we would not be able to
jump thread accross the joiners of the if condition and the joiner of the switch
condition: i.e., these paths
Right.  There's nothing I can think of inherently that makes that 
impossible, it's just not something the code currently tries to support. 
 I suspect there's a few places that need light generalization.  I can 
offhand think of one or two.


It's not lost on me that what we're building is a specialized region 
cloner.  I keep wanting to go back and see if there's a representation 
where we have an incoming edge, series of blocks and an outgoing edge. 
The concept of a join block really isn't important.  It's just a block 
that we want to copy for which we don't know its destination.


Anyway, within that representation, I think the way to wire up the edges 
is simple if we build a map at the beginning of the process.   The set 
of blocks in the path all get clones.  There's a single edge into the 
cloned path (the incoming edge) and a single edge out (the edge 
corresponding to the statically computed destination of the path). 
Edges that were interior in the original path are kept interior in the 
clone.  Edges that reached outside the original path go from the clone 
to the original destinations outside the path.  It's a SEME region.





Another problem is that we attach the path to be threaded to the -aux field of
the first edge in the path, such that we would have to cancel some of the paths
because we cannot keep track of all the paths to be threaded.
What I can easily see is cases where you have two paths starting at a 
particular node where one path is a strict subset of another path. 
Assuming the amount of copying we're doing is reasonable, then you'd 
want to keep just the longer path


But I don't think that's the case you're struggling with.  We could (for 
example) have a case where all the successors of a join block become 
threadable, possibly to different destinations.


That would argue that we really want to handle the threads off a 
different data structure than e-aux.



jeff


Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-18 Thread Steve Ellcey
On Mon, 2014-11-17 at 09:24 +, James Greenhalgh wrote:

 For what it is worth, I've bootstrapped and tested this patch on
 aarch64-none-linux-gnu and arm-none-linux-gnueabi with no issues, and
 both targets get the expected speedup in the interesting benchmark.
 I've also thrown some of the larger popular benchmark suites at it, and
 they've compiled and run without any compilation issues or miscompares.
 
 I'm happy to help out with the testing and bug-triaging effort once this
 patch goes in.
 
 Some very shallow comments: you should document the new parameters
 in doc/invoke.texi and you ought to run contrib/check_GNU_style.sh
 on the patch and clean up the coding style issues it highlights.
 
 Thanks,
 James Greenhalgh

I tested the patch on MIPS and things looked good there too.  I got the
desired speedup and did not see any regressions.

Steve Ellcey



Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-18 Thread Jeff Law

On 11/18/14 12:25, Steve Ellcey wrote:

On Mon, 2014-11-17 at 09:24 +, James Greenhalgh wrote:


For what it is worth, I've bootstrapped and tested this patch on
aarch64-none-linux-gnu and arm-none-linux-gnueabi with no issues, and
both targets get the expected speedup in the interesting benchmark.
I've also thrown some of the larger popular benchmark suites at it, and
they've compiled and run without any compilation issues or miscompares.

I'm happy to help out with the testing and bug-triaging effort once this
patch goes in.

Some very shallow comments: you should document the new parameters
in doc/invoke.texi and you ought to run contrib/check_GNU_style.sh
on the patch and clean up the coding style issues it highlights.

Thanks,
James Greenhalgh


I tested the patch on MIPS and things looked good there too.  I got the
desired speedup and did not see any regressions.
It's on my list of things to look at.  The patch clearly was in prior to 
stage1 close and is thus eligible for inclusion in GCC 5.  Slogging 
through stuff as quickly as I can.



jeff


Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-18 Thread Sebastian Pop
Richard Biener wrote:
 On Tue, Nov 11, 2014 at 2:14 AM, Sebastian Pop seb...@gmail.com wrote:
  Hi Jeff,
 
  I have adapted the code generation part from James' patch to current trunk, 
  and
  the resulting patch gets the 30% speedup on coremark and passes bootstrap 
  of GCC.
 
  Ok for trunk?
 
 In addition to missing documentation for the parameters
 
 +  /* If we are copying an abnormally shaped region, keep track of
 +which block will become our loop header.  */
 +  if ((region[i] != entry-dest  region[i] == loop-header)
 + || (region[i] != entry-src  region[i] == loop-latch))
 +   {
 + save_loop_details = true;
 + memo_loop_latch_no = i;
 + memo_loop_header_no = i;
 +   }
 
 this looks bogus as you overwrite latch/header.  

Right, this should be:

if (region[i] != entry-dest  region[i] == loop-header)
  {
save_loop_details = true;
memo_loop_header_no = i;
  }

if (region[i] != entry-src  region[i] == loop-latch)
  {
save_loop_details = true;
memo_loop_latch_no = i;
  }


 I wonder what you
 tried to fix with this as abnormally shaped isn't sth we support
 given the check before (all blocks must belong to the same loop
 and thus entry is always the loop header or there is no loop)?
 

The regions that we duplicate start inside a loop and stay inside the same loop,
and the jump threading path is not allowed to go in deeper nested loops.

The reason why we need to modify the sese duplication function is that the sese
region that we need to duplicate starts at an arbitrary place inside the loop,
whereas the current user of the sese duplication function tree-ssa-loop-ch.c:245
starts at the edge entering the loop and exits at the latch edge.

 I'll leave the rest to Jeff but it looks good to me from an overall
 structure.
 

Thanks for your review.

Sebastian

PS: Patch passed bootstrap and regtest on x86_64-linux.

PS: I have run some perf analysis with the patch:
- on a bootstrap of GCC I see 3209 FSM jump threads
- libpng and libjpeg contain FSM jump threads, the perf increase is in the 1%
  (measured on simulators and reduced data sets)
- coremark gets jump threaded (as expected)
- I'm setting up the llvm test-suite and I will report perf differences
From aee8e01469c05e370b757b20c357a1c9dae57950 Mon Sep 17 00:00:00 2001
From: Sebastian Pop s@samsung.com
Date: Fri, 26 Sep 2014 14:54:20 -0500
Subject: [PATCH] extend jump thread for finite state automata PR 54742

Adapted from a patch from James Greenhalgh.

	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): New.

	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
	max-fsm-thread-paths): Documented.

	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.

	* tree-cfg.c (gimple_duplicate_sese_region): Save and restore loop
	header and latch.

	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
	original value of cond when simplification fails.
	(fsm_find_thread_path): New.
	(fsm_find_control_statement_thread_paths): New.
	(fsm_thread_through_normal_block): Call find_control_statement_thread_paths.

	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
	EDGE_START_FSM_THREAD.
	(thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges
	calling gimple_duplicate_sese_region.

	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD.
---
 gcc/doc/invoke.texi  |  12 ++
 gcc/params.def   |  15 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |  38 +
 gcc/tree-cfg.c   |  26 ++-
 gcc/tree-ssa-threadedge.c| 201 ++-
 gcc/tree-ssa-threadupdate.c  |  52 +-
 gcc/tree-ssa-threadupdate.h  |   1 +
 7 files changed, 340 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 13270bc..613edbb 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10586,6 +10586,18 @@ large and significantly increase compile time at optimization level
 @option{-O1} and higher.  This parameter is a maximum nubmer of statements
 in a single generated constructor.  Default value is 5000.
 
+@item max-fsm-thread-path-insns
+Maximum number of instructions to copy when duplicating blocks on a
+finite state automaton jump thread path.  The default is 100.
+
+@item max-fsm-thread-length
+Maximum number of basic blocks on a finite state automaton jump thread
+path.  The default is 10.
+
+@item max-fsm-thread-paths
+Maximum number of new jump thread paths to create for a finite state
+automaton.  The default is 50.
+
 @end table
 @end table
 
diff --git a/gcc/params.def b/gcc/params.def
index d2d2add..55c287e 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1125,6 

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-17 Thread James Greenhalgh
On Tue, Nov 11, 2014 at 01:14:04AM +, Sebastian Pop wrote:
 Hi Jeff,
 
 I have adapted the code generation part from James' patch to current trunk, 
 and
 the resulting patch gets the 30% speedup on coremark and passes bootstrap of 
 GCC.

For what it is worth, I've bootstrapped and tested this patch on
aarch64-none-linux-gnu and arm-none-linux-gnueabi with no issues, and
both targets get the expected speedup in the interesting benchmark.
I've also thrown some of the larger popular benchmark suites at it, and
they've compiled and run without any compilation issues or miscompares.

I'm happy to help out with the testing and bug-triaging effort once this
patch goes in.

Some very shallow comments: you should document the new parameters
in doc/invoke.texi and you ought to run contrib/check_GNU_style.sh
on the patch and clean up the coding style issues it highlights.

Thanks,
James Greenhalgh

 diff --git a/gcc/params.def b/gcc/params.def
 index d2d2add..749f962 100644
 --- a/gcc/params.def
 +++ b/gcc/params.def
 @@ -123,6 +123,25 @@ DEFPARAM (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY,
 Maximum probability of the entry BB of split region (in percent 
 relative to entry BB of the function) to make partial inlining happen,
 70, 0, 0)
  
 +/* Maximum number of instructions to copy when duplicating blocks
 +   on a jump thread path.  */
 +DEFPARAM (PARAM_MAX_THREAD_PATH_INSNS,
 +   max-thread-path-insns,
 +   Maximum number of instructions to copy when duplicating blocks on a 
 jump thread path,
 +   100, 1, 99)
 +
 +/* Maximum length of a jump thread path.  */
 +DEFPARAM (PARAM_MAX_THREAD_LENGTH,
 +   max-thread-length,
 +   Maximum number of basic blocks on a jump thread path,
 +   10, 1, 99)
 +
 +/* Maximum number of jump thread paths to duplicate.  */
 +DEFPARAM (PARAM_MAX_THREAD_PATHS,
 +   max-thread-paths,
 +   Maximum number of new jump thread paths to create,
 +   50, 1, 99)
 +
  /* Limit the number of expansions created by the variable expansion
 optimization to avoid register pressure.  */
  DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS,
 diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c 
 b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
 new file mode 100644
 index 000..f3ef725
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
 @@ -0,0 +1,32 @@
 +int sum0, sum1, sum2, sum3;
 +int foo(char * s, char** ret)
 +{
 +  int state=0;
 +  char c;
 +
 +  for (; *s  state != 4; s++)
 +{
 +  c = *s;
 +  if (c == '*')
 + {
 +   s++;
 +   break;
 + }
 +  switch (state) {
 + case 0:
 +   if (c == '+') state = 1;
 +   else if (c != '-') sum0+=c;
 +   break;
 + case 1:
 +   if (c == '+') state = 2;
 +   else if (c == '-') state = 0;
 +   else sum1+=c;
 +   break;
 + default:
 +   break;
 +  }
 +
 +}
 +  *ret = s;
 +  return state;
 +}
 diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
 index ee10bc6..565cfe3 100644
 --- a/gcc/tree-cfg.c
 +++ b/gcc/tree-cfg.c
 @@ -5949,10 +5949,12 @@ gimple_duplicate_sese_region (edge entry, edge exit,
  {
unsigned i;
bool free_region_copy = false, copying_header = false;
 +  bool save_loop_details = false;
struct loop *loop = entry-dest-loop_father;
edge exit_copy;
vecbasic_block doms;
edge redirected;
 +  int memo_loop_header_no = 0, memo_loop_latch_no = 0;
int total_freq = 0, entry_freq = 0;
gcov_type total_count = 0, entry_count = 0;
  
 @@ -5970,9 +5972,15 @@ gimple_duplicate_sese_region (edge entry, edge exit,
if (region[i]-loop_father != loop)
   return false;
  
 -  if (region[i] != entry-dest
 -region[i] == loop-header)
 - return false;
 +  /* If we are copying an abnormally shaped region, keep track of
 +  which block will become our loop header.  */
 +  if ((region[i] != entry-dest  region[i] == loop-header)
 +   || (region[i] != entry-src  region[i] == loop-latch))
 + {
 +   save_loop_details = true;
 +   memo_loop_latch_no = i;
 +   memo_loop_header_no = i;
 + }
  }
  
/* In case the function is used for loop header copying (which is the 
 primary
 @@ -6055,6 +6063,13 @@ gimple_duplicate_sese_region (edge entry, edge exit,
loop-latch = exit-src;
  }
  
 +  /* Restore loop details if we were asked to save them.  */
 +  if (save_loop_details)
 +{
 +  loop-header = region_copy[memo_loop_header_no];
 +  loop-latch = region_copy[memo_loop_latch_no];
 +}
 +
/* Redirect the entry and add the phi node arguments.  */
redirected = redirect_edge_and_branch (entry, get_bb_copy (entry-dest));
gcc_assert (redirected != NULL);
 diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
 index d5b9696..de9b3fe 100644
 --- a/gcc/tree-ssa-threadedge.c
 +++ b/gcc/tree-ssa-threadedge.c
 @@ -56,6 +56,8 @@ along with GCC; see the file COPYING3.  If not see
  

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-17 Thread Richard Biener
On Tue, Nov 11, 2014 at 2:14 AM, Sebastian Pop seb...@gmail.com wrote:
 Hi Jeff,

 I have adapted the code generation part from James' patch to current trunk, 
 and
 the resulting patch gets the 30% speedup on coremark and passes bootstrap of 
 GCC.

 Ok for trunk?

In addition to missing documentation for the parameters

+  /* If we are copying an abnormally shaped region, keep track of
+which block will become our loop header.  */
+  if ((region[i] != entry-dest  region[i] == loop-header)
+ || (region[i] != entry-src  region[i] == loop-latch))
+   {
+ save_loop_details = true;
+ memo_loop_latch_no = i;
+ memo_loop_header_no = i;
+   }

this looks bogus as you overwrite latch/header.  I wonder what you
tried to fix with this as abnormally shaped isn't sth we support
given the check before (all blocks must belong to the same loop
and thus entry is always the loop header or there is no loop)?

I'll leave the rest to Jeff but it looks good to me from an overall
structure.

Thanks,
Richard.



 Thanks,
 Sebastian

 Sebastian Pop wrote:
 Sebastian Pop wrote:
  Jeff Law wrote:
   On 08/21/14 04:30, Richard Biener wrote:
   It turns Jeff's jump-threading code in to a strange franken-pass of 
   bits and
   pieces of detection and optimisation, and would need some substantial
   reworking to fit in with Jeff's changes last Autumn, but if it is more
   likely to be acceptable for trunk then perhaps we could look to revive 
   it.
   It would be nice to reuse the path copy code Jeff added last year, but 
   I
   don't have much intuition as to how feasible that is.
   
   Was this the sort of thing that you were imagining?
   
   Yeah, didn't look too closely though.
   It'd be pretty ugly I suspect.  But it's probably worth pondering
   since that approach would eliminate the concerns about the cost of
   detection (which is problematical for the jump threader) by using
   Steve's code for that.
  
   On the update side, I suspect most, if not all of the framework is
   in place to handle this kind of update if the right threading paths
   were passed to the updater.  I can probably cobble together that
   by-hand and see what the tree-ssa-threadupdate does with it.  But
   it'll be a week or so before I could look at it.
 
  I adapted the patch James has sent last year to use the new update paths

 Attached an updated version of the patch.

  mechanism.  I verified that the attached patch does register all the paths 
  that
  need to be threaded.  Thread updater seems to have some problems handling 
  the
  attached testcase (a simplified version of the testcase attached to the 
  bug.)
 
  Jeff, could you please have a look at why the jump-thread updater is 
  crashing?

 I have tried to understand why the code generation part ICEs on coremark: the
 first problem that I have seen is that tree-ssa-threadupdate.c does not 
 handle
 more than a joiner block per path to be threaded, so we would not be able to
 jump thread accross the joiners of the if condition and the joiner of the 
 switch
 condition: i.e., these paths

 patch:   Registering jump thread: (7, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 14) 
 nocopy;
 patch:   Registering jump thread: (28, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 11) 
 nocopy;
 patch:   Registering jump thread: (8, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 17) 
 nocopy;
 patch:   Registering jump thread: (9, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 25) 
 nocopy;

 Another problem is that we attach the path to be threaded to the -aux field 
 of
 the first edge in the path, such that we would have to cancel some of the 
 paths
 because we cannot keep track of all the paths to be threaded.

 For coremark, we would discover some jump-thread paths from one of the switch
 cases over the loop exit condition, either to bb_27 outside the loop, or to 
 bb_4
 staying inside the loop.  Then with the patch: we would discover jump 
 threads
 that would thread switch cases to switch cases, and because these paths start
 with the same edges for which we have already assigned a path to e-aux, we
 would have to cancel the interesting threads added by the patch:

   Registering jump thread: (12, 25) incoming edge;  (25, 26) joiner;  (26, 
 4) nocopy;
   Registering jump thread: (13, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (29, 25) incoming edge;  (25, 26) joiner;  (26, 
 4) nocopy;
   Registering jump thread: (31, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (16, 25) incoming edge;  (25, 26) joiner;  (26, 
 4) nocopy;
   Registering jump thread: (15, 25) incoming edge;  (25, 26) joiner;  

Re: [Patch] Improving jump-thread pass for PR 54742

2014-11-10 Thread Sebastian Pop
Hi Jeff,

I have adapted the code generation part from James' patch to current trunk, and
the resulting patch gets the 30% speedup on coremark and passes bootstrap of 
GCC.

Ok for trunk?

Thanks,
Sebastian

Sebastian Pop wrote:
 Sebastian Pop wrote:
  Jeff Law wrote:
   On 08/21/14 04:30, Richard Biener wrote:
   It turns Jeff's jump-threading code in to a strange franken-pass of 
   bits and
   pieces of detection and optimisation, and would need some substantial
   reworking to fit in with Jeff's changes last Autumn, but if it is more
   likely to be acceptable for trunk then perhaps we could look to revive 
   it.
   It would be nice to reuse the path copy code Jeff added last year, but I
   don't have much intuition as to how feasible that is.
   
   Was this the sort of thing that you were imagining?
   
   Yeah, didn't look too closely though.
   It'd be pretty ugly I suspect.  But it's probably worth pondering
   since that approach would eliminate the concerns about the cost of
   detection (which is problematical for the jump threader) by using
   Steve's code for that.
   
   On the update side, I suspect most, if not all of the framework is
   in place to handle this kind of update if the right threading paths
   were passed to the updater.  I can probably cobble together that
   by-hand and see what the tree-ssa-threadupdate does with it.  But
   it'll be a week or so before I could look at it.
  
  I adapted the patch James has sent last year to use the new update paths
 
 Attached an updated version of the patch.
 
  mechanism.  I verified that the attached patch does register all the paths 
  that
  need to be threaded.  Thread updater seems to have some problems handling 
  the
  attached testcase (a simplified version of the testcase attached to the 
  bug.)
  
  Jeff, could you please have a look at why the jump-thread updater is 
  crashing?
 
 I have tried to understand why the code generation part ICEs on coremark: the
 first problem that I have seen is that tree-ssa-threadupdate.c does not handle
 more than a joiner block per path to be threaded, so we would not be able to
 jump thread accross the joiners of the if condition and the joiner of the 
 switch
 condition: i.e., these paths
 
 patch:   Registering jump thread: (7, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 14) 
 nocopy;
 patch:   Registering jump thread: (28, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 11) 
 nocopy;
 patch:   Registering jump thread: (8, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 17) 
 nocopy;
 patch:   Registering jump thread: (9, 10) incoming edge;  (10, 25) joiner;  
 (25, 26) joiner;  (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 25) 
 nocopy;
 
 Another problem is that we attach the path to be threaded to the -aux field 
 of
 the first edge in the path, such that we would have to cancel some of the 
 paths
 because we cannot keep track of all the paths to be threaded.
 
 For coremark, we would discover some jump-thread paths from one of the switch
 cases over the loop exit condition, either to bb_27 outside the loop, or to 
 bb_4
 staying inside the loop.  Then with the patch: we would discover jump 
 threads
 that would thread switch cases to switch cases, and because these paths start
 with the same edges for which we have already assigned a path to e-aux, we
 would have to cancel the interesting threads added by the patch:
 
   Registering jump thread: (12, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (13, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (29, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (31, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (16, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (15, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (32, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (19, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (18, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (22, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (21, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (34, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (33, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
   Registering jump thread: (35, 25) incoming edge;  (25, 26) joiner;  (26, 
 27) nocopy;
   Registering jump thread: (24, 25) incoming edge;  (25, 26) joiner;  (26, 4) 
 nocopy;
 patch:   Registering jump thread: (12, 25)