Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-15 Thread Aldy Hernandez

On 06/15/12 06:40, Eric Botcazou wrote:

Whoops, I forgot to commit that last patch.  Check now.


The warning is there on the 4.7 branch now.



Arghhh, that's the second time.  I wonder why the warning doesn't show 
up on my bootstraps.


Anyway, committed the attached patch to branch.
Backport from mainline:
2012-05-31  Aldy Hernandez  
* tree-ssa-loop-im.c (execute_sm): Do not check flag_tm.
* gimple.h (block_in_transaction): Check for flag_tm.

Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 188631)
+++ tree-ssa-loop-im.c  (working copy)
@@ -2154,7 +2154,7 @@ execute_sm (struct loop *loop, VEC (edge
   fmt_data.orig_loop = loop;
   for_each_index (&ref->mem, force_move_till, &fmt_data);
 
-  if ((flag_tm && block_in_transaction (loop_preheader_edge (loop)->src))
+  if (block_in_transaction (loop_preheader_edge (loop)->src)
   || !PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
 multi_threaded_model_p = true;
 
Index: gimple.h
===
--- gimple.h(revision 188631)
+++ gimple.h(working copy)
@@ -1587,7 +1587,7 @@ gimple_set_has_volatile_ops (gimple stmt
 static inline bool
 block_in_transaction (basic_block bb)
 {
-  return bb->flags & BB_IN_TRANSACTION;
+  return flag_tm && bb->flags & BB_IN_TRANSACTION;
 }
 
 /* Return true if STMT is in a transaction.  */


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-15 Thread Eric Botcazou
> Whoops, I forgot to commit that last patch.  Check now.

The warning is there on the 4.7 branch now.

-- 
Eric Botcazou


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-01 Thread Aldy Hernandez

On 06/01/12 10:11, Aldy Hernandez wrote:

On 06/01/12 01:22, Tobias Burnus wrote:


gcc/gimple.h: In function 'block_in_transaction':
gcc/gimple.h:1596:20: warning: overflow in implicit constant conversion
[-Woverflow]
return bb->flags & BB_IN_TRANSACTION;
^


Is this still the case with the code currently in mainline:

return flag_tm && bb->flags & BB_IN_TRANSACTION;

??


Whoops, I forgot to commit that last patch.  Check now.


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-01 Thread Aldy Hernandez

On 06/01/12 01:22, Tobias Burnus wrote:


gcc/gimple.h: In function 'block_in_transaction':
gcc/gimple.h:1596:20: warning: overflow in implicit constant conversion
[-Woverflow]
return bb->flags & BB_IN_TRANSACTION;
^


Is this still the case with the code currently in mainline:

  return flag_tm && bb->flags & BB_IN_TRANSACTION;

??


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-31 Thread Tobias Burnus

Aldy Hernandez wrote:

PR tree-optimization/52558
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* basic-block.h (alloc_aux_for_edge): Protoize.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(tree_ssa_lim_initialize): Call alloc_aux_for_edges.
(tree_ssa_lim_finalize): Call free_aux_for_edges.
* gimple.h (block_in_transaction): New.
(gimple_in_transaction): Use block_in_transaction.



Index: gimple.h
===
--- gimple.h(revision 188080)
+++ gimple.h(working copy)



+/* Return true if BB is in a transaction.  */
+
+static inline bool
+block_in_transaction (basic_block bb)
+{
+  return bb->flags & BB_IN_TRANSACTION;


The compile does not seem to like the latter very much and, thus, warns 
for every file which includes gimple.h:


gcc/gimple.h: In function 'block_in_transaction':
gcc/gimple.h:1596:20: warning: overflow in implicit constant conversion 
[-Woverflow]

   return bb->flags & BB_IN_TRANSACTION;
^

Tobias


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-31 Thread Aldy Hernandez

On 05/29/12 06:13, Richard Guenther wrote:

On Mon, 21 May 2012, Aldy Hernandez wrote:


On 05/16/12 07:53, Richard Guenther wrote:

On Mon, 7 May 2012, Aldy Hernandez wrote:




(flag_tm&&  loop_preheader_edge (loop)->src->flags&  BB_IN_TRANSACTION)

can you encapsulate this into a predicate?  Like block_in_transaction ()
that also checks flag_tm?


Done.. whoops, forgot to check for flag_tm.  I will move this into the 
predicate after I do another round of testing.  I missed this bit, and 
I've just committed.




+  /* ?? FIXME TESTING TESTING ?? */
+  multi_threaded_model_p=true;
+  /* ?? FIXME TESTING TESTING ?? */

that of course needs fixing ;)  (and re-testing)


This was here on purpose, so you'd see how I was testing.  I have 
committed the attached patch, not before testing with _and_ without the 
above FIXME.


Thanks so much for the review.  I will follow up with the flag_tm 
abstraction.


Closing the PR...

Aldy
PR tree-optimization/52558
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* basic-block.h (alloc_aux_for_edge): Protoize.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(tree_ssa_lim_initialize): Call alloc_aux_for_edges.
(tree_ssa_lim_finalize): Call free_aux_for_edges.
* gimple.h (block_in_transaction): New.
(gimple_in_transaction): Use block_in_transaction.

Index: testsuite/gcc.dg/pr52558-1.c
===
--- testsuite/gcc.dg/pr52558-1.c(revision 0)
+++ testsuite/gcc.dg/pr52558-1.c(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "--param allow-store-data-races=0 -O2 -fdump-tree-lim1" } */
+
+/* Test that `count' is not written to unless p->data > 0.  */
+
+int count;
+
+struct obj {
+int data;
+struct obj *next;
+} *q;
+
+void func()
+{
+  struct obj *p;
+  for (p = q; p; p = p->next)
+if (p->data > 0)
+  count++;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM count_lsm.. count_lsm_flag" 1 "lim1" 
} } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: testsuite/gcc.dg/pr52558-2.c
===
--- testsuite/gcc.dg/pr52558-2.c(revision 0)
+++ testsuite/gcc.dg/pr52558-2.c(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "--param allow-store-data-races=0 -O2 -fdump-tree-lim1" } */
+
+/* Test that g_2 is not written to unless !g_1.  */
+
+int g_1 = 1;
+int g_2 = 0;
+
+int func_1(void)
+{
+ int l;
+ for (l = 0; l < 1234; l++)
+ {
+   if (g_1)
+ return l;
+   else
+ g_2 = 0;
+ }
+ return 999;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM.*g_2_lsm_flag" 1 "lim1" } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: testsuite/gcc.dg/tm/reg-promotion.c
===
--- testsuite/gcc.dg/tm/reg-promotion.c (revision 0)
+++ testsuite/gcc.dg/tm/reg-promotion.c (revision 0)
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-fgnu-tm -O2 -fdump-tree-lim1" } */
+
+/* Test that `count' is not written to unless p->data>0.  */
+
+int count;
+
+struct obj {
+int data;
+struct obj *next;
+} *q;
+
+void func()
+{
+  struct obj *p;
+  __transaction_atomic {
+for (p = q; p; p = p->next)
+  if (p->data > 0)
+   count++;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "MEM count_lsm.. count_lsm_flag" 1 "lim1" 
} } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 188080)
+++ tree-ssa-loop-im.c  (working copy)
@@ -52,7 +52,7 @@ along with GCC; see the file COPYING3.
 }
  }
 
-   Where COND and INV are is invariants, but evaluating INV may trap or be
+   Where COND and INV are invariants, but evaluating INV may trap or be
invalid from some other reason if !COND.  This may be transformed to
 
if (cond)
@@ -1626,6 +1626,7 @@ gather_mem_refs_stmt (struct loop *loop,
  fprintf (dump_file, "\n");
}
 }
+
   if (is_stored)
 mark_ref_stored (ref, loop);
 
@@ -1956,6 +1957,173 @@ get_lsm_tmp_name (tree ref, unsigned n)
   return lsm_tmp_name;
 }
 
+struct prev_flag_edges {
+  /* Edge to insert new flag comparison code.  */
+  edge append_cond_position;
+
+  /* Edge for fall through from previous flag comparison.  */
+  edge last_cond_fallthru;
+};
+
+/* Helper function for execute_sm.  Emit code to store TMP_VAR into
+   MEM along edge EX.
+
+   The store is only done if MEM has changed.  We do this so no
+   changes to MEM occur on code paths that did not originally store
+   into it.
+
+   The common case for execute_sm will transform:
+
+ for (...) {
+   if (foo)
+

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-29 Thread Richard Guenther
On Mon, 21 May 2012, Aldy Hernandez wrote:

> On 05/16/12 07:53, Richard Guenther wrote:
> > On Mon, 7 May 2012, Aldy Hernandez wrote:
> 
> [Sorry for the delay; I was on vacation.]
> 
> I am forgoing the load avoidance code altogether to simplify things. Thanks.
> 
> > +  /* Emit the load code into the latch, so that we are sure it will
> > + be processed after all dependencies.  */
> > +  latch_edge = loop_latch_edge (loop);
> >
> > inserting code into the latch block is bad - the vectorizer will
> > be confused by non-empty latches.
> 
> The code as is on trunk inserts the load on the latch.

Does it?  Ok ...

>  That's why I also
> inserted the supporting flag checking code there.  Do you want me to put the
> supporting code somewhere else?

Technically there isn't a good other place that does not add a
partial redundancy.

> > Your ChangeLog mentions changes that are no longer necessary
> > (the target hook).
> 
> Whoops.  Fixed.
> 
> >
> > I didn't quite get the store order issue - we _do_ preserve store
> > ordering right now, do we?  So how come introducing the flags
> > change that?
> 
> The current scheme on trunk works because it inserts one load with
> gsi_insert_on_edge() and subsequent ones get appended correctly, whereas my
> patch has to split the edge (which happens at the top of the block), so
> subsequent code inserts happen in reverse order (at the top).  If I remember
> correctly, that is... I can look again and report if it's still unclear.

Hmm, the old code (well, and the new one ...) walks stores to move
by walking a bitmap.  I suppose we rely on computing uids in the
original program order here then.

(flag_tm && loop_preheader_edge (loop)->src->flags & BB_IN_TRANSACTION)

can you encapsulate this into a predicate?  Like block_in_transaction ()
that also checks flag_tm?

+  /* ?? FIXME TESTING TESTING ?? */
+  multi_threaded_model_p=true;
+  /* ?? FIXME TESTING TESTING ?? */

that of course needs fixing ;)  (and re-testing)


> New patch attached.  Tested on x86-64 Linux.  No regressions.

Ok with the new predicate in basic-block.h and re-testing without the
fixme above.

Thanks,
Richard.


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-21 Thread Aldy Hernandez

On 05/16/12 07:53, Richard Guenther wrote:

On Mon, 7 May 2012, Aldy Hernandez wrote:


[Sorry for the delay; I was on vacation.]

I am forgoing the load avoidance code altogether to simplify things. 
Thanks.



+  /* Emit the load code into the latch, so that we are sure it will
+ be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);

inserting code into the latch block is bad - the vectorizer will
be confused by non-empty latches.


The code as is on trunk inserts the load on the latch.  That's why I 
also inserted the supporting flag checking code there.  Do you want me 
to put the supporting code somewhere else?



Your ChangeLog mentions changes that are no longer necessary
(the target hook).


Whoops.  Fixed.



I didn't quite get the store order issue - we _do_ preserve store
ordering right now, do we?  So how come introducing the flags
change that?


The current scheme on trunk works because it inserts one load with 
gsi_insert_on_edge() and subsequent ones get appended correctly, whereas 
my patch has to split the edge (which happens at the top of the block), 
so subsequent code inserts happen in reverse order (at the top).  If I 
remember correctly, that is... I can look again and report if it's still 
unclear.


New patch attached.  Tested on x86-64 Linux.  No regressions.

Thanks.
Aldy
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* basic-block.h (alloc_aux_for_edge): Protoize.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(tree_ssa_lim_initialize): Call alloc_aux_for_edges.
(tree_ssa_lim_finalize): Call free_aux_for_edges.

Index: testsuite/gcc.dg/tm/reg-promotion.c
===
--- testsuite/gcc.dg/tm/reg-promotion.c (revision 0)
+++ testsuite/gcc.dg/tm/reg-promotion.c (revision 0)
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-fgnu-tm -O2 -fdump-tree-lim1" } */
+
+/* Test that `count' is not written to unless p->data>0.  */
+
+int count;
+
+struct obj {
+int data;
+struct obj *next;
+} *q;
+
+void func()
+{
+  struct obj *p;
+  __transaction_atomic {
+for (p = q; p; p = p->next)
+  if (p->data > 0)
+   count++;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "MEM count_lsm.. count_lsm_flag" 1 "lim1" 
} } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: testsuite/gcc.dg/pr52558-1.c
===
--- testsuite/gcc.dg/pr52558-1.c(revision 0)
+++ testsuite/gcc.dg/pr52558-1.c(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "--param allow-store-data-races=0 -O2 -fdump-tree-lim1" } */
+
+/* Test that `count' is not written to unless p->data > 0.  */
+
+int count;
+
+struct obj {
+int data;
+struct obj *next;
+} *q;
+
+void func()
+{
+  struct obj *p;
+  for (p = q; p; p = p->next)
+if (p->data > 0)
+  count++;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM count_lsm.. count_lsm_flag" 1 "lim1" 
} } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: testsuite/gcc.dg/pr52558-2.c
===
--- testsuite/gcc.dg/pr52558-2.c(revision 0)
+++ testsuite/gcc.dg/pr52558-2.c(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "--param allow-store-data-races=0 -O2 -fdump-tree-lim1" } */
+
+/* Test that g_2 is not written to unless !g_1.  */
+
+int g_1 = 1;
+int g_2 = 0;
+
+int func_1(void)
+{
+ int l;
+ for (l = 0; l < 1234; l++)
+ {
+   if (g_1)
+ return l;
+   else
+ g_2 = 0;
+ }
+ return 999;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM.*g_2_lsm_flag" 1 "lim1" } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: basic-block.h
===
--- basic-block.h   (revision 187729)
+++ basic-block.h   (working copy)
@@ -802,6 +802,7 @@ extern basic_block alloc_block (void);
 extern void alloc_aux_for_blocks (int);
 extern void clear_aux_for_blocks (void);
 extern void free_aux_for_blocks (void);
+extern void alloc_aux_for_edge (edge, int);
 extern void alloc_aux_for_edges (int);
 extern void clear_aux_for_edges (void);
 extern void free_aux_for_edges (void);
Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 187729)
+++ tree-ssa-loop-im.c  (working copy)
@@ -52,7 +52,7 @@ along with GCC; see the file COPYING3.
 }
  }
 
-   Where COND and INV are is invariants, but evaluating INV may trap or be
+   Where COND and INV are invariants, but evaluating INV may trap or be
invalid from some other reason if !COND.  This may be transformed to
 
if (cond)
@@ -1626,6 +1626,7

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-16 Thread Richard Guenther
On Mon, 7 May 2012, Aldy Hernandez wrote:

> Hi.  Sorry for the delay.  There were various tricky hiccups along the way to
> bootstrappability and regression cleanliness...
> 
> On 04/26/12 04:51, Richard Guenther wrote:
> > On Wed, 25 Apr 2012, Aldy Hernandez wrote:
> >
> > > On 04/25/12 06:45, Richard Guenther wrote:
> > > > On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez
> > > > wrote:
> > > > > On 04/13/12 03:46, Richard Guenther wrote:
> > > > > >
> > > > > > On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez
> > > > > > wrote:
> 
> > > Speak of loads, I am keeping the information as an additional bitmap in
> > > `memory_accesses', as ->refs_in_loop was set for stores as well, so I
> > > couldn't
> > > depend on it.  Let me know if you have another idea.
> >
> > Hmm, refs_in_loop&  ~all_refs_stored_in_loop, so instead of
> >
> > +  bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
> > +   loop->num);
> > +  ref_read_in_loop_p = bitmap_bit_p (reads, ref->id);
> >
> > ref_read_in_loop_p = bitmap_bit_p (refs, ref->id)&&  !bitmap_bit_p
> > (stores, ref->id);
> >
> > ?  But maybe that doesn't work if a ref is both read and stored to.
> > Btw, rather than adding a bitmap to memory_accesses I'd rather add
> > a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
> > both into one) and a bitmap to struct mem_ref.
> 
> I have done so as mark_ref_loaded().  I first tried to merge both into a
> generic mark_ref(), to avoid iterating twice, but I was concerned with the
> early loop exit of !bitmap_bit_p(ref->stored, loop->num), not knowing if I
> should exit the loop altogether in the merged case or what.  In either case,
> the code was somewhat convoluted, so I opted for a separate clean function.
> 
> In scanning the gimple stream for the loads I noticed that instances of two
> component refs (say foo.bar) were not pointer comparable, so I am asking the
> alias oracle with refs_may_alias_p.  It also seemed to make more sense wrt
> memory chunks that may alias.  What do you think?

No, it asks for equal must aliases (it will replace them after all!).
You should use operand_equal_p to compare reference trees.

But why do all the following dance

+
+  if (gimple_assign_single_p (stmt))
+{
+  tree t = gimple_assign_rhs1 (stmt);
+  /* ?? For some reason two exact COMPONENT_REFs cannot be
+compared with pointer equality, so ask the alias oracle.  */
+  if (ref->mem == t
+ || ((TREE_CODE (t) == SSA_NAME
+  || DECL_P (t)
+  || handled_component_p (t)
+  || TREE_CODE (t) == MEM_REF
+  || TREE_CODE (t) == TARGET_MEM_REF)
+ && refs_may_alias_p (t, ref->mem)))
+   mark_ref_loaded (ref, loop);
+}

at all and not just

   if (is_stored)
 mark_ref_stored (ref, loop);
   else
 mark_ref_loaded (ref, loop);

and similar in the !mem case mark the ref loaded unconditionally:

   mark_ref_loaded (ref, loop);
   if (gimple_vdef (stmt))
 mark_ref_stored (ref, loop);

> This patch is far more robust and fully tested.  Two wrenches to complicate
> things though...
> 
> First, while inserting the code writing the _LSM_ temporary writes into the
> original variables, I found a handful of tests where the order of the writes
> mattered (aliased locations, inlined code, etc, IIRC).  [This is in loops
> where multiple stores are moved.]  So iterating and inserting on the exit
> edges caused the stores to be saved in reverse.  I am now saving the edges and
> threading the code, so they happen in the correct order:
> 
>   if (foo_flag_lsm)
>   foo = foo_lsm;
>   if (foo2_flag_lsm)
>   foo2 = foo2_lsm;
>   actual_exit:
> 
> This required some edge redirection so we didn't jump to the original actual
> exit prematurely when handling multiple moved stores.  The dominator tree
> needed to be updated, and there is one instance where I couldn't find a clever
> way of saving the order without jumping through an inordinate amount of hoops.
> It seemed easier to call recompute_dominator.
> 
> Second, avoiding the load, unless it happens in the loop like you have
> suggested, brought about some bogus unused warnings with the LSM temporaries.
> What happens is that compute_ssa() creates uses of the LSM temporaries that
> are not yet defined, so we end up with PHI nodes with a definition entry (D).
> Since we are sure the eventual uses will be gated by their corresponding flag,
> I created phi nodes with an initial value of 0 in the preheader of the loops
> in which they are used.  This solved the problem.  I also played with
> initializing to 0 at the entry block, but that seemed a bit too invasive.

Hm.  Btw, see also PR39612 which you could solve with that?

> Andrew suggested the correct fix was to add a new pass that was able to do
> some ?? flow sensitive data flow analysis ?? that could discover these
> unreac

PING: Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-15 Thread Aldy Hernandez

PING.


Hi. Sorry for the delay. There were various tricky hiccups along the way
to bootstrappability and regression cleanliness...

On 04/26/12 04:51, Richard Guenther wrote:

On Wed, 25 Apr 2012, Aldy Hernandez wrote:


On 04/25/12 06:45, Richard Guenther wrote:

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez
wrote:

On 04/13/12 03:46, Richard Guenther wrote:


On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez
wrote:



Speak of loads, I am keeping the information as an additional bitmap in
`memory_accesses', as ->refs_in_loop was set for stores as well, so I
couldn't
depend on it. Let me know if you have another idea.


Hmm, refs_in_loop& ~all_refs_stored_in_loop, so instead of

+ bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
+ loop->num);
+ ref_read_in_loop_p = bitmap_bit_p (reads, ref->id);

ref_read_in_loop_p = bitmap_bit_p (refs, ref->id)&& !bitmap_bit_p
(stores, ref->id);

? But maybe that doesn't work if a ref is both read and stored to.
Btw, rather than adding a bitmap to memory_accesses I'd rather add
a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
both into one) and a bitmap to struct mem_ref.


I have done so as mark_ref_loaded(). I first tried to merge both into a
generic mark_ref(), to avoid iterating twice, but I was concerned with
the early loop exit of !bitmap_bit_p(ref->stored, loop->num), not
knowing if I should exit the loop altogether in the merged case or what.
In either case, the code was somewhat convoluted, so I opted for a
separate clean function.

In scanning the gimple stream for the loads I noticed that instances of
two component refs (say foo.bar) were not pointer comparable, so I am
asking the alias oracle with refs_may_alias_p. It also seemed to make
more sense wrt memory chunks that may alias. What do you think?

This patch is far more robust and fully tested. Two wrenches to
complicate things though...

First, while inserting the code writing the _LSM_ temporary writes into
the original variables, I found a handful of tests where the order of
the writes mattered (aliased locations, inlined code, etc, IIRC). [This
is in loops where multiple stores are moved.] So iterating and inserting
on the exit edges caused the stores to be saved in reverse. I am now
saving the edges and threading the code, so they happen in the correct
order:

if (foo_flag_lsm)
foo = foo_lsm;
if (foo2_flag_lsm)
foo2 = foo2_lsm;
actual_exit:

This required some edge redirection so we didn't jump to the original
actual exit prematurely when handling multiple moved stores. The
dominator tree needed to be updated, and there is one instance where I
couldn't find a clever way of saving the order without jumping through
an inordinate amount of hoops. It seemed easier to call
recompute_dominator.

Second, avoiding the load, unless it happens in the loop like you have
suggested, brought about some bogus unused warnings with the LSM
temporaries. What happens is that compute_ssa() creates uses of the LSM
temporaries that are not yet defined, so we end up with PHI nodes with a
definition entry (D). Since we are sure the eventual uses will be gated
by their corresponding flag, I created phi nodes with an initial value
of 0 in the preheader of the loops in which they are used. This solved
the problem. I also played with initializing to 0 at the entry block,
but that seemed a bit too invasive.

Andrew suggested the correct fix was to add a new pass that was able to
do some ?? flow sensitive data flow analysis ?? that could discover
these unreachable paths and insert the 0 phis at the start of the blocks
automatically. But that seemed like far too much work, considering how
long it's taken me to get this far ;-).

Bootstraped and fully tested with multi_threaded_model_p=true
hard-coded. This is the revision I'd like to contribute, sans the
hardcoded value.

What do you think?




Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-08 Thread Aldy Hernandez

On 05/07/12 19:11, Andrew MacLeod wrote:

On 05/07/2012 07:04 PM, Aldy Hernandez wrote:



Andrew suggested the correct fix was to add a new pass that was able
to do some ?? flow sensitive data flow analysis ?? that could discover
these unreachable paths and insert the 0 phis at the start of the
blocks automatically. But that seemed like far too much work,
considering how long it's taken me to get this far ;-).



Wait, no. I didn't suggest writing an entire generic pass was the
correct fix for you... I said that this was a technique that a generic
pass which identified this sort of flow sensitive data flow info could
use to work within the CFG. Simply zeroing out uses of the load in PHI
nodes on paths it has determined are not actually reachable by that
value, and you were suppose to integrate just the bits of that technique
that you required.. just replace any uses of your LSM temporary with 0.
I never intended you should write an entire pass...


Ah, well in that case, I've already done that.  Well I don't do it on 
any path, just in the loop header, and things propagate down from there 
quite nicely :).


Aldy


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-07 Thread Andrew MacLeod

On 05/07/2012 07:04 PM, Aldy Hernandez wrote:



Andrew suggested the correct fix was to add a new pass that was able 
to do some ?? flow sensitive data flow analysis ?? that could discover 
these unreachable paths and insert the 0 phis at the start of the 
blocks automatically.  But that seemed like far too much work, 
considering how long it's taken me to get this far ;-).




Wait, no. I didn't suggest writing an entire generic pass was the 
correct fix for you... I said that this was a technique that a generic 
pass  which identified this sort of flow sensitive data flow info could 
use to work within the CFG.  Simply zeroing out uses of the load in PHI 
nodes on paths it has determined are not actually reachable by that 
value, and you were suppose to integrate just the bits of that technique 
that you required.. just replace any uses of your LSM temporary with 
0.   I never intended you should write an entire pass...


Andrew



Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-07 Thread Aldy Hernandez
Hi.  Sorry for the delay.  There were various tricky hiccups along the 
way to bootstrappability and regression cleanliness...


On 04/26/12 04:51, Richard Guenther wrote:

On Wed, 25 Apr 2012, Aldy Hernandez wrote:


On 04/25/12 06:45, Richard Guenther wrote:

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez   wrote:

On 04/13/12 03:46, Richard Guenther wrote:


On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez
wrote:



Speak of loads, I am keeping the information as an additional bitmap in
`memory_accesses', as ->refs_in_loop was set for stores as well, so I couldn't
depend on it.  Let me know if you have another idea.


Hmm, refs_in_loop&  ~all_refs_stored_in_loop, so instead of

+  bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
+   loop->num);
+  ref_read_in_loop_p = bitmap_bit_p (reads, ref->id);

ref_read_in_loop_p = bitmap_bit_p (refs, ref->id)&&  !bitmap_bit_p
(stores, ref->id);

?  But maybe that doesn't work if a ref is both read and stored to.
Btw, rather than adding a bitmap to memory_accesses I'd rather add
a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
both into one) and a bitmap to struct mem_ref.


I have done so as mark_ref_loaded().  I first tried to merge both into a 
generic mark_ref(), to avoid iterating twice, but I was concerned with 
the early loop exit of !bitmap_bit_p(ref->stored, loop->num), not 
knowing if I should exit the loop altogether in the merged case or what. 
 In either case, the code was somewhat convoluted, so I opted for a 
separate clean function.


In scanning the gimple stream for the loads I noticed that instances of 
two component refs (say foo.bar) were not pointer comparable, so I am 
asking the alias oracle with refs_may_alias_p.  It also seemed to make 
more sense wrt memory chunks that may alias.  What do you think?


This patch is far more robust and fully tested.  Two wrenches to 
complicate things though...


First, while inserting the code writing the _LSM_ temporary writes into 
the original variables, I found a handful of tests where the order of 
the writes mattered (aliased locations, inlined code, etc, IIRC).  [This 
is in loops where multiple stores are moved.]  So iterating and 
inserting on the exit edges caused the stores to be saved in reverse.  I 
am now saving the edges and threading the code, so they happen in the 
correct order:


if (foo_flag_lsm)
foo = foo_lsm;
if (foo2_flag_lsm)
foo2 = foo2_lsm;
actual_exit:

This required some edge redirection so we didn't jump to the original 
actual exit prematurely when handling multiple moved stores.  The 
dominator tree needed to be updated, and there is one instance where I 
couldn't find a clever way of saving the order without jumping through 
an inordinate amount of hoops.  It seemed easier to call 
recompute_dominator.


Second, avoiding the load, unless it happens in the loop like you have 
suggested, brought about some bogus unused warnings with the LSM 
temporaries.  What happens is that compute_ssa() creates uses of the LSM 
temporaries that are not yet defined, so we end up with PHI nodes with a 
definition entry (D).  Since we are sure the eventual uses will be gated 
by their corresponding flag, I created phi nodes with an initial value 
of 0 in the preheader of the loops in which they are used.  This solved 
the problem.  I also played with initializing to 0 at the entry block, 
but that seemed a bit too invasive.


Andrew suggested the correct fix was to add a new pass that was able to 
do some ?? flow sensitive data flow analysis ?? that could discover 
these unreachable paths and insert the 0 phis at the start of the blocks 
automatically.  But that seemed like far too much work, considering how 
long it's taken me to get this far ;-).


Bootstraped and fully tested with multi_threaded_model_p=true 
hard-coded.  This is the revision I'd like to contribute, sans the 
hardcoded value.


What do you think?
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* gimple.h (gimple_set_in_transaction): Remove.
(gimple_in_transaction): Look in BB instead.
(gimple_statement_base): Remove in_transaction field.
* basic-block.h (enum bb_flags): Add BB_IN_TRANSACTION.
(alloc_aux_for_edge): Protoize.
* trans-mem.c (compute_transaction_bits): Place transaction bit
information into basic blocks.
* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
* doc/tm.texi: Regenerate.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(mark_ref): Rename from mark_ref_stored.  Set loaded bit.
(mem_ref_alloc): Initialize loaded field.
(memref_free): F

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-26 Thread Richard Guenther
On Wed, 25 Apr 2012, Aldy Hernandez wrote:

> On 04/25/12 06:45, Richard Guenther wrote:
> > On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez  wrote:
> > > On 04/13/12 03:46, Richard Guenther wrote:
> > > >
> > > > On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez
> > > > wrote:
> 
> > +  /* ?? Perhaps we should cache this somewhere in the BB, or are
> > + multiple levels of empty BB's common. ??  */
> > +  FOR_EACH_EDGE (e, ei, bb->preds)
> > +{
> > +  int res = bb_in_transaction_1 (e->src);
> > +  if (res != -1)
> > +   return (bool) res;
> > +}
> > +  return false;
> >
> > that's weird code - if predecessors are not all in or not in a transaction
> > you return a random value.  So it seems this is somewhat fragile.
> >
> > If bb status can be derived from looking at a single stmt why is the
> > transaction-ness not recorded as a basic-block flag then?  Thus,
> > why do we have a bit in gimple stmts?
> 
> How about we get rid of the GIMPLE bit altogether and keep it in the BB flags
> like you mention?  See patch.

Yeah, that looks good.

> > +  bool single_exit_p = single_pred_p (ex->dest);
> >
> > that's a strange variable name ;)
> 
> How about loop_has_only_one_exit? :)

Fine ;)

> >
> > +/* Helper function for execute_sm.  On every path that sets REF, set
> > +   an appropriate flag indicating the store.  */
> > +
> > +static tree
> > +execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
> > +{
> > ...
> > +  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
> > +{
> > +  gimple_stmt_iterator gsi;
> > +  gimple stmt;
> > +
> > +  gsi = gsi_start_bb (gimple_bb (loc->stmt));
> > +  for (; !gsi_end_p (gsi); gsi_next (&gsi))
> > +   if (gsi_stmt (gsi) == loc->stmt)
> >
> > does not seem to do it on every path but on every REF setter.  And instead
> > of the innermost loop just use gsi_for_stmt (loc->stmt).
> 
> Updated comment.  Fixed.
> 
> >
> > +  /* Emit the load code into the latch, so that we are sure it will
> > + be processed after all dependencies.  */
> > +  latch_edge = loop_latch_edge (loop);
> > +  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
> > lim_data = init_lim_data (load);
> > lim_data->max_loop = loop;
> > lim_data->tgt_loop = loop;
> > +  gsi_insert_on_edge (latch_edge, load);
> > +  load = gimple_build_assign (tmp_var, mem_ssa);
> > +  lim_data = init_lim_data (load);
> > +  lim_data->max_loop = loop;
> > +  lim_data->tgt_loop = loop;
> > +  gsi_insert_on_edge (latch_edge, load);
> >
> > the 2nd assign is no longer a "load", so I suppose you don't need any
> > lim_data
> > for it?
> 
> Re-did all this.  See patch.
> 
> >
> > +  else if (store == STORE_IF_CHANGED_FLAG)
> > +   execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
> > +  store_flag);
> >
> > and this can pass NULL instead of mem_ssa?
> 
> Got rid of mem_ssa altogether, as well as the if-changed approach which proved
> inferior to the flags based approach.

Good.

> > Overall this looks reasonable.  With also handling trapping things I meant
> > that we should be able to apply store-motion to
> >
> > int a[256];
> > void foo (int *p)
> > {
> >int i;
> >for (i= 0;i<256;i++)
> >  if (flag)
> >*p = a[i];
> > }
> >
> > but we cannot, because the transform
> >
> >lsm = *p;
> >for (i = 0; i<  256; ++i)
> >  if (flag)
> >lsm = a[i];
> >*p = lsm;
> >
> > even when the store is properly conditional the load is not (and you do not
> > change that).  The unconditional load may trap here.  What I meant to say
> > is that we do not need the load at all unless there is also a load involved
> > inside the loop - if we use the flag-based approach.  If there is also a
> > load
> > inside the loop we'd need to perform the load there (or not handle this
> > case)
> 
> Ahh!  Yes, this sounds very profitable.  Would you mind me doing this in a
> separate patch?  I am already handling loads in this incarnation, so this
> should be straightforward.

No, it's fine to do it separately.

> Speak of loads, I am keeping the information as an additional bitmap in
> `memory_accesses', as ->refs_in_loop was set for stores as well, so I couldn't
> depend on it.  Let me know if you have another idea.

Hmm, refs_in_loop & ~all_refs_stored_in_loop, so instead of

+  bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
+   loop->num);
+  ref_read_in_loop_p = bitmap_bit_p (reads, ref->id);

ref_read_in_loop_p = bitmap_bit_p (refs, ref->id) && !bitmap_bit_p 
(stores, ref->id);

?  But maybe that doesn't work if a ref is both read and stored to.
Btw, rather than adding a bitmap to memory_accesses I'd rather add
a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
both into one) and a bitmap to struct mem_ref.

> Mildly tested.  Wadaya think?

Looks good overall, minus the detail of remembering refs loaded.

Richard.


> Thanks a

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-25 Thread Aldy Hernandez

On 04/25/12 06:45, Richard Guenther wrote:

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez  wrote:

On 04/13/12 03:46, Richard Guenther wrote:


On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezwrote:



+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+ multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+{
+  int res = bb_in_transaction_1 (e->src);
+  if (res != -1)
+   return (bool) res;
+}
+  return false;

that's weird code - if predecessors are not all in or not in a transaction
you return a random value.  So it seems this is somewhat fragile.

If bb status can be derived from looking at a single stmt why is the
transaction-ness not recorded as a basic-block flag then?  Thus,
why do we have a bit in gimple stmts?


How about we get rid of the GIMPLE bit altogether and keep it in the BB 
flags like you mention?  See patch.



+  bool single_exit_p = single_pred_p (ex->dest);

that's a strange variable name ;)


How about loop_has_only_one_exit? :)



+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
...
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+{
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+
+  gsi = gsi_start_bb (gimple_bb (loc->stmt));
+  for (; !gsi_end_p (gsi); gsi_next (&gsi))
+   if (gsi_stmt (gsi) == loc->stmt)

does not seem to do it on every path but on every REF setter.  And instead
of the innermost loop just use gsi_for_stmt (loc->stmt).


Updated comment.  Fixed.



+  /* Emit the load code into the latch, so that we are sure it will
+ be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
lim_data = init_lim_data (load);
lim_data->max_loop = loop;
lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);

the 2nd assign is no longer a "load", so I suppose you don't need any lim_data
for it?


Re-did all this.  See patch.



+  else if (store == STORE_IF_CHANGED_FLAG)
+   execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+  store_flag);

and this can pass NULL instead of mem_ssa?


Got rid of mem_ssa altogether, as well as the if-changed approach which 
proved inferior to the flags based approach.



Overall this looks reasonable.  With also handling trapping things I meant
that we should be able to apply store-motion to

int a[256];
void foo (int *p)
{
   int i;
   for (i= 0;i<256;i++)
 if (flag)
   *p = a[i];
}

but we cannot, because the transform

   lsm = *p;
   for (i = 0; i<  256; ++i)
 if (flag)
   lsm = a[i];
   *p = lsm;

even when the store is properly conditional the load is not (and you do not
change that).  The unconditional load may trap here.  What I meant to say
is that we do not need the load at all unless there is also a load involved
inside the loop - if we use the flag-based approach.  If there is also a load
inside the loop we'd need to perform the load there (or not handle this
case)


Ahh!  Yes, this sounds very profitable.  Would you mind me doing this in 
a separate patch?  I am already handling loads in this incarnation, so 
this should be straightforward.


Speak of loads, I am keeping the information as an additional bitmap in 
`memory_accesses', as ->refs_in_loop was set for stores as well, so I 
couldn't depend on it.  Let me know if you have another idea.


Mildly tested.  Wadaya think?

Thanks again.
Aldy
* gimple.h (gimple_set_in_transaction): Remove.
(gimple_in_transaction): Look in BB instead.
(gimple_statement_base): Remove in_transaction field.
* basic-block.h (enum bb_flags): Add BB_IN_TRANSACTION.
* trans-mem.c (compute_transaction_bits): Place transaction bit
information into basic blocks.
* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
* doc/tm.texi: Regenerate.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
* targhooks.c (default_can_compare_bitwise_p): New.
* targhooks.h (default_can_compare_bitwise_p): Protoize.
* target.def (can_compare_bitwise_p): New hook.
* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
conditional.
* opts.c (finish_options): Do not allow store or load data races
with -fgnu-tm.

Index: trans-mem.c
===
--- trans-mem.c (revision 186108)
+++ tran

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-25 Thread Richard Guenther
On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez  wrote:
> On 04/13/12 03:46, Richard Guenther wrote:
>>
>> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez  wrote:
>
>
> Richard.  Thanks so much for reviewing and providing an alternative
> approach, which AFAICT provides superior results.
>
>
>> A similar effect could be obtained by keeping a flag whether we entered
>> the
>> path that stored the value before the transform.  Thus, do
>>
>>   lsm = g2;  // technically not needed, if you also handle loads
>>   flag = false;
>>   for (...)
>>     {
>>        if (g1)
>>          {
>>             if (flag)
>>               g2 = lsm;
>>          }
>>        else
>>          {
>>             lsm = 0;
>>             flag = true;
>>          }
>>     }
>>   if (flag)
>>     g2 = lsm;
>
>
> I have implemented this in the attached patch, and it seems to be generating
> better code than my other if-changed approach.  It also avoids generating
> unnecessary loads that may trap.
>
> For the original testcase:
>
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
>  int l;
>  for (l = 0; l < 1234; l++)
>  {
>   if (g_1)
>     return l;
>   else
>     g_2 = 0;
>  }
>  return 999;
> }
>
> After all optimizations are done, we are now generating the following for
> the flags approach:
>
> new:
> func_1:
>        movl    g_1(%rip), %edx
>        xorl    %eax, %eax
>        testl   %edx, %edx
>        je      .L5
>        rep
>        ret
> .L5:
>        movl    $0, g_2(%rip)
>        movw    $999, %ax
>        ret
>
> Which is significantly better than the unmodified trunk of:
>
> func_1:
>        movl    g_1(%rip), %edx
>        movl    g_2(%rip), %eax
>        testl   %edx, %edx
>        je      .L5
>        movl    %eax, g_2(%rip)
>        xorl    %eax, %eax
>        ret
> .L5:
>        movl    $0, g_2(%rip)
>        movl    $999, %eax
>        ret
>
> And in other less contrived cases, we generate correct code that avoids
> writes on code paths that did not have it.  For example, in Hans register
> promotion example:
>
> int count;
>
> struct obj {
>    int data;
>    struct obj *next;
> } *q;
>
> void func()
> {
>  struct obj *p;
>  for (p = q; p; p = p->next)
>        if (p->data > 0)
>                count++;
> }
>
> Under the new memory model we should avoid promoting "count" to a register
> and unilaterally writing to it upon exiting the loop.
>
> With the attached patch, we now generate:
>
> func:
> .LFB0:
>        movq    q(%rip), %rax
>        xorl    %ecx, %ecx
>        movl    count(%rip), %edx
>        testq   %rax, %rax
>        je      .L1
> .L9:
>        movl    (%rax), %esi
>        testl   %esi, %esi
>        jle     .L3
>        addl    $1, %edx
>        movl    $1, %ecx
> .L3:
>        movq    8(%rax), %rax
>        testq   %rax, %rax
>        jne     .L9
>        testb   %cl, %cl
>        jne     .L12
> .L1:
>        rep
>        ret
> .L12:
>        movl    %edx, count(%rip)
>        ret
>
> Which is as expected.
>
> I don't understand your suggestion of:
>
>
>>    lsm = g2;  // technically not needed, if you also handle loads
>>
>> that would allow to extend this to cover the cases where the access may
>>
>> eventually trap (if you omit the load before the loop).
>
>
> Can't I just remove the lsm=g2?  What's this about also handling loads?
>
>
>>
>> Not sure which is more efficient (you can eventually combine flag
>> variables
>> for different store locations, combining the if-changed compare is not so
>> easy
>> I guess).
>
>
> So far, I see better code generated with the flags approach, so...
>
>
>>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
>>> g_2.
>>>  In the PR, Richard mentions that both FRE/PRE can figure this out, but
>>> they
>>> are not run after store motion.  DOM should also be able to, but
>>> apparently
>>> does not :(.  Tips?
>>
>>
>> Well.  Schedule FRE after loop opts ...
>>
>> DOM is not prepared to handle loads/stores with differing VOPs - it was
>> never
>> updated really after the alias-improvements merge.
>>
>>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>>> with complex floats/doubles.  do_jump is complaining that it can't
>>> compare
>>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare
>>> complex
>>> values since complex lowering has already happened (??). Is there a more
>>> canonical way of creating a GIMPLE_COND that may contain complex floats
>>> at
>>> this stage?
>>
>>
>> As you are handling redundant stores you want a bitwise comparison anyway,
>> but yes, complex compares need to be lowered.  But as said, you want a
>> bitwise comparison, not a value-comparison.  You can try using
>
>
> I can ignore all this.  I have implemented both alternatives, with a target
> hook as suggested, but I'm thinking of removing it altogether and just
> leaving the flags approach.
>
>> Few comments on the patch itself.
>>
>> +         new_bb = split_edge (ex);
>> +         then_bb = create_empty_bb (new_b

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-24 Thread Aldy Hernandez

On 04/13/12 03:46, Richard Guenther wrote:

On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez  wrote:


Richard.  Thanks so much for reviewing and providing an alternative 
approach, which AFAICT provides superior results.



A similar effect could be obtained by keeping a flag whether we entered the
path that stored the value before the transform.  Thus, do

   lsm = g2;  // technically not needed, if you also handle loads
   flag = false;
   for (...)
 {
if (g1)
  {
 if (flag)
   g2 = lsm;
  }
else
  {
 lsm = 0;
 flag = true;
  }
 }
   if (flag)
 g2 = lsm;


I have implemented this in the attached patch, and it seems to be 
generating better code than my other if-changed approach.  It also 
avoids generating unnecessary loads that may trap.


For the original testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
 int l;
 for (l = 0; l < 1234; l++)
 {
   if (g_1)
 return l;
   else
 g_2 = 0;
 }
 return 999;
}

After all optimizations are done, we are now generating the following 
for the flags approach:


new:
func_1:
movlg_1(%rip), %edx
xorl%eax, %eax
testl   %edx, %edx
je  .L5
rep
ret
.L5:
movl$0, g_2(%rip)
movw$999, %ax
ret

Which is significantly better than the unmodified trunk of:

func_1:
movlg_1(%rip), %edx
movlg_2(%rip), %eax
testl   %edx, %edx
je  .L5
movl%eax, g_2(%rip)
xorl%eax, %eax
ret
.L5:
movl$0, g_2(%rip)
movl$999, %eax
ret

And in other less contrived cases, we generate correct code that avoids 
writes on code paths that did not have it.  For example, in Hans 
register promotion example:


int count;

struct obj {
int data;
struct obj *next;
} *q;

void func()
{
  struct obj *p;
  for (p = q; p; p = p->next)
if (p->data > 0)
count++;
}

Under the new memory model we should avoid promoting "count" to a 
register and unilaterally writing to it upon exiting the loop.


With the attached patch, we now generate:

func:
.LFB0:
movqq(%rip), %rax
xorl%ecx, %ecx
movlcount(%rip), %edx
testq   %rax, %rax
je  .L1
.L9:
movl(%rax), %esi
testl   %esi, %esi
jle .L3
addl$1, %edx
movl$1, %ecx
.L3:
movq8(%rax), %rax
testq   %rax, %rax
jne .L9
testb   %cl, %cl
jne .L12
.L1:
rep
ret
.L12:
movl%edx, count(%rip)
ret

Which is as expected.

I don't understand your suggestion of:

>lsm = g2;  // technically not needed, if you also handle loads

that would allow to extend this to cover the cases where the access may
eventually trap (if you omit the load before the loop).


Can't I just remove the lsm=g2?  What's this about also handling loads?



Not sure which is more efficient (you can eventually combine flag variables
for different store locations, combining the if-changed compare is not so easy
I guess).


So far, I see better code generated with the flags approach, so...


1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
  In the PR, Richard mentions that both FRE/PRE can figure this out, but they
are not run after store motion.  DOM should also be able to, but apparently
does not :(.  Tips?


Well.  Schedule FRE after loop opts ...

DOM is not prepared to handle loads/stores with differing VOPs - it was never
updated really after the alias-improvements merge.


2. The GIMPLE_CONDs I create in this patch are currently causing problems
with complex floats/doubles.  do_jump is complaining that it can't compare
them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
values since complex lowering has already happened (??). Is there a more
canonical way of creating a GIMPLE_COND that may contain complex floats at
this stage?


As you are handling redundant stores you want a bitwise comparison anyway,
but yes, complex compares need to be lowered.  But as said, you want a
bitwise comparison, not a value-comparison.  You can try using


I can ignore all this.  I have implemented both alternatives, with a 
target hook as suggested, but I'm thinking of removing it altogether and 
just leaving the flags approach.



Few comments on the patch itself.

+ new_bb = split_edge (ex);
+ then_bb = create_empty_bb (new_bb);
+ if (current_loops&&  new_bb->loop_father)
+   add_bb_to_loop (then_bb, new_bb->loop_father);

+ gsi = gsi_start_bb (new_bb);
+ t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);

Hmm, why do you need to re-load the value?  Don't you have both
the value as it was loaded before the loop and the new value (the lsm tmp reg)
already?  Why not simply remember the SSA name us

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-17 Thread Aldy Hernandez

On 04/13/12 18:22, Boehm, Hans wrote:




-Original Message-
From: Aldy Hernandez [mailto:al...@redhat.com]
Sent: Thursday, April 12, 2012 3:12 PM
To: Richard Guenther
Cc: Andrew MacLeod; Boehm, Hans; gcc-patches; Torvald Riegel
Subject: [PR tree-optimization/52558]: RFC: questions on store data
race

Here we have a testcase that affects both the C++ memory model and
transactional memory.

[Hans, this is caused by the same problem that is causing the
speculative register promotion issue you and Torvald pointed me at].




In the following testcase (adapted from the PR), the loop invariant
motion pass caches a pre-existing value for g_2, and then performs a
store to g_2 on every path, causing a store data race:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
int l;
for (l = 0; l<  1234; l++)
{
  if (g_1)
return l;
  else
g_2 = 0;<-- Store to g_2 should only happen if !g_1
}
return 999;
}

This gets transformed into something like:

g_2_lsm = g_2;
if (g_1) {
g_2 = g_2_lsm;  // boo! hiss!
return 0;
} else {
g_2_lsm = 0;
g_2 = g_2_lsm;
}

The spurious write to g_2 could cause a data race.

Presumably the g_2_lsm = g_2 is actually outside the loop?

Why does the second g_2 = g_2_lsm; get introduced?  I would have expected it 
before the return.  Was the example just over-abbreviated?


There is some abbreviation going on :).  To be exact, this is what -O2 
currently produces for the lim1 pass.


:
  pretmp.4_1 = g_1;
  g_2_lsm.6_12 = g_2;

:
  # l_13 = PHI 
  # g_2_lsm.6_10 = PHI 
  g_1.0_4 = pretmp.4_1;
  if (g_1.0_4 != 0)
goto ;
  else
goto ;

:
  g_2_lsm.6_11 = 0;
  l_6 = l_13 + 1;
  if (l_6 != 1234)
goto ;
  else
goto ;

:
  # g_2_lsm.6_18 = PHI 
  g_2 = g_2_lsm.6_18;
  goto ;

:
  goto ;

:
  # g_2_lsm.6_17 = PHI 
  # l_19 = PHI 
  g_2 = g_2_lsm.6_17;

:
  # l_2 = PHI 
  return l_2;

So yes, there seems to be another write to g_2 inside the else, but 
probably because we have merged some basic blocks along the way.




Other than that, this sounds right to me.  So does Richard's flag-based 
version, which is the approach I would have originally expected.  But that 
clearly costs you a register.  It would be interesting to see how they compare.


I am working on the flag based approach.

Thanks to both of you.


RE: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-13 Thread Boehm, Hans
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Can we _remove_ a store we percieve as redundant (with a single-threaded
> view) with the memory model?
Generally yes, so long as synchronization operations either conservatively 
treated as completely opaque, or are treated correctly in the "single-threaded 
view".  If there is no synchronization between the original store, and the 
redundant one, then the redundant one changes things only if another thread 
writes to the same variable in-between.  That constitutes a data race, which 
invokes undefined behavior.  The general rule is that any sequentially correct 
transformation is OK between synchronization operations, so long as you don't 
store to anything you weren't supposed to modify, and the state at the next 
synchronization point is what would have been expected.  You can sometimes 
treat synchronizations more aggressively, but that should be safe.

Hans


RE: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-13 Thread Boehm, Hans


> -Original Message-
> From: Aldy Hernandez [mailto:al...@redhat.com]
> Sent: Thursday, April 12, 2012 3:12 PM
> To: Richard Guenther
> Cc: Andrew MacLeod; Boehm, Hans; gcc-patches; Torvald Riegel
> Subject: [PR tree-optimization/52558]: RFC: questions on store data
> race
> 
> Here we have a testcase that affects both the C++ memory model and
> transactional memory.
> 
> [Hans, this is caused by the same problem that is causing the
> speculative register promotion issue you and Torvald pointed me at].
> 
> In the following testcase (adapted from the PR), the loop invariant
> motion pass caches a pre-existing value for g_2, and then performs a
> store to g_2 on every path, causing a store data race:
> 
> int g_1 = 1;
> int g_2 = 0;
> 
> int func_1(void)
> {
>int l;
>for (l = 0; l < 1234; l++)
>{
>  if (g_1)
>return l;
>  else
>g_2 = 0;   <-- Store to g_2 should only happen if !g_1
>}
>return 999;
> }
> 
> This gets transformed into something like:
> 
>   g_2_lsm = g_2;
>   if (g_1) {
>   g_2 = g_2_lsm;  // boo! hiss!
>   return 0;
>   } else {
>   g_2_lsm = 0;
>   g_2 = g_2_lsm;
>   }
> 
> The spurious write to g_2 could cause a data race.
Presumably the g_2_lsm = g_2 is actually outside the loop?

Why does the second g_2 = g_2_lsm; get introduced?  I would have expected it 
before the return.  Was the example just over-abbreviated?

Other than that, this sounds right to me.  So does Richard's flag-based 
version, which is the approach I would have originally expected.  But that 
clearly costs you a register.  It would be interesting to see how they compare.

Hans

> 
> Andrew has suggested verifying that the store to g_2 was actually
> different than on entry, and letting PHI copy propagation optimize the
> redundant comparisons.  Like this:
> 
>   g_2_lsm = g_2;
>   if (g_1) {
>   if (g_2_lsm != g_2) // hopefully optimized away
>   g_2 = g_2_lsm;  // hopefully optimized away
>   return 0;
>   } else {
>   g_2_lsm = 0;
>   if (g_2_lsm != g_2) // hopefully optimized away
>   g_2 = g_2_lsm;
>   }
> 
> ...which PHI copy propagation and/or friends should be able to
> optimize.
> 
> For that matter, regardless of the memory model, the proposed solution
> would improve the existing code by removing the extra store (in this
> contrived test case anyhow).
> 
> Anyone see a problem with this approach (see attached patch)?
> 
> Assuming the unlikely scenario that everyone likes this :), I have two
> problems that I would like answered.  But feel free to ignore if the
> approach is a no go.
> 
> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
> g_2.  In the PR, Richard mentions that both FRE/PRE can figure this
> out, but they are not run after store motion.  DOM should also be able
> to, but apparently does not :(.  Tips?
> 
> 2. The GIMPLE_CONDs I create in this patch are currently causing
> problems with complex floats/doubles.  do_jump is complaining that it
> can't compare them, so I assume it is too late (in tree-ssa-loop-im.c)
> to compare complex values since complex lowering has already happened
> (??). Is there a more canonical way of creating a GIMPLE_COND that may
> contain complex floats at this stage?
> 
> Thanks folks.


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-13 Thread Richard Guenther
On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez  wrote:
> Here we have a testcase that affects both the C++ memory model and
> transactional memory.
>
> [Hans, this is caused by the same problem that is causing the speculative
> register promotion issue you and Torvald pointed me at].
>
> In the following testcase (adapted from the PR), the loop invariant motion
> pass caches a pre-existing value for g_2, and then performs a store to g_2
> on every path, causing a store data race:
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
>  int l;
>  for (l = 0; l < 1234; l++)
>  {
>    if (g_1)
>      return l;
>    else
>      g_2 = 0;  <-- Store to g_2 should only happen if !g_1
>  }
>  return 999;
> }
>
> This gets transformed into something like:
>
>        g_2_lsm = g_2;
>        if (g_1) {
>                g_2 = g_2_lsm;  // boo! hiss!
>                return 0;
>        } else {
>                g_2_lsm = 0;
>                g_2 = g_2_lsm;
>        }
>
> The spurious write to g_2 could cause a data race.
>
> Andrew has suggested verifying that the store to g_2 was actually different
> than on entry, and letting PHI copy propagation optimize the redundant
> comparisons.  Like this:
>
>        g_2_lsm = g_2;
>        if (g_1) {
>                if (g_2_lsm != g_2)     // hopefully optimized away
>                        g_2 = g_2_lsm;  // hopefully optimized away
>                return 0;
>        } else {
>                g_2_lsm = 0;
>                if (g_2_lsm != g_2)     // hopefully optimized away
>                        g_2 = g_2_lsm;
>        }
>
> ...which PHI copy propagation and/or friends should be able to optimize.
>
> For that matter, regardless of the memory model, the proposed solution would
> improve the existing code by removing the extra store (in this contrived
> test case anyhow).
>
> Anyone see a problem with this approach (see attached patch)?

Can we _remove_ a store we percieve as redundant (with a single-threaded
view) with the memory model?  If so, then this sounds plausible (minus
the optimization that if the variable is written to unconditionally we avoid
this comparison -- see the places where we check whether we can apply
store motion to possibly trapping locations).

A similar effect could be obtained by keeping a flag whether we entered the
path that stored the value before the transform.  Thus, do

  lsm = g2;  // technically not needed, if you also handle loads
  flag = false;
  for (...)
{
   if (g1)
 {
if (flag)
  g2 = lsm;
 }
   else
 {
lsm = 0;
flag = true;
 }
}
  if (flag)
g2 = lsm;

that would allow to extend this to cover the cases where the access may
eventually trap (if you omit the load before the loop).

Not sure which is more efficient (you can eventually combine flag variables
for different store locations, combining the if-changed compare is not so easy
I guess).

> Assuming the unlikely scenario that everyone likes this :), I have two
> problems that I would like answered.  But feel free to ignore if the
> approach is a no go.
>
> 1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
>  In the PR, Richard mentions that both FRE/PRE can figure this out, but they
> are not run after store motion.  DOM should also be able to, but apparently
> does not :(.  Tips?

Well.  Schedule FRE after loop opts ...

DOM is not prepared to handle loads/stores with differing VOPs - it was never
updated really after the alias-improvements merge.

> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
> with complex floats/doubles.  do_jump is complaining that it can't compare
> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
> values since complex lowering has already happened (??). Is there a more
> canonical way of creating a GIMPLE_COND that may contain complex floats at
> this stage?

As you are handling redundant stores you want a bitwise comparison anyway,
but yes, complex compares need to be lowered.  But as said, you want a
bitwise comparison, not a value-comparison.  You can try using

  if (VIEW_CONVERT_EXPR 
(lsm_tmp_reg) != VIEW_CONVERT_EXPR <  > (orig_value))
...

that can of course result in really awful codegen (think of fp - gp register
moves, or even stack spills).  Which maybe hints at that the flag approach
would be better in some cases?  (you need to cache the original value
in a register anyway, the only thing you save is updating it when you would
update the flag value)

Maybe you can combine both, eventually dependent on a target hook
can_compare_bitwise (mode) that would tell you if the target can efficiently
compare two registers in mode for bit equality.

Few comments on the patch itself.

+ new_bb = split_edge (ex);
+ then_bb = create_empty_bb (new_bb);
+ if (current_loops && new_bb->loop_father)
+   add_bb_to_loop (then_bb, new_bb->lo

[PR tree-optimization/52558]: RFC: questions on store data race

2012-04-12 Thread Aldy Hernandez
Here we have a testcase that affects both the C++ memory model and 
transactional memory.


[Hans, this is caused by the same problem that is causing the 
speculative register promotion issue you and Torvald pointed me at].


In the following testcase (adapted from the PR), the loop invariant 
motion pass caches a pre-existing value for g_2, and then performs a 
store to g_2 on every path, causing a store data race:


int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
  int l;
  for (l = 0; l < 1234; l++)
  {
if (g_1)
  return l;
else
  g_2 = 0;  <-- Store to g_2 should only happen if !g_1
  }
  return 999;
}

This gets transformed into something like:

g_2_lsm = g_2;
if (g_1) {
g_2 = g_2_lsm;  // boo! hiss!
return 0;
} else {
g_2_lsm = 0;
g_2 = g_2_lsm;
}

The spurious write to g_2 could cause a data race.

Andrew has suggested verifying that the store to g_2 was actually 
different than on entry, and letting PHI copy propagation optimize the 
redundant comparisons.  Like this:


g_2_lsm = g_2;
if (g_1) {
if (g_2_lsm != g_2) // hopefully optimized away
g_2 = g_2_lsm;  // hopefully optimized away
return 0;
} else {
g_2_lsm = 0;
if (g_2_lsm != g_2) // hopefully optimized away
g_2 = g_2_lsm;
}

...which PHI copy propagation and/or friends should be able to optimize.

For that matter, regardless of the memory model, the proposed solution 
would improve the existing code by removing the extra store (in this 
contrived test case anyhow).


Anyone see a problem with this approach (see attached patch)?

Assuming the unlikely scenario that everyone likes this :), I have two 
problems that I would like answered.  But feel free to ignore if the 
approach is a no go.


1. No pass can figure out the equality (or inequality) of g_2_lsm and 
g_2.  In the PR, Richard mentions that both FRE/PRE can figure this out, 
but they are not run after store motion.  DOM should also be able to, 
but apparently does not :(.  Tips?


2. The GIMPLE_CONDs I create in this patch are currently causing 
problems with complex floats/doubles.  do_jump is complaining that it 
can't compare them, so I assume it is too late (in tree-ssa-loop-im.c) 
to compare complex values since complex lowering has already happened 
(??). Is there a more canonical way of creating a GIMPLE_COND that may 
contain complex floats at this stage?


Thanks folks.
* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
conditional.
* opts.c (finish_options): Do not allow store or load data races
with -fgnu-tm.

Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 186108)
+++ tree-ssa-loop-im.c  (working copy)
@@ -1999,8 +1999,59 @@ execute_sm (struct loop *loop, VEC (edge
 
   FOR_EACH_VEC_ELT (edge, exits, i, ex)
 {
-  store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
-  gsi_insert_on_edge (ex, store);
+  basic_block new_bb, then_bb, old_dest;
+  edge then_old_edge;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  tree t1;
+
+  if (PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
+   {
+ store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+ gsi_insert_on_edge (ex, store);
+   }
+  else
+   {
+ old_dest = ex->dest;
+ new_bb = split_edge (ex);
+ then_bb = create_empty_bb (new_bb);
+ if (current_loops && new_bb->loop_father)
+   add_bb_to_loop (then_bb, new_bb->loop_father);
+
+ gsi = gsi_start_bb (new_bb);
+ t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
+ stmt = gimple_build_assign (t1, unshare_expr (ref->mem));
+ gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+ stmt = gimple_build_cond (NE_EXPR, t1, tmp_var,
+   NULL_TREE, NULL_TREE);
+ gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+ gsi = gsi_start_bb (then_bb);
+ store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+ gsi_insert_after (&gsi, store, GSI_CONTINUE_LINKING);
+
+ make_edge (new_bb, then_bb, EDGE_TRUE_VALUE);
+ make_edge (new_bb, old_dest, EDGE_FALSE_VALUE);
+ then_old_edge = make_edge (then_bb, old_dest, EDGE_FALLTHRU);
+ set_immediate_dominator (CDI_DOMINATORS, then_bb, new_bb);
+
+ for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); gsi_next 
(&gsi))
+   {
+ gimple phi = gsi_stmt (gsi);
+ unsigned i;
+
+ for (i = 0; i < gimple_phi_num_args (phi); i++)
+   if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+ {
+   tree arg = gimple_phi_arg_def (phi, i)