Re: fix PR46029: reimplement if conversion of loads and stores

2015-07-08 Thread Richard Biener
On Tue, Jul 7, 2015 at 11:23 PM, Abe abe_skol...@yahoo.com wrote:
 (if-conversion could directly generate masked load/stores
  of course and not use a scratch-pad at all in that case).


 IMO that`s a great idea, but I don`t know how to do it.  Hints would be
 welcome.  In particular, how does one generate masked load/stores at the
 GIMPLE level?

It already does that, see predicate_mem_writes.  You should definitely
preserve that path (I think it does that only when versioning the loop
for if-conversion, non-vectorized loops will then not be if-converted).


 But are we correctly handling these cases in the current if conversion
 code?


 I`m uncertain to what that is intended to refer, but I believe Sebastian
 would agree that the new if converter is safer than the old one in terms of
 correctness at the time of running the code being compiled.


 Abe's changes would seem like a step forward from a correctness standpoint


 Not to argue, but as a point of humility: Sebastian did by far the most work
 on this patch.  I just modernized it and helped move it along.  Even _that_
 was done with Sebastian`s help.


 even if they take us a step backwards from a performance standpoint.


 For now, we have a few performance regressions, and so far we have found
 that it`s non-trivial to remove all of those regressions.

On what hardware did you test?

 We may be better off pushing the current patch to trunk and having the
 performance regressions currently introduced by the new if converter be
 fixed by later patches.
 Pushing to trunk gives us excellent visibility amongst GCC hackers, so the
 code will get more eyeballs than if it lingers in an uncommitted patch or
 in a branch.
 I, for one, would love some help in fixing these performance regressions.
 ;-)

 If fixing the performance regressions winds up taking too long, perhaps the
 current imperfect patch could be undone on trunk just before a release is
 tagged,
 and then we`ll push it in again when trunk goes back to being allowed to be
 unstable?  According to my analysis of the data near the end of the page at
 https://gcc.gnu.org/develop.html, we have until roughly April of 2016 to
 work on not-yet-perfect patches in trunk.


 So the question is whether we get more non-vectorized if-converted
 code out of this (and thus whether we want to use --param
 allow-store-data-races to get the old code back which is nicer to less
 capable CPUs and probably faster than using scatter/gather or masked
 loads/stores).


 I do think conditionalizing some of this on the allow-store-data-races
 makes sense.


 I think having both the old if-converter and the new one live on in GCC is
 nontrivial, but not impossible.  I also don`t think it`s the best long-term
 goal,
 but only a short-term workaround.  In the long run, IMO there should be only
 one if converter, albeit perhaps with tweaking flags [e.g.
 -fallow-unsafe-if-conversion].


 I also wonder if we should really care about load data races (not sure
 your patch does).


 According to a recent long discussion I had with Sebastian, our current
 patch does not have the flaw I was concerned it might have in terms of loads
 because:

   [1] the scratchpad is only being used to if-convert assignments to
 thread-local scalars, never to globals/statics, and because...

   [2] the gimplifier is supposed to detect the address of this scalar has
 been taken and when such is detected in the code being compiled,
   it causes the scalar to no longer look like a scalar in GIMPLE so that
 we are also safe from stale-data problems that could come from
   corner-case code that takes the address of a thread-local variable and
 gives that address to another thread [which then proceeds to
   overwrite the value in the supposedly-thread-local scalar that belongs
 to a different thread from the one doing the writing]


 Regards,

 Abe




Re: fix PR46029: reimplement if conversion of loads and stores

2015-07-08 Thread Alan Lawrence

Abe wrote:


I`m uncertain to what that is intended to refer, but I believe Sebastian would 
agree that the new if converter is safer than the old one in terms of 
correctness at the time of running the code being compiled.



even if they take us a step backwards from a performance standpoint.


For now, we have a few performance regressions, and so far we have found that 
it`s non-trivial to remove all of those regressions.
We may be better off pushing the current patch to trunk and having the 
performance regressions currently introduced by the new if converter be fixed 
by later patches.
Pushing to trunk gives us excellent visibility amongst GCC hackers, so the code will 
get more eyeballs than if it lingers in an uncommitted patch or in a branch.
I, for one, would love some help in fixing these performance regressions. ;-)

If fixing the performance regressions winds up taking too long, perhaps the 
current imperfect patch could be undone on trunk just before a release is 
tagged,
and then we`ll push it in again when trunk goes back to being allowed to be unstable? 


TBH my two cents would be that a performance-regressed, but correct, compiler, 
is far better to release, than a performance-improved one that generates 
unsafe code (e.g. with extra faults in the straightforward single-threaded case!)...


--Alan



Re: fix PR46029: reimplement if conversion of loads and stores

2015-07-08 Thread Abe

[Abe wrote:]

I believe Sebastian would agree that the new if converter  is safer than the 
old one

 in terms of correctness at the time of running the code being compiled.
[...]

For now, we have a few performance regressions

[snip]


[Alan wrote:]

TBH my two cents would be that a performance-regressed, but correct, compiler, 
is far better to release,  than a

 performance-improved one that generates unsafe code (e.g. with extra faults 
in the straightforward single-threaded case!)...

I strongly agree that -- by default -- correctness trumps performance.  The 
only times it is allowable to reverse that relationship, IMO,
is when the user* of the compiler has explicitly specified flags [e.g. -ffast-math, 
-Ofast] that tell the compiler that the user*
currently cares more about performance than about 
{correctness-according-to-spec and/or safety in all conditions including null 
pointers}.

['*': or Makefiles, build scripts, etc.]

FYI: TTBOMK, the old if converter was not unsafe with default flags or with only big 
knobs like -O3; I`m unsure what it did
under -ffast-math and -Ofast, if anything of interest.  The main advantage 
of the new if converter over the old one is that
the new one is safe in certain situations wherein the old one is unsafe, e.g. 
the old one may cause the vectorized code to segfault
where the non-if-converted code would have run just fine all the way to program 
completion with the same inputs.  This additional
safety allows the new converter to be used under more conditions, which in turn 
allows it to be enabled by default.  We intend
for all the safe if-conversions to be done by default whenever the vectorizer 
is on.  If there are any unsafe conversions left,
which I`m not sure there are, then we will enable them only when the user* specifies 
something like -fif-conversion=allow-unsafe.
The allows it to be enabled by default property should help the code that GCC generates 
under -O3 w/o any additional
flags to be faster than it currently is, for the relevant targets, *without 
sacrificing even _one_ _bit_ of correctness*.

Regards,

Abe


Re: fix PR46029: reimplement if conversion of loads and stores

2015-07-08 Thread Abe

(if-conversion could directly generate masked load/stores
  of course and not use a scratch-pad at all in that case).


[Abe wrote:]

IMO that`s a great idea, but I don`t know how to do it.
Hints would be welcome.  In particular, how does one

 generate masked load/stores at the GIMPLE level?

[Richard Biener wrote:]

It already does that, see predicate_mem_writes.

 You should definitely preserve that path

Thanks.  Yes, we have not intentionally disabled that.



On what hardware did you test?


AMD64 arch., Intel implementation.  Nothing fancy AFAIK in the flags to make it 
super-specific,
e.g. -march=nocona or -march=native.  Occasionally using AVX-2 flags as 
specified by some test cases.

Regards,

Abe


RE: fix PR46029: reimplement if conversion of loads and stores

2015-07-07 Thread Abe

(if-conversion could directly generate masked load/stores
 of course and not use a scratch-pad at all in that case).


IMO that`s a great idea, but I don`t know how to do it.  Hints would be welcome.  In 
particular, how does one generate masked load/stores at the GIMPLE level?



But are we correctly handling these cases in the current if conversion code?


I`m uncertain to what that is intended to refer, but I believe Sebastian would 
agree that the new if converter is safer than the old one in terms of 
correctness at the time of running the code being compiled.



Abe's changes would seem like a step forward from a correctness standpoint


Not to argue, but as a point of humility: Sebastian did by far the most work on 
this patch.  I just modernized it and helped move it along.  Even _that_ was 
done with Sebastian`s help.



even if they take us a step backwards from a performance standpoint.


For now, we have a few performance regressions, and so far we have found that 
it`s non-trivial to remove all of those regressions.
We may be better off pushing the current patch to trunk and having the 
performance regressions currently introduced by the new if converter be fixed 
by later patches.
Pushing to trunk gives us excellent visibility amongst GCC hackers, so the code will 
get more eyeballs than if it lingers in an uncommitted patch or in a branch.
I, for one, would love some help in fixing these performance regressions. ;-)

If fixing the performance regressions winds up taking too long, perhaps the 
current imperfect patch could be undone on trunk just before a release is 
tagged,
and then we`ll push it in again when trunk goes back to being allowed to be 
unstable?  According to my analysis of the data near the end of the page at
https://gcc.gnu.org/develop.html, we have until roughly April of 2016 to work 
on not-yet-perfect patches in trunk.



So the question is whether we get more non-vectorized if-converted
code out of this (and thus whether we want to use --param
allow-store-data-races to get the old code back which is nicer to less
capable CPUs and probably faster than using scatter/gather or masked 
loads/stores).



I do think conditionalizing some of this on the allow-store-data-races makes 
sense.


I think having both the old if-converter and the new one live on in GCC is 
nontrivial, but not impossible.  I also don`t think it`s the best long-term goal,
but only a short-term workaround.  In the long run, IMO there should be only one if 
converter, albeit perhaps with tweaking flags [e.g. 
-fallow-unsafe-if-conversion].



I also wonder if we should really care about load data races (not sure your 
patch does).


According to a recent long discussion I had with Sebastian, our current patch 
does not have the flaw I was concerned it might have in terms of loads because:

  [1] the scratchpad is only being used to if-convert assignments to 
thread-local scalars, never to globals/statics, and because...

  [2] the gimplifier is supposed to detect the address of this scalar has been 
taken and when such is detected in the code being compiled,
  it causes the scalar to no longer look like a scalar in GIMPLE so that we 
are also safe from stale-data problems that could come from
  corner-case code that takes the address of a thread-local variable and 
gives that address to another thread [which then proceeds to
  overwrite the value in the supposedly-thread-local scalar that belongs to 
a different thread from the one doing the writing]


Regards,

Abe




Re: fix PR46029: reimplement if conversion of loads and stores

2015-07-06 Thread Jeff Law

On 06/25/2015 03:43 AM, Richard Biener wrote:


Yes, and if you just look at scalar code then with the rewrite we can
enable store if-conversion unconditionally.

_But_

when the new scheme triggers vectorization cannot succeed on the
result as we get

   if (cond)
 *p = val;

if-converted to

   tem = cond ? p : scratch;
   *tem = val;

and

if (cond)
  val = *p;

if-converted to

   scatch = val;
   tem = cond ? p : scratch;
   val = *tem;

and thus the store and loads appear as scather/gather ones to
the vectorizer (if-conversion could directly generate masked
load/stores of course and not use a scratch-pad at all in that case).
Right.  But are we correctly handling these cases in the current if 
conversion code?  If we're currently mis-handling them, then Abe's 
changes would seem like a step forward from a correctness standpoint, 
even if they take us a step backwards from a performance standpoint.




So the question is whether we get more non-vectorized if-converted
code out of this (and thus whether we want to use
--param allow-store-data-races to get the old code back which is
nicer to less capable CPUs and probably faster than using
scatter/gather or masked loads/stores).
I do think conditionalizing some of this on the allo-store-data-races 
makes sense.





  Note that I think scatter

support is still not implemented in the vectorizer.  I also wonder
if we should really care about load data races (not sure your patch
does).
I thought the consensus was that load data races weren't important from 
a correctness standpoint.  Torvald would be the man to know the answer 
on that one (trie...@redhat.com).



In general I like the idea.
Similarly.  Just trying to find the time to actually look at it is 
proving hard :(


jeff



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-30 Thread Alan Lawrence

Abe Skolnik wrote:


In tree-if-conv.c:[…] if it doesn't trap, but has_non_addressable_refs, can't 
we use
ifcvt_can_use_mask_load_store there too?
if an access could trap, but is addressable, can't we use the scratchpad 
technique
to get round the trapping problem?


That`s how we deal with loads.  We use the scratchpad in case the condition is 
false.  That way it doesn`t matter if the original code was hypothetically 
dereferencing a null pointer in the condition-false section, for example: we 
aren`t unconditionally executing both sides exactly as written.  On the false 
side, we load from the scratchpad and then ignore the result of that load.


Ok, so I'm looking at tree-if-conv.c:

@@ -876,32 +726,40 @@ if_convertible_gimple_assign_stmt_p (gimple stmt,
   return false;
 }

   /* tree-into-ssa.c uses GF_PLF_1, so avoid it, because
  in between if_convertible_loop_p and combine_blocks
  we can perform loop versioning.  */
   gimple_set_plf (stmt, GF_PLF_2, false);

   if (flag_tree_loop_if_convert_stores)
 {
-  if (ifcvt_could_trap_p (stmt, refs))
+  if (ifcvt_could_trap_p (stmt))
{
  if (ifcvt_can_use_mask_load_store (stmt))
{
  gimple_set_plf (stmt, GF_PLF_2, true);
  *any_mask_load_store = true;
  return true;
}
  if (dump_file  (dump_flags  TDF_DETAILS))
-   fprintf (dump_file, tree could trap...\n);
+   fprintf (dump_file, tree could trap\n);
  return false;
}
+
+  if (has_non_addressable_refs (stmt))
+   {
+ if (dump_file  (dump_flags  TDF_DETAILS))
+   fprintf (dump_file, has non-addressable memory references\n);
+ return false;
+   }
+
   return true;
 }


I feel I'm being stupid here, but...if ifcvt_could_trap_p  
!ifcvt_can_use_mask_load_store, we return false here. Specifically, we return 
false if ifcvt_could_trap_p  !ifcvt_can_use_mask_load_store  
!has_non_addressable_refs. Shouldn't we return true in that case, in order to 
use the scratchpad technique rather than bail out?


Thanks, Alan



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-26 Thread Richard Biener
On June 26, 2015 2:21:22 PM GMT+02:00, Alan Lawrence alan.lawre...@arm.com 
wrote:
Sebastian Pop wrote:
 On Thu, Jun 25, 2015 at 4:43 AM, Richard Biener
 richard.guent...@gmail.com wrote:
 when the new scheme triggers vectorization cannot succeed on the
 result as we get

   if (cond)
 *p = val;

 if-converted to

   tem = cond ? p : scratch;
   *tem = val;
 
 That's correct.
 
 and

if (cond)
  val = *p;

 if-converted to

   scatch = val;
   tem = cond ? p : scratch;
   val = *tem;
 
 The patch does this slightly differently:
 
tem = cond ? p : scratch;
val = cond ? *tem : val;
 
 I think I like your version better as it has only one cond_expr.

Another slight concern...by reusing scratchpad's, are we at risk of
creating 
lots of false dependencies here, that restrict instruction scheduling /
other 
opts later?

Another possibility is to not share them and make sure to insert clobbers for 
them when they become dead so later stack slot sharing can merge them again.

Richard.

 [...]
 and thus the store and loads appear as scather/gather ones to
 the vectorizer (if-conversion could directly generate masked
 load/stores of course and not use a scratch-pad at all in that
case).

Thank you Richard for much better expressing what I was thinking here
too :)

 Abe also suggested to continue optimizing the other way in cases
 where we know to write or load from the same location on all
branches:
 
 if (c)
   A[i] = ...
 else
   A[i] = ...

The store here really should be commoned, yes.

(Related but different?: BZ 56625.)


I might add, I find this code much easier to follow than the old
(removed) code 
about data references, memrefs_read_unconditionally, and 
write_memrefs_written_at_least-once...

Thanks, Alan




Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-26 Thread Jeff Law

On 06/26/2015 08:57 AM, Richard Biener wrote:


Another possibility is to not share them and make sure to insert clobbers for 
them when they become dead so later stack slot sharing can merge them again.
We've certainly had cases in the past where re-using a stack slot for an 
object that has gone out of scope for some other object in a different 
scope with a different type are not seen as conflicting by the aliasing 
machinery in the past resulting in incorrect code motions.


Creating distinct scratchpads and letting the existing machinery (which 
has been fixed to deal with the issues noted above I believe) would make 
more sense than trying to reimplement the sharing correctness stuff 
within if-conversion.


jeff



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-26 Thread Alan Lawrence

Sebastian Pop wrote:

On Thu, Jun 25, 2015 at 4:43 AM, Richard Biener
richard.guent...@gmail.com wrote:

when the new scheme triggers vectorization cannot succeed on the
result as we get

  if (cond)
*p = val;

if-converted to

  tem = cond ? p : scratch;
  *tem = val;


That's correct.


and

   if (cond)
 val = *p;

if-converted to

  scatch = val;
  tem = cond ? p : scratch;
  val = *tem;


The patch does this slightly differently:

   tem = cond ? p : scratch;
   val = cond ? *tem : val;

I think I like your version better as it has only one cond_expr.


Another slight concern...by reusing scratchpad's, are we at risk of creating 
lots of false dependencies here, that restrict instruction scheduling / other 
opts later?



[...]
and thus the store and loads appear as scather/gather ones to
the vectorizer (if-conversion could directly generate masked
load/stores of course and not use a scratch-pad at all in that case).


Thank you Richard for much better expressing what I was thinking here too :)


Abe also suggested to continue optimizing the other way in cases
where we know to write or load from the same location on all branches:

if (c)
  A[i] = ...
else
  A[i] = ...


The store here really should be commoned, yes.

(Related but different?: BZ 56625.)


I might add, I find this code much easier to follow than the old (removed) code 
about data references, memrefs_read_unconditionally, and 
write_memrefs_written_at_least-once...


Thanks, Alan



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-25 Thread Sebastian Pop
On Thu, Jun 25, 2015 at 4:43 AM, Richard Biener
richard.guent...@gmail.com wrote:
 when the new scheme triggers vectorization cannot succeed on the
 result as we get

   if (cond)
 *p = val;

 if-converted to

   tem = cond ? p : scratch;
   *tem = val;

That's correct.


 and

if (cond)
  val = *p;

 if-converted to

   scatch = val;
   tem = cond ? p : scratch;
   val = *tem;

The patch does this slightly differently:

   tem = cond ? p : scratch;
   val = cond ? *tem : val;

I think I like your version better as it has only one cond_expr.


 and thus the store and loads appear as scather/gather ones to
 the vectorizer (if-conversion could directly generate masked
 load/stores of course and not use a scratch-pad at all in that case).

 So the question is whether we get more non-vectorized if-converted
 code out of this

this is the case.

 (and thus whether we want to use
 --param allow-store-data-races to get the old code back which is
 nicer to less capable CPUs and probably faster than using
 scatter/gather or masked loads/stores).

A flag to allow load and store data-races is an interesting suggestion.

Abe also suggested to continue optimizing the other way in cases
where we know to write or load from the same location on all branches:

if (c)
  A[i] = ...
else
  A[i] = ...

 scatter support is still not implemented in the vectorizer.

Correct.

 I also wonder if we should really care about load data races
 (not sure your patch does).

The patch does both loads and stores.

 I didn't look at the patch in detail yet - please address Alans comments
 and re-post an updated patch.

 In general I like the idea.

Thanks for your review,
Sebastian


 Thanks,
 Richard.



 Re. creation of scratchpads:
 (1) Should the '64' byte size be the result of scanning the
 function, for the largest data size to which we store? (ideally,
 conditionally store!)

 I suspect most functions have conditional stores, but far fewer have
 conditional stores that we'd like to if-convert.  ISTM that if we can lazily
 allocate the scratchpad that'd be best.   If this were an RTL pass, then I'd
 say query the backend for the widest mode store insn and use that to size
 the scratchpad.  We may have something similar we can do in gimple without
 resorting querying insn backend capabilities. Perhaps walking the in-scope
 addressable variables or somesuch.


 (2) Allocating only once per function: if we had one scratchpad per
 loop, it could/would live inside the test of gimple_build_call_internal
 (IFN_LOOP_VECTORIZED,  Otherwise, if we if-convert one or more
 loops in the function, but then fail to vectorize them, we'll leave the
 scratchpad around for later phases to clean up. Is that OK?

 If the scratchpad is local to a function, then I'd expect we'd clean it up
 just like any other unused local.  Once it's a global, then all bets are
 off.

 Anyway, I probably should just look at the patch before making more
 comments.

 jeff



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-25 Thread Richard Biener
On Wed, Jun 24, 2015 at 7:10 PM, Jeff Law l...@redhat.com wrote:
 On 06/22/2015 10:27 AM, Alan Lawrence wrote:


 My main thought concerns the direction we are travelling here. A major
 reason why we do if-conversion is to enable vectorization. Is this is
 targetted at gathering/scattering loads? Following vectorization,
 different elements of the vector being loaded/stored may have to go
 to/from the scratchpad or to/from main memory.

 Or, are we aiming at the case where the predicate or address are
 invariant? That seems unlikely - loop unswitching would be better for
 the predicate; loading from an address, we'd just peel and hoist;
 storing, this'd result in the address holding the last value written, at
 exit from the loop, a curious idiom. Where the predicate/address is
 invariant across the vector? (!)

 Or, at we aiming at non-vectorized code?

 I think we're aiming at correctness issues, particularly WRT not allowing
 the optimizers to introduce new data races for C11/C++11.

Yes, and if you just look at scalar code then with the rewrite we can
enable store if-conversion unconditionally.

_But_

when the new scheme triggers vectorization cannot succeed on the
result as we get

  if (cond)
*p = val;

if-converted to

  tem = cond ? p : scratch;
  *tem = val;

and

   if (cond)
 val = *p;

if-converted to

  scatch = val;
  tem = cond ? p : scratch;
  val = *tem;

and thus the store and loads appear as scather/gather ones to
the vectorizer (if-conversion could directly generate masked
load/stores of course and not use a scratch-pad at all in that case).

So the question is whether we get more non-vectorized if-converted
code out of this (and thus whether we want to use
--param allow-store-data-races to get the old code back which is
nicer to less capable CPUs and probably faster than using
scatter/gather or masked loads/stores).  Note that I think scatter
support is still not implemented in the vectorizer.  I also wonder
if we should really care about load data races (not sure your patch
does).

I didn't look at the patch in detail yet - please address Alans comments
and re-post an updated patch.

In general I like the idea.

Thanks,
Richard.



 Re. creation of scratchpads:
 (1) Should the '64' byte size be the result of scanning the
 function, for the largest data size to which we store? (ideally,
 conditionally store!)

 I suspect most functions have conditional stores, but far fewer have
 conditional stores that we'd like to if-convert.  ISTM that if we can lazily
 allocate the scratchpad that'd be best.   If this were an RTL pass, then I'd
 say query the backend for the widest mode store insn and use that to size
 the scratchpad.  We may have something similar we can do in gimple without
 resorting querying insn backend capabilities. Perhaps walking the in-scope
 addressable variables or somesuch.


 (2) Allocating only once per function: if we had one scratchpad per
 loop, it could/would live inside the test of gimple_build_call_internal
 (IFN_LOOP_VECTORIZED,  Otherwise, if we if-convert one or more
 loops in the function, but then fail to vectorize them, we'll leave the
 scratchpad around for later phases to clean up. Is that OK?

 If the scratchpad is local to a function, then I'd expect we'd clean it up
 just like any other unused local.  Once it's a global, then all bets are
 off.

 Anyway, I probably should just look at the patch before making more
 comments.

 jeff



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-24 Thread Ramana Radhakrishnan



On 12/06/15 21:50, Abe Skolnik wrote:

Hi everybody!

In the current implementation of if conversion, loads and stores are
if-converted in a thread-unsafe way:

   * loads were always executed, even when they should have not been.
 Some source code could be rendered invalid due to null pointers
 that were OK in the original program because they were never
 dereferenced.

   * writes were if-converted via load/maybe-modify/store, which
 renders some code multithreading-unsafe.

This patch reimplements if-conversion of loads and stores in a safe
way using a scratchpad allocated by the compiler on the stack:

   * loads are done through an indirection, reading either the correct
 data from the correct source [if the condition is true] or reading
 from the scratchpad and later ignoring this read result [if the
 condition is false].

   * writes are also done through an indirection, writing either to the
 correct destination [if the condition is true] or to the
 scratchpad [if the condition is false].

Vectorization of if-cvt-stores-vect-ifcvt-18.c disabled because the
old if-conversion resulted in unsafe code that could fail under
multithreading even though the as-written code _was_ thread-safe.

Passed regression testing and bootstrap on amd64-linux.
Is this OK to commit to trunk?


I can't approve or reject but this certainly looks like an improvement 
compared to where we are as we get rid of the data races.


The only gotcha I can think with this approach is that this introduces 
false dependencies that would cause unnecessary write-after-write 
hazards with the writes to the scratchpad when you unroll the loop - but 
that's not necessarily worse than where we are today.


Some fun stats from a previous Friday afternoon poke at this without 
doing any benchmarking as such.


In a bootstrap with BOOT_CFLAGS=-O2 -ftree-loop-if-convert-stores and 
one without it, I see about 12.20% more csel's on an AArch64 bootstrap 
(goes from 7898 - 8862) vs plain old -O2.


And I did see the one case in libquantum get sorted with this, though 
the performance results were funny let's say (+5% in one case, -1.5% on 
another core), I haven't analysed it deeply yet but it does look 
interesting.


regards
Ramana




Regards,

Abe




2015-06-12  Sebastian Pop  s@samsung.com
 Abe Skolnik  a.skol...@samsung.com

PR tree-optimization/46029
* tree-data-ref.c (struct data_ref_loc_d): Moved...
(get_references_in_stmt): Exported.
* tree-data-ref.h (struct data_ref_loc_d): ... here.
(get_references_in_stmt): Declared.

* doc/invoke.texi (-ftree-loop-if-convert-stores): Update description.
* tree-if-conv.c (struct ifc_dr): Removed.
(IFC_DR): Removed.
(DR_WRITTEN_AT_LEAST_ONCE): Removed.
(DR_RW_UNCONDITIONALLY): Removed.
(memrefs_read_or_written_unconditionally): Removed.
(write_memrefs_written_at_least_once): Removed.
(ifcvt_could_trap_p): Does not take refs parameter anymore.
(ifcvt_memrefs_wont_trap): Removed.
(has_non_addressable_refs): New.
(if_convertible_gimple_assign_stmt_p): Call has_non_addressable_refs.
Removed use of refs.
(if_convertible_stmt_p): Removed use of refs.
(if_convertible_gimple_assign_stmt_p): Same.
(if_convertible_loop_p_1): Removed use of refs.  Remove initialization
of dr-aux, DR_WRITTEN_AT_LEAST_ONCE, and DR_RW_UNCONDITIONALLY.
(insert_address_of): New.
(create_scratchpad): New.
(create_indirect_cond_expr): New.
(predicate_mem_writes): Call create_indirect_cond_expr.  Take an extra
parameter for scratch_pad.
(combine_blocks): Same.
(tree_if_conversion): Same.

testsuite/
* g++.dg/tree-ssa/ifc-pr46029.C: New.
* gcc.dg/tree-ssa/ifc-5.c: Make it exactly like the FFmpeg kernel.
* gcc.dg/tree-ssa/ifc-8.c: New.
* gcc.dg/tree-ssa/ifc-9.c: New.
* gcc.dg/tree-ssa/ifc-10.c: New.
* gcc.dg/tree-ssa/ifc-11.c: New.
* gcc.dg/tree-ssa/ifc-12.c: New.
* gcc.dg/vect/if-cvt-stores-vect-ifcvt-18.c: Disabled.
* gcc.dg/vect/if-cvt-stores-vect-ifcvt-19.c: New.
---
  gcc/ChangeLog  |  28 ++
  gcc/doc/invoke.texi|  18 +-
  gcc/testsuite/g++.dg/tree-ssa/ifc-pr46029.C|  76 
  gcc/testsuite/gcc.dg/tree-ssa/ifc-10.c |  17 +
  gcc/testsuite/gcc.dg/tree-ssa/ifc-11.c |  16 +
  gcc/testsuite/gcc.dg/tree-ssa/ifc-12.c |  13 +
  gcc/testsuite/gcc.dg/tree-ssa/ifc-5.c  |  19 +-
  gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c  |  29 ++
  gcc/testsuite/gcc.dg/tree-ssa/ifc-9.c  |  17 +
  .../gcc.dg/vect/if-cvt-stores-vect-ifcvt-18.c  |  10 +-
  .../gcc.dg/vect/if-cvt-stores-vect-ifcvt-19.c  |  46 +++
  gcc/tree-data-ref.c 

Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-24 Thread Jeff Law

On 06/22/2015 10:27 AM, Alan Lawrence wrote:



My main thought concerns the direction we are travelling here. A major
reason why we do if-conversion is to enable vectorization. Is this is
targetted at gathering/scattering loads? Following vectorization,
different elements of the vector being loaded/stored may have to go
to/from the scratchpad or to/from main memory.

Or, are we aiming at the case where the predicate or address are
invariant? That seems unlikely - loop unswitching would be better for
the predicate; loading from an address, we'd just peel and hoist;
storing, this'd result in the address holding the last value written, at
exit from the loop, a curious idiom. Where the predicate/address is
invariant across the vector? (!)

Or, at we aiming at non-vectorized code?
I think we're aiming at correctness issues, particularly WRT not 
allowing the optimizers to introduce new data races for C11/C++11.





Re. creation of scratchpads:
(1) Should the '64' byte size be the result of scanning the
function, for the largest data size to which we store? (ideally,
conditionally store!)
I suspect most functions have conditional stores, but far fewer have 
conditional stores that we'd like to if-convert.  ISTM that if we can 
lazily allocate the scratchpad that'd be best.   If this were an RTL 
pass, then I'd say query the backend for the widest mode store insn and 
use that to size the scratchpad.  We may have something similar we can 
do in gimple without resorting querying insn backend capabilities. 
Perhaps walking the in-scope addressable variables or somesuch.




(2) Allocating only once per function: if we had one scratchpad per
loop, it could/would live inside the test of gimple_build_call_internal
(IFN_LOOP_VECTORIZED,  Otherwise, if we if-convert one or more
loops in the function, but then fail to vectorize them, we'll leave the
scratchpad around for later phases to clean up. Is that OK?
If the scratchpad is local to a function, then I'd expect we'd clean it 
up just like any other unused local.  Once it's a global, then all bets 
are off.


Anyway, I probably should just look at the patch before making more 
comments.


jeff



Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-24 Thread Jeff Law

On 06/24/2015 10:50 AM, Ramana Radhakrishnan wrote:



On 12/06/15 21:50, Abe Skolnik wrote:

Hi everybody!

In the current implementation of if conversion, loads and stores are
if-converted in a thread-unsafe way:

   * loads were always executed, even when they should have not been.
 Some source code could be rendered invalid due to null pointers
 that were OK in the original program because they were never
 dereferenced.

   * writes were if-converted via load/maybe-modify/store, which
 renders some code multithreading-unsafe.

This patch reimplements if-conversion of loads and stores in a safe
way using a scratchpad allocated by the compiler on the stack:

   * loads are done through an indirection, reading either the correct
 data from the correct source [if the condition is true] or reading
 from the scratchpad and later ignoring this read result [if the
 condition is false].

   * writes are also done through an indirection, writing either to the
 correct destination [if the condition is true] or to the
 scratchpad [if the condition is false].

Vectorization of if-cvt-stores-vect-ifcvt-18.c disabled because the
old if-conversion resulted in unsafe code that could fail under
multithreading even though the as-written code _was_ thread-safe.

Passed regression testing and bootstrap on amd64-linux.
Is this OK to commit to trunk?


I can't approve or reject but this certainly looks like an improvement
compared to where we are as we get rid of the data races.
Right.  I was going to assume the primary purpose to to address 
correctness issues, not increase the amount of if-conversion for 
optimization purposes.


I have a couple of high level concerns around the scratchpad usage 
(aliasing, write-write hazards), but until I dig into the patch I don't 
know if they're real issues or not.



Jeff


Re: fix PR46029: reimplement if conversion of loads and stores

2015-06-22 Thread Alan Lawrence

Abe Skolnik wrote:

Hi everybody!

In the current implementation of if conversion, loads and stores are
if-converted in a thread-unsafe way:

  * loads were always executed, even when they should have not been.
Some source code could be rendered invalid due to null pointers
that were OK in the original program because they were never
dereferenced.

  * writes were if-converted via load/maybe-modify/store, which
renders some code multithreading-unsafe.

This patch reimplements if-conversion of loads and stores in a safe
way using a scratchpad allocated by the compiler on the stack:

  * loads are done through an indirection, reading either the correct
data from the correct source [if the condition is true] or reading
from the scratchpad and later ignoring this read result [if the
condition is false].

  * writes are also done through an indirection, writing either to the
correct destination [if the condition is true] or to the
scratchpad [if the condition is false].

Vectorization of if-cvt-stores-vect-ifcvt-18.c disabled because the
old if-conversion resulted in unsafe code that could fail under
multithreading even though the as-written code _was_ thread-safe.

Passed regression testing and bootstrap on amd64-linux.
Is this OK to commit to trunk?

Regards,

Abe



Thanks for getting back to this!

My main thought concerns the direction we are travelling here. A major reason 
why we do if-conversion is to enable vectorization. Is this is targetted at 
gathering/scattering loads? Following vectorization, different elements of the 
vector being loaded/stored may have to go to/from the scratchpad or to/from main 
memory.


Or, are we aiming at the case where the predicate or address are invariant? That 
seems unlikely - loop unswitching would be better for the predicate; loading 
from an address, we'd just peel and hoist; storing, this'd result in the address 
holding the last value written, at exit from the loop, a curious idiom. Where 
the predicate/address is invariant across the vector? (!)


Or, at we aiming at non-vectorized code?

Beyond that question...

Does the description for -ftree-loop-if-convert-stores in doc/invoke.texi 
describe what the flag now does? (It doesn't mention loads; the code doesn't 
look like we use scratchpads at all without -ftree-loop-if-convert-stores, or am 
I missing something?)


In tree-if-conv.c:
@@ -883,7 +733,7 @@ if_convertible_gimple_assign_stmt_p (gimple stmt,

   if (flag_tree_loop_if_convert_stores)
 {
-  if (ifcvt_could_trap_p (stmt, refs))
+  if (ifcvt_could_trap_p (stmt))
{
  if (ifcvt_can_use_mask_load_store (stmt))
{

and

+
+  if (has_non_addressable_refs (stmt))
+   {
+ if (dump_file  (dump_flags  TDF_DETAILS))
+   fprintf (dump_file, has non-addressable memory references\n);
+ return false;
+   }
+

if it doesn't trap, but has_non_addressable_refs, can't we use 
ifcvt_can_use_mask_load_store there too?


And/or, I think I may be misunderstanding here, but if an access could trap, but 
is addressable, can't we use the scratchpad technique to get round the trapping 
problem?


(Look at it another way - this patch makes strictly more things return true from 
ifcvt_could_trap_p, which always exits immediately from 
if_convertible_gimple_assign_stmt_p...?)



Re. creation of scratchpads:
   (1) Should the '64' byte size be the result of scanning the function, for 
the largest data size to which we store? (ideally, conditionally store!)
   (2) Allocating only once per function: if we had one scratchpad per loop, it 
could/would live inside the test of gimple_build_call_internal 
(IFN_LOOP_VECTORIZED,  Otherwise, if we if-convert one or more loops in the 
function, but then fail to vectorize them, we'll leave the scratchpad around for 
later phases to clean up. Is that OK?



Also some style nits:

@@ -1342,7 +1190,7 @@ if_convertible_loop_p_1 (struct loop *loop,
   /* Check the if-convertibility of statements in predicated BBs.  */
   if (!dominated_by_p (CDI_DOMINATORS, loop-latch, bb))
for (itr = gsi_start_bb (bb); !gsi_end_p (itr); gsi_next (itr))
- if (!if_convertible_stmt_p (gsi_stmt (itr), *refs,
+ if (!if_convertible_stmt_p (gsi_stmt (itr),
  any_mask_load_store))
return false;
 }

bet that fits on one line now.

+ * Returns a memory reference to the pointer defined by the
+conditional expression: pointer = cond ? A[i] : scratch_pad; and
+   inserts this code at GSI.  */
+
+static tree
+create_indirect_cond_expr (tree ai, tree cond, tree *scratch_pad,
+  gimple_stmt_iterator *gsi, bool swap)

in comment, should A[i] just be AI, as I see nothing in 
create_indirect_cond_expr that requires ai to be an array dereference?


@@ -2063,12 +1998,14 @@ mask_exists (int size, vecint vec)
| end_bb_1
|
| bb_2
+   |   cond =