Re: C++0x Memory model and gcc

2010-05-17 Thread Michael Matz
Hi,

On Wed, 12 May 2010, Andrew MacLeod wrote:

 Well, you get the same thing you get today.  Any synchronization done 
 via a function call will tend to be correct since we never move shared 
 memory operations across calls.  Depending on your application, the 
 types of data races the options deal with may not be an issue.  Using 
 the options will eliminate having to think whether they are issues or 
 not at a (hopefully) small cost.
 
 Since the atomic operations are being built into the compiler, the 
 intent is to eventually optimize and inline them for speed... and in the 
 best case, simply result in a load or store. That's further work of 
 course, but these options are laying some of the groundwork.

Are you and the other proponents of that memory model seriously proposing 
it as an alternative to explicit locking via atomic builtins (that map to 
some form of atomic instructions)?


Ciao,
Michael.


Re: C++0x Memory model and gcc

2010-05-17 Thread Ian Lance Taylor
Michael Matz m...@suse.de writes:

 On Wed, 12 May 2010, Andrew MacLeod wrote:

 Well, you get the same thing you get today.  Any synchronization done 
 via a function call will tend to be correct since we never move shared 
 memory operations across calls.  Depending on your application, the 
 types of data races the options deal with may not be an issue.  Using 
 the options will eliminate having to think whether they are issues or 
 not at a (hopefully) small cost.
 
 Since the atomic operations are being built into the compiler, the 
 intent is to eventually optimize and inline them for speed... and in the 
 best case, simply result in a load or store. That's further work of 
 course, but these options are laying some of the groundwork.

 Are you and the other proponents of that memory model seriously proposing 
 it as an alternative to explicit locking via atomic builtins (that map to 
 some form of atomic instructions)?

I'm not sure what you mean here.  Do you an alternative way to
implement the C++0x proposed standard?  Or are you questioning the
approach taken by the standard?

Ian


Re: C++0x Memory model and gcc

2010-05-17 Thread Andrew MacLeod

Michael Matz wrote:

Hi,

On Wed, 12 May 2010, Andrew MacLeod wrote:

  
Well, you get the same thing you get today.  Any synchronization done 
via a function call will tend to be correct since we never move shared 
memory operations across calls.  Depending on your application, the 
types of data races the options deal with may not be an issue.  Using 
the options will eliminate having to think whether they are issues or 
not at a (hopefully) small cost.


Since the atomic operations are being built into the compiler, the 
intent is to eventually optimize and inline them for speed... and in the 
best case, simply result in a load or store. That's further work of 
course, but these options are laying some of the groundwork.



Are you and the other proponents of that memory model seriously proposing 
it as an alternative to explicit locking via atomic builtins (that map to 
some form of atomic instructions)?


  

Proposing what as an alternative?

These optimization restrictions defined by the memory model are there to 
create predictable memory behaviour across threads. This is applicable 
when you use the atomic built-ins for locking.  Especially in the case 
when the atomic operation is inlined.  One goal is to have unoptimized 
program behaviour be consistent with  the optimized version.  If the 
optimizers introduce new data races, there is a potential behaviour 
difference.


Lock free data structures which utilize the atomic built-ins but do not 
require explicit locking are potential applications built on top of that.


Andrew



Re: C++0x Memory model and gcc

2010-05-17 Thread Michael Matz
Hi,

On Mon, 17 May 2010, Ian Lance Taylor wrote:

  Since the atomic operations are being built into the compiler, the 
  intent is to eventually optimize and inline them for speed... and in 
  the best case, simply result in a load or store. That's further work 
  of course, but these options are laying some of the groundwork.
 
  Are you and the other proponents of that memory model seriously 
  proposing it as an alternative to explicit locking via atomic builtins 
  (that map to some form of atomic instructions)?
 
 I'm not sure what you mean here.  Do you an alternative way to
 implement the C++0x proposed standard?

I actually see no way to implement the proposed memory model on common 
hardware, except by emitting locked instructions and memory barriers for 
all memory accesses to potentially shared (and hence all non-stack) data.  
And even then it only works on a subset of types, namely those for this 
the hardware provides such instructions with the associated guarantees.

 Or are you questioning the approach taken by the standard?

I do, yes.


Ciao,
Michael.


Re: C++0x Memory model and gcc

2010-05-17 Thread Michael Matz
Hi,

On Mon, 17 May 2010, Andrew MacLeod wrote:

   Well, you get the same thing you get today.  Any synchronization 
   done via a function call will tend to be correct since we never move 
   shared memory operations across calls.  Depending on your 
   application, the types of data races the options deal with may not 
   be an issue.  Using the options will eliminate having to think 
   whether they are issues or not at a (hopefully) small cost.
  
   Since the atomic operations are being built into the compiler, the 
   intent is to eventually optimize and inline them for speed... and in 
   the best case, simply result in a load or store. That's further work 
   of course, but these options are laying some of the groundwork.
   
 
  Are you and the other proponents of that memory model seriously 
  proposing it as an alternative to explicit locking via atomic builtins 
  (that map to some form of atomic instructions)?
 
 Proposing what as an alternative?

The guarantees you seem to want to establish by the proposed memory model.  
Possibly I misunderstood.

I'm not 100% sure on the guarantees you want to establish.  The proposed 
model seems to merge multiple concepts together, all related to 
memory access ordering and atomicity, but with different scope and 
difficulty to guarantee.

The mail to which I reacted seemed to me to imply that you would believe 
the guarantees from the memory model alone would relieve users from 
writing explicit atomic instructions for data synchronization.

If you didn't imply that, then I'm also interested to learn what other 
advantages you expect to derive from the guarantees.

And third, I'm interested to learn how you intend to actually guarantee 
the guarantees given by the model.

So, in short, I'd like to know
  What (guarantees are established),
  Why (are those sensible and useful), and
  How (are those intended to be implemented)

I've tried to find this in the Wiki, but it only states some broad goals 
it seems (not introduce data races).  I also find the papers of Boehm 
somewhat lacking when it comes to how to actually implement the whole 
model on hardware, especially because he himself acknowledges the obvious 
problems on real hardware, like:
  * load/store reorder buffers,
  * store-load forwarding,
  * cache-line granularity even for strict coherency models, 
  * existence of weak coherency machines (have to aquire whole cache line
for exlusive write)
  * general slowness of locked (or atomic) instructions compared to normal
stores/loads
  * existence of store granularity on some hardware (we don't even have to 
enter bit-field business, alpha e.g. has only 64 bit accesses)

But for all of these to be relevant questions we first need to know what 
exactly are the intended guarantees of that model; say from the 
perspective of observable behaviour from other threads.

 These optimization restrictions defined by the memory model are there to 
 create predictable memory behaviour across threads.

With our without use of atomics?  I.e. is the mem behaviour supposed to be 
predictable also in absense of all mentions of explicitely written atomic 
builtins?  And you need to defined predictable.  Predictable == behaves 
according to rules.  What are the rules?

 This is applicable when you use the atomic built-ins for locking.  
 Especially in the case when the atomic operation is inlined.  One goal 
 is to have unoptimized program behaviour be consistent with the 
 optimized version.

We have that now (because atomics are memory barriers), so that's probably 
not why the model was devised.


Ciao,
Michael.


Re: C++0x Memory model and gcc

2010-05-17 Thread Andrew MacLeod

Michael Matz wrote:

Hi,

On Mon, 17 May 2010, Andrew MacLeod wrote:
  
The guarantees you seem to want to establish by the proposed memory model.  
Possibly I misunderstood.


I'm not 100% sure on the guarantees you want to establish.  The proposed 
model seems to merge multiple concepts together, all related to 
memory access ordering and atomicity, but with different scope and 
difficulty to guarantee.
  


I think the standard is excessively confusing, and overly academic. I 
even find the term memory model adds to the confusion.  Some effort was 
clearly involved in defining behaviour for hardware which does not yet 
exist, but the language is prepared for.  I was particularly unhappy 
that they merged the whole synchronization thing to an atomic load or 
store, at least originally. I would hazard a guess that it evolved to 
this state based on an observation that synchronization is almost 
inevitably required when an atomic is being accessed. Thats just a guess 
however.


However, there is some fundamental goodness in it once you sort through it.

Lets see if I can paraphrase normal uses and map them to the standard :-)

The normal case would be when you have a system wide lock, and when you 
acquire the lock, you expect everything which occurred before the lock 
to be completed.

ie
process1 :otherglob = 2;  global = 10;   set atomic_lock(1);
process2:   wait (atomic_lock() == 1);print (global)

you expect 'global' in process 2 to always be 10. You are in effect 
using the lock as a ready flag for global.


In order for that to happen in a consistent manner, there is more 
involved than just waiting for the lock.  If process 1 and 2 are running 
on different machines, process 1 will have to flush its cache all the 
way to memory, and process  2 will have to wait for that to complete and 
visible before it can proceed with allowing the proper value of global 
to be loaded.  Otherwise the results will not be as expected.


Thats the synchronization model which maps to the default or 
'sequentially consistent' C++ model.  The cache flushing and whatever 
else is required is built into the library routines  for performing 
atomic loads and stores. There is no mechanism to specify that this lock 
is  for the value of 'global', so the standard extends the definition of 
the lock to say it applies to *all* shared memory before the atomic lock 
value is set.  so


process3:  wait (atomic_lock() == 1) print (otherglob);

will also work properly.  This memory model will always involve some 
form of synchronization instructions, and potentially waiting on other 
hardware to complete. I don't know much about this , but Im told 
machines are starting to provide instructions to accomplish this type of 
synchronization. The obvious conclusion is that once the hardware starts 
to be able to do this synchronization with a few instructions, the 
entire library call to set or read an atomic and perform 
synchronization  may be inlinable without having  a call of any kind, 
just straight line instructions.  At this point, the optimizer will need 
to understand that those instructions are barriers.


If you are using an atomic variable simply as an variable, and don't 
care about the synchronization aspects (ie, you just want to always see 
a valid value for the variable), then that maps to the 'relaxed' mode.  
There may be some academic babble about certain provisions, but this is 
effectively what it boils down to. The relaxed mode is what you use when 
you don't care about all that memory flushing and just want to see the 
values of the atomic itself. So this is the fastest model, but don't 
depend on the values of other shared variables.  This is also what you 
get when you use the basic atomic store and load macros in C.


The sequential mode has the possibility of being VERY slow if you have a 
widely distributed system. Thats where the third mode comes in, the 
release/acquire model.  Proper utilization of it can remove many of the 
waits present in the sequential model since different processes don't 
have to wait for *all* cache flushes, just ones directly related to a 
specific atomic variable in a specific other process. The model is 
provided to allow code to run more efficiently, but requires a better 
understanding of the subtleties of multi-processor side effects in the 
code you write.  I still don't really get it completely, but I'm not 
implementing the synchronization parts, so I only need to understand 
some of it :-)  It is possible to optimize these operations, ie you can 
do CSE and dead store elimination which can also help the code run 
faster. That comes later tho.


The optimization flags I'm currently working on are orthogonal to all 
this, even though it uses the term memory-model.  When a program is 
written for multi-processing the programmer usually attempts to write it 
such that there are no data races, otherwise there may be 
inconsistencies during execution.  If a program 

Re: C++0x Memory model and gcc

2010-05-17 Thread Ian Lance Taylor
Michael Matz m...@suse.de writes:

 On Mon, 17 May 2010, Ian Lance Taylor wrote:

  Since the atomic operations are being built into the compiler, the 
  intent is to eventually optimize and inline them for speed... and in 
  the best case, simply result in a load or store. That's further work 
  of course, but these options are laying some of the groundwork.
 
  Are you and the other proponents of that memory model seriously 
  proposing it as an alternative to explicit locking via atomic builtins 
  (that map to some form of atomic instructions)?
 
 I'm not sure what you mean here.  Do you an alternative way to
 implement the C++0x proposed standard?

 I actually see no way to implement the proposed memory model on common 
 hardware, except by emitting locked instructions and memory barriers for 
 all memory accesses to potentially shared (and hence all non-stack) data.  
 And even then it only works on a subset of types, namely those for this 
 the hardware provides such instructions with the associated guarantees.

I'm sure the C++ standards committee would like to hear a case for why
the proposal is unusable.  The standard has not yet been voted out.

As far as I understand the proposal, though, your statement turns out
not to be the case.  Those locked instructions and memory barries are
only required for loads and stores to atomic types, not to all types.

Ian


Re: C++0x Memory model and gcc

2010-05-12 Thread Andrew MacLeod

Miles Bader wrote:

Andrew MacLeod amacl...@redhat.com writes:
  

-fmemory-model=single - Enable all data races introductions, as they
are today. (relax all 4 internal restrictions.)


One could still use this mode with a multi-threaded program as long as
explicit synchronization is done, right?
  

Right.  Its just a single processor memory model, so it doesn't limit
any optimizations.



Hmm, though now that I think about it, I'm not exactly sure what I mean
by explicit synchronization.  Standard libraries (boost threads, the
upcoming std::thread) provide things like mutexes and
conditional-variables, but does using those guarantee that the right
things happen with any shared data-structures they're used to
coordinate...?

  


Well, you get the same thing you get today.  Any synchronization done 
via a function call will tend to be correct since we never move shared 
memory operations across calls.   Depending on your application, the 
types of data races the options deal with may not be an issue.   Using 
the options will eliminate having to think whether they are issues or 
not at a (hopefully) small cost.


Since the atomic operations are being built into the compiler,  the 
intent is to eventually optimize and inline them for speed... and in the 
best case, simply result in a load or store. That's further work of 
course, but these options are laying some of the groundwork.


Andrew


Re: C++0x Memory model and gcc

2010-05-11 Thread Miles Bader
Andrew MacLeod amacl...@redhat.com writes:
 -fmemory-model=single - Enable all data races introductions, as they
 are today. (relax all 4 internal restrictions.)

One could still use this mode with a multi-threaded program as long as
explicit synchronization is done, right?

-Miles

-- 
Road, n. A strip of land along which one may pass from where it is too
tiresome to be to where it is futile to go.



Re: C++0x Memory model and gcc

2010-05-11 Thread Andrew MacLeod

Miles Bader wrote:

Andrew MacLeod amacl...@redhat.com writes:
  

-fmemory-model=single - Enable all data races introductions, as they
are today. (relax all 4 internal restrictions.)



One could still use this mode with a multi-threaded program as long as
explicit synchronization is done, right?
  


Right.   Its just a single processor memory model, so it doesn't limit 
any optimizations.


Andrew


Re: C++0x Memory model and gcc

2010-05-11 Thread Miles Bader
Andrew MacLeod amacl...@redhat.com writes:
 -fmemory-model=single - Enable all data races introductions, as they
 are today. (relax all 4 internal restrictions.)

 One could still use this mode with a multi-threaded program as long as
 explicit synchronization is done, right?

 Right.  Its just a single processor memory model, so it doesn't limit
 any optimizations.

Hmm, though now that I think about it, I'm not exactly sure what I mean
by explicit synchronization.  Standard libraries (boost threads, the
upcoming std::thread) provide things like mutexes and
conditional-variables, but does using those guarantee that the right
things happen with any shared data-structures they're used to
coordinate...?

Thanks,

-Miles

-- 
Vote, v. The instrument and symbol of a freeman's power to make a fool of
himself and a wreck of his country.



Re: C++0x Memory model and gcc

2010-05-10 Thread Andrew MacLeod

On 05/10/2010 12:39 AM, Ian Lance Taylor wrote:

Albert Cohenalbert.co...@inria.fr  writes:
   


I agree. Or even, =c++0x or =gnu++0x

On the other hand, I fail to see the differen between =single and
=fast, and the explanation about the same memory word is not really
relevant as memory models typically tell you about concurrent accesses
to different memory words.
 

What I was thinking is that the difference between =single and =fast
is that =single permits store speculation.  The difference between
=fast and =safe/=conformant is that =fast permits writing to a byte by
loading a word, changing the byte, and storing the word; in
particular, =fast permits write combining in cases where =safe does
not.

Memory models may not talk about memory words, but they exist
nevertheless.

Ian
   


I've changed the documentation and code to --params suggestion and the 
following, for now.  we can work out the exact wording and other options 
later.


-fmemory-model=c++0x- Disable data races as per architectural 
requirements to match the standard.
-fmemory-model=safe- Disable all data race introductions. 
(enforce all 4 internal restrictions.)
-fmemory-model=single - Enable all data races introductions, as they 
are today. (relax all 4 internal restrictions.)


Andrew




Re: C++0x Memory model and gcc

2010-05-09 Thread Ian Lance Taylor
Albert Cohen albert.co...@inria.fr writes:

 Jean-Marc Bourguet wrote:
 -fmemory-model=single
 Assume single threaded execution, which also means no signal
 handlers.
 -fmemory-model=fast
 The user is responsible for all synchronization.  Accessing
 the same memory words from different threads may break
 unpredictably.
 -fmemory-model=safe
 The compiler will do its best to protect you.

 With that description, I'd think that safe lets the user code assumes
 the sequential consistency model.  I'd use -fmemory-model=conformant or
 something like that for the model where the compiler assumes that the user
 code respect the constraint led out for it by the standard.  As which
 constraints are put on user code depend on the languages -- Java has its
 own memory model which AFAIK is more constraining than C++ and I think Ada
 has its own but my Ada programming days are too far for me to comment on
 it -- one may prefer some other name.

 I agree. Or even, =c++0x or =gnu++0x

 On the other hand, I fail to see the differen between =single and
 =fast, and the explanation about the same memory word is not really
 relevant as memory models typically tell you about concurrent accesses
 to different memory words.

What I was thinking is that the difference between =single and =fast
is that =single permits store speculation.  The difference between
=fast and =safe/=conformant is that =fast permits writing to a byte by
loading a word, changing the byte, and storing the word; in
particular, =fast permits write combining in cases where =safe does
not.

Memory models may not talk about memory words, but they exist
nevertheless.

Ian


Re: C++0x Memory model and gcc

2010-05-08 Thread Jean-Marc Bourguet

-fmemory-model=single
Assume single threaded execution, which also means no signal
handlers.
-fmemory-model=fast
The user is responsible for all synchronization.  Accessing
the same memory words from different threads may break
unpredictably.
-fmemory-model=safe
The compiler will do its best to protect you.


With that description, I'd think that safe lets the user code assumes
the sequential consistency model.  I'd use -fmemory-model=conformant or
something like that for the model where the compiler assumes that the user
code respect the constraint led out for it by the standard.  As which
constraints are put on user code depend on the languages -- Java has its
own memory model which AFAIK is more constraining than C++ and I think Ada
has its own but my Ada programming days are too far for me to comment on
it -- one may prefer some other name.

Yours,

--
Jean-Marc Bourguet



Re: C++0x Memory model and gcc

2010-05-08 Thread Albert Cohen

Jean-Marc Bourguet wrote:

-fmemory-model=single
Assume single threaded execution, which also means no signal
handlers.
-fmemory-model=fast
The user is responsible for all synchronization.  Accessing
the same memory words from different threads may break
unpredictably.
-fmemory-model=safe
The compiler will do its best to protect you.


With that description, I'd think that safe lets the user code assumes
the sequential consistency model.  I'd use -fmemory-model=conformant or
something like that for the model where the compiler assumes that the user
code respect the constraint led out for it by the standard.  As which
constraints are put on user code depend on the languages -- Java has its
own memory model which AFAIK is more constraining than C++ and I think Ada
has its own but my Ada programming days are too far for me to comment on
it -- one may prefer some other name.


I agree. Or even, =c++0x or =gnu++0x

On the other hand, I fail to see the differen between =single and =fast, 
and the explanation about the same memory word is not really relevant 
as memory models typically tell you about concurrent accesses to 
different memory words.


Albert


Re: C++0x Memory model and gcc

2010-05-07 Thread Richard Guenther
On Thu, May 6, 2010 at 6:22 PM, Andrew MacLeod amacl...@redhat.com wrote:
 Richard Guenther wrote:

 On Thu, May 6, 2010 at 5:50 PM, Richard Guenther
 richard.guent...@gmail.com wrote:


 First let me say that the C++ memory model is crap when it
 forces data-races to be avoided for unannotated data like
 the examples for packed data.


 And it isn't consistent across the board, since neighbouring bits normally
 don't qualify and can introduce data races. I don't like it when a solution
 has exceptions like that. It is what it is however, and last I heard the
 plan was for C to adopt the changes as well.

I would have hoped that only data races between independent
objects are covered, thus

 tmp = a.i;
 b.j = tmp;

would qualify as a load of a and a store to b as far as dependencies
are concerned.  That would have been consistent with the
exceptions for bitfields and much more friendly to architectures
with weak support for unaligned accesses.


 Well, I hope that instead of just disabling optimizations you
 will help to improve their implementation to be able to optimize
 in a conformant manner.


 I don't want to disable any more than required. SSA names aren't affected
 since they are local variables only,  its only operations on shared memory,
 and I am hopeful that I can minimize the restrictions placed on them.  Some
 will be more interesting than others... like CSE... you can still perform
 CSE on a global as long as you don't introduce a NEW load on some execution
 path that didn't have before. What fun.

I don't understand that restriction anyway - how can an extra
load cause a data-race if the result is only used when it was
used before?  (You'd need to disable PPRE and GCSE completely
if that's really a problem)

Thus,

if (p)
  tmp = load;
...
if (q)
  use tmp;

how can transforming that to

tmp = load;
...
if (q)
  use tmp;

ever cause a problem?

 And btw, if you are thinking on how to represent the extra
 data-dependencies required for the consistency models think
 of how to extend whatever you need in infrastructure for that
 to also allow FENV dependencies - it's a quite similar problem
 (FENV query/set are the atomic operations, usual arithmetic
 is what the dependency is to).  It's completely non-trivial
 (because it's scalar code, not memory accesses).  For
 atomics you should be able to just massage the alias-oracle
 data-dependence routines (maybe).


 That's what I'm hoping actually..

We'll see.

Richard.

 Andrew.



Re: C++0x Memory model and gcc

2010-05-07 Thread Andrew MacLeod

Richard Guenther wrote:

On Thu, May 6, 2010 at 6:22 PM, Andrew MacLeod amacl...@redhat.com wrote:
  

Richard Guenther wrote:



I would have hoped that only data races between independent
objects are covered, thus

 tmp = a.i;
 b.j = tmp;

would qualify as a load of a and a store to b as far as dependencies
are concerned.  That would have been consistent with the
exceptions for bitfields and much more friendly to architectures
with weak support for unaligned accesses.
  

They are independent as far as dependencies within this compilation unit.
The problem is if thread number 2 is performing
 a.j = val
 b.i = val2

now there are data races on both A and B if we load/store full words and 
the struct was something like: 
struct {

 char i;
 char j;
} a, b;

The store to B is particularly unpleasant since you may lose one of the 
2 stores.  The load data race on A is only in the territory of hardware 
or software race detectors.





I don't want to disable any more than required. SSA names aren't affected
since they are local variables only,  its only operations on shared memory,
and I am hopeful that I can minimize the restrictions placed on them.  Some
will be more interesting than others... like CSE... you can still perform
CSE on a global as long as you don't introduce a NEW load on some execution
path that didn't have before. What fun.



I don't understand that restriction anyway - how can an extra
load cause a data-race if the result is only used when it was
used before?  (You'd need to disable PPRE and GCSE completely
if that's really a problem)

Thus,

if (p)
  tmp = load;
...
if (q)
  use tmp;

how can transforming that to

tmp = load;
...
if (q)
  use tmp;

ever cause a problem?
  


If the other thread is doing something like:

if (!p)
 load = something

then there was no data race before since your thread wasn't performing a 
load when this thread was storing.  ie, 'p' was being used as the 
synchronization guard that prevented a data race.  When you do the 
transformation,  there is now a potential race that wasn't there before.


Some hardware can detect that and trigger an exception, or a software 
data race detector could trigger, and neither would have before, which 
means the behaviour is detectably different.


Thats also why I've separated the loads and stores for handling 
separately.  under normal circumstances, we want to allow this 
transformation.  If there aren't any detection abilities in play, then 
the transformation is fine... you can't tell that there was a race.  
With stores you can actually get different results, so we do need to 
monitor those.


Andrew





Re: C++0x Memory model and gcc

2010-05-07 Thread Ian Lance Taylor
Andrew MacLeod amacl...@redhat.com writes:

 They are independent as far as dependencies within this compilation unit.
 The problem is if thread number 2 is performing
  a.j = val
  b.i = val2

 now there are data races on both A and B if we load/store full words
 and the struct was something like: struct {
  char i;
  char j;
 } a, b;

 The store to B is particularly unpleasant since you may lose one of
 the 2 stores.  The load data race on A is only in the territory of
 hardware or software race detectors.

In this exmaple, if we do a word access to a, then we are running past
the boundaries of the struct.  We can only assume that is OK if a is
aligned to a word boundary.  And if both a and b are aligned to word
boundaries, then there is no problem doing a word access to a.

So the only potential problem here is if we have two small variables
where one is aligned and the other is not.  This is an unusual
situation because small variables are not normally aligned.  We can
avoid trouble by forcing an alignment to a word boundary after every
aligned variable.

Or so it seems to me.

Ian


[Fwd: Re: C++0x Memory model and gcc]

2010-05-07 Thread Andrew MacLeod

Oops, didn't reply all...


 Original Message 
Subject:Re: C++0x Memory model and gcc
Date:   Fri, 07 May 2010 10:37:40 -0400
From:   Andrew MacLeod amacl...@redhat.com
To: Ian Lance Taylor i...@google.com
References: 	4be2e39a.5060...@redhat.com 
n2j84fc9c001005060850z942976d9n2b8431ff66cf9...@mail.gmail.com 
k2o84fc9c001005060910w30a6a9a6r4a89d26ab716d...@mail.gmail.com 
4be2ecdb.2040...@redhat.com 
m2n84fc9c001005070214u4ef0af75me6815cae6ef70...@mail.gmail.com 
4be41552.4000...@redhat.com 
mcr632zvld8@dhcp-172-17-9-151.mtv.corp.google.com




Ian Lance Taylor wrote:

Andrew MacLeod amacl...@redhat.com writes:

  

They are independent as far as dependencies within this compilation unit.
The problem is if thread number 2 is performing
 a.j = val
 b.i = val2

now there are data races on both A and B if we load/store full words
and the struct was something like: struct {
 char i;
 char j;
} a, b;

The store to B is particularly unpleasant since you may lose one of
the 2 stores.  The load data race on A is only in the territory of
hardware or software race detectors.



In this exmaple, if we do a word access to a, then we are running past
  


yes, well that's not what I was getting at.  Add a short to the struct 
to pad it out to a word or access the variables via a short load and 
store...  and do the appropriate masking.   I was just trying not to 
make the example any bigger than need be :-P



So the only potential problem here is if we have two small variables
where one is aligned and the other is not.  This is an unusual
situation because small variables are not normally aligned.  We can
avoid trouble by forcing an alignment to a word boundary after every
aligned variable.

Or so it seems to me.
  


The problem is when you load or store a larger unit than the actual 
object you are loading or storing.  Its not specifically an alignment 
thing.


I don't know how many architectures still do this, but we need to 
disable those that do.


Andrew





Re: C++0x Memory model and gcc

2010-05-06 Thread Richard Guenther
On Thu, May 6, 2010 at 5:43 PM, Andrew MacLeod amacl...@redhat.com wrote:
 I've been working for a while on understanding how the new memory model and
 Atomics work, and what the impacts are on GCC.

 It would be ideal to get as many of these changes into GCC 4.6 as possible.
 I've started work on some of the modifications and testing,  and the overall
 impact on GCC shouldn't be *too* bad :-)

 The plan is to localize the changes as much as possible, and any intrusive
 bits like optimization changes will be controlled by a flag enabling us to
 keep the current behaviour when we want it.

 I've put together a document summarizing how the memory model works, and how
 I propose to make the changes. I've converted it to wiki pages.  Maybe no
 one will laugh at my choice of document format this time :-)

 The document is linked off the Atomics wiki page, or directly  here:
  http://gcc.gnu.org/wiki/Atomic/GCCMM

 It consists mainly of describing the 2 primary aspects of the memory model
 which affects us
 - Optimization changes to avoid introducing new data races
 - Implementation of atomic variables and synchronization modes
 as well as a new infrastructure to test these types of things.

 I'm sure I've screwed something up while doing it, and I will proofread it
 later today again and tweak it further.

 Please point out anything that isn't clear,  or is downright wrong.
 Especially in the testing methodology since its all new stuff.
 Suggestions for improvements on any of the plan are welcome as well.

First let me say that the C++ memory model is crap when it
forces data-races to be avoided for unannotated data like
the examples for packed data.

Well, I hope that instead of just disabling optimizations you
will help to improve their implementation to be able to optimize
in a conformant manner.

Richard.

 Andrew






Re: C++0x Memory model and gcc

2010-05-06 Thread Joseph S. Myers
On Thu, 6 May 2010, Andrew MacLeod wrote:

 - Implementation of atomic variables and synchronization modes
 as well as a new infrastructure to test these types of things.

I presume you've read the long thread starting at 
http://gcc.gnu.org/ml/gcc/2009-08/msg00199.html regarding the issues 
involved in implementing the atomics (involving compiler and libc 
cooperation to provide stdatomic.h), and in particular ensuring that code 
built for one CPU remains safe on later CPU variants that may have more 
native atomic operations.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: C++0x Memory model and gcc

2010-05-06 Thread Richard Guenther
On Thu, May 6, 2010 at 5:50 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, May 6, 2010 at 5:43 PM, Andrew MacLeod amacl...@redhat.com wrote:
 I've been working for a while on understanding how the new memory model and
 Atomics work, and what the impacts are on GCC.

 It would be ideal to get as many of these changes into GCC 4.6 as possible.
 I've started work on some of the modifications and testing,  and the overall
 impact on GCC shouldn't be *too* bad :-)

 The plan is to localize the changes as much as possible, and any intrusive
 bits like optimization changes will be controlled by a flag enabling us to
 keep the current behaviour when we want it.

 I've put together a document summarizing how the memory model works, and how
 I propose to make the changes. I've converted it to wiki pages.  Maybe no
 one will laugh at my choice of document format this time :-)

 The document is linked off the Atomics wiki page, or directly  here:
  http://gcc.gnu.org/wiki/Atomic/GCCMM

 It consists mainly of describing the 2 primary aspects of the memory model
 which affects us
 - Optimization changes to avoid introducing new data races
 - Implementation of atomic variables and synchronization modes
 as well as a new infrastructure to test these types of things.

 I'm sure I've screwed something up while doing it, and I will proofread it
 later today again and tweak it further.

 Please point out anything that isn't clear,  or is downright wrong.
 Especially in the testing methodology since its all new stuff.
 Suggestions for improvements on any of the plan are welcome as well.

 First let me say that the C++ memory model is crap when it
 forces data-races to be avoided for unannotated data like
 the examples for packed data.

 Well, I hope that instead of just disabling optimizations you
 will help to improve their implementation to be able to optimize
 in a conformant manner.

And btw, if you are thinking on how to represent the extra
data-dependencies required for the consistency models think
of how to extend whatever you need in infrastructure for that
to also allow FENV dependencies - it's a quite similar problem
(FENV query/set are the atomic operations, usual arithmetic
is what the dependency is to).  It's completely non-trivial
(because it's scalar code, not memory accesses).  For
atomics you should be able to just massage the alias-oracle
data-dependence routines (maybe).

Richard.


Re: C++0x Memory model and gcc

2010-05-06 Thread Andrew MacLeod

Joseph S. Myers wrote:

On Thu, 6 May 2010, Andrew MacLeod wrote:

  

- Implementation of atomic variables and synchronization modes
as well as a new infrastructure to test these types of things.



I presume you've read the long thread starting at 
http://gcc.gnu.org/ml/gcc/2009-08/msg00199.html regarding the issues 
involved in implementing the atomics (involving compiler and libc 
cooperation to provide stdatomic.h), and in particular ensuring that code 
built for one CPU remains safe on later CPU variants that may have more 
native atomic operations.


  
Im not actually doing the implementation of Atomics themselves right 
now, Lawrence is looking at that. Im focusing on the GCC optimization 
requirements, changes and testing.  I'll leave issues like you point out 
there to the guys like like that stuff :-)


I couldn't understand a lot  of the atomic synchronization stuff from 
existing documentation, so I figured that part might help others 
understand it better too.  It still gives me a headache.


Andrew




Re: C++0x Memory model and gcc

2010-05-06 Thread Andrew MacLeod

Richard Guenther wrote:

On Thu, May 6, 2010 at 5:50 PM, Richard Guenther
richard.guent...@gmail.com wrote:
  

First let me say that the C++ memory model is crap when it
forces data-races to be avoided for unannotated data like
the examples for packed data.



And it isn't consistent across the board, since neighbouring bits 
normally don't qualify and can introduce data races. I don't like it 
when a solution has exceptions like that. It is what it is however, and 
last I heard the plan was for C to adopt the changes as well. 

Well, I hope that instead of just disabling optimizations you
will help to improve their implementation to be able to optimize
in a conformant manner.

I don't want to disable any more than required. SSA names aren't 
affected since they are local variables only,  its only operations on 
shared memory, and I am hopeful that I can minimize the restrictions 
placed on them.  Some will be more interesting than others... like 
CSE... you can still perform CSE on a global as long as you don't 
introduce a NEW load on some execution path that didn't have before. 
What fun.


And btw, if you are thinking on how to represent the extra
data-dependencies required for the consistency models think
of how to extend whatever you need in infrastructure for that
to also allow FENV dependencies - it's a quite similar problem
(FENV query/set are the atomic operations, usual arithmetic
is what the dependency is to).  It's completely non-trivial
(because it's scalar code, not memory accesses).  For
atomics you should be able to just massage the alias-oracle
data-dependence routines (maybe).
  


That's what I'm hoping actually..

Andrew.


Re: C++0x Memory model and gcc

2010-05-06 Thread Ian Lance Taylor
Andrew MacLeod amacl...@redhat.com writes:

 I've been working for a while on understanding how the new memory
 model and Atomics work, and what the impacts are on GCC.

Thanks for looking at this.

One issue I didn't see clearly was how to actually implement this in
the compiler.  For example, speculated stores are fine for local stack
variables, but not for global variables or heap memory.  We can
implement that in the compiler via a set of tests at each potential
speculated store.  Or we can implement it via a constraint expressed
directly in the IR--perhaps some indicator that this specific store
may not merge with conditionals.  The latter approach is harder to
design but I suspect will be more likely to be reliable over time.
The former approach is straightforward to patch into the compiler but
can easily degrade as people who don't understand the issues work on
the code.

I don't agree with your proposed command line options.  They seem fine
for internal use, but I think very very few users would know when or
whether they should use -fno-data-race-stores.  I think you should
downgrade those options to a --param value, and think about a
multi-layered -fmemory-model option.  E.g.,
-fmemory-model=single
Assume single threaded execution, which also means no signal
handlers.
-fmemory-model=fast
The user is responsible for all synchronization.  Accessing
the same memory words from different threads may break
unpredictably.
-fmemory-model=safe
The compiler will do its best to protect you.

Ian


Re: C++0x Memory model and gcc

2010-05-06 Thread Andrew MacLeod

Ian Lance Taylor wrote:

Andrew MacLeod amacl...@redhat.com writes:

  

I've been working for a while on understanding how the new memory
model and Atomics work, and what the impacts are on GCC.



Thanks for looking at this.

One issue I didn't see clearly was how to actually implement this in
the compiler.  For example, speculated stores are fine for local stack
variables, but not for global variables or heap memory.  We can
implement that in the compiler via a set of tests at each potential
speculated store.  Or we can implement it via a constraint expressed
directly in the IR--perhaps some indicator that this specific store
may not merge with conditionals.  The latter approach is harder to
design but I suspect will be more likely to be reliable over time.
The former approach is straightforward to patch into the compiler but
can easily degrade as people who don't understand the issues work on
the code.
  


which is why the ability to regression test it is so important :-).  

Right now its my intention to modify the optimizations based on the flag 
settings. Some cases will be quite tricky.  If we're CSE'ing something 
in the absence of atomics, and it is shared memory, it is still possible 
to move it if there is already a load from that location on all paths.  
So the optimization itself will need to taught how to figure that out.


ie

if ()
 a_1 = glob
else
 if ()
 b_2 = glob
  else
 c_3 = glob

we can still common glob and produce

tmp_4 = glob
if ()
 a_1 = tmp_4
else
 if ()
   b_2 = tmp_4
 else
   c_3 = tmp4

all paths loaded glob before, so we can do this safely.

but if we had:

if ()
 a_1 = glob
else
 if ()
b_2 = notglob
 else
c_3 = glob

then we can no longer do anything since we'd be introducing a new load 
of 'glob' on the path that sets b_2 which wasn't performed before. If 
there was another load of glob somewhere before the first 'if', then 
commoning becomes possible again.


Some other cases won't be nearly so tricky, thankfully :-). I do think 
we need to do it in the optimizations because of some of the complex 
situations which can arise. We can at least try to do a good job and 
then punt if it gets too hard.


Now, thankfully, on most architectures we care about, hardware detection 
of data race loads isn't an issue.  So most of the time  its only the 
stores that we need to be careful about introducing new ones.  Im hoping 
the actual impact to codegen is low most of the time





I don't agree with your proposed command line options.  They seem fine
for internal use, but I think very very few users would know when or
whether they should use -fno-data-race-stores.  I think you should
  


I'm fine with alternatives. I'm focused mostly on the internals and I 
want an individual flag for each of those things to cleanly separate 
them out.  How we expose it I'm ambivalent about as long as testing can 
turn it them on and off individually. 

There will be people using software data race detectors which may want 
to be able to turn things on or off from the system default. I think 
-fmemory-model=   with options enabling at a minimum some form of 'off', 
'system default', and 'on' would probably work for external exposure.


Andrew