Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-22 Thread Torvald Riegel
On Thu, 2013-06-20 at 09:53 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 22:25, Torvald Riegel ha scritto:
> > On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> >> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> >> but RMW shouldn't be that bad.  I don't care if reference counting is a
> >> little slower than it could be, for example);
> > 
> > I doubt relaxed RMW ops are sufficient even for reference counting.
> 
> They are enough on the increment side, or so says boost...
> 
> http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters

Oh, right, for this kind of refcounting it's okay on the increment side.
But the explanation on the page you referenced isn't correct I think:
"...passing an existing reference from one thread to another must
already provide any required synchronization." is not sufficient because
that would just create a happens-before from the reference-passing
source to the other thread that gets the reference.
The relaxed RMW increment works because of the modification order being
consistent with happens-before (see 6.17 in the model), so we can never
reach a value of zero for the refcount once we incremented the reference
even with a relaxed RMW.

IMO, the acquire fence in the release is not 100% correct according to
my understanding of the memory model:
if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
  boost::atomic_thread_fence(boost::memory_order_acquire);
  delete x;
}
"delete x" is unconditional, and I guess not specified to read all of
what x points to.  The acquire fence would only result in a
synchronizes-with edge if there is a reads-from edge between the release
store and a load that reads the stores value and is sequenced after the
acquire fence.
Thus, I think the compiler could be allowed to reorder the fence after
the delete in some case (e.g., when there's no destructor called or it
doesn't have any conditionals in it), but I guess it's likely to not
ever try to do that in practice.
Regarding the hardware fences that this maps, I suppose this just
happens to work fine on most architectures, perhaps just because
"delete" will access some of the memory when releasing the memory.

Changing the release to the following would be correct, and probably
little additional overhead:
if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) { 
  if (x->refcount.load(boost::memory_order_acquire) == 0)
delete x;
}

That makes delete conditional and thus having to happen after we ensured
the happens before edge that we need.

> >> By contrast, Java volatile semantics are easily converted to a sequence
> >> of relaxed loads, relaxed stores, and acq/rel/sc fences.
> > 
> > The same holds for C11/C++11.  If you look at either the standard or the
> > Batty model, you'll see that for every pair like store(rel)--load(acq),
> > there is also store(rel)--fence(acq)+load(relaxed),
> > store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
> > giving the same semantics.  Likewise for SC.
> 
> Do you have a pointer to that?  It would help.

In the full model (n3132.pdf), see 6.12 (which then references which
parts in the standard lead to those parts of the model).  SC fences are
also acquire and release fences, so this covers synchronizes-with via
reads-from too.  6.17 has more constraints on SC fences and modification
order, so we get something similar for the ordering of just writes.


Torvald




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-20 Thread Paul E. McKenney
On Wed, Jun 19, 2013 at 09:11:36AM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 18:53 -0700, Paul E. McKenney wrote:
> > On Tue, Jun 18, 2013 at 05:37:42PM +0200, Torvald Riegel wrote:
> > > On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> > > > First, I am not a fan of SC, mostly because there don't seem to be many
> > > > (any?) production-quality algorithms that need SC.  But if you really
> > > > want to take a parallel-programming trip back to the 1980s, let's go!  
> > > > ;-)
> > > 
> > > Dekker-style mutual exclusion is useful for things like read-mostly
> > > multiple-reader single-writer locks, or similar "asymmetric" cases of
> > > synchronization.  SC fences are needed for this.
> > 
> > They definitely need Power hwsync rather than lwsync, but they need
> > fewer fences than would be emitted by slavishly following either of the
> > SC recipes for Power.  (Another example needing store-to-load ordering
> > is hazard pointers.)
> 
> The C++11 seq-cst fence expands to hwsync; combined with a relaxed
> store / load, that should be minimal.  Or are you saying that on Power,
> there is a weaker HW barrier available that still constrains store-load
> reordering sufficiently?

Your example use of seq-cst fence is a very good one for this example.
But most people I have talked to think of C++11 SC as being SC atomic
accesses, and SC atomics would get you a bunch of redundant fences
in this example -- some but not all of which could be easily optimized
away.

Thanx, Paul




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-20 Thread Paolo Bonzini
Il 19/06/2013 22:25, Torvald Riegel ha scritto:
> On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
>> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
>> but RMW shouldn't be that bad.  I don't care if reference counting is a
>> little slower than it could be, for example);
> 
> I doubt relaxed RMW ops are sufficient even for reference counting.

They are enough on the increment side, or so says boost...

http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters

>>[An aside: Java guarantees that volatile stores are not reordered
>>with volatile loads.  This is not guaranteed by just using release
>>stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
>
> Or maybe Java volatile is acq for loads and seq_cst for stores...

Perhaps (but I'm not 100% sure).

>> As long as you only have a producer and a consumer, C11 is fine, because
>> all you need is load-acquire/store-release.  In fact, if it weren't for
>> the experience factor, C11 is easier than manually placing acquire and
>> release barriers.  But as soon as two or more threads are reading _and_
>> writing the shared memory, it gets complicated and I want to provide
>> something simple that people can use.  This is the reason for (2) above.
> 
> I can't quite follow you here.  There is a total order for all
> modifications to a single variable, and if you use acq/rel combined with
> loads and stores on this variable, then you basically can make use of
> the total order.  (All loads that read-from a certain store get a
> synchronized-with (and thus happens-before edge) with the store, and the
> stores are in a total order.)  This is independent of the number of
> readers and writers.  The difference starts once you want to sync with
> more than one variable, and need to establish an order between those
> accesses.

You're right of course.  More specifically when there is a thread where
some variables are stored while others are loaded.

>> There will still be a few cases that need to be optimized, and here are
>> where the difficult requirements come:
>>
>> (R1) the primitives *should* not be alien to people who know Linux.
>>
>> (R2) those optimizations *must* be easy to do and review; at least as
>> easy as these things go.
>>
>> The two are obviously related.  Ease of review is why it is important to
>> make things familiar to people who know Linux.
>>
>> In C11, relaxing SC loads and stores is complicated, and more
>> specifically hard to explain!
> 
> I can't see why that would be harder than reasoning about equally weaker
> Java semantics.  But you obviously know your community, and I don't :)

Because Java semantics are "almost" SC, and as Paul mentioned the
difference doesn't matter in practice (IRIW/RWC is where it matters, WRC
works even on Power; see
http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/ppc051.html#toc5, row
WRC+lwsyncs).  It hasn't ever mattered for Linux, at least.

>> By contrast, Java volatile semantics are easily converted to a sequence
>> of relaxed loads, relaxed stores, and acq/rel/sc fences.
> 
> The same holds for C11/C++11.  If you look at either the standard or the
> Batty model, you'll see that for every pair like store(rel)--load(acq),
> there is also store(rel)--fence(acq)+load(relaxed),
> store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
> giving the same semantics.  Likewise for SC.

Do you have a pointer to that?  It would help.

> You can also build Dekker with SC stores and acq loads, if I'm not
> mistaken.  Typically one would probably use SC fences and relaxed
> stores/loads.

Yes.

>>> I guess so.  But you also have to consider the legacy that you create.
>>> I do think the C11/C++11 model will used widely, and more and more
>>> people will used to it.
>>
>> I don't think many people will learn how to use the various non-seqcst
>> modes...  At least so far I punted. :)
> 
> But you already use similarly weaker orderings that the other
> abstractions provide (e.g., Java), so you're half-way there :)

True.  On the other hand you can treat Java like "kinda SC but don't
worry, you won't see the difference".  It is both worrisome and appealing...

Paolo



Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Torvald Riegel
On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 15:15, Torvald Riegel ha scritto:
> >> One reason is that implementing SC for POWER is quite expensive,
> > 
> > Sure, but you don't have to use SC fences or atomics if you don't want
> > them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> > to specify a memory order.  It's perfectly fine to use acquire fences or
> > release fences.  There shouldn't be just one kind of barrier/fence.
> 
> Agreed.  For example Linux uses four: consume (read_barrier_depends),
> acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
> stores are always relaxed, some RMW ops are SC but others are relaxed.
> 
> I want to do something similar in QEMU, with as few changes as possible.
>  In the end I settled for the following:
> 
> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> but RMW shouldn't be that bad.  I don't care if reference counting is a
> little slower than it could be, for example);

I doubt relaxed RMW ops are sufficient even for reference counting.
Typically, the reference counter is used conceptually similar to a lock,
so you need the acquire/release (modulo funny optimizations).  The only
use case that comes to my mind right now for relaxed RMW is really just
statistics counters or such, or cases where you can "re-use" another
barrier.

> (2) I'd like to have some kind of non-reordering load/store too, either
> SC (which I've improperly referred to as C11/C++11 in my previous email)
> or Java volatile.

Often you probably don't need more than acq/rel, as Paul pointed out.
SC becomes important once you do something like Dekker-style sync, so
cases where you sync via several separate variables to avoid the cache
misses in some common case.  Once you go through one variable in the
end, acq/rel should be fine.

>[An aside: Java guarantees that volatile stores are not reordered
>with volatile loads.  This is not guaranteed by just using release
>stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
   Or maybe Java volatile is acq for loads and seq_cst for stores...
> 
> As long as you only have a producer and a consumer, C11 is fine, because
> all you need is load-acquire/store-release.  In fact, if it weren't for
> the experience factor, C11 is easier than manually placing acquire and
> release barriers.  But as soon as two or more threads are reading _and_
> writing the shared memory, it gets complicated and I want to provide
> something simple that people can use.  This is the reason for (2) above.

I can't quite follow you here.  There is a total order for all
modifications to a single variable, and if you use acq/rel combined with
loads and stores on this variable, then you basically can make use of
the total order.  (All loads that read-from a certain store get a
synchronized-with (and thus happens-before edge) with the store, and the
stores are in a total order.)  This is independent of the number of
readers and writers.  The difference starts once you want to sync with
more than one variable, and need to establish an order between those
accesses.

> There will still be a few cases that need to be optimized, and here are
> where the difficult requirements come:
> 
> (R1) the primitives *should* not be alien to people who know Linux.
> 
> (R2) those optimizations *must* be easy to do and review; at least as
> easy as these things go.
> 
> The two are obviously related.  Ease of review is why it is important to
> make things familiar to people who know Linux.
> 
> In C11, relaxing SC loads and stores is complicated, and more
> specifically hard to explain!

I can't see why that would be harder than reasoning about equally weaker
Java semantics.  But you obviously know your community, and I don't :)

> I cannot do that myself, and much less
> explain that to the community.  I cannot make them do that.
> Unfortunately, relaxing SC loads and stores is important on POWER which
> has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
> fails both requirements. :(
> 
> By contrast, Java volatile semantics are easily converted to a sequence
> of relaxed loads, relaxed stores, and acq/rel/sc fences.

The same holds for C11/C++11.  If you look at either the standard or the
Batty model, you'll see that for every pair like store(rel)--load(acq),
there is also store(rel)--fence(acq)+load(relaxed),
store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
giving the same semantics.  Likewise for SC.

> It's almost an
> algorithm; I tried to do that myself and succeeded, I could document it
> nicely.  Even better, there are authoritative sources that confirm my
> writing and should be accessible to people who did synchronization
> "stuff" in Linux (no formal models :)).  In this respect, Java satisfies
> both requirements.
> 
> And the loss is limited, since things such as Dekker's algorithm are
> rare in practice.  (In particular, RCU c

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Andrew Haley
On 06/19/2013 10:30 AM, Paolo Bonzini wrote:
> Il 18/06/2013 19:38, Andrew Haley ha scritto:
 Or is Java volatile somewhere between acq_rel and seq_cst, as the last
 paragraph of
 http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
 seems to suggest?
>> As far as I know, the Java semantics are acq/rel.  I can't see anything
>> there that suggests otherwise.  If we'd wanted to know for certain we
>> should have CC'd Doug lea.
> 
> acq/rel wouldn't have a full store-load barrier between a volatile store
> and a volatile load.

Ahhh, okay.  I had to check the C++11 spec to see the difference.  I'm
so deep into the Java world that I hadn't noticed that C++11 was any
different.

Andrew.



Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Paolo Bonzini
Il 19/06/2013 15:15, Torvald Riegel ha scritto:
>> One reason is that implementing SC for POWER is quite expensive,
> 
> Sure, but you don't have to use SC fences or atomics if you don't want
> them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> to specify a memory order.  It's perfectly fine to use acquire fences or
> release fences.  There shouldn't be just one kind of barrier/fence.

Agreed.  For example Linux uses four: consume (read_barrier_depends),
acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
stores are always relaxed, some RMW ops are SC but others are relaxed.

I want to do something similar in QEMU, with as few changes as possible.
 In the end I settled for the following:

(1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
but RMW shouldn't be that bad.  I don't care if reference counting is a
little slower than it could be, for example);

(2) I'd like to have some kind of non-reordering load/store too, either
SC (which I've improperly referred to as C11/C++11 in my previous email)
or Java volatile.

   [An aside: Java guarantees that volatile stores are not reordered
   with volatile loads.  This is not guaranteed by just using release
   stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].

As long as you only have a producer and a consumer, C11 is fine, because
all you need is load-acquire/store-release.  In fact, if it weren't for
the experience factor, C11 is easier than manually placing acquire and
release barriers.  But as soon as two or more threads are reading _and_
writing the shared memory, it gets complicated and I want to provide
something simple that people can use.  This is the reason for (2) above.

There will still be a few cases that need to be optimized, and here are
where the difficult requirements come:

(R1) the primitives *should* not be alien to people who know Linux.

(R2) those optimizations *must* be easy to do and review; at least as
easy as these things go.

The two are obviously related.  Ease of review is why it is important to
make things familiar to people who know Linux.

In C11, relaxing SC loads and stores is complicated, and more
specifically hard to explain!  I cannot do that myself, and much less
explain that to the community.  I cannot make them do that.
Unfortunately, relaxing SC loads and stores is important on POWER which
has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
fails both requirements. :(

By contrast, Java volatile semantics are easily converted to a sequence
of relaxed loads, relaxed stores, and acq/rel/sc fences.  It's almost an
algorithm; I tried to do that myself and succeeded, I could document it
nicely.  Even better, there are authoritative sources that confirm my
writing and should be accessible to people who did synchronization
"stuff" in Linux (no formal models :)).  In this respect, Java satisfies
both requirements.

And the loss is limited, since things such as Dekker's algorithm are
rare in practice.  (In particular, RCU can be implemented just fine with
Java volatile semantics, but load-acquire/store-release is not enough).

[Nothing really important after this point, I think].

> Note that there is a reason why C11/C++11 don't just have barriers
> combined with ordinary memory accesses: The compiler needs to be aware
> which accesses are sequential code (so it can assume that they are
> data-race-free) and which are potentially concurrent with other accesses
> to the same data.  [...]
> you can try to make this very likely be correct by careful
> placement of asm compiler barriers, but this is likely to be more
> difficult than just using atomics, which will do the right thing.

Note that asm is just for older compilers (and even then I try to use
GCC intrinsics as much as possible).

On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
thread fences) to properly annotate references.  rth also suggested that
I use load/store(relaxed) instead of C volatile.

> Maybe the issue that you see with C11/C++11 is that it offers more than
> you actually need.  If you can summarize what kind of synchronization /
> concurrent code you are primarily looking at, I can try to help outline
> a subset of it (i.e., something like code style but just for
> synchronization).

Is the above explanation clearer?

>> I obviously trust Cambridge for
>> C11/C++11, but their material is very concise or just refers to the
>> formal model.
> 
> Yes, their publications are really about the model.  It's not a
> tutorial, but useful for reference.  BTW, have you read their C++ paper
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> or the POPL paper?  The former has more detail (no page limit).

I know it, but I cannot say I tried hard to understand it.

> If you haven't yet, I suggest giving their cppmem tool a try too:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

I saw and tried the similar tool for POWER.  The pr

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Torvald Riegel
On Wed, 2013-06-19 at 11:31 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> > I don't think that this is the conclusion here.  I strongly suggest to
> > just go with the C11/C++11 model, instead of rolling your own or trying
> > to replicate the Java model.  That would also allow you to just point to
> > the C11 model and any information / tutorials about it instead of having
> > to document your own (see the patch), and you can make use of any
> > (future) tool support (e.g., race detectors).
> 
> I'm definitely not rolling my own, but I think there is some value in
> using the Java model.  Warning: the explanation came out quite
> verbose... tl;dr at the end.

That's fine -- it is a nontrival topic after all.  My reply is equally
lengthy :)

> 
> 
> One reason is that implementing SC for POWER is quite expensive,

Sure, but you don't have to use SC fences or atomics if you don't want
them.  Note that C11/C++11 as well as the __atomic* builtins allow you
to specify a memory order.  It's perfectly fine to use acquire fences or
release fences.  There shouldn't be just one kind of barrier/fence.

> while
> this is not the case for Java volatile (which I'm still not convinced is
> acq-rel, because it also orders volatile stores and volatile loads).

But you have the same choice with C11/C++11.  You seemed to be fine with
using acquire/release in your code, so if that's what you want, just use
it :)

I vaguely remember that the Java model gives stronger guarantees than
C/C++ in the presence of data races; something about no values
out-of-thin-air.  Maybe that's where the stronger-than-acq/rel memory
orders for volatile stores come from.

> People working on QEMU are often used to manually placed barriers on
> Linux, and Linux barriers do not fully give you seq-cst semantics.  They
> give you something much more similar to the Java model.

Note that there is a reason why C11/C++11 don't just have barriers
combined with ordinary memory accesses: The compiler needs to be aware
which accesses are sequential code (so it can assume that they are
data-race-free) and which are potentially concurrent with other accesses
to the same data.  When you just use normal memory accesses with
barriers, you're relying on non-specified behavior in a compiler, and
you have no guarantee that the compiler reads a value just once, for
example; you can try to make this very likely be correct by careful
placement of asm compiler barriers, but this is likely to be more
difficult than just using atomics, which will do the right thing.

Linux folks seem to be doing fine in this area, but they also seem to
mostly use existing concurrency abstractions such as locks or RCU.
Thus, maybe it's not a too good indication of the ease-of-use of their
style of expressing low-level synchronization.

> The Java model gives good performance and is easier to understand than
> the non-seqcst modes of atomic builtins.  It is pretty much impossible
> to understand the latter without a formal model;

I'm certainly biased because I've been looking at this a lot.  But I
believe that there are common synchronization idioms which are
relatively straightforward to understand.  Acquire/release pairs should
be one such case.  Locks are essentially the same everywhere.

> I see the importance of
> a formal model, but at the same time it is hard not to appreciate the
> detailed-but-practical style of the Linux documentation.

I don't think the inherent complexity programmers *need* to deal with is
different because in the end, if you have two ways to express the same
ordering guarantees, you have to reason about the very same ordering
guarantees.  This is what makes up most of the complexity.  For example,
if you want to use acq/rel widely, you need to understand what this
means for your concurrent code; this won't change significantly
depending on how you express it.

I think the C++11 model is pretty compact, meaning no unnecessary fluff
(maybe one would need fewer memory orders, but those aren't too hard to
ignore).  I believe that if you'd want to specify the Linux model
including any implicit assumptions on the compilers, you'd end up with
the same level of detail in the model.

Maybe the issue that you see with C11/C++11 is that it offers more than
you actually need.  If you can summarize what kind of synchronization /
concurrent code you are primarily looking at, I can try to help outline
a subset of it (i.e., something like code style but just for
synchronization).

> Second, the Java model has very good "practical" documentation from
> sources I trust.  Note the part about trust: I found way too many Java
> tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.
> 
> Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
> (especially now that Richard and Paul completed my understanding of
> them).  There are substantially fewer practical documents for C11/C++11
> that are similarly author

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Paolo Bonzini
Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> I don't think that this is the conclusion here.  I strongly suggest to
> just go with the C11/C++11 model, instead of rolling your own or trying
> to replicate the Java model.  That would also allow you to just point to
> the C11 model and any information / tutorials about it instead of having
> to document your own (see the patch), and you can make use of any
> (future) tool support (e.g., race detectors).

I'm definitely not rolling my own, but I think there is some value in
using the Java model.  Warning: the explanation came out quite
verbose... tl;dr at the end.


One reason is that implementing SC for POWER is quite expensive, while
this is not the case for Java volatile (which I'm still not convinced is
acq-rel, because it also orders volatile stores and volatile loads).
People working on QEMU are often used to manually placed barriers on
Linux, and Linux barriers do not fully give you seq-cst semantics.  They
give you something much more similar to the Java model.

The Java model gives good performance and is easier to understand than
the non-seqcst modes of atomic builtins.  It is pretty much impossible
to understand the latter without a formal model; I see the importance of
a formal model, but at the same time it is hard not to appreciate the
detailed-but-practical style of the Linux documentation.


Second, the Java model has very good "practical" documentation from
sources I trust.  Note the part about trust: I found way too many Java
tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.

Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
(especially now that Richard and Paul completed my understanding of
them).  There are substantially fewer practical documents for C11/C++11
that are similarly authoritative.  I obviously trust Cambridge for
C11/C++11, but their material is very concise or just refers to the
formal model.  The formal model is not what I want when my question is
simply "why is lwsync good for acquire and release, but not for
seqcst?", for example.  And the papers sometime refer to "private
communication" between the authors and other people, which can be
annoying.  Hans Boehm and Herb Sutter have good poster and slide
material, but they do not have the same level of completeness as Paul's
Linux documentation.  Paul _really_ has spoiled us "pure practitioners"...


Third, we must support old GCC (even as old as 4.2), so we need
hand-written assembly for atomics anyway.  This again goes back to
documentation and the JSR-133 cookbook.  It not only gives you
instructions on how to implement the model (which is also true for the
Cambridge web pages on C11/C++11), but is also a good base for writing
our own documentation.  It helped me understanding existing code using
barriers, optimizing it, and putting this knowledge in words.  I just
couldn't find anything as useful for C11/C++11.


In short, the C11/C++11 model is not what most developers are used to
here, hardware is not 100% mature for it (for example ARMv8 has seqcst
load/store; perhaps POWER will grow that in time), is harder to
optimize, and has (as of 2013) less "practical" documentation from
sources I trust.

Besides, since what I'm using is weaker than SC, there's always the
possibility of switching to SC in the future when enough of these issues
are solved.  In case you really need SC _now_, it is easy to do it using
fetch-and-add (for loads) or xchg (for stores).

>> I will just not use __atomic_load/__atomic_store to implement the 
>> primitives, and always express them in terms of memory barriers.
> 
> Why?  (If there's some QEMU-specific reason, just let me know; I know
> little about QEMU..)

I guess I mentioned the QEMU-specific reasons above.

> I would assume that using the __atomic* builtins is just fine if they're
> available.

It would implement slightly different semantics based on the compiler
version, so I think it's dangerous.

Paolo



Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Paolo Bonzini
Il 18/06/2013 19:38, Andrew Haley ha scritto:
>> > Or is Java volatile somewhere between acq_rel and seq_cst, as the last
>> > paragraph of
>> > http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
>> > seems to suggest?
> As far as I know, the Java semantics are acq/rel.  I can't see anything
> there that suggests otherwise.  If we'd wanted to know for certain we
> should have CC'd Doug lea.

acq/rel wouldn't have a full store-load barrier between a volatile store
and a volatile load.

Paolo



Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-19 Thread Torvald Riegel
On Tue, 2013-06-18 at 18:53 -0700, Paul E. McKenney wrote:
> On Tue, Jun 18, 2013 at 05:37:42PM +0200, Torvald Riegel wrote:
> > On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> > > First, I am not a fan of SC, mostly because there don't seem to be many
> > > (any?) production-quality algorithms that need SC.  But if you really
> > > want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
> > 
> > Dekker-style mutual exclusion is useful for things like read-mostly
> > multiple-reader single-writer locks, or similar "asymmetric" cases of
> > synchronization.  SC fences are needed for this.
> 
> They definitely need Power hwsync rather than lwsync, but they need
> fewer fences than would be emitted by slavishly following either of the
> SC recipes for Power.  (Another example needing store-to-load ordering
> is hazard pointers.)

The C++11 seq-cst fence expands to hwsync; combined with a relaxed
store / load, that should be minimal.  Or are you saying that on Power,
there is a weaker HW barrier available that still constrains store-load
reordering sufficiently?


Torvald




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Paul E. McKenney
On Tue, Jun 18, 2013 at 06:38:38PM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 18:08 +0200, Paolo Bonzini wrote:
> > Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> > > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> > >  only because I stand with one foot in theory and the other in
> > >  practice.  If I were a pure practitioner, I would probably strongly
> > >  prefer the Java formulation.
> > 
> > Awesome answer, and this last paragraph sums it up pretty well.
> 
> I disagree that for non-Java code the Java model should be better.  Both
> C11 and C++11 use the same model, and I don't see a reason to not use it
> if you're writing C/C++ code anyway.
> 
> The C++ model is definitely useful for practitioners; just because it
> uses seq-cst memory order as safe default doesn't mean that programmers
> that can deal with weaker ordering guarantees can't make use of those
> weaker ones.
> I thought Paul was referring to seq-cst as default; if that wasn't the
> point he wanted to make, I actually don't understand his theory/practice
> comparison (never mind that whenever you need to reason about concurrent
> stuff, having a solid formal framework as the one by the Cambridge group
> is definitely helpful).  Seq-cst and acq-rel are just different
> guarantees -- this doesn't mean that one is better than the other; you
> need to understand anyway what you're doing and which one you need.
> Often, ensuring a synchronized-with edge by pairing release/acquire will
> be sufficient, but that doesn't say anything about the Java vs. C/C++
> model.

Having the Cambridge group's model and tooling around has definitely
made my life much easier!

And yes, I do very much see a need for a wide range of ordering/overhead
tradeoffs, at least until such time as someone figures a way around
either the atomic nature of matter or the finite speed of light.  ;-)

Thanx, Paul

> > That was basically my understanding, too.  I still do not completely 
> > get the relationship between Java semantics and ACQ_REL, but I can 
> > sidestep the issue for adding portable atomics to QEMU.  QEMU 
> > developers and Linux developers have some overlap, and Java volatiles 
> > are simple to understand in terms of memory barriers (which Linux 
> > uses); hence, I'll treat ourselves as pure practitioners.
> 
> I don't think that this is the conclusion here.  I strongly suggest to
> just go with the C11/C++11 model, instead of rolling your own or trying
> to replicate the Java model.  That would also allow you to just point to
> the C11 model and any information / tutorials about it instead of having
> to document your own (see the patch), and you can make use of any
> (future) tool support (e.g., race detectors).
> 
> > I will just not use __atomic_load/__atomic_store to implement the 
> > primitives, and always express them in terms of memory barriers.
> 
> Why?  (If there's some QEMU-specific reason, just let me know; I know
> little about QEMU..)
> I would assume that using the __atomic* builtins is just fine if they're
> available.
> 
> 
> Torvald
> 




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Paul E. McKenney
On Tue, Jun 18, 2013 at 05:37:42PM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> > First, I am not a fan of SC, mostly because there don't seem to be many
> > (any?) production-quality algorithms that need SC.  But if you really
> > want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
> 
> Dekker-style mutual exclusion is useful for things like read-mostly
> multiple-reader single-writer locks, or similar "asymmetric" cases of
> synchronization.  SC fences are needed for this.

They definitely need Power hwsync rather than lwsync, but they need
fewer fences than would be emitted by slavishly following either of the
SC recipes for Power.  (Another example needing store-to-load ordering
is hazard pointers.)

> > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> >  only because I stand with one foot in theory and the other in
> >  practice.  If I were a pure practitioner, I would probably strongly
> >  prefer the Java formulation.
> 
> That's because you're a practitioner with experience :)

I knew I had all this grey hair for some reason or another...  ;-)

Thanx, Paul




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Torvald Riegel
On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> First, I am not a fan of SC, mostly because there don't seem to be many
> (any?) production-quality algorithms that need SC.  But if you really
> want to take a parallel-programming trip back to the 1980s, let's go!  ;-)

Dekker-style mutual exclusion is useful for things like read-mostly
multiple-reader single-writer locks, or similar "asymmetric" cases of
synchronization.  SC fences are needed for this.

> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>  only because I stand with one foot in theory and the other in
>  practice.  If I were a pure practitioner, I would probably strongly
>  prefer the Java formulation.

That's because you're a practitioner with experience :)




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Andrew Haley
On 06/18/2013 02:24 PM, Paolo Bonzini wrote:
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?

As far as I know, the Java semantics are acq/rel.  I can't see anything
there that suggests otherwise.  If we'd wanted to know for certain we
should have CC'd Doug lea.

Andrew.




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Torvald Riegel
On Tue, 2013-06-18 at 18:08 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> >  only because I stand with one foot in theory and the other in
> >  practice.  If I were a pure practitioner, I would probably strongly
> >  prefer the Java formulation.
> 
> Awesome answer, and this last paragraph sums it up pretty well.

I disagree that for non-Java code the Java model should be better.  Both
C11 and C++11 use the same model, and I don't see a reason to not use it
if you're writing C/C++ code anyway.

The C++ model is definitely useful for practitioners; just because it
uses seq-cst memory order as safe default doesn't mean that programmers
that can deal with weaker ordering guarantees can't make use of those
weaker ones.
I thought Paul was referring to seq-cst as default; if that wasn't the
point he wanted to make, I actually don't understand his theory/practice
comparison (never mind that whenever you need to reason about concurrent
stuff, having a solid formal framework as the one by the Cambridge group
is definitely helpful).  Seq-cst and acq-rel are just different
guarantees -- this doesn't mean that one is better than the other; you
need to understand anyway what you're doing and which one you need.
Often, ensuring a synchronized-with edge by pairing release/acquire will
be sufficient, but that doesn't say anything about the Java vs. C/C++
model.

> That was basically my understanding, too.  I still do not completely 
> get the relationship between Java semantics and ACQ_REL, but I can 
> sidestep the issue for adding portable atomics to QEMU.  QEMU 
> developers and Linux developers have some overlap, and Java volatiles 
> are simple to understand in terms of memory barriers (which Linux 
> uses); hence, I'll treat ourselves as pure practitioners.

I don't think that this is the conclusion here.  I strongly suggest to
just go with the C11/C++11 model, instead of rolling your own or trying
to replicate the Java model.  That would also allow you to just point to
the C11 model and any information / tutorials about it instead of having
to document your own (see the patch), and you can make use of any
(future) tool support (e.g., race detectors).

> I will just not use __atomic_load/__atomic_store to implement the 
> primitives, and always express them in terms of memory barriers.

Why?  (If there's some QEMU-specific reason, just let me know; I know
little about QEMU..)
I would assume that using the __atomic* builtins is just fine if they're
available.


Torvald




Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Torvald Riegel
On Tue, 2013-06-18 at 15:24 +0200, Paolo Bonzini wrote:
> Il 17/06/2013 20:57, Richard Henderson ha scritto:
> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> > ...
> >> + */
> >> +#ifndef atomic_mb_read
> >> +#if QEMU_GNUC_PREREQ(4, 8)
> >> +#define atomic_mb_read(ptr)   ({ \
> >> +typeof(*ptr) _val;   \
> >> +__atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
> >> +_val;\
> >> +})
> >> +#else
> >> +#define atomic_mb_read(ptr)({   \
> >> +typeof(*ptr) _val = atomic_read(ptr);   \
> >> +smp_rmb();  \
> >> +_val;   \
> > 
> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> > load_acquire, one needs barriers before and after the atomic_read in order 
> > to
> > implement SEQ_CST.
> 
> The store-load barrier between atomic_mb_set and atomic_mb_read are
> provided by the atomic_mb_set.  The load-load barrier between two
> atomic_mb_reads are provided by the first read.
> 
> > So again I have to ask, what semantics are you actually looking for here?
> 
> So, everything I found points to Java volatile being sequentially
> consistent, though I'm still not sure why C11 suggests hwsync for load
> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
> code (based on the JSR-133 cookbook) generate with the C11 suggestions
> you get:
> 
> Load seqcst   hwsync; ld; cmp; bc; isync
> Store seqcst  hwsync; st
> (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> 
> Load Java volatileld; lwsync
> Store Java volatile   lwsync; st; hwsync
> (source: http://g.oswego.edu/dl/jmm/cookbook.html)
> 
> where the lwsync in loads acts as a load-load barrier, while the one in
> stores acts as load-store + store-store barrier.
> 
> Is the cookbook known to be wrong?
> 
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?

That section seems to indeed state that Java volatile accesses have
acquire/release semantics (I don't know too much about the Java model
though).

That would also match the load-acquire code on the C++11 mappings page
(if using an acquire fence after the load).  For the stores, the
cookbook mapping then has an extra hwsync at the end, which would
translate to an extra seqcst C++11 fence.  (Again, I can't say why.)

Perhaps http://www.cl.cam.ac.uk/~pes20/cc could help explain parts
of it.


Torvald





Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Peter Sewell
On 18 June 2013 15:50, Paul E. McKenney  wrote:
> On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote:
>> Il 17/06/2013 20:57, Richard Henderson ha scritto:
>> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
>> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
>> > ...
>> >> + */
>> >> +#ifndef atomic_mb_read
>> >> +#if QEMU_GNUC_PREREQ(4, 8)
>> >> +#define atomic_mb_read(ptr)   ({ \
>> >> +typeof(*ptr) _val;   \
>> >> +__atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
>> >> +_val;\
>> >> +})
>> >> +#else
>> >> +#define atomic_mb_read(ptr)({   \
>> >> +typeof(*ptr) _val = atomic_read(ptr);   \
>> >> +smp_rmb();  \
>> >> +_val;   \
>> >
>> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
>> > load_acquire, one needs barriers before and after the atomic_read in order 
>> > to
>> > implement SEQ_CST.
>>
>> The store-load barrier between atomic_mb_set and atomic_mb_read are
>> provided by the atomic_mb_set.  The load-load barrier between two
>> atomic_mb_reads are provided by the first read.
>>
>> > So again I have to ask, what semantics are you actually looking for here?
>>
>> So, everything I found points to Java volatile being sequentially
>> consistent, though I'm still not sure why C11 suggests hwsync for load
>> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
>> code (based on the JSR-133 cookbook) generate with the C11 suggestions
>> you get:
>>
>> Load seqcst   hwsync; ld; cmp; bc; isync
>> Store seqcst  hwsync; st
>> (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>>
>> Load Java volatileld; lwsync
>> Store Java volatile   lwsync; st; hwsync
>> (source: http://g.oswego.edu/dl/jmm/cookbook.html)
>>
>> where the lwsync in loads acts as a load-load barrier, while the one in
>> stores acts as load-store + store-store barrier.

(I don't know the context of these mails, but let me add a gloss to
correct a common misconception there:  it's not sufficient to think
about
these things solely in terms of thread-local reordering, as the
tutorial that Paul points to
below explains. One has to think about the lack of multi-copy
atomicity too, as seen in that IRIW example)

>> Is the cookbook known to be wrong?
>>
>> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
>> paragraph of
>> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
>> seems to suggest?
>
> First, I am not a fan of SC, mostly because there don't seem to be many
> (any?) production-quality algorithms that need SC.  But if you really
> want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
>
> The canonical litmus test for SC is independent reads of independent
> writes, or IRIW for short.  It is as follows, with x and y both initially
> zero:
>
> Thread 0Thread 1Thread 2Thread 3
> x=1;y=1;r1=x;   r2=y;
> r3=y;   r4=x;
>
> On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is
> a forbidden final outcome, where "final" means that we wait for all
> four threads to terminate and for their effects on memory to propagate
> fully through the system before checking the final values.
>
> The first mapping has been demonstrated to be correct at the URL
> you provide.  So let's apply the second mapping:
>
> Thread 0Thread 1Thread 2Thread 3
> lwsync; lwsync; r1=x;   r2=y;
> x=1;y=1;lwsync; lwsync;
> hwsync; hwsync; r3=y;   r4=x;
> lwsync; lwsync;
>
> Now barriers operate by forcing ordering between accesses on either
> side of the barrier.  This means that barriers at the very start or
> the very end of a given thread's execution have no effect and may
> be ignored.  This results in the following:
>
> Thread 0Thread 1Thread 2Thread 3
> x=1;y=1;r1=x;   r2=y;
> lwsync; lwsync;
> r3=y;   r4=x;
>
> This sequence does -not- result in SC execution, as documented in the
> table at the beginning of Section 6.3 on page 23 of:
>
> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>
> See the row with "IRIW+lwsync", which shows that this code sequence does
> not preserve SC ordering, neither in theory nor on real hardware.  (And I
> strongly recommend this paper, even before having read it thoroughly
> myself, hence my taking the precaution of CCing the authors in ca

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Paul E. McKenney
On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote:
> Il 17/06/2013 20:57, Richard Henderson ha scritto:
> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> > ...
> >> + */
> >> +#ifndef atomic_mb_read
> >> +#if QEMU_GNUC_PREREQ(4, 8)
> >> +#define atomic_mb_read(ptr)   ({ \
> >> +typeof(*ptr) _val;   \
> >> +__atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
> >> +_val;\
> >> +})
> >> +#else
> >> +#define atomic_mb_read(ptr)({   \
> >> +typeof(*ptr) _val = atomic_read(ptr);   \
> >> +smp_rmb();  \
> >> +_val;   \
> > 
> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> > load_acquire, one needs barriers before and after the atomic_read in order 
> > to
> > implement SEQ_CST.
> 
> The store-load barrier between atomic_mb_set and atomic_mb_read are
> provided by the atomic_mb_set.  The load-load barrier between two
> atomic_mb_reads are provided by the first read.
> 
> > So again I have to ask, what semantics are you actually looking for here?
> 
> So, everything I found points to Java volatile being sequentially
> consistent, though I'm still not sure why C11 suggests hwsync for load
> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
> code (based on the JSR-133 cookbook) generate with the C11 suggestions
> you get:
> 
> Load seqcst   hwsync; ld; cmp; bc; isync
> Store seqcst  hwsync; st
> (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> 
> Load Java volatileld; lwsync
> Store Java volatile   lwsync; st; hwsync
> (source: http://g.oswego.edu/dl/jmm/cookbook.html)
> 
> where the lwsync in loads acts as a load-load barrier, while the one in
> stores acts as load-store + store-store barrier.
> 
> Is the cookbook known to be wrong?
> 
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?

First, I am not a fan of SC, mostly because there don't seem to be many
(any?) production-quality algorithms that need SC.  But if you really
want to take a parallel-programming trip back to the 1980s, let's go!  ;-)

The canonical litmus test for SC is independent reads of independent
writes, or IRIW for short.  It is as follows, with x and y both initially
zero:

Thread 0Thread 1Thread 2Thread 3
x=1;y=1;r1=x;   r2=y;
r3=y;   r4=x;

On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is
a forbidden final outcome, where "final" means that we wait for all
four threads to terminate and for their effects on memory to propagate
fully through the system before checking the final values.

The first mapping has been demonstrated to be correct at the URL
you provide.  So let's apply the second mapping:

Thread 0Thread 1Thread 2Thread 3
lwsync; lwsync; r1=x;   r2=y;
x=1;y=1;lwsync; lwsync;
hwsync; hwsync; r3=y;   r4=x;
lwsync; lwsync;

Now barriers operate by forcing ordering between accesses on either
side of the barrier.  This means that barriers at the very start or
the very end of a given thread's execution have no effect and may
be ignored.  This results in the following:

Thread 0Thread 1Thread 2Thread 3
x=1;y=1;r1=x;   r2=y;
lwsync; lwsync;
r3=y;   r4=x;

This sequence does -not- result in SC execution, as documented in the
table at the beginning of Section 6.3 on page 23 of:

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

See the row with "IRIW+lwsync", which shows that this code sequence does
not preserve SC ordering, neither in theory nor on real hardware.  (And I
strongly recommend this paper, even before having read it thoroughly
myself, hence my taking the precaution of CCing the authors in case I
am misinterpreting something.)  If you want to prove it for yourself,
use the software tool called out at http://lwn.net/Articles/470681/.

So is this a real issue?  Show me a parallel algorithm that relies on
SC, demonstrate that it is the best algorithm for the problem that it
solves, and then demonstrate that your problem statement is the best
match for some important real-world problem.  Until then, no, this is
not a real issue.

Thanx, Paul

PS:

Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)

2013-06-18 Thread Paolo Bonzini
Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>  only because I stand with one foot in theory and the other in
>  practice.  If I were a pure practitioner, I would probably strongly
>  prefer the Java formulation.

Awesome answer, and this last paragraph sums it up pretty well.

That was basically my understanding, too.  I still do not completely 
get the relationship between Java semantics and ACQ_REL, but I can 
sidestep the issue for adding portable atomics to QEMU.  QEMU 
developers and Linux developers have some overlap, and Java volatiles 
are simple to understand in terms of memory barriers (which Linux 
uses); hence, I'll treat ourselves as pure practitioners.

I will just not use __atomic_load/__atomic_store to implement the 
primitives, and always express them in terms of memory barriers.  Of 
course, the memory barriers themselves are defined using 
__atomic_thread_fence.

I attach v2 of the patch below; Ping Fan, feel free to pick it up.

Thanks again!

Paolo

--- 8< -

>From 5c71a423156d1f0d61f17afa99b099bf139c64ae Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Mon, 13 May 2013 13:29:47 +0200
Subject: [PATCH] add a header file for atomic operations

We're already using them in several places, but __sync builtins are just
too ugly to type, and do not provide seqcst load/store operations.

Signed-off-by: Paolo Bonzini 
---
 docs/atomics.txt | 345 +++
 hw/display/qxl.c |   3 +-
 hw/virtio/vhost.c|   9 +-
 include/qemu/atomic.h| 190 +-
 migration.c  |   3 +-
 tests/test-thread-pool.c |   8 +-
 6 files changed, 514 insertions(+), 44 deletions(-)
 create mode 100644 docs/atomics.txt

diff --git a/docs/atomics.txt b/docs/atomics.txt
new file mode 100644
index 000..e2ce04b
--- /dev/null
+++ b/docs/atomics.txt
@@ -0,0 +1,345 @@
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest).  Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking.  Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization.  Anything else is considerably harder, but it's
+also justified more often than one would like.  The two tools that
+are provided by qemu/atomic.h are memory barriers and atomic operations.
+
+Macros defined by qemu/atomic.h fall in three camps:
+
+- compiler barriers: barrier();
+
+- weak atomic access and manual memory barriers: atomic_read(),
+  atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
+
+- sequentially consistent atomic access: everything else.
+
+
+COMPILER MEMORY BARRIER
+===
+
+barrier() prevents the compiler from moving the memory accesses either
+side of it to the other side.  The compiler barrier has no direct effect
+on the CPU, which may then reorder things however it wishes.
+
+barrier() is mostly used within qemu/atomic.h itself.  On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution.  In this
+case, qemu/atomic.h will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, barrier() can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+SEQUENTIALLY CONSISTENT ATOMIC ACCESS
+=
+
+Most of the operations in the qemu/atomic.h header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+qemu/atomic.h provides the following set of atomic read-modify-write
+operations:
+
+typeof(*ptr) atomic_inc(ptr)
+typeof(*ptr) atomic_dec(ptr)
+typeof(*ptr) atomic_add(ptr, val)
+typeof(*ptr) atomic_sub(ptr, val)
+typeof(*ptr) atomic_and(ptr, val)
+typeof(*ptr) atomic_or(ptr, val)
+typeof(*ptr) atomic_xchg(ptr, val
+typeof(*ptr) atomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of *ptr.  These operations are
+polymorphic; they operate on any type that is as wide as an int.
+
+Sequentially consistent loads and stores can be done using:
+
+atomic_add(ptr, 0) for loads
+atomic_xchg(ptr, val) for stores
+
+However, they are quite expensive on some platforms, notably POWER and
+ARM.  Therefore, qemu/atomic.h provides two primitives with slightly
+weaker constraints:
+
+typeof(*ptr) atomic_mb_read(ptr)
+void atomic_mb_set(ptr, val)
+
+The semantics of these