subject:"\[RFC\]\[PATCH 0\/5\] arch\: atomic rework"

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney

On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>  wrote:
> >
> > 3.  The comparison was against another RCU-protected pointer,
> > where that other pointer was properly fetched using one
> > of the RCU primitives.  Here it doesn't matter which pointer
> > you use.  At least as long as the rcu_assign_pointer() for
> > that other pointer happened after the last update to the
> > pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>p = atomic_read(consume);
>X;
>q = atomic_read(consume);
>Y;
>if (p == q)
> data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.
> 
> I cannot for the life of me come up with a situation where this would
> matter, though. If "X" contains a fence, then that fence will be a
> stronger ordering than anything the consume through "p" would
> guarantee anyway. And if "X" does *not* contain a fence, then the
> atomic reads of p and q are unordered *anyway*, so then whether the
> ordering to the access through "p" is through p or q is kind of
> irrelevant. No?

I can make a contrived litmus test for it, but you are right, the only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU can
reorder the loads into p and q, and the read from p->val can, as you say,
come from either pointer.

For whatever it is worth, hear is the litmus test:

T1: p = kmalloc(...);
if (p == NULL)
deal_with_it();
p->a = 42;  /* Each field in its own cache line. */
p->b = 43;
p->c = 44;
atomic_store_explicit(&gp1, p, memory_order_release);
p->b = 143;
p->c = 144;
atomic_store_explicit(&gp2, p, memory_order_release);

T2: p = atomic_load_explicit(&gp2, memory_order_consume);
r1 = p->b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(&gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q->c is same as p->c. */
r2 = p->c; /* Could get 44 on weakly order system. */
}

The loads from gp1 and gp2 are, as you say, unordered, so you get what
you get.

And publishing a structure via one RCU-protected pointer, updating it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and still
see consistency across all the fields in the structure, please put a lock
in the structure and use it to guard updates and accesses to those fields.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney

On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > >  wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > > results in another chained pointer in that same pointer chain.
> > > > The results of addition and subtraction operations that cancel
> > > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > > is a pointer to char) are implementation defined.
> > > >
> > > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > > applied to a chained pointer and an integer for the purposes
> > > > of alignment and pointer translation results in another
> > > > chained pointer in that same pointer chain.  Other uses
> > > > of bitwise operators on chained pointers (for example,
> > > > "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started s

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-28 Thread Paul E. McKenney

On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >  wrote:
> > >
> > > 3.  The comparison was against another RCU-protected pointer,
> > > where that other pointer was properly fetched using one
> > > of the RCU primitives.  Here it doesn't matter which pointer
> > > you use.  At least as long as the rcu_assign_pointer() for
> > > that other pointer happened after the last update to the
> > > pointed-to structure.
> > >
> > > I am a bit nervous about #3.  Any thoughts on it?
> > 
> > I think that it might be worth pointing out as an example, and saying
> > that code like
> > 
> >p = atomic_read(consume);
> >X;
> >q = atomic_read(consume);
> >Y;
> >if (p == q)
> > data = p->val;
> > 
> > then the access of "p->val" is constrained to be data-dependent on
> > *either* p or q, but you can't really tell which, since the compiler
> > can decide that the values are interchangeable.
> > 
> > I cannot for the life of me come up with a situation where this would
> > matter, though. If "X" contains a fence, then that fence will be a
> > stronger ordering than anything the consume through "p" would
> > guarantee anyway. And if "X" does *not* contain a fence, then the
> > atomic reads of p and q are unordered *anyway*, so then whether the
> > ordering to the access through "p" is through p or q is kind of
> > irrelevant. No?
> 
> I can make a contrived litmus test for it, but you are right, the only
> time you can see it happen is when X has no barriers, in which case
> you don't have any ordering anyway -- both the compiler and the CPU can
> reorder the loads into p and q, and the read from p->val can, as you say,
> come from either pointer.
> 
> For whatever it is worth, hear is the litmus test:
> 
> T1:   p = kmalloc(...);
>   if (p == NULL)
>   deal_with_it();
>   p->a = 42;  /* Each field in its own cache line. */
>   p->b = 43;
>   p->c = 44;
>   atomic_store_explicit(&gp1, p, memory_order_release);
>   p->b = 143;
>   p->c = 144;
>   atomic_store_explicit(&gp2, p, memory_order_release);
> 
> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>   r1 = p->b;  /* Guaranteed to get 143. */
>   q = atomic_load_explicit(&gp1, memory_order_consume);
>   if (p == q) {
>   /* The compiler decides that q->c is same as p->c. */
>   r2 = p->c; /* Could get 44 on weakly order system. */
>   }
> 
> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> you get.
> 
> And publishing a structure via one RCU-protected pointer, updating it,
> then publishing it via another pointer seems to me to be asking for
> trouble anyway.  If you really want to do something like that and still
> see consistency across all the fields in the structure, please put a lock
> in the structure and use it to guard updates and accesses to those fields.

And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(&p, memory_order_consume).

Thoughts?

Thanx, Paul



documentation: Record rcu_dereference() value mishandling

Recent LKML discussings (see http://lwn.net/Articles/586838/ and
http://lwn.net/Articles/588300/ for the LWN writeups) brought out
some ways of misusing the return value from rcu_dereference() that
are not necessarily completely intuitive.  This commit therefore
documents what can and cannot safely be done with these values.

Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index fa57139f50bf..f773a264ae02 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ lockdep-splat.txt
- RCU Lockdep splats explained.
 NMI-RCU.txt
- Using RCU to Protect Dynamic NMI Handlers
+rcu_dereference.txt
+   - Proper care and feeding of return values from rcu_dereference()
 rcubarrier.txt
- RCU and Unloadable Modules
 rculist_nulls.txt
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 9d10d1db16a5..877947130ebe 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -114,12 +114,16 @@ over a rather long period of time, but improvements are 
always welcome!
http://www.openvms.compaq.com/wizard/wiz_2637.html
 
The rcu_dereference() primitive is also an excellent
-   documentation aid, letting the person reading the code
-   know exactly which pointers are protected by RCU.
+   documentation aid, letting the person reading the
+

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Peter Sewell

Hi Paul,

On 28 February 2014 18:50, Paul E. McKenney  wrote:
> On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >  wrote:
>> > >
>> > > 3.  The comparison was against another RCU-protected pointer,
>> > > where that other pointer was properly fetched using one
>> > > of the RCU primitives.  Here it doesn't matter which pointer
>> > > you use.  At least as long as the rcu_assign_pointer() for
>> > > that other pointer happened after the last update to the
>> > > pointed-to structure.
>> > >
>> > > I am a bit nervous about #3.  Any thoughts on it?
>> >
>> > I think that it might be worth pointing out as an example, and saying
>> > that code like
>> >
>> >p = atomic_read(consume);
>> >X;
>> >q = atomic_read(consume);
>> >Y;
>> >if (p == q)
>> > data = p->val;
>> >
>> > then the access of "p->val" is constrained to be data-dependent on
>> > *either* p or q, but you can't really tell which, since the compiler
>> > can decide that the values are interchangeable.
>> >
>> > I cannot for the life of me come up with a situation where this would
>> > matter, though. If "X" contains a fence, then that fence will be a
>> > stronger ordering than anything the consume through "p" would
>> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> > atomic reads of p and q are unordered *anyway*, so then whether the
>> > ordering to the access through "p" is through p or q is kind of
>> > irrelevant. No?
>>
>> I can make a contrived litmus test for it, but you are right, the only
>> time you can see it happen is when X has no barriers, in which case
>> you don't have any ordering anyway -- both the compiler and the CPU can
>> reorder the loads into p and q, and the read from p->val can, as you say,
>> come from either pointer.
>>
>> For whatever it is worth, hear is the litmus test:
>>
>> T1:   p = kmalloc(...);
>>   if (p == NULL)
>>   deal_with_it();
>>   p->a = 42;  /* Each field in its own cache line. */
>>   p->b = 43;
>>   p->c = 44;
>>   atomic_store_explicit(&gp1, p, memory_order_release);
>>   p->b = 143;
>>   p->c = 144;
>>   atomic_store_explicit(&gp2, p, memory_order_release);
>>
>> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>>   r1 = p->b;  /* Guaranteed to get 143. */
>>   q = atomic_load_explicit(&gp1, memory_order_consume);
>>   if (p == q) {
>>   /* The compiler decides that q->c is same as p->c. */
>>   r2 = p->c; /* Could get 44 on weakly order system. */
>>   }
>>
>> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> you get.
>>
>> And publishing a structure via one RCU-protected pointer, updating it,
>> then publishing it via another pointer seems to me to be asking for
>> trouble anyway.  If you really want to do something like that and still
>> see consistency across all the fields in the structure, please put a lock
>> in the structure and use it to guard updates and accesses to those fields.
>
> And here is a patch documenting the restrictions for the current Linux
> kernel.  The rules change a bit due to rcu_dereference() acting a bit
> differently than atomic_load_explicit(&p, memory_order_consume).
>
> Thoughts?

That might serve as informal documentation for linux kernel
programmers about the bounds on the optimisations that you expect
compilers to do for common-case RCU code - and I guess that's what you
intend it to be for.   But I don't see how one can make it precise
enough to serve as a language definition, so that compiler people
could confidently say "yes, we respect that", which I guess is what
you really need.  As a useful criterion, we should aim for something
precise enough that in a verified-compiler context you can
mathematically prove that the compiler will satisfy it  (even though
that won't happen anytime soon for GCC), and that analysis tool
authors can actually know what they're working with.   All this stuff
about "you should avoid cancellation", and "avoid masking with just a
small number of bits" is just too vague.

The basic problem is that the compiler may be doing sophisticated
reasoning with a bunch of non-local knowledge that it's deduced from
the code, neither of which are well-understood, and here we have to
identify some envelope, expressive enough for RCU idioms, in which
that reasoning doesn't allow data/address dependencies to be removed
(and hence the hardware guarantee about them will be maintained at the
source level).

The C11 syntactic notion of dependency, whatever its faults, was at
least precise, could be reasoned about locally (just looking at the
syntactic code in question), and did do that.  The fact that current
compilers do optimisations that remove dependencies and will likely
have many bugs at present

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Paul E. McKenney

On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> Hi Paul,
> 
> On 28 February 2014 18:50, Paul E. McKenney  
> wrote:
> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >  wrote:
> >> > >
> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> > > where that other pointer was properly fetched using one
> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> > > that other pointer happened after the last update to the
> >> > > pointed-to structure.
> >> > >
> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >
> >> > I think that it might be worth pointing out as an example, and saying
> >> > that code like
> >> >
> >> >p = atomic_read(consume);
> >> >X;
> >> >q = atomic_read(consume);
> >> >Y;
> >> >if (p == q)
> >> > data = p->val;
> >> >
> >> > then the access of "p->val" is constrained to be data-dependent on
> >> > *either* p or q, but you can't really tell which, since the compiler
> >> > can decide that the values are interchangeable.
> >> >
> >> > I cannot for the life of me come up with a situation where this would
> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> > stronger ordering than anything the consume through "p" would
> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> > ordering to the access through "p" is through p or q is kind of
> >> > irrelevant. No?
> >>
> >> I can make a contrived litmus test for it, but you are right, the only
> >> time you can see it happen is when X has no barriers, in which case
> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> come from either pointer.
> >>
> >> For whatever it is worth, hear is the litmus test:
> >>
> >> T1:   p = kmalloc(...);
> >>   if (p == NULL)
> >>   deal_with_it();
> >>   p->a = 42;  /* Each field in its own cache line. */
> >>   p->b = 43;
> >>   p->c = 44;
> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >>   p->b = 143;
> >>   p->c = 144;
> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >>
> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
> >>   if (p == q) {
> >>   /* The compiler decides that q->c is same as p->c. */
> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >>   }
> >>
> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> you get.
> >>
> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> then publishing it via another pointer seems to me to be asking for
> >> trouble anyway.  If you really want to do something like that and still
> >> see consistency across all the fields in the structure, please put a lock
> >> in the structure and use it to guard updates and accesses to those fields.
> >
> > And here is a patch documenting the restrictions for the current Linux
> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> > differently than atomic_load_explicit(&p, memory_order_consume).
> >
> > Thoughts?
> 
> That might serve as informal documentation for linux kernel
> programmers about the bounds on the optimisations that you expect
> compilers to do for common-case RCU code - and I guess that's what you
> intend it to be for.   But I don't see how one can make it precise
> enough to serve as a language definition, so that compiler people
> could confidently say "yes, we respect that", which I guess is what
> you really need.  As a useful criterion, we should aim for something
> precise enough that in a verified-compiler context you can
> mathematically prove that the compiler will satisfy it  (even though
> that won't happen anytime soon for GCC), and that analysis tool
> authors can actually know what they're working with.   All this stuff
> about "you should avoid cancellation", and "avoid masking with just a
> small number of bits" is just too vague.

Understood, and yes, this is intended to document current compiler
behavior for the Linux kernel community.  It would not make sense to show
it to the C11 or C++11 communities, except perhaps as an informational
piece on current practice.

> The basic problem is that the compiler may be doing sophisticated
> reasoning with a bunch of non-local knowledge that it's deduced from
> the code, neither of which are well-understood, and here we have to
> identify some envelop

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Peter Sewell

On 1 March 2014 08:03, Paul E. McKenney  wrote:
> On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> Hi Paul,
>>
>> On 28 February 2014 18:50, Paul E. McKenney  
>> wrote:
>> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >  wrote:
>> >> > >
>> >> > > 3.  The comparison was against another RCU-protected pointer,
>> >> > > where that other pointer was properly fetched using one
>> >> > > of the RCU primitives.  Here it doesn't matter which pointer
>> >> > > you use.  At least as long as the rcu_assign_pointer() for
>> >> > > that other pointer happened after the last update to the
>> >> > > pointed-to structure.
>> >> > >
>> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >
>> >> > I think that it might be worth pointing out as an example, and saying
>> >> > that code like
>> >> >
>> >> >p = atomic_read(consume);
>> >> >X;
>> >> >q = atomic_read(consume);
>> >> >Y;
>> >> >if (p == q)
>> >> > data = p->val;
>> >> >
>> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> > can decide that the values are interchangeable.
>> >> >
>> >> > I cannot for the life of me come up with a situation where this would
>> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> > stronger ordering than anything the consume through "p" would
>> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> > ordering to the access through "p" is through p or q is kind of
>> >> > irrelevant. No?
>> >>
>> >> I can make a contrived litmus test for it, but you are right, the only
>> >> time you can see it happen is when X has no barriers, in which case
>> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> reorder the loads into p and q, and the read from p->val can, as you say,
>> >> come from either pointer.
>> >>
>> >> For whatever it is worth, hear is the litmus test:
>> >>
>> >> T1:   p = kmalloc(...);
>> >>   if (p == NULL)
>> >>   deal_with_it();
>> >>   p->a = 42;  /* Each field in its own cache line. */
>> >>   p->b = 43;
>> >>   p->c = 44;
>> >>   atomic_store_explicit(&gp1, p, memory_order_release);
>> >>   p->b = 143;
>> >>   p->c = 144;
>> >>   atomic_store_explicit(&gp2, p, memory_order_release);
>> >>
>> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
>> >>   if (p == q) {
>> >>   /* The compiler decides that q->c is same as p->c. */
>> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >>   }
>> >>
>> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> you get.
>> >>
>> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> then publishing it via another pointer seems to me to be asking for
>> >> trouble anyway.  If you really want to do something like that and still
>> >> see consistency across all the fields in the structure, please put a lock
>> >> in the structure and use it to guard updates and accesses to those fields.
>> >
>> > And here is a patch documenting the restrictions for the current Linux
>> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >
>> > Thoughts?
>>
>> That might serve as informal documentation for linux kernel
>> programmers about the bounds on the optimisations that you expect
>> compilers to do for common-case RCU code - and I guess that's what you
>> intend it to be for.   But I don't see how one can make it precise
>> enough to serve as a language definition, so that compiler people
>> could confidently say "yes, we respect that", which I guess is what
>> you really need.  As a useful criterion, we should aim for something
>> precise enough that in a verified-compiler context you can
>> mathematically prove that the compiler will satisfy it  (even though
>> that won't happen anytime soon for GCC), and that analysis tool
>> authors can actually know what they're working with.   All this stuff
>> about "you should avoid cancellation", and "avoid masking with just a
>> small number of bits" is just too vague.
>
> Understood, and yes, this is intended to document current compiler
> behavior for the Linux kernel community.  It would not make sense to show
> it to the C11 or C++11 communities, except perhaps as an informational
> piece on current practice.
>
>> The basic problem is that the compiler may be doing sophisticated
>> reasoni

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney

On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney  
> >> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >  wrote:
> >> >> > >
> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> > > where that other pointer was properly fetched using one
> >> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> >> > > that other pointer happened after the last update to the
> >> >> > > pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >p = atomic_read(consume);
> >> >> >X;
> >> >> >q = atomic_read(consume);
> >> >> >Y;
> >> >> >if (p == q)
> >> >> > data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>   if (p == NULL)
> >> >>   deal_with_it();
> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >>   p->b = 43;
> >> >>   p->c = 44;
> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >> >>   p->b = 143;
> >> >>   p->c = 144;
> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >>   if (p == q) {
> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>   }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a 
> >> >> lock
> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff > >> 
> >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is inte

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Peter Sewell

On 2 March 2014 23:20, Paul E. McKenney  wrote:
> On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
>> On 1 March 2014 08:03, Paul E. McKenney  wrote:
>> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> Hi Paul,
>> >>
>> >> On 28 February 2014 18:50, Paul E. McKenney  
>> >> wrote:
>> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >  wrote:
>> >> >> > >
>> >> >> > > 3.  The comparison was against another RCU-protected pointer,
>> >> >> > > where that other pointer was properly fetched using one
>> >> >> > > of the RCU primitives.  Here it doesn't matter which 
>> >> >> > > pointer
>> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
>> >> >> > > that other pointer happened after the last update to the
>> >> >> > > pointed-to structure.
>> >> >> > >
>> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >
>> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> > that code like
>> >> >> >
>> >> >> >p = atomic_read(consume);
>> >> >> >X;
>> >> >> >q = atomic_read(consume);
>> >> >> >Y;
>> >> >> >if (p == q)
>> >> >> > data = p->val;
>> >> >> >
>> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> > can decide that the values are interchangeable.
>> >> >> >
>> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> > irrelevant. No?
>> >> >>
>> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> reorder the loads into p and q, and the read from p->val can, as you 
>> >> >> say,
>> >> >> come from either pointer.
>> >> >>
>> >> >> For whatever it is worth, hear is the litmus test:
>> >> >>
>> >> >> T1:   p = kmalloc(...);
>> >> >>   if (p == NULL)
>> >> >>   deal_with_it();
>> >> >>   p->a = 42;  /* Each field in its own cache line. */
>> >> >>   p->b = 43;
>> >> >>   p->c = 44;
>> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
>> >> >>   p->b = 143;
>> >> >>   p->c = 144;
>> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
>> >> >>
>> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
>> >> >>   if (p == q) {
>> >> >>   /* The compiler decides that q->c is same as p->c. */
>> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >>   }
>> >> >>
>> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> you get.
>> >> >>
>> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> see consistency across all the fields in the structure, please put a 
>> >> >> lock
>> >> >> in the structure and use it to guard updates and accesses to those 
>> >> >> fields.
>> >> >
>> >> > And here is a patch documenting the restrictions for the current Linux
>> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >> >
>> >> > Thoughts?
>> >>
>> >> That might serve as informal documentation for linux kernel
>> >> programmers about the bounds on the optimisations that you expect
>> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> intend it to be for.   But I don't see how one can make it precise
>> >> enough to serve as a language definition, so that compiler people
>> >> could confidently say "yes, we respect that", which I guess is what
>> >> you really need.  As a useful criterion, we should aim for something
>> >> precise enough that in a verified-compiler context you can
>> >> mathematically prove that the compiler will satisfy it  (even though
>> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> authors can actually know what they're working with.   All this stuff > 
>>

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney

On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote:
> On 2 March 2014 23:20, Paul E. McKenney  wrote:
> > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney 
> >> >>  wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> >  wrote:
> >> >> >> > >
> >> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> >> > > where that other pointer was properly fetched using one
> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
> >> >> >> > > pointer
> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
> >> >> >> > > for
> >> >> >> > > that other pointer happened after the last update to the
> >> >> >> > > pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and 
> >> >> >> > saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >p = atomic_read(consume);
> >> >> >> >X;
> >> >> >> >q = atomic_read(consume);
> >> >> >> >Y;
> >> >> >> >if (p == q)
> >> >> >> > data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the 
> >> >> >> > compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this 
> >> >> >> > would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the 
> >> >> >> only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
> >> >> >> can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> >> say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>   if (p == NULL)
> >> >> >>   deal_with_it();
> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >> >>   p->b = 43;
> >> >> >>   p->c = 44;
> >> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >> >> >>   p->b = 143;
> >> >> >>   p->c = 144;
> >> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >> >>   if (p == q) {
> >> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>   }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get 
> >> >> >> what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating 
> >> >> >> it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and 
> >> >> >> still
> >> >> >> see consistency across all the fields in the structure, please put a 
> >> >> >> lock
> >> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> >> fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel

On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == &variable) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.
> 
> And the only such thing I can think of is basically compiler-initiated
> value-prediction, presumably directed by PGO (since now if the value
> prediction is in the source code, it's considered to break the value
> chain).

The other example that comes to mind would be feedback-directed JIT
compilation.  I don't think that's widely used today, and it might never
be for the kernel -- but *in the standard*, we at least have to consider
what the future might bring.

> The good thing is that afaik, value-prediction is largely not used in
> real life, afaik. There are lots of papers on it, but I don't think
> anybody actually does it (although I can easily see some
> specint-specific optimization pattern that is build up around it).
> 
> And even value prediction is actually fine, as long as the compiler
> can see the memory *source* of the value prediction (and it isn't a
> mo_consume). So it really ends up limiting your value prediction in
> very simple ways: you cannot do it to function arguments if they are
> registers. But you can still do value prediction on values you loaded
> from memory, if you can actually *see* that memory op.

I think one would need to show that the source is *not even indirectly*
a mo_consume load.  With the wording I proposed, value dependencies
don't break when storing to / loading from memory locations.

Thus, if a compiler ends up at a memory load after waling SSA, it needs
to prove that the load cannot read a value that (1) was produced by a
store sequenced-before the load and (2) might carry a value dependency
(e.g., by being a mo_consume load) that the value prediction in question
would break.  This, in general, requires alias analysis.
Deciding whether a prediction would break a value dependency has to
consider what later stages in a compiler would be doing, including LTO
or further rounds of inlining/optimizations.  OTOH, if the compiler can
treat an mo_consume load as returning all possible values (eg, by
ignoring all knowledge about it), then it can certainly do so with other
memory loads too.

So, I think that the constraints due to value dependencies can matter in
practice.  However, the impact on optimizations on
non-mo_consume-related code are hard to estimate -- I don't see a huge
amount of impact right now, but I also wouldn't want to predict that
this can't change in the future.

> Of course, on more strongly ordered CPU's, even that "register
> argument" limitation goes away.
> 
> So I agree that there is basically no real optimization constraint.
> Value-prediction is of dubious value to begin with, and the actual
> constraint on its use if some compiler writer really wants to is not
> onerous.
> 
> > What I have in mind is roughly the following (totally made-up syntax --
> > suggestions for how to do this properly are very welcome):
> > * Have a type modifier (eg, like restrict), that specifies that
> > operations on data of this type are preserving value dependencies:
> 
> So I'm not violently opposed, but I think the upsides are not great.
> Note that my earlier suggestion to use "restrict" wasn't because I
> believed the annotation itself would be visible, but basically just as
> a legalistic promise to the compiler that *if* it found an alias, then
> it didn't need to worry about ordering. So to me, that type modifier
> was about conceptual guarantees, not about actual value chains.
> 
> Anyway, the reason I don't believe any type modifier (and
> "[[carries_dependency]]" is basically just that) is worth it is simply
> that it adds a real burden on the programmer, without actually giving
> the programmer any real upside:
> 
> Within a single function, the compiler already sees that mo_consume
> source, and so doing a type-based restriction doesn't really help. The
> information is alrea

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel

On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote:
> Your proposal looks quite promising at first glance.  But rather than
> try and comment on it immediately, I am going to take a number of uses of
> RCU from the Linux kernel and apply your proposal to them, then respond
> with the results
> 
> Fair enough?

Sure.  Thanks for doing the cross-check!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel

On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>  wrote:
> >
> > 3.  The comparison was against another RCU-protected pointer,
> > where that other pointer was properly fetched using one
> > of the RCU primitives.  Here it doesn't matter which pointer
> > you use.  At least as long as the rcu_assign_pointer() for
> > that other pointer happened after the last update to the
> > pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>p = atomic_read(consume);
>X;
>q = atomic_read(consume);
>Y;
>if (p == q)
> data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.

The wording I proposed would make the p dereference have a value
dependency unless X and Y would somehow restrict p and q.  The reasoning
is that if the atomic loads return potentially more than one value, then
even if we find out that two such loads did return the same value, we
still don't know what the exact value was.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> +oDo not use the results from the boolean "&&" and "||" when
> + dereferencing.  For example, the following (rather improbable)
> + code is buggy:
> +
> + int a[2];
> + int index;
> + int force_zero_index = 1;
> +
> + ...
> +
> + r1 = rcu_dereference(i1)
> + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> +
> + The reason this is buggy is that "&&" and "||" are often compiled
> + using branches.  While weak-memory machines such as ARM or PowerPC
> + do order stores after such branches, they can speculate loads,
> + which can result in misordering bugs.
> +
> +oDo not use the results from relational operators ("==", "!=",
> + ">", ">=", "<", or "<=") when dereferencing.  For example,
> + the following (quite strange) code is buggy:
> +
> + int a[2];
> + int index;
> + int flip_index = 0;
> +
> + ...
> +
> + r1 = rcu_dereference(i1)
> + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> +
> + As before, the reason this is buggy is that relational operators
> + are often compiled using branches.  And as before, although
> + weak-memory machines such as ARM or PowerPC do order stores
> + after such branches, but can speculate loads, which can again
> + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).

I don't think the wording is flawed.  We could raise the requirement of
having more than one value left for r1 to having more than N with N > 1
values left, but the fundamental problem remains in that a compiler
could try to generate a (big) switch statement.

Instead, I think that this indicates that the value_dep_preserving type
modifier would be useful: It would tell the compiler that it shouldn't
transform this into a branch in this case, yet allow that optimization
for all other code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Paul E. McKenney

On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> 
> On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > +o  Do not use the results from the boolean "&&" and "||" when
> > +   dereferencing.  For example, the following (rather improbable)
> > +   code is buggy:
> > +
> > +   int a[2];
> > +   int index;
> > +   int force_zero_index = 1;
> > +
> > +   ...
> > +
> > +   r1 = rcu_dereference(i1)
> > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > +
> > +   The reason this is buggy is that "&&" and "||" are often compiled
> > +   using branches.  While weak-memory machines such as ARM or PowerPC
> > +   do order stores after such branches, they can speculate loads,
> > +   which can result in misordering bugs.
> > +
> > +o  Do not use the results from relational operators ("==", "!=",
> > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > +   the following (quite strange) code is buggy:
> > +
> > +   int a[2];
> > +   int index;
> > +   int flip_index = 0;
> > +
> > +   ...
> > +
> > +   r1 = rcu_dereference(i1)
> > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > +
> > +   As before, the reason this is buggy is that relational operators
> > +   are often compiled using branches.  And as before, although
> > +   weak-memory machines such as ARM or PowerPC do order stores
> > +   after such branches, but can speculate loads, which can again
> > +   result in misordering bugs.
> 
> Those two would be allowed by the wording I have recently proposed,
> AFAICS.  r1 != flip_index would result in two possible values (unless
> there are further constraints due to the type of r1 and the values that
> flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)

> I don't think the wording is flawed.  We could raise the requirement of
> having more than one value left for r1 to having more than N with N > 1
> values left, but the fundamental problem remains in that a compiler
> could try to generate a (big) switch statement.
> 
> Instead, I think that this indicates that the value_dep_preserving type
> modifier would be useful: It would tell the compiler that it shouldn't
> transform this into a branch in this case, yet allow that optimization
> for all other code.

Understood!

BTW, my current task is generating examples using the value_dep_preserving
type for RCU-protected array indexes.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel

On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> Hi Paul,
> >>
> >> On 28 February 2014 18:50, Paul E. McKenney  
> >> wrote:
> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >  wrote:
> >> >> > >
> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> > > where that other pointer was properly fetched using one
> >> >> > > of the RCU primitives.  Here it doesn't matter which pointer
> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
> >> >> > > that other pointer happened after the last update to the
> >> >> > > pointed-to structure.
> >> >> > >
> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >
> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> > that code like
> >> >> >
> >> >> >p = atomic_read(consume);
> >> >> >X;
> >> >> >q = atomic_read(consume);
> >> >> >Y;
> >> >> >if (p == q)
> >> >> > data = p->val;
> >> >> >
> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> > can decide that the values are interchangeable.
> >> >> >
> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> > stronger ordering than anything the consume through "p" would
> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> > irrelevant. No?
> >> >>
> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> say,
> >> >> come from either pointer.
> >> >>
> >> >> For whatever it is worth, hear is the litmus test:
> >> >>
> >> >> T1:   p = kmalloc(...);
> >> >>   if (p == NULL)
> >> >>   deal_with_it();
> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >>   p->b = 43;
> >> >>   p->c = 44;
> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >> >>   p->b = 143;
> >> >>   p->c = 144;
> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >> >>
> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >>   if (p == q) {
> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >>   }
> >> >>
> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> you get.
> >> >>
> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> see consistency across all the fields in the structure, please put a 
> >> >> lock
> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> fields.
> >> >
> >> > And here is a patch documenting the restrictions for the current Linux
> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >
> >> > Thoughts?
> >>
> >> That might serve as informal documentation for linux kernel
> >> programmers about the bounds on the optimisations that you expect
> >> compilers to do for common-case RCU code - and I guess that's what you
> >> intend it to be for.   But I don't see how one can make it precise
> >> enough to serve as a language definition, so that compiler people
> >> could confidently say "yes, we respect that", which I guess is what
> >> you really need.  As a useful criterion, we should aim for something
> >> precise enough that in a verified-compiler context you can
> >> mathematically prove that the compiler will satisfy it  (even though
> >> that won't happen anytime soon for GCC), and that analysis tool
> >> authors can actually know what they're working with.   All this stuff
> >> about "you should avoid cancellation", and "avoid masking with just a
> >> small number of bits" is just too vague.
> >
> > Understood, and yes, this is intended to docum

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Torvald Riegel

On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > 
> > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > +oDo not use the results from the boolean "&&" and "||" when
> > > + dereferencing.  For example, the following (rather improbable)
> > > + code is buggy:
> > > +
> > > + int a[2];
> > > + int index;
> > > + int force_zero_index = 1;
> > > +
> > > + ...
> > > +
> > > + r1 = rcu_dereference(i1)
> > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > +
> > > + The reason this is buggy is that "&&" and "||" are often compiled
> > > + using branches.  While weak-memory machines such as ARM or PowerPC
> > > + do order stores after such branches, they can speculate loads,
> > > + which can result in misordering bugs.
> > > +
> > > +oDo not use the results from relational operators ("==", "!=",
> > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > + the following (quite strange) code is buggy:
> > > +
> > > + int a[2];
> > > + int index;
> > > + int flip_index = 0;
> > > +
> > > + ...
> > > +
> > > + r1 = rcu_dereference(i1)
> > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > +
> > > + As before, the reason this is buggy is that relational operators
> > > + are often compiled using branches.  And as before, although
> > > + weak-memory machines such as ARM or PowerPC do order stores
> > > + after such branches, but can speculate loads, which can again
> > > + result in misordering bugs.
> > 
> > Those two would be allowed by the wording I have recently proposed,
> > AFAICS.  r1 != flip_index would result in two possible values (unless
> > there are further constraints due to the type of r1 and the values that
> > flip_index can have).
> 
> And I am OK with the value_dep_preserving type providing more/better
> guarantees than we get by default from current compilers.
> 
> One question, though.  Suppose that the code did not want a value
> dependency to be tracked through a comparison operator.  What does
> the developer do in that case?  (The reason I ask is that I have
> not yet found a use case in the Linux kernel that expects a value
> dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney

On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> 
> On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > 
> > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > +   dereferencing.  For example, the following (rather improbable)
> > > > +   code is buggy:
> > > > +
> > > > +   int a[2];
> > > > +   int index;
> > > > +   int force_zero_index = 1;
> > > > +
> > > > +   ...
> > > > +
> > > > +   r1 = rcu_dereference(i1)
> > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > +
> > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > compiled
> > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > PowerPC
> > > > +   do order stores after such branches, they can speculate loads,
> > > > +   which can result in misordering bugs.
> > > > +
> > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > +   the following (quite strange) code is buggy:
> > > > +
> > > > +   int a[2];
> > > > +   int index;
> > > > +   int flip_index = 0;
> > > > +
> > > > +   ...
> > > > +
> > > > +   r1 = rcu_dereference(i1)
> > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > +
> > > > +   As before, the reason this is buggy is that relational operators
> > > > +   are often compiled using branches.  And as before, although
> > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > +   after such branches, but can speculate loads, which can again
> > > > +   result in misordering bugs.
> > > 
> > > Those two would be allowed by the wording I have recently proposed,
> > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > there are further constraints due to the type of r1 and the values that
> > > flip_index can have).
> > 
> > And I am OK with the value_dep_preserving type providing more/better
> > guarantees than we get by default from current compilers.
> > 
> > One question, though.  Suppose that the code did not want a value
> > dependency to be tracked through a comparison operator.  What does
> > the developer do in that case?  (The reason I ask is that I have
> > not yet found a use case in the Linux kernel that expects a value
> > dependency to be tracked through a comparison.)
> 
> Hmm.  I suppose use an explicit cast to non-vdp before or after the
> comparison?

That should work well assuming that things like "if", "while", and "?:"
conditions are happy to take a vdp.  This assumes that p->a only returns
vdp if field "a" is declared vdp, otherwise we have vdps running wild
through the program.  ;-)

The other thing that can happen is that a vdp can get handed off to
another synchronization mechanism, for example, to reference counting:

p = atomic_load_explicit(&gp, memory_order_consume);
if (do_something_with(p->a)) {
/* fast path protected by RCU. */
return 0;
}
if (atomic_inc_not_zero(&p->refcnt) {
/* slow path protected by reference counting. */
return do_something_else_with((struct foo *)p);  /* CHANGE */
}
/* Needed slow path, but raced with deletion. */
return -EAGAIN;

I am guessing that the cast ends the vdp.  Is that the case?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney

On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +oDo not use the results from the boolean "&&" and "||" when
> > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > + code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int force_zero_index = 1;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > compiled
> > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > PowerPC
> > > > > + do order stores after such branches, they can speculate loads,
> > > > > + which can result in misordering bugs.
> > > > > +
> > > > > +oDo not use the results from relational operators ("==", "!=",
> > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > + the following (quite strange) code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int flip_index = 0;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > + As before, the reason this is buggy is that relational operators
> > > > > + are often compiled using branches.  And as before, although
> > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > + after such branches, but can speculate loads, which can again
> > > > > + result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.  This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)
> 
> The other thing that can happen is that a vdp can get handed off to
> another synchronization mechanism, for example, to reference counting:
> 
>   p = atomic_load_explicit(&gp, memory_order_consume);
>   if (do_something_with(p->a)) {
>   /* fast path protected by RCU. */
>   return 0;
>   }
>   if (atomic_inc_not_zero(&p->refcnt) {
>   /* slow path protected by reference counting. */
>   return do_something_else_with((struct foo *)p);  /* CHANGE */
>   }
>   /* Needed slow path, but raced with deletion. */
>   return -EAGAIN;
> 
> I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

rdev = rcu_dereference(conf->mirrors[disk].rdev);
if (r1_bio->bios[disk] == IO_BLOCKED
|| rdev == NULL
|| test_bit(Unmerged, &rdev->flags)
|| test_bit(Faulty, &rdev->flags))
continue;

The fact that the "rdev == NULL" returns vdp does not force the "||"
operators to be evaluated arithmetically because the entire function
is an "if" condition, correct?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Peter Sewell

On 3 March 2014 20:44, Torvald Riegel  wrote:
> On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
>> On 1 March 2014 08:03, Paul E. McKenney  wrote:
>> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> Hi Paul,
>> >>
>> >> On 28 February 2014 18:50, Paul E. McKenney  
>> >> wrote:
>> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >  wrote:
>> >> >> > >
>> >> >> > > 3.  The comparison was against another RCU-protected pointer,
>> >> >> > > where that other pointer was properly fetched using one
>> >> >> > > of the RCU primitives.  Here it doesn't matter which 
>> >> >> > > pointer
>> >> >> > > you use.  At least as long as the rcu_assign_pointer() for
>> >> >> > > that other pointer happened after the last update to the
>> >> >> > > pointed-to structure.
>> >> >> > >
>> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >
>> >> >> > I think that it might be worth pointing out as an example, and saying
>> >> >> > that code like
>> >> >> >
>> >> >> >p = atomic_read(consume);
>> >> >> >X;
>> >> >> >q = atomic_read(consume);
>> >> >> >Y;
>> >> >> >if (p == q)
>> >> >> > data = p->val;
>> >> >> >
>> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> > *either* p or q, but you can't really tell which, since the compiler
>> >> >> > can decide that the values are interchangeable.
>> >> >> >
>> >> >> > I cannot for the life of me come up with a situation where this would
>> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
>> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> > irrelevant. No?
>> >> >>
>> >> >> I can make a contrived litmus test for it, but you are right, the only
>> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
>> >> >> reorder the loads into p and q, and the read from p->val can, as you 
>> >> >> say,
>> >> >> come from either pointer.
>> >> >>
>> >> >> For whatever it is worth, hear is the litmus test:
>> >> >>
>> >> >> T1:   p = kmalloc(...);
>> >> >>   if (p == NULL)
>> >> >>   deal_with_it();
>> >> >>   p->a = 42;  /* Each field in its own cache line. */
>> >> >>   p->b = 43;
>> >> >>   p->c = 44;
>> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
>> >> >>   p->b = 143;
>> >> >>   p->c = 144;
>> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
>> >> >>
>> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
>> >> >>   if (p == q) {
>> >> >>   /* The compiler decides that q->c is same as p->c. */
>> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >>   }
>> >> >>
>> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
>> >> >> you get.
>> >> >>
>> >> >> And publishing a structure via one RCU-protected pointer, updating it,
>> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> trouble anyway.  If you really want to do something like that and still
>> >> >> see consistency across all the fields in the structure, please put a 
>> >> >> lock
>> >> >> in the structure and use it to guard updates and accesses to those 
>> >> >> fields.
>> >> >
>> >> > And here is a patch documenting the restrictions for the current Linux
>> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >> >
>> >> > Thoughts?
>> >>
>> >> That might serve as informal documentation for linux kernel
>> >> programmers about the bounds on the optimisations that you expect
>> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> intend it to be for.   But I don't see how one can make it precise
>> >> enough to serve as a language definition, so that compiler people
>> >> could confidently say "yes, we respect that", which I guess is what
>> >> you really need.  As a useful criterion, we should aim for something
>> >> precise enough that in a verified-compiler context you can
>> >> mathematically prove that the compiler will satisfy it  (even though
>> >> that won't happen anytime soon for GCC), and that analysis tool
>> >> authors can actually know what they're working with.   All this stuff
>> >> about "yo

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel

On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > 
> > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > 
> > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > +oDo not use the results from the boolean "&&" and "||" when
> > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > + code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int force_zero_index = 1;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > +
> > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > compiled
> > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > PowerPC
> > > > > + do order stores after such branches, they can speculate loads,
> > > > > + which can result in misordering bugs.
> > > > > +
> > > > > +oDo not use the results from relational operators ("==", "!=",
> > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > + the following (quite strange) code is buggy:
> > > > > +
> > > > > + int a[2];
> > > > > + int index;
> > > > > + int flip_index = 0;
> > > > > +
> > > > > + ...
> > > > > +
> > > > > + r1 = rcu_dereference(i1)
> > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > +
> > > > > + As before, the reason this is buggy is that relational operators
> > > > > + are often compiled using branches.  And as before, although
> > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > + after such branches, but can speculate loads, which can again
> > > > > + result in misordering bugs.
> > > > 
> > > > Those two would be allowed by the wording I have recently proposed,
> > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > there are further constraints due to the type of r1 and the values that
> > > > flip_index can have).
> > > 
> > > And I am OK with the value_dep_preserving type providing more/better
> > > guarantees than we get by default from current compilers.
> > > 
> > > One question, though.  Suppose that the code did not want a value
> > > dependency to be tracked through a comparison operator.  What does
> > > the developer do in that case?  (The reason I ask is that I have
> > > not yet found a use case in the Linux kernel that expects a value
> > > dependency to be tracked through a comparison.)
> > 
> > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > comparison?
> 
> That should work well assuming that things like "if", "while", and "?:"
> conditions are happy to take a vdp.

I currently don't see a reason why that should be disallowed.  If we
have allowed an implicit conversion to non-vdp, I believe that should
follow.  ?: could be somewhat special, in that the type depends on the
2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
disallowed if we don't provide for implicit casts from non-vdp to vdp.

> This assumes that p->a only returns
> vdp if field "a" is declared vdp, otherwise we have vdps running wild
> through the program.  ;-)

That's a good question.  For the scheme I had in mind, I'm not concerned
about vdps running wild because one needs to assign to explicitly
vdp-typed variables (or function arguments, etc.) to let vdp extend to
beyond single expressions.

Nonetheless, I think it's a good question how -> should behave if the
field is not vdp; in particular, should vdp->non_vdp be automatically
vdp?  One concern might be that we know something about non-vdp -- OTOH,
we shouldn't be able to do so because we (assume to) don't know anything
about the vdp pointer, so we can't infer something about something it
points to.

> The other thing that can happen is that a vdp can get handed off to
> another synchronization mechanism, for example, to reference counting:
> 
>   p = atomic_load_explicit(&gp, memory_order_consume);
>   if (do_something_with(p->a)) {
>   /* fast path protected by RCU. */
>   return 0;
>   }
>   if (atomic_inc_not_zero(&p->refcnt) {

Is the argument to atomic_inc_no_zero vdp or non-vdp?

>   /* slow path protected by reference counting. */
>   return do_something_else_with((struct foo *)p);  /*

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel

On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > > > +   dereferencing.  For example, the following (rather improbable)
> > > > > > +   code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int force_zero_index = 1;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > > > compiled
> > > > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > > > PowerPC
> > > > > > +   do order stores after such branches, they can speculate loads,
> > > > > > +   which can result in misordering bugs.
> > > > > > +
> > > > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +   the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int flip_index = 0;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   As before, the reason this is buggy is that relational operators
> > > > > > +   are often compiled using branches.  And as before, although
> > > > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +   after such branches, but can speculate loads, which can again
> > > > > > +   result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values 
> > > > > that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.  This assumes that p->a only returns
> > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > through the program.  ;-)
> > 
> > The other thing that can happen is that a vdp can get handed off to
> > another synchronization mechanism, for example, to reference counting:
> > 
> > p = atomic_load_explicit(&gp, memory_order_consume);
> > if (do_something_with(p->a)) {
> > /* fast path protected by RCU. */
> > return 0;
> > }
> > if (atomic_inc_not_zero(&p->refcnt) {
> > /* slow path protected by reference counting. */
> > return do_something_else_with((struct foo *)p);  /* CHANGE */
> > }
> > /* Needed slow path, but raced with deletion. */
> > return -EAGAIN;
> > 
> > I am guessing that the cast ends the vdp.  Is that the case?
> 
> And here is a more elaborate example from the Linux kernel:
> 
>   struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> 
>   rdev = rcu_dereference(conf->mirrors[disk].rdev);
>   if (r1_bio->bios[disk] == IO_BLOCKED
>   || rdev == NULL
>   || test_bit(Unmerged, &rdev->flags)
>   || test_bit(Faulty, &rdev->flags))
>   continue;
> 
> The fact that the "rdev == NULL" returns vdp does not force the "||"
> operators to be evaluated arithmetically because the entire function
> is an "if" condition, correct?

That's a good question, and one that as far as I understand currently,
essentially boils down to whether we want to have tight restrictions on
which operations are still vdp

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Torvald Riegel

On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
> On 3 March 2014 20:44, Torvald Riegel  wrote:
> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney  wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney 
> >> >>  wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> >  wrote:
> >> >> >> > >
> >> >> >> > > 3.  The comparison was against another RCU-protected pointer,
> >> >> >> > > where that other pointer was properly fetched using one
> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
> >> >> >> > > pointer
> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
> >> >> >> > > for
> >> >> >> > > that other pointer happened after the last update to the
> >> >> >> > > pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and 
> >> >> >> > saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >p = atomic_read(consume);
> >> >> >> >X;
> >> >> >> >q = atomic_read(consume);
> >> >> >> >Y;
> >> >> >> >if (p == q)
> >> >> >> > data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the 
> >> >> >> > compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this 
> >> >> >> > would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the 
> >> >> >> only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
> >> >> >> can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you 
> >> >> >> say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>   if (p == NULL)
> >> >> >>   deal_with_it();
> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
> >> >> >>   p->b = 43;
> >> >> >>   p->c = 44;
> >> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
> >> >> >>   p->b = 143;
> >> >> >>   p->c = 144;
> >> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >> >>   if (p == q) {
> >> >> >>   /* The compiler decides that q->c is same as p->c. */
> >> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>   }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get 
> >> >> >> what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating 
> >> >> >> it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and 
> >> >> >> still
> >> >> >> see consistency across all the fields in the structure, please put a 
> >> >> >> lock
> >> >> >> in the structure and use it to guard updates and accesses to those 
> >> >> >> fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is what
> >> >> you

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney

On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> 
> On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > 
> > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > 
> > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > +o  Do not use the results from the boolean "&&" and "||" when
> > > > > > +   dereferencing.  For example, the following (rather improbable)
> > > > > > +   code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int force_zero_index = 1;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   The reason this is buggy is that "&&" and "||" are often 
> > > > > > compiled
> > > > > > +   using branches.  While weak-memory machines such as ARM or 
> > > > > > PowerPC
> > > > > > +   do order stores after such branches, they can speculate loads,
> > > > > > +   which can result in misordering bugs.
> > > > > > +
> > > > > > +o  Do not use the results from relational operators ("==", "!=",
> > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > +   the following (quite strange) code is buggy:
> > > > > > +
> > > > > > +   int a[2];
> > > > > > +   int index;
> > > > > > +   int flip_index = 0;
> > > > > > +
> > > > > > +   ...
> > > > > > +
> > > > > > +   r1 = rcu_dereference(i1)
> > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > +
> > > > > > +   As before, the reason this is buggy is that relational operators
> > > > > > +   are often compiled using branches.  And as before, although
> > > > > > +   weak-memory machines such as ARM or PowerPC do order stores
> > > > > > +   after such branches, but can speculate loads, which can again
> > > > > > +   result in misordering bugs.
> > > > > 
> > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > AFAICS.  r1 != flip_index would result in two possible values (unless
> > > > > there are further constraints due to the type of r1 and the values 
> > > > > that
> > > > > flip_index can have).
> > > > 
> > > > And I am OK with the value_dep_preserving type providing more/better
> > > > guarantees than we get by default from current compilers.
> > > > 
> > > > One question, though.  Suppose that the code did not want a value
> > > > dependency to be tracked through a comparison operator.  What does
> > > > the developer do in that case?  (The reason I ask is that I have
> > > > not yet found a use case in the Linux kernel that expects a value
> > > > dependency to be tracked through a comparison.)
> > > 
> > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > comparison?
> > 
> > That should work well assuming that things like "if", "while", and "?:"
> > conditions are happy to take a vdp.
> 
> I currently don't see a reason why that should be disallowed.  If we
> have allowed an implicit conversion to non-vdp, I believe that should
> follow.

I am a bit nervous about a silent implicit conversion from vdp to
non-vdp in the general case.  However, when the result is being used by
a conditional, the silent implicit conversion makes a lot of sense.
Is that distinction something that the compiler can handle easily?

On the other hand, silent implicit conversion from non-vdp to vdp
is very useful for common code that can be invoked both by RCU
readers and by updaters.

>  ?: could be somewhat special, in that the type depends on the
> 2nd and 3rd operand.  Thus, "vdp x = non-vdp ? vdp : vdp;" should be
> allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be
> disallowed if we don't provide for implicit casts from non-vdp to vdp.

Actually, from the Linux-kernel code that I am seeing, we want to be able
to silently convert from non-vdp to vdp in order to permit common code
that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
This common code must be compiled conservatively to allow vdp, but should
be just find with non-vdp.

Going through the combinations...

 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */
 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */
 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */
 3. vdp

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney

On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > when
> > > > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > > > + code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int force_zero_index = 1;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > > > compiled
> > > > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > > > PowerPC
> > > > > > > + do order stores after such branches, they can speculate loads,
> > > > > > > + which can result in misordering bugs.
> > > > > > > +
> > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > "!=",
> > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > + the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int flip_index = 0;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + As before, the reason this is buggy is that relational operators
> > > > > > > + are often compiled using branches.  And as before, although
> > > > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > + after such branches, but can speculate loads, which can again
> > > > > > > + result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > (unless
> > > > > > there are further constraints due to the type of r1 and the values 
> > > > > > that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > comparison?
> > > 
> > > That should work well assuming that things like "if", "while", and "?:"
> > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > through the program.  ;-)
> > > 
> > > The other thing that can happen is that a vdp can get handed off to
> > > another synchronization mechanism, for example, to reference counting:
> > > 
> > >   p = atomic_load_explicit(&gp, memory_order_consume);
> > >   if (do_something_with(p->a)) {
> > >   /* fast path protected by RCU. */
> > >   return 0;
> > >   }
> > >   if (atomic_inc_not_zero(&p->refcnt) {
> > >   /* slow path protected by reference counting. */
> > >   return do_something_else_with((struct foo *)p);  /* CHANGE */
> > >   }
> > >   /* Needed slow path, but raced with deletion. */
> > >   return -EAGAIN;
> > > 
> > > I am guessing that the cast ends the vdp.  Is that the case?
> > 
> > And here is a more elaborate example from the Linux kernel:
> > 
> > struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
> > 
> > rdev = rcu_dereference(conf->mirrors[disk].rdev);
> > if (r1_bio->bios[disk] == IO_BLOCKED
> > || rdev == NULL
> > || test_bit(Unmerged, &rdev->flags)
> > || test_bit(Faulty, &rdev->flags))
> > continue;
> > 
> > The fact that the "rdev == NULL" returns vdp does not force the "||"
> > operators to be evaluated arithmeti

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Peter Sewell

On 5 March 2014 17:15, Torvald Riegel  wrote:
> On Tue, 2014-03-04 at 22:11 +, Peter Sewell wrote:
>> On 3 March 2014 20:44, Torvald Riegel  wrote:
>> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
>> >> On 1 March 2014 08:03, Paul E. McKenney  
>> >> wrote:
>> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
>> >> >> Hi Paul,
>> >> >>
>> >> >> On 28 February 2014 18:50, Paul E. McKenney 
>> >> >>  wrote:
>> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
>> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
>> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>> >> >> >> >  wrote:
>> >> >> >> > >
>> >> >> >> > > 3.  The comparison was against another RCU-protected 
>> >> >> >> > > pointer,
>> >> >> >> > > where that other pointer was properly fetched using one
>> >> >> >> > > of the RCU primitives.  Here it doesn't matter which 
>> >> >> >> > > pointer
>> >> >> >> > > you use.  At least as long as the rcu_assign_pointer() 
>> >> >> >> > > for
>> >> >> >> > > that other pointer happened after the last update to the
>> >> >> >> > > pointed-to structure.
>> >> >> >> > >
>> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
>> >> >> >> >
>> >> >> >> > I think that it might be worth pointing out as an example, and 
>> >> >> >> > saying
>> >> >> >> > that code like
>> >> >> >> >
>> >> >> >> >p = atomic_read(consume);
>> >> >> >> >X;
>> >> >> >> >q = atomic_read(consume);
>> >> >> >> >Y;
>> >> >> >> >if (p == q)
>> >> >> >> > data = p->val;
>> >> >> >> >
>> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
>> >> >> >> > *either* p or q, but you can't really tell which, since the 
>> >> >> >> > compiler
>> >> >> >> > can decide that the values are interchangeable.
>> >> >> >> >
>> >> >> >> > I cannot for the life of me come up with a situation where this 
>> >> >> >> > would
>> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
>> >> >> >> > stronger ordering than anything the consume through "p" would
>> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
>> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether 
>> >> >> >> > the
>> >> >> >> > ordering to the access through "p" is through p or q is kind of
>> >> >> >> > irrelevant. No?
>> >> >> >>
>> >> >> >> I can make a contrived litmus test for it, but you are right, the 
>> >> >> >> only
>> >> >> >> time you can see it happen is when X has no barriers, in which case
>> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU 
>> >> >> >> can
>> >> >> >> reorder the loads into p and q, and the read from p->val can, as 
>> >> >> >> you say,
>> >> >> >> come from either pointer.
>> >> >> >>
>> >> >> >> For whatever it is worth, hear is the litmus test:
>> >> >> >>
>> >> >> >> T1:   p = kmalloc(...);
>> >> >> >>   if (p == NULL)
>> >> >> >>   deal_with_it();
>> >> >> >>   p->a = 42;  /* Each field in its own cache line. */
>> >> >> >>   p->b = 43;
>> >> >> >>   p->c = 44;
>> >> >> >>   atomic_store_explicit(&gp1, p, memory_order_release);
>> >> >> >>   p->b = 143;
>> >> >> >>   p->c = 144;
>> >> >> >>   atomic_store_explicit(&gp2, p, memory_order_release);
>> >> >> >>
>> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
>> >> >> >>   r1 = p->b;  /* Guaranteed to get 143. */
>> >> >> >>   q = atomic_load_explicit(&gp1, memory_order_consume);
>> >> >> >>   if (p == q) {
>> >> >> >>   /* The compiler decides that q->c is same as p->c. */
>> >> >> >>   r2 = p->c; /* Could get 44 on weakly order system. */
>> >> >> >>   }
>> >> >> >>
>> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get 
>> >> >> >> what
>> >> >> >> you get.
>> >> >> >>
>> >> >> >> And publishing a structure via one RCU-protected pointer, updating 
>> >> >> >> it,
>> >> >> >> then publishing it via another pointer seems to me to be asking for
>> >> >> >> trouble anyway.  If you really want to do something like that and 
>> >> >> >> still
>> >> >> >> see consistency across all the fields in the structure, please put 
>> >> >> >> a lock
>> >> >> >> in the structure and use it to guard updates and accesses to those 
>> >> >> >> fields.
>> >> >> >
>> >> >> > And here is a patch documenting the restrictions for the current 
>> >> >> > Linux
>> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
>> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
>> >> >> >
>> >> >> > Thoughts?
>> >> >>
>> >> >> That might serve as informal documentation for linux kernel
>> >> >> programmers about the bounds on the optimisations that you expect
>> >> >> compilers to do for common-case RCU code - and I guess that's what you
>> >> >> intend it to be for.   But

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel

On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > 
> > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > 
> > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > 
> > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > when
> > > > > > > + dereferencing.  For example, the following (rather improbable)
> > > > > > > + code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int force_zero_index = 1;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + The reason this is buggy is that "&&" and "||" are often 
> > > > > > > compiled
> > > > > > > + using branches.  While weak-memory machines such as ARM or 
> > > > > > > PowerPC
> > > > > > > + do order stores after such branches, they can speculate loads,
> > > > > > > + which can result in misordering bugs.
> > > > > > > +
> > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > "!=",
> > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For example,
> > > > > > > + the following (quite strange) code is buggy:
> > > > > > > +
> > > > > > > + int a[2];
> > > > > > > + int index;
> > > > > > > + int flip_index = 0;
> > > > > > > +
> > > > > > > + ...
> > > > > > > +
> > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > +
> > > > > > > + As before, the reason this is buggy is that relational operators
> > > > > > > + are often compiled using branches.  And as before, although
> > > > > > > + weak-memory machines such as ARM or PowerPC do order stores
> > > > > > > + after such branches, but can speculate loads, which can again
> > > > > > > + result in misordering bugs.
> > > > > > 
> > > > > > Those two would be allowed by the wording I have recently proposed,
> > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > (unless
> > > > > > there are further constraints due to the type of r1 and the values 
> > > > > > that
> > > > > > flip_index can have).
> > > > > 
> > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > guarantees than we get by default from current compilers.
> > > > > 
> > > > > One question, though.  Suppose that the code did not want a value
> > > > > dependency to be tracked through a comparison operator.  What does
> > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > dependency to be tracked through a comparison.)
> > > > 
> > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > comparison?
> > > 
> > > That should work well assuming that things like "if", "while", and "?:"
> > > conditions are happy to take a vdp.
> > 
> > I currently don't see a reason why that should be disallowed.  If we
> > have allowed an implicit conversion to non-vdp, I believe that should
> > follow.
> 
> I am a bit nervous about a silent implicit conversion from vdp to
> non-vdp in the general case.

Why are you nervous about it?

> However, when the result is being used by
> a conditional, the silent implicit conversion makes a lot of sense.
> Is that distinction something that the compiler can handle easily?

I think so. I'm not a language lawyer, but we have other such
conversions in the standard (e.g., int to boolean, between int and
float) and I currently don't see a fundamental difference to those.  But
we'll have to ask the language folks (or SG1 or LEWG) to really verify
that.

> On the other hand, silent implicit conversion from non-vdp to vdp
> is very useful for common code that can be invoked both by RCU
> readers and by updaters.

I'd be more nervous about that because then there's less obstacles to
one programmer expecting a vdp to indicate a dependency vs. another
programmer putting non-vdp into vdp.

For this case of common code (which I agree is a valid concern), would
it be a lot of programmer overhead to add explicit casts from non-vdp to
vdp?

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Torvald Riegel

On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o  Do not use the results from the boolean "&&" and "||" 
> > > > > > > > when
> > > > > > > > +   dereferencing.  For example, the following (rather 
> > > > > > > > improbable)
> > > > > > > > +   code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   The reason this is buggy is that "&&" and "||" are 
> > > > > > > > often compiled
> > > > > > > > +   using branches.  While weak-memory machines such as ARM 
> > > > > > > > or PowerPC
> > > > > > > > +   do order stores after such branches, they can speculate 
> > > > > > > > loads,
> > > > > > > > +   which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o  Do not use the results from relational operators ("==", 
> > > > > > > > "!=",
> > > > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > example,
> > > > > > > > +   the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int flip_index = 0;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   As before, the reason this is buggy is that relational 
> > > > > > > > operators
> > > > > > > > +   are often compiled using branches.  And as before, 
> > > > > > > > although
> > > > > > > > +   weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > stores
> > > > > > > > +   after such branches, but can speculate loads, which can 
> > > > > > > > again
> > > > > > > > +   result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > (unless
> > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > values that
> > > > > > > flip_index can have).
> > > > > > 
> > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > guarantees than we get by default from current compilers.
> > > > > > 
> > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > dependency to be tracked through a comparison.)
> > > > > 
> > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > comparison?
> > > > 
> > > > That should work well assuming that things like "if", "while", and "?:"
> > > > conditions are happy to take a vdp.  This assumes that p->a only returns
> > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > through the program.  ;-)
> > > > 
> > > > The other thing that can happen is that a vdp can get handed off to
> > > > another synchronization mechanism, for example, to reference counting:
> > > > 
> > > > p = atomic_load_explicit(&gp, memory_order_consume);
> > > > if (do_something_with(p->a)) {
> > > > /* fast path protected by RCU. */
> > > > return 0;
> > > > }
> > > > if (atomic_inc_not_zero(&p->refcnt) {
> > > > /* slow path protected by reference counting. */
> > > > return do_something_else_with((struct foo *)p);  /* 
> > > > CHANGE */
> > > > }
> > > > /* Needed slow path, but raced with deletion. */
> > > > retur

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney

On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote:
> xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6)
> 
> On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
> > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
> > > xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
> > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
> > > 
> > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
> > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > 
> > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > 
> > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > +o  Do not use the results from the boolean "&&" and "||" 
> > > > > > > > when
> > > > > > > > +   dereferencing.  For example, the following (rather 
> > > > > > > > improbable)
> > > > > > > > +   code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int force_zero_index = 1;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   The reason this is buggy is that "&&" and "||" are 
> > > > > > > > often compiled
> > > > > > > > +   using branches.  While weak-memory machines such as ARM 
> > > > > > > > or PowerPC
> > > > > > > > +   do order stores after such branches, they can speculate 
> > > > > > > > loads,
> > > > > > > > +   which can result in misordering bugs.
> > > > > > > > +
> > > > > > > > +o  Do not use the results from relational operators ("==", 
> > > > > > > > "!=",
> > > > > > > > +   ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > example,
> > > > > > > > +   the following (quite strange) code is buggy:
> > > > > > > > +
> > > > > > > > +   int a[2];
> > > > > > > > +   int index;
> > > > > > > > +   int flip_index = 0;
> > > > > > > > +
> > > > > > > > +   ...
> > > > > > > > +
> > > > > > > > +   r1 = rcu_dereference(i1)
> > > > > > > > +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > +
> > > > > > > > +   As before, the reason this is buggy is that relational 
> > > > > > > > operators
> > > > > > > > +   are often compiled using branches.  And as before, 
> > > > > > > > although
> > > > > > > > +   weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > stores
> > > > > > > > +   after such branches, but can speculate loads, which can 
> > > > > > > > again
> > > > > > > > +   result in misordering bugs.
> > > > > > > 
> > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > proposed,
> > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > (unless
> > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > values that
> > > > > > > flip_index can have).
> > > > > > 
> > > > > > And I am OK with the value_dep_preserving type providing more/better
> > > > > > guarantees than we get by default from current compilers.
> > > > > > 
> > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > dependency to be tracked through a comparison.)
> > > > > 
> > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > comparison?
> > > > 
> > > > That should work well assuming that things like "if", "while", and "?:"
> > > > conditions are happy to take a vdp.
> > > 
> > > I currently don't see a reason why that should be disallowed.  If we
> > > have allowed an implicit conversion to non-vdp, I believe that should
> > > follow.
> > 
> > I am a bit nervous about a silent implicit conversion from vdp to
> > non-vdp in the general case.
> 
> Why are you nervous about it?

If someone expects the vdp to propagate into some function that might
be compiled with aggressive optimizations that break this expectation,
it would be good for that someone to know about it.

Ah!  I am assuming that the compiler is -not- emitting memory barriers
at vdp-to-non-vdp transitions.  In that case, warn

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney

On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote:
> On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
> > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
> > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
> > > > > > xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
> > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
> > > > > > 
> > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
> > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
> > > > > > > > xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
> > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
> > > > > > > > 
> > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
> > > > > > > > > +oDo not use the results from the boolean "&&" and "||" 
> > > > > > > > > when
> > > > > > > > > + dereferencing.  For example, the following (rather 
> > > > > > > > > improbable)
> > > > > > > > > + code is buggy:
> > > > > > > > > +
> > > > > > > > > + int a[2];
> > > > > > > > > + int index;
> > > > > > > > > + int force_zero_index = 1;
> > > > > > > > > +
> > > > > > > > > + ...
> > > > > > > > > +
> > > > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > > > + r2 = a[r1 && force_zero_index];  /* BUGGY!!! */
> > > > > > > > > +
> > > > > > > > > + The reason this is buggy is that "&&" and "||" are 
> > > > > > > > > often compiled
> > > > > > > > > + using branches.  While weak-memory machines such as ARM 
> > > > > > > > > or PowerPC
> > > > > > > > > + do order stores after such branches, they can speculate 
> > > > > > > > > loads,
> > > > > > > > > + which can result in misordering bugs.
> > > > > > > > > +
> > > > > > > > > +oDo not use the results from relational operators ("==", 
> > > > > > > > > "!=",
> > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing.  For 
> > > > > > > > > example,
> > > > > > > > > + the following (quite strange) code is buggy:
> > > > > > > > > +
> > > > > > > > > + int a[2];
> > > > > > > > > + int index;
> > > > > > > > > + int flip_index = 0;
> > > > > > > > > +
> > > > > > > > > + ...
> > > > > > > > > +
> > > > > > > > > + r1 = rcu_dereference(i1)
> > > > > > > > > + r2 = a[r1 != flip_index];  /* BUGGY!!! */
> > > > > > > > > +
> > > > > > > > > + As before, the reason this is buggy is that relational 
> > > > > > > > > operators
> > > > > > > > > + are often compiled using branches.  And as before, 
> > > > > > > > > although
> > > > > > > > > + weak-memory machines such as ARM or PowerPC do order 
> > > > > > > > > stores
> > > > > > > > > + after such branches, but can speculate loads, which can 
> > > > > > > > > again
> > > > > > > > > + result in misordering bugs.
> > > > > > > > 
> > > > > > > > Those two would be allowed by the wording I have recently 
> > > > > > > > proposed,
> > > > > > > > AFAICS.  r1 != flip_index would result in two possible values 
> > > > > > > > (unless
> > > > > > > > there are further constraints due to the type of r1 and the 
> > > > > > > > values that
> > > > > > > > flip_index can have).
> > > > > > > 
> > > > > > > And I am OK with the value_dep_preserving type providing 
> > > > > > > more/better
> > > > > > > guarantees than we get by default from current compilers.
> > > > > > > 
> > > > > > > One question, though.  Suppose that the code did not want a value
> > > > > > > dependency to be tracked through a comparison operator.  What does
> > > > > > > the developer do in that case?  (The reason I ask is that I have
> > > > > > > not yet found a use case in the Linux kernel that expects a value
> > > > > > > dependency to be tracked through a comparison.)
> > > > > > 
> > > > > > Hmm.  I suppose use an explicit cast to non-vdp before or after the
> > > > > > comparison?
> > > > > 
> > > > > That should work well assuming that things like "if", "while", and 
> > > > > "?:"
> > > > > conditions are happy to take a vdp.  This assumes that p->a only 
> > > > > returns
> > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild
> > > > > through the program.  ;-)
> > > > > 
> > > > > The other thing that can happen is that a vdp can get handed off to
> > > > > another synchronization mechanism, for example, to reference counting:
> > > > > 
> > > > >   p = atomic_load_explicit(&gp, memory_order_consume);
> > > > >   if (do_something_with(p->a)) {
> > > > >   /* fast path protected by RCU. */
> > > > >   return 0;
> > > > >   }
> > > > >   if (atomic_inc_not_zero(&p->refcnt) {
> > > > >   /* slow path protecte

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Paul E. McKenney

On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > Hopefully some discussion of out-of-thin-air values as well.
> > > 
> > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > a wooden stake through its hart.
> > > 
> > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > is sorted.
> > 
> > There actually is a proposal being put forward, but it might not make ARM
> > and Power people happy because it involves adding a compare, a branch,
> > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > much preferring the no-store-speculation approach.
> 
> Can you elaborate a bit on this please? We don't permit speculative stores
> in the ARM architecture, so it seems counter-intuitive that GCC needs to
> emit any additional instructions to prevent that from happening.

Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
proof that out-of-thin-air values cannot be observed in the face of any
compiler optimization that refrains from reordering a prior relaxed load
with a subsequent relaxed store.

> Stores can, of course, be observed out-of-order but that's a lot more
> reasonable :)

So let me try an example.  I am sure that Torvald Riegel will jump in
with any needed corrections or amplifications:

Initial state: x == y == 0

T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
atomic_store_explicit(r1, y, memory_order_relaxed);

T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
atomic_store_explicit(r2, x, memory_order_relaxed);

One would intuitively expect r1 == r2 == 0 as the only possible outcome.
But suppose that the compiler used specialization optimizations, as it
would if there was a function that has a very lightweight implementation
for some values and a very heavyweight one for other.  In particular,
suppose that the lightweight implementation was for the value 42.
Then the compiler might do something like the following:

Initial state: x == y == 0

T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
if (r1 == 42)
atomic_store_explicit(42, y, memory_order_relaxed);
else
atomic_store_explicit(r1, y, memory_order_relaxed);

T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
atomic_store_explicit(r2, x, memory_order_relaxed);

Suddenly we have an explicit constant 42 showing up.  Of course, if
the compiler carefully avoided speculative stores (as both Peter and
I believe that it should if its code generation is to be regarded as
anything other than an act of vandalism, the words in the standard
notwithstanding), there would be no problem.  But currently, a number
of compiler writers see absolutely nothing wrong with transforming
the optimized-for-42 version above with something like this:

Initial state: x == y == 0

T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
atomic_store_explicit(42, y, memory_order_relaxed);
if (r1 != 42)
atomic_store_explicit(r1, y, memory_order_relaxed);

T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
atomic_store_explicit(r2, x, memory_order_relaxed);

And then it is a short and uncontroversial step to the following:

Initial state: x == y == 0

T1: atomic_store_explicit(42, y, memory_order_relaxed);
r1 = atomic_load_explicit(x, memory_order_relaxed);
if (r1 != 42)
atomic_store_explicit(r1, y, memory_order_relaxed);

T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
atomic_store_explicit(r2, x, memory_order_relaxed);

This can of course result in r1 == r2 == 42, even though the constant
42 never appeared in the original code.  This is one way to generate
an out-of-thin-air value.

As near as I can tell, compiler writers hate the idea of prohibiting
speculative-store optimizations because it requires them to introduce
both control and data dependency tracking into their compilers.  Many of
them seem to hate dependency tracking with a purple passion.  At least,
such a hatred would go a long way towards explaining the incomplete
and high-overhead implementations of memory_order_consume, the long
and successful use of idioms based on the memory_order_consume pattern
notwithstanding [*].  ;-)

That said, the Java guys are talking about introducing something
vaguely resembling memory_order_consume (and thus resembling the
rcu_assign_pointer() and rcu_dereference() portions of RCU) to solve Java
out-of-thin-air issues involving initialization, so perhaps there is hope.

Thanx, Paul

[*] http://queue.acm.org/detail.cfm?id=2488549
http://www.rdrop.com/users/paulmck/RCU/rclockpdcsproo

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Torvald Riegel

On Fri, 2014-02-07 at 18:06 +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> > 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> This is more about the compiler speculating on stores; imagine:
> 
>   if (x)
>   y = 1;
>   else
>   y = 2;
> 
> The compiler is allowed to change that into:
> 
>   y = 2;
>   if (x)
>   y = 1;

If you write the example like that, this is indeed allowed because it's
all sequential code (and there's no volatiles in there, at least you
didn't show them :).  A store to y would happen in either case.  You
cannot observe the difference between both examples in a data-race-free
program.

Are there supposed to be atomic/non-sequential accesses in there?  If
so, please update the example.

> Which is of course a big problem when you want to rely on the ordering.
> 
> There's further problems where things like memset() can write outside
> the specified address range. Examples are memset() using single
> instructions to wipe entire cachelines and then 'restoring' the tail
> bit.

As Joseph said, this would be a bug IMO.

> While valid for single threaded, its a complete disaster for concurrent
> code.
> 
> There's more, but it all boils down to doing stores you don't expect in
> a 'sane' concurrent environment and/or don't respect the control flow.

A few of those got fixed already, because they violated the memory
model's requirements.  If you have further examples that are valid code
in the C11/C++11 model, please report them.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Torvald Riegel

On Fri, 2014-02-07 at 08:50 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > Hopefully some discussion of out-of-thin-air values as well.
> > 
> > Yes, absolutely shoot store speculation in the head already. Then drive
> > a wooden stake through its hart.
> > 
> > C11/C++11 should not be allowed to claim itself a memory model until that
> > is sorted.
> 
> There actually is a proposal being put forward, but it might not make ARM
> and Power people happy because it involves adding a compare, a branch,
> and an ISB/isync after every relaxed load...  Me, I agree with you,
> much preferring the no-store-speculation approach.

My vague recollection is that everyone agrees that out-of-thin-air
values shouldn't be allowed, but that it's surprisingly complex to
actually specify this properly.

However, the example that Peter posted further down in the thread seems
to be unrelated to out-of-thin-air.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Paul E. McKenney

On Fri, Feb 07, 2014 at 05:13:36PM +, Will Deacon wrote:
> On Fri, Feb 07, 2014 at 05:06:54PM +, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > > 
> > > > > Yes, absolutely shoot store speculation in the head already. Then 
> > > > > drive
> > > > > a wooden stake through its hart.
> > > > > 
> > > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > > that
> > > > > is sorted.
> > > > 
> > > > There actually is a proposal being put forward, but it might not make 
> > > > ARM
> > > > and Power people happy because it involves adding a compare, a branch,
> > > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > > much preferring the no-store-speculation approach.
> > > 
> > > Can you elaborate a bit on this please? We don't permit speculative stores
> > > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > > emit any additional instructions to prevent that from happening.
> > > 
> > > Stores can, of course, be observed out-of-order but that's a lot more
> > > reasonable :)
> > 
> > This is more about the compiler speculating on stores; imagine:
> > 
> >   if (x)
> > y = 1;
> >   else
> > y = 2;
> > 
> > The compiler is allowed to change that into:
> > 
> >   y = 2;
> >   if (x)
> > y = 1;
> > 
> > Which is of course a big problem when you want to rely on the ordering.
> 
> Understood, but that doesn't explain why Paul wants to add ISB/isync
> instructions which affect the *CPU* rather than the compiler!

Hey!!!  -I- don't want to add those instructions!  Others do.
Unfortunately, lots of others.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Joseph S. Myers

On Fri, 7 Feb 2014, Peter Zijlstra wrote:

> There's further problems where things like memset() can write outside
> the specified address range. Examples are memset() using single
> instructions to wipe entire cachelines and then 'restoring' the tail
> bit.

If memset (or any C library function) modifies bytes it's not permitted to 
modify in the abstract machine, that's a simple bug and should be reported 
as usual.  We've made GCC follow that part of the memory model by default 
(so a store to a non-bit-field structure field doesn't do a 
read-modify-write to a word containing another field, for example) and I 
think it's pretty obvious that glibc should do so as well.

(Of course, memset is not an atomic operation, and you need to allow for 
that if you use it on an _Atomic object - which is I think valid, unless 
the object is also volatile, but perhaps ill-advised.)

-- 
Joseph S. Myers
jos...@codesourcery.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > There are also so many ways to blow your head off it's untrue. For 
> > > > > > example,
> > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > success, but
> > > > > > then there are restrictions on the sets you can use for each. It's 
> > > > > > not hard
> > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > > > > there's
> > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > ignores consume
> > > > > > atm and optimises all of the data dependencies away) as well as the 
> > > > > > definition
> > > > > > of "data races", which seem to be used as an excuse to miscompile a 
> > > > > > program
> > > > > > at the earliest opportunity.
> > > > > 
> > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > memory_order_consume until the compilers implement it both correctly 
> > > > > and
> > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > shortage
> > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > 
> > > > Do you have any input on
> > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > language standard's definition of dependencies?
> > > 
> > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > 
> > > — B is an invocation of any specialization of std::kill_dependency 
> > > (29.3), or
> > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > logical OR (||, see 5.15) operator,
> > > or
> > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > 
> > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > after the "?" will carry a dependency, so the code fragment in 59448
> > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > of existence.  One way to do that on both ARM and Power is to actually
> > > emit code for "flag - flag", but there are a number of other ways to
> > > make that work.
> > 
> > And that's what would concern me, considering that these requirements
> > seem to be able to creep out easily.  Also, whereas the other atomics
> > just constrain compilers wrt. reordering across atomic accesses or
> > changes to the atomic accesses themselves, the dependencies are new
> > requirements on pieces of otherwise non-synchronizing code.  The latter
> > seems far more involved to me.
> 
> Well, the wording of 1.10p9 is pretty explicit on this point.
> There are only a few exceptions to the rule that dependencies from
> memory_order_consume loads must be tracked.  And to your point about
> requirements being placed on pieces of otherwise non-synchronizing code,
> we already have that with plain old load acquire and store release --
> both of these put ordering constraints that affect the surrounding
> non-synchronizing code.

I think there's a significant difference.  With acquire/release or more
general memory orders, it's true that we can't order _across_ the atomic
access.  However, we can reorder and optimize without additional
constraints if we do not reorder.  This is not the case with consume
memory order, as the (p + flag - flag) example shows.

> This issue got a lot of discussion, and the compromise is that
> dependencies cannot leak into or out of functions unless the relevant
> parameters or return values are annotated with [[carries_dependency]].
> This means that the compiler can see all the places where dependencies
> must be tracked.  This is described in 7.6.4.

I wasn't aware of 7.6.4 (but it isn't referred to as an additional
constraint--what it is--in 1.10, so I guess at least that should be
fixed).
Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
difference, at least if one is assuming that normal optimization of
sequential code is the default, and that maintaining things such as
(flag-flag) is not; if optimizing away (flag-flag) would require the
insertion of fences unless there is the carries_dependency attribute,
then this would be bad I think.

IMHO, the dependencies construct (carries_dependency, kill_dependency)
seem to be backwards to me.  They assume that the compiler preserves all
those dependencies including (flag-flag) by default, which prohibits
meaningful optimizations.  Instead, I guess there should be a construct
for explicitly exploiting the dependencies (e.g.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> 
> Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> proof that out-of-thin-air values cannot be observed in the face of any
> compiler optimization that refrains from reordering a prior relaxed load
> with a subsequent relaxed store.
> 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> So let me try an example.  I am sure that Torvald Riegel will jump in
> with any needed corrections or amplifications:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> But suppose that the compiler used specialization optimizations, as it
> would if there was a function that has a very lightweight implementation
> for some values and a very heavyweight one for other.  In particular,
> suppose that the lightweight implementation was for the value 42.
> Then the compiler might do something like the following:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   if (r1 == 42)
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   else
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Suddenly we have an explicit constant 42 showing up.  Of course, if
> the compiler carefully avoided speculative stores (as both Peter and
> I believe that it should if its code generation is to be regarded as
> anything other than an act of vandalism, the words in the standard
> notwithstanding), there would be no problem.  But currently, a number
> of compiler writers see absolutely nothing wrong with transforming
> the optimized-for-42 version above with something like this:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

For this to be correct, the compiler would actually have to prove that
the speculative store is "as-if correct", which in turn would mean that
it needs to be aware of all potential observers, and check whether those
observers aren't actually affected by the speculative store.

I would guess that the compilers you have in mind don't really do that.
If they do, then I don't see why this should be okay, unless you think
out-of-thin-air values are something good (which I wouldn't agree with).

> And then it is a short and uncontroversial step to the following:
> 
> Initial state: x == y == 0
> 
> T1:   atomic_store_explicit(42, y, memory_order_relaxed);
>   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> This can of course result in r1 == r2 == 42, even though the constant
> 42 never appeared in the original code.  This is one way to generate
> an out-of-thin-air value.
>

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
>
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.

What BS is that? If you use an "atomic_store_explicit()", by
definition you're either

 (a) f*cking insane
 (b) not doing sequential non-synchronizing code

and a compiler that assumes that the programmer is insane may actually
be correct more often than not, but it's still a shit compiler.
Agreed?

So I don't see how any sane person can say that speculative writes are
ok. They are clearly not ok.

Speculative stores are a bad idea in general. They are completely
invalid for anything that says "atomic". This is not even worth
discussing.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > I wouldn't characterize the situation like this (although I can't speak
> > for others, obviously).  IMHO, it's perfectly fine on sequential /
> > non-synchronizing code, because we know the difference isn't observable
> > by a correct program.
> 
> What BS is that? If you use an "atomic_store_explicit()", by
> definition you're either
> 
>  (a) f*cking insane
>  (b) not doing sequential non-synchronizing code
> 
> and a compiler that assumes that the programmer is insane may actually
> be correct more often than not, but it's still a shit compiler.
> Agreed?

Due to all the expletives, I can't really understand what you are
saying.  Nor does what I guess it might mean seem to relate to what I
said.

(a) seems to say that you don't like requiring programmers to mark
atomic accesses specially.  Is that the case?

If so, I have to disagree.  If you're writing concurrent code, marking
the atomic accesses is the least of your problem.  Instead, for the
concurrent code I've written, it rather improved readability; it made
code more verbose, but in turn it made the synchronization easier to
see.

Beside this question of taste (and I don't care what the Kernel style
guides are), there is a real technical argument here, though:  Requiring
all synchronizing memory accesses to be annotated makes the compiler
aware what is sequential code, and what is not.  Without it, one can't
exploit data-race-freedom.  So unless we want to slow down optimization
of sequential code, we need to tell the compiler what is what.  If every
access is potentially synchronizing, then this doesn't just prevent
speculative stores.

(b) seems as if you are saying that if there is a specially annotated
atomic access in the code, that this isn't sequential/non-synchronizing
code anymore.  I don't agree with that, obviously.

> So I don't see how any sane person can say that speculative writes are
> ok. They are clearly not ok.

We are discussing programs written against the C11/C++11 memory model
here.  At least that's the part that got forwarded to g...@gcc.gnu.org,
and was subject of the nearest emails in the thread.

This memory model requires programs to be data-race-free.  Thus, we can
optimize accordingly.  If you don't like that, then this isn't C11/C++11
anymore.  Which would be fine, but then complain about that
specifically.

> Speculative stores are a bad idea in general.

I disagree in the context of C11/C++11 programs.  At least from a
correctness point of view.

> They are completely
> invalid for anything that says "atomic".

I agree, as you will see when you read the other emails I posted as part
of this thread.  But those two things are separate.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel  wrote:
>
> (a) seems to say that you don't like requiring programmers to mark
> atomic accesses specially.  Is that the case?

In Paul's example, they were marked specially.

And you seemed to argue that Paul's example could possibly return
anything but 0/0.

If you really think that, I hope to God that you have nothing to do
with the C standard or any actual compiler I ever use.

Because such a standard or compiler would be shit. It's sadly not too uncommon.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel  wrote:
> >
> > (a) seems to say that you don't like requiring programmers to mark
> > atomic accesses specially.  Is that the case?
> 
> In Paul's example, they were marked specially.
> 
> And you seemed to argue that Paul's example could possibly return
> anything but 0/0.

Just read my reply to Paul again.  Here's an excerpt:

> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

IOW, I wrote that such a compiler transformation would be wrong in my
opinion.  Thus, it should *not* return 42.
If you still see how what I wrote could be misunderstood, please let me
know because I really don't see it.

Yes, I don't do so by swearing or calling other insane, and I try to see
the reasoning that those compilers' authors might have had, even if I
don't agree with it.  In my personal opinion, that's a good thing.

> If you really think that, I hope to God that you have nothing to do
> with the C standard or any actual compiler I ever use.
> 
> Because such a standard or compiler would be shit. It's sadly not too 
> uncommon.

Thanks for the kind words.  Perhaps writing that was quicker than
reading what I actually wrote.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel  wrote:
>
> IOW, I wrote that such a compiler transformation would be wrong in my
> opinion.  Thus, it should *not* return 42.

Ahh, I am happy to have misunderstood. The "intuitively" threw me,
because I thought that was building up to a "but", and misread the
rest.

I then react stronly, because I've seen so much total crap (the
type-based C aliasing rules topping my list) etc coming out of
standards groups because it allows them to generate wrong code that
goes faster, that I just assume compiler people are out to do stupid
things in the name of "..but the standard allows it".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > > 
> > > > > Yes, absolutely shoot store speculation in the head already. Then 
> > > > > drive
> > > > > a wooden stake through its hart.
> > > > > 
> > > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > > that
> > > > > is sorted.
> > > > 
> > > > There actually is a proposal being put forward, but it might not make 
> > > > ARM
> > > > and Power people happy because it involves adding a compare, a branch,
> > > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > > much preferring the no-store-speculation approach.
> > > 
> > > Can you elaborate a bit on this please? We don't permit speculative stores
> > > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > > emit any additional instructions to prevent that from happening.
> > 
> > Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> > proof that out-of-thin-air values cannot be observed in the face of any
> > compiler optimization that refrains from reordering a prior relaxed load
> > with a subsequent relaxed store.
> > 
> > > Stores can, of course, be observed out-of-order but that's a lot more
> > > reasonable :)
> > 
> > So let me try an example.  I am sure that Torvald Riegel will jump in
> > with any needed corrections or amplifications:
> > 
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> > But suppose that the compiler used specialization optimizations, as it
> > would if there was a function that has a very lightweight implementation
> > for some values and a very heavyweight one for other.  In particular,
> > suppose that the lightweight implementation was for the value 42.
> > Then the compiler might do something like the following:
> > 
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > if (r1 == 42)
> > atomic_store_explicit(42, y, memory_order_relaxed);
> > else
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > Suddenly we have an explicit constant 42 showing up.  Of course, if
> > the compiler carefully avoided speculative stores (as both Peter and
> > I believe that it should if its code generation is to be regarded as
> > anything other than an act of vandalism, the words in the standard
> > notwithstanding), there would be no problem.  But currently, a number
> > of compiler writers see absolutely nothing wrong with transforming
> > the optimized-for-42 version above with something like this:
> > 
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > atomic_store_explicit(42, y, memory_order_relaxed);
> > if (r1 != 42)
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.
> 
> For this to be correct, the compiler would actually have to prove that
> the speculative store is "as-if correct", which in turn would mean that
> it needs to be aware of all potential observers, and check whether those
> observers aren't actually affected by the speculative store.
> 
> I would guess that the compilers you have in mind don't really do that.
> If they do, then I don't see why this should be okay, unless you think
> out-of-thin-air values are something good (which I wouldn't agree with).

OK, we agree that pulling the atomic store to y out of its "if" statement
is a bad thing.  Very good!  Now we just have to convince others on
the committee.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:

[ . . . ]

> > And then it is a short and uncontroversial step to the following:
> > 
> > Initial state: x == y == 0
> > 
> > T1: atomic_store_explicit(42, y, memory_order_relaxed);
> > r1 = atomic_load_explicit(x, memory_order_relaxed);
> > if (r1 != 42)
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > This can of course result in r1 == r2 == 42, even though the constant
> > 42 never appeared in the original code.  This is one way to generate
> > an out-of-thin-air value.
> > 
> > As near as I can tell, compiler writers hate the idea of prohibiting
> > speculative-store optimizations because it requires them to introduce
> > both control and data dependency tracking into their compilers.
> 
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.  For synchronizing code, compilers just shouldn't
> do it, or they would have to truly prove that speculation is harmless.
> That will be hard, so I think it should just be avoided.
> 
> Synchronization code will likely have been tuned anyway (especially if
> it uses relaxed MO), so I don't see a large need for trying to optimize
> using speculative atomic stores.
> 
> Thus, I think there's an easy and practical solution.

I like this approach, but there has been resistance to it in the past.
Definitely worth a good try, though!

> > Many of
> > them seem to hate dependency tracking with a purple passion.  At least,
> > such a hatred would go a long way towards explaining the incomplete
> > and high-overhead implementations of memory_order_consume, the long
> > and successful use of idioms based on the memory_order_consume pattern
> > notwithstanding [*].  ;-)
> 
> I still think that's different because it blurs the difference between
> sequential code and synchronizing code (ie, atomic accesses).  With
> consume MO, the simple solution above doesn't work anymore, because
> suddenly synchronizing code does affect optimizations in sequential
> code, even if that wouldn't reorder across the synchronizing code (which
> would be clearly "visible" to the implementation of the optimization).

I understand that memory_order_consume is a bit harder on compiler
writers than the other memory orders, but it is also pretty valuable.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > > There are also so many ways to blow your head off it's untrue. 
> > > > > > > For example,
> > > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > > success, but
> > > > > > > then there are restrictions on the sets you can use for each. 
> > > > > > > It's not hard
> > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". 
> > > > > > > Then there's
> > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > > ignores consume
> > > > > > > atm and optimises all of the data dependencies away) as well as 
> > > > > > > the definition
> > > > > > > of "data races", which seem to be used as an excuse to miscompile 
> > > > > > > a program
> > > > > > > at the earliest opportunity.
> > > > > > 
> > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > memory_order_consume until the compilers implement it both 
> > > > > > correctly and
> > > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > > shortage
> > > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > > 
> > > > > Do you have any input on
> > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > > language standard's definition of dependencies?
> > > > 
> > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > 
> > > > — B is an invocation of any specialization of std::kill_dependency 
> > > > (29.3), or
> > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > > logical OR (||, see 5.15) operator,
> > > > or
> > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > 
> > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > emit code for "flag - flag", but there are a number of other ways to
> > > > make that work.
> > > 
> > > And that's what would concern me, considering that these requirements
> > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > just constrain compilers wrt. reordering across atomic accesses or
> > > changes to the atomic accesses themselves, the dependencies are new
> > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > seems far more involved to me.
> > 
> > Well, the wording of 1.10p9 is pretty explicit on this point.
> > There are only a few exceptions to the rule that dependencies from
> > memory_order_consume loads must be tracked.  And to your point about
> > requirements being placed on pieces of otherwise non-synchronizing code,
> > we already have that with plain old load acquire and store release --
> > both of these put ordering constraints that affect the surrounding
> > non-synchronizing code.
> 
> I think there's a significant difference.  With acquire/release or more
> general memory orders, it's true that we can't order _across_ the atomic
> access.  However, we can reorder and optimize without additional
> constraints if we do not reorder.  This is not the case with consume
> memory order, as the (p + flag - flag) example shows.

Agreed, memory_order_consume does introduce additional restrictions.

> > This issue got a lot of discussion, and the compromise is that
> > dependencies cannot leak into or out of functions unless the relevant
> > parameters or return values are annotated with [[carries_dependency]].
> > This means that the compiler can see all the places where dependencies
> > must be tracked.  This is described in 7.6.4.
> 
> I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> constraint--what it is--in 1.10, so I guess at least that should be
> fixed).
> Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> difference, at least if one is assuming that normal optimization of
> sequential code is the default, and that maintaining things such as
> (flag-flag) is not; if optimizing away (flag-flag) would require the
> insertion of fences unless there is the carries_dependency attribute,
> then this would be bad I think.

No, the attribute does not

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > atomic_store_explicit(42, y, memory_order_relaxed);
> > if (r1 != 42)
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Yeah, my bad for not being familiar with the atrocious crap C11 made of
atomics :/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> As near as I can tell, compiler writers hate the idea of prohibiting
> speculative-store optimizations because it requires them to introduce
> both control and data dependency tracking into their compilers.  Many of
> them seem to hate dependency tracking with a purple passion.  At least,
> such a hatred would go a long way towards explaining the incomplete
> and high-overhead implementations of memory_order_consume, the long
> and successful use of idioms based on the memory_order_consume pattern
> notwithstanding [*].  ;-)

Just tell them that because the hardware provides control dependencies
we actually use and rely on them.

Not that I expect they care too much what we do, given the current state
of things.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Will Deacon

On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > As near as I can tell, compiler writers hate the idea of prohibiting
> > speculative-store optimizations because it requires them to introduce
> > both control and data dependency tracking into their compilers.  Many of
> > them seem to hate dependency tracking with a purple passion.  At least,
> > such a hatred would go a long way towards explaining the incomplete
> > and high-overhead implementations of memory_order_consume, the long
> > and successful use of idioms based on the memory_order_consume pattern
> > notwithstanding [*].  ;-)
> 
> Just tell them that because the hardware provides control dependencies
> we actually use and rely on them.

s/control/address/ ?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Nope, control.

Since stores cannot be speculated and thus require linear control flow
history we can use it to order LOAD -> STORE when the LOAD is required
for the control flow decision and the STORE depends on the control flow
path.

Also see commit 18c03c61444a211237f3d4782353cb38dba795df to
Documentation/memory-barriers.txt

---
commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63
Author: Peter Zijlstra 
Date:   Mon Nov 25 11:49:10 2013 +0100

perf: Optimize ring-buffer write by depending on control dependencies

Remove a full barrier from the ring-buffer write path by relying on
a control dependency to order a LOAD -> STORE scenario.

Cc: "Paul E. McKenney" 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org
Signed-off-by: Ingo Molnar 

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index e8b168af135b..146a5792b1d2 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -61,19 +61,20 @@ static void perf_output_put_handle(struct 
perf_output_handle *handle)
 *
 *   kernel user
 *
-*   READ ->data_tail   READ ->data_head
-*   smp_mb()   (A) smp_rmb()   (C)
-*   WRITE $dataREAD $data
-*   smp_wmb()  (B) smp_mb()(D)
-*   STORE ->data_head  WRITE ->data_tail
+*   if (LOAD ->data_tail) {LOAD ->data_head
+*  (A) smp_rmb()   (C)
+*  STORE $data LOAD $data
+*  smp_wmb()   (B) smp_mb()(D)
+*  STORE ->data_head   STORE ->data_tail
+*   }
 *
 * Where A pairs with D, and B pairs with C.
 *
-* I don't think A needs to be a full barrier because we won't in fact
-* write data until we see the store from userspace. So we simply don't
-* issue the data WRITE until we observe it. Be conservative for now.
+* In our case (A) is a control dependency that separates the load of
+* the ->data_tail and the stores of $data. In case ->data_tail
+* indicates there is no room in the buffer to store $data we do not.
 *
-* OTOH, D needs to be a full barrier since it separates the data READ
+* D needs to be a full barrier since it separates the data READ
 * from the tail WRITE.
 *
 * For B a WMB is sufficient since it separates two WRITEs, and for C
@@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle 
*handle)
 *
 * See perf_output_begin().
 */
-   smp_wmb();
+   smp_wmb(); /* B, matches C */
rb->user_page->data_head = head;
 
/*
@@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle,
if (!rb->overwrite &&
unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
goto fail;
+
+   /*
+* The above forms a control dependency barrier separating the
+* @tail load above from the data stores below. Since the @tail
+* load is required to compute the branch to fail below.
+*
+* A, matches D; the full memory barrier userspace SHOULD issue
+* after reading the data and before storing the new tail
+* position.
+*
+* See perf_output_put_handle().
+*/
+
head += size;
} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
/*
-* Separate the userpage->tail read from the data stores below.
-* Matches the MB userspace SHOULD issue after reading the data
-* and before storing the new tail position.
-*
-* See perf_output_put_handle().
+* We rely o

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Both are important, but as Peter's reply noted, it was control
dependencies under discussion.  Data dependencies (which include the
ARM/PowerPC notion of address dependencies) are called out by the standard
already, but control dependencies are not.  I am not all that satisified
by current implementations of data dependencies, admittedly.  Should
be an interesting discussion.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Will Deacon

On Mon, Feb 10, 2014 at 03:04:43PM +, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
> > On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > > speculative-store optimizations because it requires them to introduce
> > > > both control and data dependency tracking into their compilers.  Many of
> > > > them seem to hate dependency tracking with a purple passion.  At least,
> > > > such a hatred would go a long way towards explaining the incomplete
> > > > and high-overhead implementations of memory_order_consume, the long
> > > > and successful use of idioms based on the memory_order_consume pattern
> > > > notwithstanding [*].  ;-)
> > > 
> > > Just tell them that because the hardware provides control dependencies
> > > we actually use and rely on them.
> > 
> > s/control/address/ ?
> 
> Both are important, but as Peter's reply noted, it was control
> dependencies under discussion.  Data dependencies (which include the
> ARM/PowerPC notion of address dependencies) are called out by the standard
> already, but control dependencies are not.  I am not all that satisified
> by current implementations of data dependencies, admittedly.  Should
> be an interesting discussion.  ;-)

Ok, but since you can't use control dependencies to order LOAD -> LOAD, it's
a pretty big ask of the compiler to make use of them for things like
consume, where a data dependency will suffice for any combination of
accesses.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
>
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Btw, what is the definition of "observable" for the atomics?

Because I'm hoping that it's not the same as for volatiles, where
"observable" is about the virtual machine itself, and as such volatile
accesses cannot be combined or optimized at all.

Now, I claim that atomic accesses cannot be done speculatively for
writes, and not re-done for reads (because the value could change),
but *combining* them would be possible and good.

For example, we often have multiple independent atomic accesses that
could certainly be combined: testing the individual bits of an atomic
value with helper functions, causing things like "load atomic, test
bit, load same atomic, test another bit". The two atomic loads could
be done as a single load without possibly changing semantics on a real
machine, but if "visibility" is defined in the same way it is for
"volatile", that wouldn't be a valid transformation. Right now we use
"volatile" semantics for these kinds of things, and they really can
hurt.

Same goes for multiple writes (possibly due to setting bits):
combining multiple accesses into a single one is generally fine, it's
*adding* write accesses speculatively that is broken by design..

At the same time, you can't combine atomic loads or stores infinitely
- "visibility" on a real machine definitely is about timeliness.
Removing all but the last write when there are multiple consecutive
writes is generally fine, even if you unroll a loop to generate those
writes. But if what remains is a loop, it might be a busy-loop
basically waiting for something, so it would be wrong ("untimely") to
hoist a store in a loop entirely past the end of the loop, or hoist a
load in a loop to before the loop.

Does the standard allow for that kind of behavior?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Chris Metcalf

(+LKML again)

On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
>> On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
>>> Its been compiled on everything I have a compiler for, however frv and
>>> tile are missing because they're special and I was tired.
>> So what's the specialness on tile?
> Its not doing the atomic work in ASM but uses magic builtins or such.
>
> I got the list of magic funcs for tilegx, but didn't look into the 32bit
> chips.

Oh, I see.  The  files on tile are already reasonably 
well-factored.

It's possible you could do better, but I think not by too much, other than 
possibly
by using  for some of the common idioms like "subtraction
is addition with a negative second argument", etc., which hasn't been done 
elsewhere.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote:
> (+LKML again)
> 
> On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
> > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
> >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
> >>> Its been compiled on everything I have a compiler for, however frv and
> >>> tile are missing because they're special and I was tired.
> >> So what's the specialness on tile?
> > Its not doing the atomic work in ASM but uses magic builtins or such.
> >
> > I got the list of magic funcs for tilegx, but didn't look into the 32bit
> > chips.
> 
> Oh, I see.  The  files on tile are already reasonably 
> well-factored.
> 
> It's possible you could do better, but I think not by too much, other than 
> possibly
> by using  for some of the common idioms like 
> "subtraction
> is addition with a negative second argument", etc., which hasn't been done 
> elsewhere.

The last patch 5/5 adds a few atomic ops; I could of course use
cmpxchg() loops for everything, but I found tilegx actually has fetch_or
and fetch_and to implement atomic_or() / atomic_and().

It doesn't have fetch_xor() from what I've been told so atomic_xor()
will have to become a cmpxchg() loop.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.
> 
> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),
> but *combining* them would be possible and good.
> 
> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.
> 
> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..
> 
> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.
> 
> Does the standard allow for that kind of behavior?

You asked!  ;-)

So the current standard allows merging of both loads and stores, unless of
course ordring constraints prevent the merging.  Volatile semantics may be
used to prevent this merging, if desired, for example, for real-time code.
Infinite merging is intended to be prohibited, but I am not certain that
the current wording is bullet-proof (1.10p24 and 1.10p25).

The only prohibition against speculative stores that I can see is in a
non-normative note, and it can be argued to apply only to things that are
not atomics (1.10p22).  I don't see any prohibition against reordering
a store to precede a load preceding a conditional branch -- which would
not be speculative if the branch was know to be taken and the load
hit in the store buffer.  In a system where stores could be reordered,
some other CPU might perceive the store as happening before the load
that controlled the conditional branch.  This needs to be addressed.

Why this hole?  At the time, the current formalizations of popular
CPU architectures did not exist, and it was not clear that all popular
hardware avoided speculative stores.  

There is also fun with "out of thin air" values, which everyone agrees
should be prohibited, but where there is not agreement on how to prohibit
them in a mathematically constructive manner.  The current draft contains
a clause simply stating that out-of-thin-air values are prohibited,
which doesn't help someone constructing tools to analyze C++ code.
One proposal requires that subsequent atomic stores never be reordered
before prior atomic loads, which requires useless ordering code to be
emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter
Zijlstra's reaction to this proposal a few days ago).  Note that Itanium
already pays this price in order to provide full single-variable cache
coherence.  This out-of-thin-air discussion is also happening in the
Java community in preparation for a new rev of the Java memory model.

There will also be some discussions on memory_order_consume, which is
intended to (eventually) implement rcu_dereference().  The compiler
writers don't like tracking dependencies, but there may be some ways
of constraining optimizations to preserve the common dependencies that,
while providing some syntax to force preservation of dependencies that
would normally be optimized out.  One example of this is where you have an
RCU-protected array that might sometimes contain only a single element.
In the single-element case, the compiler knows a priori which element
will be used, and will therefore optimize the dependency away, so that
the reader might see pre-initialization state.  But this is rare, so
if added syntax needs to be added in this case, I believe we should be
OK with it.  (If syntax is

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > > > There are also so many ways to blow your head off it's untrue. 
> > > > > > > > For example,
> > > > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > > > success, but
> > > > > > > > then there are restrictions on the sets you can use for each. 
> > > > > > > > It's not hard
> > > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". 
> > > > > > > > Then there's
> > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > > > ignores consume
> > > > > > > > atm and optimises all of the data dependencies away) as well as 
> > > > > > > > the definition
> > > > > > > > of "data races", which seem to be used as an excuse to 
> > > > > > > > miscompile a program
> > > > > > > > at the earliest opportunity.
> > > > > > > 
> > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > > memory_order_consume until the compilers implement it both 
> > > > > > > correctly and
> > > > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > > > shortage
> > > > > > > of compiler writers who would prefer to ignore 
> > > > > > > memory_order_consume.
> > > > > > 
> > > > > > Do you have any input on
> > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, 
> > > > > > the
> > > > > > language standard's definition of dependencies?
> > > > > 
> > > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > > 
> > > > > — B is an invocation of any specialization of std::kill_dependency 
> > > > > (29.3), or
> > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > > > logical OR (||, see 5.15) operator,
> > > > > or
> > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > > 
> > > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > > emit code for "flag - flag", but there are a number of other ways to
> > > > > make that work.
> > > > 
> > > > And that's what would concern me, considering that these requirements
> > > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > > just constrain compilers wrt. reordering across atomic accesses or
> > > > changes to the atomic accesses themselves, the dependencies are new
> > > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > > seems far more involved to me.
> > > 
> > > Well, the wording of 1.10p9 is pretty explicit on this point.
> > > There are only a few exceptions to the rule that dependencies from
> > > memory_order_consume loads must be tracked.  And to your point about
> > > requirements being placed on pieces of otherwise non-synchronizing code,
> > > we already have that with plain old load acquire and store release --
> > > both of these put ordering constraints that affect the surrounding
> > > non-synchronizing code.
> > 
> > I think there's a significant difference.  With acquire/release or more
> > general memory orders, it's true that we can't order _across_ the atomic
> > access.  However, we can reorder and optimize without additional
> > constraints if we do not reorder.  This is not the case with consume
> > memory order, as the (p + flag - flag) example shows.
> 
> Agreed, memory_order_consume does introduce additional restrictions.
> 
> > > This issue got a lot of discussion, and the compromise is that
> > > dependencies cannot leak into or out of functions unless the relevant
> > > parameters or return values are annotated with [[carries_dependency]].
> > > This means that the compiler can see all the places where dependencies
> > > must be tracked.  This is described in 7.6.4.
> > 
> > I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> > constraint--what it is--in 1.10, so I guess at least that should be
> > fixed).
> > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> > difference, at least if one is assuming that normal optimization of
> > sequential c

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.

No, atomics aren't an observable behavior of the abstract machine
(unless they are volatile).  See 1.8.p8 (citing the C++ standard).

> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),

Agreed, unless the compiler can prove that this doesn't make a
difference in the program at hand and it's not volatile atomics.  In
general, that will be hard and thus won't happen often I suppose, but if
correctly proved it would fall under the as-if rule I think.

> but *combining* them would be possible and good.

Agreed.

> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.

Agreed.  In your example, the compiler would have to prove that the
abstract machine would always be able to run the two loads atomically
(ie, as one load) without running into impossible/disallowed behavior of
the program.  But if there's no loop or branch or such in-between, this
should be straight-forward because any hardware oddity or similar could
merge those loads and it wouldn't be disallowed by the standard
(considering that we're talking about a finite number of loads), so the
compiler would be allowed to do it as well.

> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..

Agreed.  As Paul points out, this being correct assumes that there are
no other ordering guarantees or memory accesses "interfering", but if
the stores are to the same memory location and adjacent to each other in
the program, then I don't see a reason why they wouldn't be combinable.

> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.

Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
although those might not be bullet-proof as Paul points out.  Forward
progress is rather vaguely specified in the standard, but at least parts
of the committee (and people in ISO C++ SG1, in particular) are working
on trying to improve this.

> Does the standard allow for that kind of behavior?

I think the standard requires (or intends to require) the behavior that
you (and I) seem to prefer in these examples.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> > >
> > > Intuitively, this is wrong because this let's the program take a step
> > > the abstract machine wouldn't do.  This is different to the sequential
> > > code that Peter posted because it uses atomics, and thus one can't
> > > easily assume that the difference is not observable.
> > 
> > Btw, what is the definition of "observable" for the atomics?
> > 
> > Because I'm hoping that it's not the same as for volatiles, where
> > "observable" is about the virtual machine itself, and as such volatile
> > accesses cannot be combined or optimized at all.
> > 
> > Now, I claim that atomic accesses cannot be done speculatively for
> > writes, and not re-done for reads (because the value could change),
> > but *combining* them would be possible and good.
> > 
> > For example, we often have multiple independent atomic accesses that
> > could certainly be combined: testing the individual bits of an atomic
> > value with helper functions, causing things like "load atomic, test
> > bit, load same atomic, test another bit". The two atomic loads could
> > be done as a single load without possibly changing semantics on a real
> > machine, but if "visibility" is defined in the same way it is for
> > "volatile", that wouldn't be a valid transformation. Right now we use
> > "volatile" semantics for these kinds of things, and they really can
> > hurt.
> > 
> > Same goes for multiple writes (possibly due to setting bits):
> > combining multiple accesses into a single one is generally fine, it's
> > *adding* write accesses speculatively that is broken by design..
> > 
> > At the same time, you can't combine atomic loads or stores infinitely
> > - "visibility" on a real machine definitely is about timeliness.
> > Removing all but the last write when there are multiple consecutive
> > writes is generally fine, even if you unroll a loop to generate those
> > writes. But if what remains is a loop, it might be a busy-loop
> > basically waiting for something, so it would be wrong ("untimely") to
> > hoist a store in a loop entirely past the end of the loop, or hoist a
> > load in a loop to before the loop.
> > 
> > Does the standard allow for that kind of behavior?
> 
> You asked!  ;-)
> 
> So the current standard allows merging of both loads and stores, unless of
> course ordring constraints prevent the merging.  Volatile semantics may be
> used to prevent this merging, if desired, for example, for real-time code.

Agreed.

> Infinite merging is intended to be prohibited, but I am not certain that
> the current wording is bullet-proof (1.10p24 and 1.10p25).

Yeah, maybe not.  But it at least seems to rather clearly indicate the
intent ;)

> The only prohibition against speculative stores that I can see is in a
> non-normative note, and it can be argued to apply only to things that are
> not atomics (1.10p22).

I think this one is specifically about speculative stores that would
affect memory locations that the abstract machine would not write to,
and that might be observable or create data races.  While a compiler
could potentially prove that such stores aren't leading to a difference
in the behavior of the program (e.g., by proving that there are no
observers anywhere and this isn't overlapping with any volatile
locations), I think that this is hard in general and most compilers will
just not do such things.  In GCC, bugs in that category were fixed after
researchers doing fuzz-testing found them (IIRC, speculative stores by
loops).

> I don't see any prohibition against reordering
> a store to precede a load preceding a conditional branch -- which would
> not be speculative if the branch was know to be taken and the load
> hit in the store buffer.  In a system where stores could be reordered,
> some other CPU might perceive the store as happening before the load
> that controlled the conditional branch.  This needs to be addressed.

I don't know the specifics of your example, but from how I understand
it, I don't see a problem if the compiler can prove that the store will
always happen.

To be more specific, if the compiler can prove that the store will
happen anyway, and the region of code can be assumed to always run
atomically (e.g., there's no loop or such in there), then it is known
that we have one atomic region of code that will always perform the
store, so we might as well do the stuff in the region in some order.

Now, if any of the memory accesses are atomic, then the whole region of
code containing those accesses is often not atomic because other threads
might observe intermediate results in a data-race-free way.

(I know that this isn't a very precise formulation, but I hope it brings
my line of reasoning across.)

> Why this hole?  At the time, the current formalizations of popular
> CPU architectures did not exist, and it was n

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread David Howells


Is it worth considering a move towards using C11 atomics and barriers and
compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
these.

One thing I'm not sure of, though, is how well gcc's atomics will cope with
interrupt handlers touching atomics on CPUs without suitable atomic
instructions - that said, userspace does have to deal with signals getting
underfoot. but then userspace can't normally disable interrupts.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Peter Zijlstra

On Thu, Feb 06, 2014 at 06:25:49PM +, David Howells wrote:
> 
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.
> 
> One thing I'm not sure of, though, is how well gcc's atomics will cope with
> interrupt handlers touching atomics on CPUs without suitable atomic
> instructions - that said, userspace does have to deal with signals getting
> underfoot. but then userspace can't normally disable interrupts.

I can do an asm-generic/atomic_c11.h if people want.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Paul E. McKenney

On Thu, Feb 06, 2014 at 06:25:49PM +, David Howells wrote:
> 
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.

Makes sense to me!

> One thing I'm not sure of, though, is how well gcc's atomics will cope with
> interrupt handlers touching atomics on CPUs without suitable atomic
> instructions - that said, userspace does have to deal with signals getting
> underfoot. but then userspace can't normally disable interrupts.

Perhaps make the C11 definitions so that any arch can override any
specific definition?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Ramana Radhakrishnan


On 02/06/14 18:25, David Howells wrote:


Is it worth considering a move towards using C11 atomics and barriers and
compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
these.



It sounds interesting to me, if we can make it work properly and 
reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.




One thing I'm not sure of, though, is how well gcc's atomics will cope with
interrupt handlers touching atomics on CPUs without suitable atomic
instructions - that said, userspace does have to deal with signals getting
underfoot. but then userspace can't normally disable interrupts.

David



Ramana

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Will Deacon

On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> On 02/06/14 18:25, David Howells wrote:
> >
> > Is it worth considering a move towards using C11 atomics and barriers and
> > compiler intrinsics inside the kernel?  The compiler _ought_ to be able to 
> > do
> > these.
> 
> 
> It sounds interesting to me, if we can make it work properly and 
> reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.

Given my (albeit limited) experience playing with the C11 spec and GCC, I
really think this is a bad idea for the kernel. It seems that nobody really
agrees on exactly how the C11 atomics map to real architectural
instructions on anything but the trivial architectures. For example, should
the following code fire the assert?

extern atomic foo, bar, baz;

void thread1(void)
{
foo.store(42, memory_order_relaxed);
bar.fetch_add(1, memory_order_seq_cst);
baz.store(42, memory_order_relaxed);
}

void thread2(void)
{
while (baz.load(memory_order_seq_cst) != 42) {
/* do nothing */
}

assert(foo.load(memory_order_seq_cst) == 42);
}

To answer that question, you need to go and look at the definitions of
synchronises-with, happens-before, dependency_ordered_before and a whole
pile of vaguely written waffle to realise that you don't know. Certainly,
the code that arm64 GCC currently spits out would allow the assertion to fire
on some microarchitectures.

There are also so many ways to blow your head off it's untrue. For example,
cmpxchg takes a separate memory model parameter for failure and success, but
then there are restrictions on the sets you can use for each. It's not hard
to find well-known memory-ordering experts shouting "Just use
memory_model_seq_cst for everything, it's too hard otherwise". Then there's
the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
atm and optimises all of the data dependencies away) as well as the definition
of "data races", which seem to be used as an excuse to miscompile a program
at the earliest opportunity.

Trying to introduce system concepts (writes to devices, interrupts,
non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
just rather stick to the semantics we have and the asm volatile barriers.

That's not to say I don't there's no room for improvement in what we have
in the kernel. Certainly, I'd welcome allowing more relaxed operations on
architectures that support them, but it needs to be something that at least
the different architecture maintainers can understand how to implement
efficiently behind an uncomplicated interface. I don't think that interface is
C11.

Just my thoughts on the matter...

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Linus Torvalds

On Thu, Feb 6, 2014 at 10:25 AM, David Howells  wrote:
>
> Is it worth considering a move towards using C11 atomics and barriers and
> compiler intrinsics inside the kernel?  The compiler _ought_ to be able to do
> these.

I think that's a bad idea as a generic implementation, but it's quite
possibly worth doing on an architecture-by-architecture basis.

The thing is, gcc builtins sometimes suck. Sometimes it's because
certain architectures want particular gcc versions that don't do a
good job, sometimes it's because the architecture doesn't lend itself
to doing what we want done and we end up better off with some tweaked
thing, sometimes it's because the language feature is just badly
designed. I'm not convinced that the C11 people got things right.

But in specific cases, maybe the architecture wants to use the
builtin. And the gcc version checks are likely to be
architecture-specific for a longish while. So some particular
architectures might want to use them, but I really doubt it makes
sense to make it the default with arch overrides.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Paul E. McKenney

On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > On 02/06/14 18:25, David Howells wrote:
> > >
> > > Is it worth considering a move towards using C11 atomics and barriers and
> > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able 
> > > to do
> > > these.
> > 
> > 
> > It sounds interesting to me, if we can make it work properly and 
> > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> 
> Given my (albeit limited) experience playing with the C11 spec and GCC, I
> really think this is a bad idea for the kernel. It seems that nobody really
> agrees on exactly how the C11 atomics map to real architectural
> instructions on anything but the trivial architectures. For example, should
> the following code fire the assert?
> 
> 
> extern atomic foo, bar, baz;
> 
> void thread1(void)
> {
>   foo.store(42, memory_order_relaxed);
>   bar.fetch_add(1, memory_order_seq_cst);
>   baz.store(42, memory_order_relaxed);
> }
> 
> void thread2(void)
> {
>   while (baz.load(memory_order_seq_cst) != 42) {
>   /* do nothing */
>   }
> 
>   assert(foo.load(memory_order_seq_cst) == 42);
> }
> 
> 
> To answer that question, you need to go and look at the definitions of
> synchronises-with, happens-before, dependency_ordered_before and a whole
> pile of vaguely written waffle to realise that you don't know. Certainly,
> the code that arm64 GCC currently spits out would allow the assertion to fire
> on some microarchitectures.

Yep!  I believe that a memory_order_seq_cst fence in combination with the
fetch_add() would do the trick on many architectures, however.  All of
this is one reason that any C11 definitions need to be individually
overridable by individual architectures.

> There are also so many ways to blow your head off it's untrue. For example,
> cmpxchg takes a separate memory model parameter for failure and success, but
> then there are restrictions on the sets you can use for each. It's not hard
> to find well-known memory-ordering experts shouting "Just use
> memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> atm and optimises all of the data dependencies away) as well as the definition
> of "data races", which seem to be used as an excuse to miscompile a program
> at the earliest opportunity.

Trust me, rcu_dereference() is not going to be defined in terms of
memory_order_consume until the compilers implement it both correctly and
efficiently.  They are not there yet, and there is currently no shortage
of compiler writers who would prefer to ignore memory_order_consume.
And rcu_dereference() will need per-arch overrides for some time during
any transition to memory_order_consume.

> Trying to introduce system concepts (writes to devices, interrupts,
> non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd
> just rather stick to the semantics we have and the asm volatile barriers.

And barrier() isn't going to go away any time soon, either.  And
ACCESS_ONCE() needs to keep volatile semantics until there is some
memory_order_whatever that prevents loads and stores from being coalesced.

> That's not to say I don't there's no room for improvement in what we have
> in the kernel. Certainly, I'd welcome allowing more relaxed operations on
> architectures that support them, but it needs to be something that at least
> the different architecture maintainers can understand how to implement
> efficiently behind an uncomplicated interface. I don't think that interface is
> C11.
> 
> Just my thoughts on the matter...

C11 does not provide a good interface for the Linux kernel, nor was it
intended to do so.  It might provide good implementations for some of
the atomic ops for some architectures.  This could reduce the amount of
assembly written for new architectures, and could potentially allow the
compiler to do a better job of optimizing (scary thought!).  But for this
to work, that architecture's Linux-kernel maintainer and gcc maintainer
would need to be working together.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel

On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > On 02/06/14 18:25, David Howells wrote:
> > >
> > > Is it worth considering a move towards using C11 atomics and barriers and
> > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able 
> > > to do
> > > these.
> > 
> > 
> > It sounds interesting to me, if we can make it work properly and 
> > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> 
> Given my (albeit limited) experience playing with the C11 spec and GCC, I
> really think this is a bad idea for the kernel.

I'm not going to comment on what's best for the kernel (simply because I
don't work on it), but I disagree with several of your statements.

> It seems that nobody really
> agrees on exactly how the C11 atomics map to real architectural
> instructions on anything but the trivial architectures.

There's certainly different ways to implement the memory model and those
have to be specified elsewhere, but I don't see how this differs much
from other things specified in the ABI(s) for each architecture.

> For example, should
> the following code fire the assert?

I don't see how your example (which is about what the language requires
or not) relates to the statement about the mapping above?

> 
> extern atomic foo, bar, baz;
> 
> void thread1(void)
> {
>   foo.store(42, memory_order_relaxed);
>   bar.fetch_add(1, memory_order_seq_cst);
>   baz.store(42, memory_order_relaxed);
> }
> 
> void thread2(void)
> {
>   while (baz.load(memory_order_seq_cst) != 42) {
>   /* do nothing */
>   }
> 
>   assert(foo.load(memory_order_seq_cst) == 42);
> }
> 

It's a good example.  My first gut feeling was that the assertion should
never fire, but that was wrong because (as I seem to usually forget) the
seq-cst total order is just a constraint but doesn't itself contribute
to synchronizes-with -- but this is different for seq-cst fences.

> To answer that question, you need to go and look at the definitions of
> synchronises-with, happens-before, dependency_ordered_before and a whole
> pile of vaguely written waffle to realise that you don't know.

Are you familiar with the formalization of the C11/C++11 model by Batty
et al.?
http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
http://www.cl.cam.ac.uk/~mjb220/n3132.pdf

They also have a nice tool that can run condensed examples and show you
all allowed (and forbidden) executions (it runs in the browser, so is
slow for larger examples), including nice annotated graphs for those:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

It requires somewhat special syntax, but the following, which should be
equivalent to your example above, runs just fine:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  atomic_int baz = 0; 
  {{{ {
foo.store(42, memory_order_relaxed);
bar.store(1, memory_order_seq_cst);
baz.store(42, memory_order_relaxed);
  }
  ||| {
r1=baz.load(memory_order_seq_cst).readsvalue(42);
r2=foo.load(memory_order_seq_cst).readsvalue(0);
  }
  }}};
  return 0; }

That yields 3 consistent executions for me, and likewise if the last
readsvalue() is using 42 as argument.

If you add a "fence(memory_order_seq_cst);" after the store to foo, the
program can't observe != 42 for foo anymore, because the seq-cst fence
is adding a synchronizes-with edge via the baz reads-from.

I think this is a really neat tool, and very helpful to answer such
questions as in your example.

> Certainly,
> the code that arm64 GCC currently spits out would allow the assertion to fire
> on some microarchitectures.
> 
> There are also so many ways to blow your head off it's untrue. For example,
> cmpxchg takes a separate memory model parameter for failure and success, but
> then there are restrictions on the sets you can use for each.

That's in there for the architectures without a single-instruction
CAS/cmpxchg, I believe.

> It's not hard
> to find well-known memory-ordering experts shouting "Just use
> memory_model_seq_cst for everything, it's too hard otherwise".

Everyone I've heard saying this meant this as advice to people new to
synchronization or just dealing infrequently with it.  The advice is the
simple and safe fallback, and I don't think it's meant as an
acknowledgment that the model itself would be too hard.  If the
language's memory model is supposed to represent weak HW memory models
to at least some extent, there's only so much you can do in terms of
keeping it simple.  If all architectures had x86-like models, the
language's model would certainly be simpler... :)

> Then there's
> the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> atm and optimises all of the data dependencies away)

AFAIK consume memory order was added to model Power/ARM-specific
behavior.  I agree that the way the standard specifies how dependencies
are to be p

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel

On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > On 02/06/14 18:25, David Howells wrote:
> > > >
> > > > Is it worth considering a move towards using C11 atomics and barriers 
> > > > and
> > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able 
> > > > to do
> > > > these.
> > > 
> > > 
> > > It sounds interesting to me, if we can make it work properly and 
> > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> > 
> > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > really think this is a bad idea for the kernel. It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures. For example, should
> > the following code fire the assert?
> > 
> > 
> > extern atomic foo, bar, baz;
> > 
> > void thread1(void)
> > {
> > foo.store(42, memory_order_relaxed);
> > bar.fetch_add(1, memory_order_seq_cst);
> > baz.store(42, memory_order_relaxed);
> > }
> > 
> > void thread2(void)
> > {
> > while (baz.load(memory_order_seq_cst) != 42) {
> > /* do nothing */
> > }
> > 
> > assert(foo.load(memory_order_seq_cst) == 42);
> > }
> > 
> > 
> > To answer that question, you need to go and look at the definitions of
> > synchronises-with, happens-before, dependency_ordered_before and a whole
> > pile of vaguely written waffle to realise that you don't know. Certainly,
> > the code that arm64 GCC currently spits out would allow the assertion to 
> > fire
> > on some microarchitectures.
> 
> Yep!  I believe that a memory_order_seq_cst fence in combination with the
> fetch_add() would do the trick on many architectures, however.  All of
> this is one reason that any C11 definitions need to be individually
> overridable by individual architectures.

"Overridable" in which sense?  Do you want to change the semantics on
the language level in the sense of altering the memory model, or rather
use a different implementation under the hood to, for example, fix
deficiencies in the compilers?

> > There are also so many ways to blow your head off it's untrue. For example,
> > cmpxchg takes a separate memory model parameter for failure and success, but
> > then there are restrictions on the sets you can use for each. It's not hard
> > to find well-known memory-ordering experts shouting "Just use
> > memory_model_seq_cst for everything, it's too hard otherwise". Then there's
> > the fun of load-consume vs load-acquire (arm64 GCC completely ignores 
> > consume
> > atm and optimises all of the data dependencies away) as well as the 
> > definition
> > of "data races", which seem to be used as an excuse to miscompile a program
> > at the earliest opportunity.
> 
> Trust me, rcu_dereference() is not going to be defined in terms of
> memory_order_consume until the compilers implement it both correctly and
> efficiently.  They are not there yet, and there is currently no shortage
> of compiler writers who would prefer to ignore memory_order_consume.

Do you have any input on
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
language standard's definition of dependencies?

> And rcu_dereference() will need per-arch overrides for some time during
> any transition to memory_order_consume.
> 
> > Trying to introduce system concepts (writes to devices, interrupts,
> > non-coherent agents) into this mess is going to be an uphill battle IMHO. 
> > I'd
> > just rather stick to the semantics we have and the asm volatile barriers.
> 
> And barrier() isn't going to go away any time soon, either.  And
> ACCESS_ONCE() needs to keep volatile semantics until there is some
> memory_order_whatever that prevents loads and stores from being coalesced.

I'd be happy to discuss something like this in ISO C++ SG1 (or has this
been discussed in the past already?).  But it needs to have a paper I
suppose.

Will you be in Issaquah for the C++ meeting next week?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Paul E. McKenney

On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > On 02/06/14 18:25, David Howells wrote:
> > > >
> > > > Is it worth considering a move towards using C11 atomics and barriers 
> > > > and
> > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be able 
> > > > to do
> > > > these.
> > > 
> > > 
> > > It sounds interesting to me, if we can make it work properly and 
> > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> > 
> > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > really think this is a bad idea for the kernel.
> 
> I'm not going to comment on what's best for the kernel (simply because I
> don't work on it), but I disagree with several of your statements.
> 
> > It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures.
> 
> There's certainly different ways to implement the memory model and those
> have to be specified elsewhere, but I don't see how this differs much
> from other things specified in the ABI(s) for each architecture.
> 
> > For example, should
> > the following code fire the assert?
> 
> I don't see how your example (which is about what the language requires
> or not) relates to the statement about the mapping above?
> 
> > 
> > extern atomic foo, bar, baz;
> > 
> > void thread1(void)
> > {
> > foo.store(42, memory_order_relaxed);
> > bar.fetch_add(1, memory_order_seq_cst);
> > baz.store(42, memory_order_relaxed);
> > }
> > 
> > void thread2(void)
> > {
> > while (baz.load(memory_order_seq_cst) != 42) {
> > /* do nothing */
> > }
> > 
> > assert(foo.load(memory_order_seq_cst) == 42);
> > }
> > 
> 
> It's a good example.  My first gut feeling was that the assertion should
> never fire, but that was wrong because (as I seem to usually forget) the
> seq-cst total order is just a constraint but doesn't itself contribute
> to synchronizes-with -- but this is different for seq-cst fences.

>From what I can see, Will's point is that mapping the Linux kernel's
atomic_add_return() primitive into fetch_add() does not work because
atomic_add_return()'s ordering properties require that the assert()
never fire.

Augmenting the fetch_add() with a seq_cst fence would work on many
architectures, but not for all similar examples.  The reason is that
the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
Power's sync.  To your point, I believe that it would make the above
example work, but there are some IRIW-like examples that would fail
according to the standard (though a number of specific implementations
would in fact work correctly).

> > To answer that question, you need to go and look at the definitions of
> > synchronises-with, happens-before, dependency_ordered_before and a whole
> > pile of vaguely written waffle to realise that you don't know.
> 
> Are you familiar with the formalization of the C11/C++11 model by Batty
> et al.?
> http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> They also have a nice tool that can run condensed examples and show you
> all allowed (and forbidden) executions (it runs in the browser, so is
> slow for larger examples), including nice annotated graphs for those:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> 
> It requires somewhat special syntax, but the following, which should be
> equivalent to your example above, runs just fine:
> 
> int main() {
>   atomic_int foo = 0; 
>   atomic_int bar = 0; 
>   atomic_int baz = 0; 
>   {{{ {
> foo.store(42, memory_order_relaxed);
> bar.store(1, memory_order_seq_cst);
> baz.store(42, memory_order_relaxed);
>   }
>   ||| {
> r1=baz.load(memory_order_seq_cst).readsvalue(42);
> r2=foo.load(memory_order_seq_cst).readsvalue(0);
>   }
>   }}};
>   return 0; }
> 
> That yields 3 consistent executions for me, and likewise if the last
> readsvalue() is using 42 as argument.
> 
> If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> program can't observe != 42 for foo anymore, because the seq-cst fence
> is adding a synchronizes-with edge via the baz reads-from.
> 
> I think this is a really neat tool, and very helpful to answer such
> questions as in your example.

Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
your substitution of store() for fetch_add() is correct.  Then this shows
that we cannot substitute fetch_add() for atomic_add_return().

Adding atomic_thread_fence(memory_order_seq_cst) after the bar.store
gives me "192 executions; no consistent", so perhaps there is hope for
augmenting the fetch_add() with a fence.  Except, as noted above, for
any number of IRIW-like examples such as the following:

int main(

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Paul E. McKenney

On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > > On 02/06/14 18:25, David Howells wrote:
> > > > >
> > > > > Is it worth considering a move towards using C11 atomics and barriers 
> > > > > and
> > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be 
> > > > > able to do
> > > > > these.
> > > > 
> > > > 
> > > > It sounds interesting to me, if we can make it work properly and 
> > > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> > > 
> > > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > > really think this is a bad idea for the kernel. It seems that nobody 
> > > really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures. For example, 
> > > should
> > > the following code fire the assert?
> > > 
> > > 
> > > extern atomic foo, bar, baz;
> > > 
> > > void thread1(void)
> > > {
> > >   foo.store(42, memory_order_relaxed);
> > >   bar.fetch_add(1, memory_order_seq_cst);
> > >   baz.store(42, memory_order_relaxed);
> > > }
> > > 
> > > void thread2(void)
> > > {
> > >   while (baz.load(memory_order_seq_cst) != 42) {
> > >   /* do nothing */
> > >   }
> > > 
> > >   assert(foo.load(memory_order_seq_cst) == 42);
> > > }
> > > 
> > > 
> > > To answer that question, you need to go and look at the definitions of
> > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > pile of vaguely written waffle to realise that you don't know. Certainly,
> > > the code that arm64 GCC currently spits out would allow the assertion to 
> > > fire
> > > on some microarchitectures.
> > 
> > Yep!  I believe that a memory_order_seq_cst fence in combination with the
> > fetch_add() would do the trick on many architectures, however.  All of
> > this is one reason that any C11 definitions need to be individually
> > overridable by individual architectures.
> 
> "Overridable" in which sense?  Do you want to change the semantics on
> the language level in the sense of altering the memory model, or rather
> use a different implementation under the hood to, for example, fix
> deficiencies in the compilers?

We need the architecture maintainer to be able to select either an
assembly-language implementation or a C11-atomics implementation for any
given Linux-kernel operation.  For example, a given architecture might
be able to use fetch_add(1, memory_order_relaxed) for atomic_inc() but
assembly for atomic_add_return().  This is because atomic_inc() is not
required to have any particular ordering properties, while as discussed
previously, atomic_add_return() requires tighter ordering than the C11
standard provides.

> > > There are also so many ways to blow your head off it's untrue. For 
> > > example,
> > > cmpxchg takes a separate memory model parameter for failure and success, 
> > > but
> > > then there are restrictions on the sets you can use for each. It's not 
> > > hard
> > > to find well-known memory-ordering experts shouting "Just use
> > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > there's
> > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores 
> > > consume
> > > atm and optimises all of the data dependencies away) as well as the 
> > > definition
> > > of "data races", which seem to be used as an excuse to miscompile a 
> > > program
> > > at the earliest opportunity.
> > 
> > Trust me, rcu_dereference() is not going to be defined in terms of
> > memory_order_consume until the compilers implement it both correctly and
> > efficiently.  They are not there yet, and there is currently no shortage
> > of compiler writers who would prefer to ignore memory_order_consume.
> 
> Do you have any input on
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> language standard's definition of dependencies?

Let's see...  1.10p9 says that a dependency must be carried unless:

— B is an invocation of any specialization of std::kill_dependency (29.3), or
— A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR 
(||, see 5.15) operator,
or
— A is the left operand of a conditional (?:, see 5.16) operator, or
— A is the left operand of the built-in comma (,) operator (5.18);

So the use of "flag" before the "?" is ignored.  But the "flag - flag"
after the "?" will carry a dependency, so the code fragment in 59448
needs to do the ordering rather than just optimizing "flag - flag" out
of existence.  One way to do that on both ARM and Power is to actually
emit code for "flag - flag", but there are a number of other ways to
make that work.

BTW, there is some discussion on 1.10p9's handling of && and ||, and
that clause is likely to change.  An

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel

On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > > On 02/06/14 18:25, David Howells wrote:
> > > > >
> > > > > Is it worth considering a move towards using C11 atomics and barriers 
> > > > > and
> > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be 
> > > > > able to do
> > > > > these.
> > > > 
> > > > 
> > > > It sounds interesting to me, if we can make it work properly and 
> > > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip in.
> > > 
> > > Given my (albeit limited) experience playing with the C11 spec and GCC, I
> > > really think this is a bad idea for the kernel.
> > 
> > I'm not going to comment on what's best for the kernel (simply because I
> > don't work on it), but I disagree with several of your statements.
> > 
> > > It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures.
> > 
> > There's certainly different ways to implement the memory model and those
> > have to be specified elsewhere, but I don't see how this differs much
> > from other things specified in the ABI(s) for each architecture.
> > 
> > > For example, should
> > > the following code fire the assert?
> > 
> > I don't see how your example (which is about what the language requires
> > or not) relates to the statement about the mapping above?
> > 
> > > 
> > > extern atomic foo, bar, baz;
> > > 
> > > void thread1(void)
> > > {
> > >   foo.store(42, memory_order_relaxed);
> > >   bar.fetch_add(1, memory_order_seq_cst);
> > >   baz.store(42, memory_order_relaxed);
> > > }
> > > 
> > > void thread2(void)
> > > {
> > >   while (baz.load(memory_order_seq_cst) != 42) {
> > >   /* do nothing */
> > >   }
> > > 
> > >   assert(foo.load(memory_order_seq_cst) == 42);
> > > }
> > > 
> > 
> > It's a good example.  My first gut feeling was that the assertion should
> > never fire, but that was wrong because (as I seem to usually forget) the
> > seq-cst total order is just a constraint but doesn't itself contribute
> > to synchronizes-with -- but this is different for seq-cst fences.
> 
> From what I can see, Will's point is that mapping the Linux kernel's
> atomic_add_return() primitive into fetch_add() does not work because
> atomic_add_return()'s ordering properties require that the assert()
> never fire.
> 
> Augmenting the fetch_add() with a seq_cst fence would work on many
> architectures, but not for all similar examples.  The reason is that
> the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
> Power's sync.  To your point, I believe that it would make the above
> example work, but there are some IRIW-like examples that would fail
> according to the standard (though a number of specific implementations
> would in fact work correctly).

Thanks for the background.  I don't read LKML, and it wasn't obvious
from reading just the part of the thread posted to gcc@ that achieving
these semantics is the goal.

> > > To answer that question, you need to go and look at the definitions of
> > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > pile of vaguely written waffle to realise that you don't know.
> > 
> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > 
> > They also have a nice tool that can run condensed examples and show you
> > all allowed (and forbidden) executions (it runs in the browser, so is
> > slow for larger examples), including nice annotated graphs for those:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > 
> > It requires somewhat special syntax, but the following, which should be
> > equivalent to your example above, runs just fine:
> > 
> > int main() {
> >   atomic_int foo = 0; 
> >   atomic_int bar = 0; 
> >   atomic_int baz = 0; 
> >   {{{ {
> > foo.store(42, memory_order_relaxed);
> > bar.store(1, memory_order_seq_cst);
> > baz.store(42, memory_order_relaxed);
> >   }
> >   ||| {
> > r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > r2=foo.load(memory_order_seq_cst).readsvalue(0);
> >   }
> >   }}};
> >   return 0; }
> > 
> > That yields 3 consistent executions for me, and likewise if the last
> > readsvalue() is using 42 as argument.
> > 
> > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > program can't observe != 42 for foo anymore, because the seq-cst fence
> > is adding a synchronizes-with edge via the baz reads-from.
> > 
> > I think this is a really neat tool, and very helpful to answer such
> > questions as in your example.
> 
> Hmmm...  The tool doesn't seem to like fetc

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Joseph S. Myers

On Thu, 6 Feb 2014, Torvald Riegel wrote:

> > It seems that nobody really
> > agrees on exactly how the C11 atomics map to real architectural
> > instructions on anything but the trivial architectures.
> 
> There's certainly different ways to implement the memory model and those
> have to be specified elsewhere, but I don't see how this differs much
> from other things specified in the ABI(s) for each architecture.

It is not clear to me that there is any good consensus understanding of 
how to specify this in an ABI, or how, given an ABI, to determine whether 
an implementation is valid.

For ABIs not considering atomics / concurrency, it's well understood, for 
example, that the ABI specifies observable conditions at function call 
boundaries ("these arguments are in these registers", "the stack pointer 
has this alignment", "on return from the function, the values of these 
registers are unchanged from the values they had on entry").  It may 
sometimes specify things at other times (e.g. presence or absence of a red 
zone - whether memory beyond the stack pointer may be overwritten on an 
interrupt).  But if it gives a code sequence, it's clear this is just an 
example rather than a requirement for particular code to be generated - 
any code sequence suffices if it meets the observable conditions at the 
points where code generated by one implementation may pass control to code 
generated by another implementation.

When atomics are involved, you no longer have a limited set of 
well-defined points where control may pass from code generated by one 
implementation to code generated by another - the code generated by the 
two may be running concurrently.

We know of certain cases 
 where there are 
choices of the mapping of atomic operations to particular instructions.  
But I'm not sure there's much evidence that these are the only ABI issues 
arising from concurrency - that there aren't any other ways in which an 
implementation may transform code, consistent with the as-if rule of ISO 
C, that may run into incompatibilities of different choices.  And even if 
those are the only issues, it's not clear there are well-understood ways 
to define the mapping from the C11 memory model to the architecture's 
model, which provide a good way to reason about whether a particular 
choice of instructions is valid according to the mapping.

> Are you familiar with the formalization of the C11/C++11 model by Batty
> et al.?
> http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf

These discuss, as well as the model itself, proving the validity of a 
particular choice of x86 instructions.  I imagine that might be a starting 
point towards an understanding of how to describe the relevant issues in 
an ABI, and how to determine whether a choice of instructions is 
consistent with what an ABI says.  But I don't get the impression this is 
yet at the level where people not doing research in the area can routinely 
read and write such ABIs and work out whether a code sequence is 
consistent with them.

(If an ABI says "use instruction X", then you can't use a more efficient 
X' added by a new version of the instruction set.  But it can't 
necessarily be as loose as saying "use X and Y, or other instructions that 
achieve semantics when the other thread is using X or Y", because it might 
be the case that Y' interoperates with X, X' interoperates with Y, but X' 
and Y' don't interoperate with each other.  I'd envisage something more 
like mapping not to instructions, but to concepts within the 
architecture's own memory model - but that requires the architecture's 
memory model to be as well defined, and probably formalized, as the C11 
one.)

-- 
Joseph S. Myers
jos...@codesourcery.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel

On Thu, 2014-02-06 at 22:13 +, Joseph S. Myers wrote:
> On Thu, 6 Feb 2014, Torvald Riegel wrote:
> 
> > > It seems that nobody really
> > > agrees on exactly how the C11 atomics map to real architectural
> > > instructions on anything but the trivial architectures.
> > 
> > There's certainly different ways to implement the memory model and those
> > have to be specified elsewhere, but I don't see how this differs much
> > from other things specified in the ABI(s) for each architecture.
> 
> It is not clear to me that there is any good consensus understanding of 
> how to specify this in an ABI, or how, given an ABI, to determine whether 
> an implementation is valid.
> 
> For ABIs not considering atomics / concurrency, it's well understood, for 
> example, that the ABI specifies observable conditions at function call 
> boundaries ("these arguments are in these registers", "the stack pointer 
> has this alignment", "on return from the function, the values of these 
> registers are unchanged from the values they had on entry").  It may 
> sometimes specify things at other times (e.g. presence or absence of a red 
> zone - whether memory beyond the stack pointer may be overwritten on an 
> interrupt).  But if it gives a code sequence, it's clear this is just an 
> example rather than a requirement for particular code to be generated - 
> any code sequence suffices if it meets the observable conditions at the 
> points where code generated by one implementation may pass control to code 
> generated by another implementation.
> 
> When atomics are involved, you no longer have a limited set of 
> well-defined points where control may pass from code generated by one 
> implementation to code generated by another - the code generated by the 
> two may be running concurrently.

Agreed.

> We know of certain cases 
>  where there are 
> choices of the mapping of atomic operations to particular instructions.  
> But I'm not sure there's much evidence that these are the only ABI issues 
> arising from concurrency - that there aren't any other ways in which an 
> implementation may transform code, consistent with the as-if rule of ISO 
> C, that may run into incompatibilities of different choices.

I can't think of other incompatibilities with high likelihood --
provided we ignore consume memory order and the handling of dependencies
(see below).  But I would doubt there is a high risk of such because (1)
the data race definition should hopefully not cause subtle
incompatibilities and (2) there is  clear "hand-off point" from the
compiler to a specific instruction representing an atomic access.  For
example, if we have a release store, and we agree on the instructions
used for that, then compilers will have to ensure happens-before for
anything before the release store; for example, as long as stores
sequenced-before the release store are performed, it doesn't matter in
which order that happens.  Subsequently, an acquire-load somewhere else
can pick this sequence of events up just by using the agreed-upon
acquire-load; like with the stores, it can order subsequent loads in any
way it sees fit, including different optimizations.  That's obviously
not a formal proof, though.  But it seems likely to me, at least :)

I'm more concerned about consume and dependencies because as far as I
understand the standard's requirements, dependencies need to be tracked
across function calls.  Thus, we might have several compilers involved
in that, and we can't just "condense" things to happens-before, but it's
instead how and that we keep dependencies intact.  Because of that, I'm
wondering whether this is actually implementable practically.  (See
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10)

> And even if 
> those are the only issues, it's not clear there are well-understood ways 
> to define the mapping from the C11 memory model to the architecture's 
> model, which provide a good way to reason about whether a particular 
> choice of instructions is valid according to the mapping.

I think that if we have different options, there needs to be agreement
on which to choose across the compilers, at the very least.  I don't
quite know how this looks like for GCC and LLVM, for example.

> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> These discuss, as well as the model itself, proving the validity of a 
> particular choice of x86 instructions.  I imagine that might be a starting 
> point towards an understanding of how to describe the relevant issues in 
> an ABI, and how to determine whether a choice of instructions is 
> consistent with what an ABI says.  But I don't get the impression this is 
> yet at the level where people not doing research in the area can routinely 
> read and write such ABIs and work out whether a co

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Joseph S. Myers

On Fri, 7 Feb 2014, Torvald Riegel wrote:

> I think that if we have different options, there needs to be agreement
> on which to choose across the compilers, at the very least.  I don't
> quite know how this looks like for GCC and LLVM, for example.

I'm not sure we even necessarily get compatibility for the alignment of 
_Atomic types yet (and no ABI document I've seen discusses that issue).

-- 
Joseph S. Myers
jos...@codesourcery.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Torvald Riegel

On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > There are also so many ways to blow your head off it's untrue. For 
> > > > example,
> > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > success, but
> > > > then there are restrictions on the sets you can use for each. It's not 
> > > > hard
> > > > to find well-known memory-ordering experts shouting "Just use
> > > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > > there's
> > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores 
> > > > consume
> > > > atm and optimises all of the data dependencies away) as well as the 
> > > > definition
> > > > of "data races", which seem to be used as an excuse to miscompile a 
> > > > program
> > > > at the earliest opportunity.
> > > 
> > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > memory_order_consume until the compilers implement it both correctly and
> > > efficiently.  They are not there yet, and there is currently no shortage
> > > of compiler writers who would prefer to ignore memory_order_consume.
> > 
> > Do you have any input on
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > language standard's definition of dependencies?
> 
> Let's see...  1.10p9 says that a dependency must be carried unless:
> 
> — B is an invocation of any specialization of std::kill_dependency (29.3), or
> — A is the left operand of a built-in logical AND (&&, see 5.14) or logical 
> OR (||, see 5.15) operator,
> or
> — A is the left operand of a conditional (?:, see 5.16) operator, or
> — A is the left operand of the built-in comma (,) operator (5.18);
> 
> So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> after the "?" will carry a dependency, so the code fragment in 59448
> needs to do the ordering rather than just optimizing "flag - flag" out
> of existence.  One way to do that on both ARM and Power is to actually
> emit code for "flag - flag", but there are a number of other ways to
> make that work.

And that's what would concern me, considering that these requirements
seem to be able to creep out easily.  Also, whereas the other atomics
just constrain compilers wrt. reordering across atomic accesses or
changes to the atomic accesses themselves, the dependencies are new
requirements on pieces of otherwise non-synchronizing code.  The latter
seems far more involved to me.

> BTW, there is some discussion on 1.10p9's handling of && and ||, and
> that clause is likely to change.  And yes, I am behind on analyzing
> usage in the Linux kernel to find out if Linux cares...

Do you have any pointers to these discussions (e.g., LWG issues)?

> > > And rcu_dereference() will need per-arch overrides for some time during
> > > any transition to memory_order_consume.
> > > 
> > > > Trying to introduce system concepts (writes to devices, interrupts,
> > > > non-coherent agents) into this mess is going to be an uphill battle 
> > > > IMHO. I'd
> > > > just rather stick to the semantics we have and the asm volatile 
> > > > barriers.
> > > 
> > > And barrier() isn't going to go away any time soon, either.  And
> > > ACCESS_ONCE() needs to keep volatile semantics until there is some
> > > memory_order_whatever that prevents loads and stores from being coalesced.
> > 
> > I'd be happy to discuss something like this in ISO C++ SG1 (or has this
> > been discussed in the past already?).  But it needs to have a paper I
> > suppose.
> 
> The current position of the usual suspects other than me is that this
> falls into the category of forward-progress guarantees, which are
> considers (again, by the usual suspects other than me) to be out
> of scope.

But I think we need to better describe forward progress, even though
that might be tricky.  We made at least some progress on
http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even
though we can't constrain the OS schedulers too much, and for lock-free
we're in this weird position that on most general-purpose schedulers and
machines, obstruction-free algorithms are likely to work just fine like
lock-free, most of the time, in practice...

We also need to discuss forward progress guarantees for any
parallelism/concurrency abstractions, I believe:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3874.pdf

Hopefully we'll get some more acceptance of this being in scope...

> > Will you be in Issaquah for the C++ meeting next week?
> 
> Weather permitting, I will be there!

Great, maybe we can find some time in SG1 to discuss this then.  Even if
the standard doesn't want to include it, SG1 should be a good forum to
understand everyone's concerns around that, with the hope that this
would help potential non-standard ex

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Paul E. McKenney

On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > > > On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote:
> > > > > On 02/06/14 18:25, David Howells wrote:
> > > > > >
> > > > > > Is it worth considering a move towards using C11 atomics and 
> > > > > > barriers and
> > > > > > compiler intrinsics inside the kernel?  The compiler _ought_ to be 
> > > > > > able to do
> > > > > > these.
> > > > > 
> > > > > 
> > > > > It sounds interesting to me, if we can make it work properly and 
> > > > > reliably. + g...@gcc.gnu.org for others in the GCC community to chip 
> > > > > in.
> > > > 
> > > > Given my (albeit limited) experience playing with the C11 spec and GCC, 
> > > > I
> > > > really think this is a bad idea for the kernel.
> > > 
> > > I'm not going to comment on what's best for the kernel (simply because I
> > > don't work on it), but I disagree with several of your statements.
> > > 
> > > > It seems that nobody really
> > > > agrees on exactly how the C11 atomics map to real architectural
> > > > instructions on anything but the trivial architectures.
> > > 
> > > There's certainly different ways to implement the memory model and those
> > > have to be specified elsewhere, but I don't see how this differs much
> > > from other things specified in the ABI(s) for each architecture.
> > > 
> > > > For example, should
> > > > the following code fire the assert?
> > > 
> > > I don't see how your example (which is about what the language requires
> > > or not) relates to the statement about the mapping above?
> > > 
> > > > 
> > > > extern atomic foo, bar, baz;
> > > > 
> > > > void thread1(void)
> > > > {
> > > > foo.store(42, memory_order_relaxed);
> > > > bar.fetch_add(1, memory_order_seq_cst);
> > > > baz.store(42, memory_order_relaxed);
> > > > }
> > > > 
> > > > void thread2(void)
> > > > {
> > > > while (baz.load(memory_order_seq_cst) != 42) {
> > > > /* do nothing */
> > > > }
> > > > 
> > > > assert(foo.load(memory_order_seq_cst) == 42);
> > > > }
> > > > 
> > > 
> > > It's a good example.  My first gut feeling was that the assertion should
> > > never fire, but that was wrong because (as I seem to usually forget) the
> > > seq-cst total order is just a constraint but doesn't itself contribute
> > > to synchronizes-with -- but this is different for seq-cst fences.
> > 
> > From what I can see, Will's point is that mapping the Linux kernel's
> > atomic_add_return() primitive into fetch_add() does not work because
> > atomic_add_return()'s ordering properties require that the assert()
> > never fire.
> > 
> > Augmenting the fetch_add() with a seq_cst fence would work on many
> > architectures, but not for all similar examples.  The reason is that
> > the C11 seq_cst fence is deliberately weak compared to ARM's dmb or
> > Power's sync.  To your point, I believe that it would make the above
> > example work, but there are some IRIW-like examples that would fail
> > according to the standard (though a number of specific implementations
> > would in fact work correctly).
> 
> Thanks for the background.  I don't read LKML, and it wasn't obvious
> from reading just the part of the thread posted to gcc@ that achieving
> these semantics is the goal.

Lots of history on both sides of this one, I am afraid!  ;-)

> > > > To answer that question, you need to go and look at the definitions of
> > > > synchronises-with, happens-before, dependency_ordered_before and a whole
> > > > pile of vaguely written waffle to realise that you don't know.
> > > 
> > > Are you familiar with the formalization of the C11/C++11 model by Batty
> > > et al.?
> > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > 
> > > They also have a nice tool that can run condensed examples and show you
> > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > slow for larger examples), including nice annotated graphs for those:
> > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > 
> > > It requires somewhat special syntax, but the following, which should be
> > > equivalent to your example above, runs just fine:
> > > 
> > > int main() {
> > >   atomic_int foo = 0; 
> > >   atomic_int bar = 0; 
> > >   atomic_int baz = 0; 
> > >   {{{ {
> > > foo.store(42, memory_order_relaxed);
> > > bar.store(1, memory_order_seq_cst);
> > > baz.store(42, memory_order_relaxed);
> > >   }
> > >   ||| {
> > > r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > > r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > >   }
> > >   }}};
> > >   return 0; }
> > > 
> > > That yields 3 consistent executions for me, and likewise if the last
> > > read

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Paul E. McKenney

On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > There are also so many ways to blow your head off it's untrue. For 
> > > > > example,
> > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > success, but
> > > > > then there are restrictions on the sets you can use for each. It's 
> > > > > not hard
> > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > > > there's
> > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores 
> > > > > consume
> > > > > atm and optimises all of the data dependencies away) as well as the 
> > > > > definition
> > > > > of "data races", which seem to be used as an excuse to miscompile a 
> > > > > program
> > > > > at the earliest opportunity.
> > > > 
> > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > memory_order_consume until the compilers implement it both correctly and
> > > > efficiently.  They are not there yet, and there is currently no shortage
> > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > 
> > > Do you have any input on
> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > language standard's definition of dependencies?
> > 
> > Let's see...  1.10p9 says that a dependency must be carried unless:
> > 
> > — B is an invocation of any specialization of std::kill_dependency (29.3), 
> > or
> > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical 
> > OR (||, see 5.15) operator,
> > or
> > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > — A is the left operand of the built-in comma (,) operator (5.18);
> > 
> > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > after the "?" will carry a dependency, so the code fragment in 59448
> > needs to do the ordering rather than just optimizing "flag - flag" out
> > of existence.  One way to do that on both ARM and Power is to actually
> > emit code for "flag - flag", but there are a number of other ways to
> > make that work.
> 
> And that's what would concern me, considering that these requirements
> seem to be able to creep out easily.  Also, whereas the other atomics
> just constrain compilers wrt. reordering across atomic accesses or
> changes to the atomic accesses themselves, the dependencies are new
> requirements on pieces of otherwise non-synchronizing code.  The latter
> seems far more involved to me.

Well, the wording of 1.10p9 is pretty explicit on this point.
There are only a few exceptions to the rule that dependencies from
memory_order_consume loads must be tracked.  And to your point about
requirements being placed on pieces of otherwise non-synchronizing code,
we already have that with plain old load acquire and store release --
both of these put ordering constraints that affect the surrounding
non-synchronizing code.

This issue got a lot of discussion, and the compromise is that
dependencies cannot leak into or out of functions unless the relevant
parameters or return values are annotated with [[carries_dependency]].
This means that the compiler can see all the places where dependencies
must be tracked.  This is described in 7.6.4.  If a dependency chain
headed by a memory_order_consume load goes into or out of a function
without the aid of the [[carries_dependency]] attribute, the compiler
needs to do something else to enforce ordering, e.g., emit a memory
barrier.

>From a Linux-kernel viewpoint, this is a bit ugly, as it requires
annotations and use of kill_dependency, but it was the best I could do
at the time.  If things go as they usually do, there will be some other
reason why those are needed...

> > BTW, there is some discussion on 1.10p9's handling of && and ||, and
> > that clause is likely to change.  And yes, I am behind on analyzing
> > usage in the Linux kernel to find out if Linux cares...
> 
> Do you have any pointers to these discussions (e.g., LWG issues)?

Nope, just a bare email thread.  I would guess that it will come up
next week.

The question is whether dependencies should be carried through && or ||
at all, and if so how.  My current guess is that && and || should not
carry dependencies.

> > > > And rcu_dereference() will need per-arch overrides for some time during
> > > > any transition to memory_order_consume.
> > > > 
> > > > > Trying to introduce system concepts (writes to devices, interrupts,
> > > > > non-coherent agents) into this mess is going to be an uphill battle 
> > > > > IMHO. I'd
> > > > > just rather stick to the semantics we have and the asm volatile 
> > > > > barrie

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-06 Thread Peter Zijlstra

On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> Hopefully some discussion of out-of-thin-air values as well.

Yes, absolutely shoot store speculation in the head already. Then drive
a wooden stake through its hart.

C11/C++11 should not be allowed to claim itself a memory model until that
is sorted.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Torvald Riegel

On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote:
> On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > > > > To answer that question, you need to go and look at the definitions of
> > > > > synchronises-with, happens-before, dependency_ordered_before and a 
> > > > > whole
> > > > > pile of vaguely written waffle to realise that you don't know.
> > > > 
> > > > Are you familiar with the formalization of the C11/C++11 model by Batty
> > > > et al.?
> > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > 
> > > > They also have a nice tool that can run condensed examples and show you
> > > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > > slow for larger examples), including nice annotated graphs for those:
> > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > > 
> > > > It requires somewhat special syntax, but the following, which should be
> > > > equivalent to your example above, runs just fine:
> > > > 
> > > > int main() {
> > > >   atomic_int foo = 0; 
> > > >   atomic_int bar = 0; 
> > > >   atomic_int baz = 0; 
> > > >   {{{ {
> > > > foo.store(42, memory_order_relaxed);
> > > > bar.store(1, memory_order_seq_cst);
> > > > baz.store(42, memory_order_relaxed);
> > > >   }
> > > >   ||| {
> > > > r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > > > r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > > >   }
> > > >   }}};
> > > >   return 0; }
> > > > 
> > > > That yields 3 consistent executions for me, and likewise if the last
> > > > readsvalue() is using 42 as argument.
> > > > 
> > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the
> > > > program can't observe != 42 for foo anymore, because the seq-cst fence
> > > > is adding a synchronizes-with edge via the baz reads-from.
> > > > 
> > > > I think this is a really neat tool, and very helpful to answer such
> > > > questions as in your example.
> > > 
> > > Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume that
> > > your substitution of store() for fetch_add() is correct.  Then this shows
> > > that we cannot substitute fetch_add() for atomic_add_return().
> > 
> > It should be in this example, I believe.
> 
> You lost me on this one.

I mean that in this example, substituting fetch_add() with store()
should not change meaning, given that what the fetch_add reads-from
seems irrelevant.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Will Deacon

Hello Torvald,

It looks like Paul clarified most of the points I was trying to make
(thanks Paul!), so I won't go back over them here.

On Thu, Feb 06, 2014 at 09:09:25PM +, Torvald Riegel wrote:
> Are you familiar with the formalization of the C11/C++11 model by Batty
> et al.?
> http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> 
> They also have a nice tool that can run condensed examples and show you
> all allowed (and forbidden) executions (it runs in the browser, so is
> slow for larger examples), including nice annotated graphs for those:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem
in the past, but I didn't realise they have a version for C++11 too.
Actually, the armmem backend doesn't implement our atomic instructions or
the acquire/release accessors, so it's not been as useful as it could be.
I should probably try to learn OCaml...

> IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
> the only memory model that has widespread support.  So, even though it's
> a fairly weak memory model (unless you go for the "only seq-cst"
> beginners advice) and thus comes with a higher complexity, this model is
> what likely most people will be familiar with over time.  Deviating from
> the "standard" model can have valid reasons, but it also has a cost in
> that new contributors are more likely to be familiar with the "standard"
> model.

Indeed, I wasn't trying to write-off the C11 memory model as something we
can never use in the kernel. I just don't think the current situation is
anywhere close to usable for a project such as Linux. If a greater
understanding of the memory model does eventually manifest amongst C/C++
developers (by which I mean, the beginners advice is really treated as
such and there is a widespread intuition about ordering guarantees, as
opposed to the need to use formal tools), then surely the tools and libraries
will stabilise and provide uniform semantics across the 25+ architectures
that Linux currently supports. If *that* happens, this discussion is certainly
worth having again.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Paul E. McKenney

On Fri, Feb 07, 2014 at 10:13:40AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote:
> > On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 18:59 +, Will Deacon wrote:
> > > > > > To answer that question, you need to go and look at the definitions 
> > > > > > of
> > > > > > synchronises-with, happens-before, dependency_ordered_before and a 
> > > > > > whole
> > > > > > pile of vaguely written waffle to realise that you don't know.
> > > > > 
> > > > > Are you familiar with the formalization of the C11/C++11 model by 
> > > > > Batty
> > > > > et al.?
> > > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > > > > 
> > > > > They also have a nice tool that can run condensed examples and show 
> > > > > you
> > > > > all allowed (and forbidden) executions (it runs in the browser, so is
> > > > > slow for larger examples), including nice annotated graphs for those:
> > > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> > > > > 
> > > > > It requires somewhat special syntax, but the following, which should 
> > > > > be
> > > > > equivalent to your example above, runs just fine:
> > > > > 
> > > > > int main() {
> > > > >   atomic_int foo = 0; 
> > > > >   atomic_int bar = 0; 
> > > > >   atomic_int baz = 0; 
> > > > >   {{{ {
> > > > > foo.store(42, memory_order_relaxed);
> > > > > bar.store(1, memory_order_seq_cst);
> > > > > baz.store(42, memory_order_relaxed);
> > > > >   }
> > > > >   ||| {
> > > > > r1=baz.load(memory_order_seq_cst).readsvalue(42);
> > > > > r2=foo.load(memory_order_seq_cst).readsvalue(0);
> > > > >   }
> > > > >   }}};
> > > > >   return 0; }
> > > > > 
> > > > > That yields 3 consistent executions for me, and likewise if the last
> > > > > readsvalue() is using 42 as argument.
> > > > > 
> > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, 
> > > > > the
> > > > > program can't observe != 42 for foo anymore, because the seq-cst fence
> > > > > is adding a synchronizes-with edge via the baz reads-from.
> > > > > 
> > > > > I think this is a really neat tool, and very helpful to answer such
> > > > > questions as in your example.
> > > > 
> > > > Hmmm...  The tool doesn't seem to like fetch_add().  But let's assume 
> > > > that
> > > > your substitution of store() for fetch_add() is correct.  Then this 
> > > > shows
> > > > that we cannot substitute fetch_add() for atomic_add_return().
> > > 
> > > It should be in this example, I believe.
> > 
> > You lost me on this one.
> 
> I mean that in this example, substituting fetch_add() with store()
> should not change meaning, given that what the fetch_add reads-from
> seems irrelevant.

Got it.  Agreed, though your other suggestion of substituting CAS is
more convincing.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Paul E. McKenney

On Fri, Feb 07, 2014 at 12:01:25PM +, Will Deacon wrote:
> Hello Torvald,
> 
> It looks like Paul clarified most of the points I was trying to make
> (thanks Paul!), so I won't go back over them here.
> 
> On Thu, Feb 06, 2014 at 09:09:25PM +, Torvald Riegel wrote:
> > Are you familiar with the formalization of the C11/C++11 model by Batty
> > et al.?
> > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > 
> > They also have a nice tool that can run condensed examples and show you
> > all allowed (and forbidden) executions (it runs in the browser, so is
> > slow for larger examples), including nice annotated graphs for those:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> 
> Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem
> in the past, but I didn't realise they have a version for C++11 too.
> Actually, the armmem backend doesn't implement our atomic instructions or
> the acquire/release accessors, so it's not been as useful as it could be.
> I should probably try to learn OCaml...

That would be very cool!

Another option would be to recruit a grad student to take on that project
for Peter Sewell.  He might already have one, for all I know.

> > IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
> > the only memory model that has widespread support.  So, even though it's
> > a fairly weak memory model (unless you go for the "only seq-cst"
> > beginners advice) and thus comes with a higher complexity, this model is
> > what likely most people will be familiar with over time.  Deviating from
> > the "standard" model can have valid reasons, but it also has a cost in
> > that new contributors are more likely to be familiar with the "standard"
> > model.
> 
> Indeed, I wasn't trying to write-off the C11 memory model as something we
> can never use in the kernel. I just don't think the current situation is
> anywhere close to usable for a project such as Linux. If a greater
> understanding of the memory model does eventually manifest amongst C/C++
> developers (by which I mean, the beginners advice is really treated as
> such and there is a widespread intuition about ordering guarantees, as
> opposed to the need to use formal tools), then surely the tools and libraries
> will stabilise and provide uniform semantics across the 25+ architectures
> that Linux currently supports. If *that* happens, this discussion is certainly
> worth having again.

And it is likely to be worthwhile even before then on a piecemeal
basis, where architecture maintainers pick and choose which primitive
is in inline assembly and which the compiler can deal with properly.
For example, I bet that atomic_inc() can be implemented just fine by C11
in the very near future.  However, atomic_add_return() is another story.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Paul E. McKenney

On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > Hopefully some discussion of out-of-thin-air values as well.
> 
> Yes, absolutely shoot store speculation in the head already. Then drive
> a wooden stake through its hart.
> 
> C11/C++11 should not be allowed to claim itself a memory model until that
> is sorted.

There actually is a proposal being put forward, but it might not make ARM
and Power people happy because it involves adding a compare, a branch,
and an ISB/isync after every relaxed load...  Me, I agree with you,
much preferring the no-store-speculation approach.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Will Deacon

Hi Paul,

On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > Hopefully some discussion of out-of-thin-air values as well.
> > 
> > Yes, absolutely shoot store speculation in the head already. Then drive
> > a wooden stake through its hart.
> > 
> > C11/C++11 should not be allowed to claim itself a memory model until that
> > is sorted.
> 
> There actually is a proposal being put forward, but it might not make ARM
> and Power people happy because it involves adding a compare, a branch,
> and an ISB/isync after every relaxed load...  Me, I agree with you,
> much preferring the no-store-speculation approach.

Can you elaborate a bit on this please? We don't permit speculative stores
in the ARM architecture, so it seems counter-intuitive that GCC needs to
emit any additional instructions to prevent that from happening.

Stores can, of course, be observed out-of-order but that's a lot more
reasonable :)

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Peter Zijlstra

On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > Hopefully some discussion of out-of-thin-air values as well.
> > > 
> > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > a wooden stake through its hart.
> > > 
> > > C11/C++11 should not be allowed to claim itself a memory model until that
> > > is sorted.
> > 
> > There actually is a proposal being put forward, but it might not make ARM
> > and Power people happy because it involves adding a compare, a branch,
> > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > much preferring the no-store-speculation approach.
> 
> Can you elaborate a bit on this please? We don't permit speculative stores
> in the ARM architecture, so it seems counter-intuitive that GCC needs to
> emit any additional instructions to prevent that from happening.
> 
> Stores can, of course, be observed out-of-order but that's a lot more
> reasonable :)

This is more about the compiler speculating on stores; imagine:

  if (x)
y = 1;
  else
y = 2;

The compiler is allowed to change that into:

  y = 2;
  if (x)
y = 1;

Which is of course a big problem when you want to rely on the ordering.

There's further problems where things like memset() can write outside
the specified address range. Examples are memset() using single
instructions to wipe entire cachelines and then 'restoring' the tail
bit.

While valid for single threaded, its a complete disaster for concurrent
code.

There's more, but it all boils down to doing stores you don't expect in
a 'sane' concurrent environment and/or don't respect the control flow.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Will Deacon

On Fri, Feb 07, 2014 at 05:06:54PM +, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> > 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> This is more about the compiler speculating on stores; imagine:
> 
>   if (x)
>   y = 1;
>   else
>   y = 2;
> 
> The compiler is allowed to change that into:
> 
>   y = 2;
>   if (x)
>   y = 1;
> 
> Which is of course a big problem when you want to rely on the ordering.

Understood, but that doesn't explain why Paul wants to add ISB/isync
instructions which affect the *CPU* rather than the compiler!

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-07 Thread Peter Zijlstra

On Fri, Feb 07, 2014 at 05:13:36PM +, Will Deacon wrote:
> Understood, but that doesn't explain why Paul wants to add ISB/isync
> instructions which affect the *CPU* rather than the compiler!

I doubt Paul wants it, but yeah, I'm curious about that proposal as
well, sounds like someone took a big toke from the bong again; it seems
a favourite past time amongst committees.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3

201 - 285 of 285 matches

Mail list logo