subject:"\[RFC\]\[PATCH\] mips\: Fix arch_spin

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-09 Thread Will Deacon

On Tue, Feb 09, 2016 at 12:23:58PM +0100, Ingo Molnar wrote:
> * Will Deacon  wrote:
> > On Wed, Feb 03, 2016 at 01:32:10PM +, Will Deacon wrote:
> > > On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote:
> > > > In fact I'd suggest to test this via a quick runtime hack like this in 
> > > > rcupdate.h:
> > > > 
> > > > extern int panic_timeout;
> > > > 
> > > > ...
> > > > 
> > > > if (panic_timeout)
> > > > smp_load_acquire(p);
> > > > else
> > > > typeof(*p) *p1 = (typeof(*p) 
> > > > *__force)lockless_dereference(p);
> > > > 
> > > > (or so)
> > > 
> > > So the problem with this is that a LOAD  LOAD sequence isn't an
> > > ordering hazard on ARM, so you're potentially at the mercy of the branch
> > > predictor as to whether you get an acquire. That's not to say it won't
> > > be discarded as soon as the conditional is resolved, but it could
> > > screw up the benchmarking.
> > > 
> > > I'd be better off doing some runtime patching, but that's not something
> > > I can knock up in a couple of minutes (so I'll add it to my list).
> > 
> > ... so I actually got that up and running, believe it or not. Filthy stuff.
> 
> Wow!
> 
> I tried to implement the simpler solution by hacking rcupdate.h, but got 
> drowned 
> in nasty circular header file dependencies and gave up...

I hacked linux/compiler.h, but got similar problems and ended up copying
what I need under a new namespace (posh way of saying I added _WILL to
everything until it built).

> If you are not overly embarrassed by posting hacky patches, mind posting your 
> solution?

Ok, since you asked! I'm thoroughly ashamed of the hacks, but maybe it
will keep those spammy Microsoft recruiters at bay ;)

Note that the "trigger" is changing the loglevel, but you need to do this
by echoing to /proc/sysrq-trigger, otherwise the whole thing gets invoked
in irq context which ends badly. It's also heavily arm64-specific.

> > The good news is that you're right, and I'm now seeing ~1% difference 
> > between 
> > the runs with ~0.3% noise for either of them. I still think that's 
> > significant, 
> > but it's a lot more reassuring than 4%.
> 
> hm, so for such marginal effects I think we could improve the testing method 
> a 
> bit: we could improve 'perf bench sched messaging' to allow 'steady state 
> testing': to not exit+restart all the processes between test iterations, but 
> to 
> continuously measure and print out current performance figures.
> 
> I.e. every 10 seconds it could print a decaying running average of current 
> throughput.
> 
> That way you could patch/unpatch the instructions without having to restart 
> the 
> tasks. If you still see an effect (in the numbers reported every 10 seconds), 
> then 
> that's a guaranteed result.
> 
> [ We have such functionality in 'perf bench numa' (the --show-convergence 
> option), 
>   for similar reasons, to allow runtime monitoring and tweaking of kernel 
>   parameters. ]

That sounds handy.

Will

--->8

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 8f271b83f910..1a1e353983be 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -30,8 +30,8 @@
 #define ARM64_HAS_LSE_ATOMICS  5
 #define ARM64_WORKAROUND_CAVIUM_23154  6
 #define ARM64_WORKAROUND_8342207
-
-#define ARM64_NCAPS8
+#define ARM64_RCU_USES_ACQUIRE 8
+#define ARM64_NCAPS9
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/alternative.c b/arch/arm64/kernel/alternative.c
index d2ee1b21a10d..edbf188a8541 100644
--- a/arch/arm64/kernel/alternative.c
+++ b/arch/arm64/kernel/alternative.c
@@ -143,6 +143,66 @@ static int __apply_alternatives_multi_stop(void *unused)
return 0;
 }
 
+static void __apply_alternatives_rcu(void *alt_region)
+{
+   struct alt_instr *alt;
+   struct alt_region *region = alt_region;
+   u32 *origptr, *replptr;
+
+   for (alt = region->begin; alt < region->end; alt++) {
+   u32 insn, orig_insn;
+   int nr_inst;
+
+   if (alt->cpufeature != ARM64_RCU_USES_ACQUIRE)
+   continue;
+
+   BUG_ON(alt->alt_len != alt->orig_len);
+
+   origptr = ALT_ORIG_PTR(alt);
+   replptr = ALT_REPL_PTR(alt);
+   nr_inst = alt->alt_len / sizeof(insn);
+
+   BUG_ON(nr_inst != 1);
+
+   insn = le32_to_cpu(*replptr);
+   orig_insn = le32_to_cpu(*origptr);
+   *(origptr) = cpu_to_le32(insn);
+   *replptr = cpu_to_le32(orig_insn);
+
+   flush_icache_range((uintptr_t)origptr, (uintptr_t)(origptr + 
1));
+   pr_info_ratelimited("%p: 0x%x => 0x%x\n", origptr, orig_insn, 
insn);
+   }
+}
+
+static int will_patched;
+
+static int __apply_alternatives

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-09 Thread Ingo Molnar

* Will Deacon  wrote:

> On Wed, Feb 03, 2016 at 01:32:10PM +, Will Deacon wrote:
> > On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote:
> > > In fact I'd suggest to test this via a quick runtime hack like this in 
> > > rcupdate.h:
> > > 
> > >   extern int panic_timeout;
> > > 
> > >   ...
> > > 
> > >   if (panic_timeout)
> > >   smp_load_acquire(p);
> > >   else
> > >   typeof(*p) *p1 = (typeof(*p) 
> > > *__force)lockless_dereference(p);
> > > 
> > > (or so)
> > 
> > So the problem with this is that a LOAD  LOAD sequence isn't an
> > ordering hazard on ARM, so you're potentially at the mercy of the branch
> > predictor as to whether you get an acquire. That's not to say it won't
> > be discarded as soon as the conditional is resolved, but it could
> > screw up the benchmarking.
> > 
> > I'd be better off doing some runtime patching, but that's not something
> > I can knock up in a couple of minutes (so I'll add it to my list).
> 
> ... so I actually got that up and running, believe it or not. Filthy stuff.

Wow!

I tried to implement the simpler solution by hacking rcupdate.h, but got 
drowned 
in nasty circular header file dependencies and gave up...

If you are not overly embarrassed by posting hacky patches, mind posting your 
solution?

> The good news is that you're right, and I'm now seeing ~1% difference between 
> the runs with ~0.3% noise for either of them. I still think that's 
> significant, 
> but it's a lot more reassuring than 4%.

hm, so for such marginal effects I think we could improve the testing method a 
bit: we could improve 'perf bench sched messaging' to allow 'steady state 
testing': to not exit+restart all the processes between test iterations, but to 
continuously measure and print out current performance figures.

I.e. every 10 seconds it could print a decaying running average of current 
throughput.

That way you could patch/unpatch the instructions without having to restart the 
tasks. If you still see an effect (in the numbers reported every 10 seconds), 
then 
that's a guaranteed result.

[ We have such functionality in 'perf bench numa' (the --show-convergence 
option), 
  for similar reasons, to allow runtime monitoring and tweaking of kernel 
  parameters. ]

Thanks,

Ingo

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-03 Thread Will Deacon

On Tue, Feb 02, 2016 at 11:55:57AM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 11:30 AM, Will Deacon  wrote:
> >
> > FWIW, and this is by no means conclusive, I hacked that up quickly and
> > ran hackbench a few times on the nearest idle arm64 system. The results
> > were consistently ~4% slower using acquire for rcu_dereference.
> 
> Ok, that's *much* more noticeable than I would have expected. I take
> it that load-acquire is really really slow on current arm64
> implementations.

See my reply to Ingo, but it seems a bunch of this was down to rebooting
the system between runs and hackbench being particularly susceptible to
that.

> Just out of interest, is store-release slow too? Because that should
> be easy to make fast.

There's a slight gotcha with arm64's store-release instruction in that
it's RCsc and therefore orders against a subsequent load-acquire. That's
not to say you can't make it fast, but it's potentially more involved
than posting a flag in a store buffer (or whatever you were envisaging :)

Measuring store-release is much more difficult, because you can't replace
it with a dependency or the like, only other barrier constructs.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-03 Thread Will Deacon

On Wed, Feb 03, 2016 at 01:32:10PM +, Will Deacon wrote:
> On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote:
> > In fact I'd suggest to test this via a quick runtime hack like this in 
> > rcupdate.h:
> > 
> > extern int panic_timeout;
> > 
> > ...
> > 
> > if (panic_timeout)
> > smp_load_acquire(p);
> > else
> > typeof(*p) *p1 = (typeof(*p) 
> > *__force)lockless_dereference(p);
> > 
> > (or so)
> 
> So the problem with this is that a LOAD  LOAD sequence isn't an
> ordering hazard on ARM, so you're potentially at the mercy of the branch
> predictor as to whether you get an acquire. That's not to say it won't
> be discarded as soon as the conditional is resolved, but it could
> screw up the benchmarking.
> 
> I'd be better off doing some runtime patching, but that's not something
> I can knock up in a couple of minutes (so I'll add it to my list).

... so I actually got that up and running, believe it or not. Filthy stuff.

The good news is that you're right, and I'm now seeing ~1% difference
between the runs with ~0.3% noise for either of them. I still think
that's significant, but it's a lot more reassuring than 4%.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-03 Thread Will Deacon

On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote:
> * Will Deacon  wrote:
> > On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote:
> > > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon  wrote:
> > > >
> > > > Given that the vast majority of weakly ordered architectures respect
> > > > address dependencies, I would expect all of them to be hurt if they
> > > > were forced to use barrier instructions instead, even those where the
> > > > microarchitecture is fairly strongly ordered in practice.
> > > 
> > > I do wonder if it would be all that noticeable, though. I don't think
> > > we've really had benchmarks.
> > > 
> > > For example, most of the RCU list traversal shows up on x86 - where
> > > loads are already acquires. But they show up not because of that, but
> > > because a RCU list traversal is pretty much always going to take the
> > > cache miss.
> > > 
> > > So it would actually be interesting to just try it - what happens to
> > > kernel-centric benchmarks (which are already fairly rare) on arm if we
> > > change the rcu_dereference() to be a smp_load_acquire()?
> > > 
> > > Because maybe nothing happens at all. I don't think we've ever tried it.
> > 
> > FWIW, and this is by no means conclusive, I hacked that up quickly and ran 
> > hackbench a few times on the nearest idle arm64 system. The results were 
> > consistently ~4% slower using acquire for rcu_dereference.
> 
> Could you please double check that? The thing is that hackbench is a 
> _notoriously_ 
> unstable workload and very dependent on various small details such as kernel 
> image 
> layout and random per-bootup cache/memory layouts details.

Yup. FWIW, I did try across a few reboots to arrive at the 4% figure, but
it's not conclusive (as I said :).

> In fact I'd suggest to test this via a quick runtime hack like this in 
> rcupdate.h:
> 
>   extern int panic_timeout;
> 
>   ...
> 
>   if (panic_timeout)
>   smp_load_acquire(p);
>   else
>   typeof(*p) *p1 = (typeof(*p) 
> *__force)lockless_dereference(p);
> 
> (or so)

So the problem with this is that a LOAD  LOAD sequence isn't an
ordering hazard on ARM, so you're potentially at the mercy of the branch
predictor as to whether you get an acquire. That's not to say it won't
be discarded as soon as the conditional is resolved, but it could
screw up the benchmarking.

I'd be better off doing some runtime patching, but that's not something
I can knock up in a couple of minutes (so I'll add it to my list).

> and you could get specific numbers of noise estimations via:
> 
>  triton:~/tip> perf stat --null --repeat 10 perf bench sched messaging -l 
> 1
> 
>  [...]
> 
>  Performance counter stats for 'perf bench sched messaging -l 1' (10 
> runs):
> 
>4.616404309 seconds time elapsed   
>( +-  1.67% )
> 
> note that even with a repeat count of 10 runs and a loop count 100 times 
> larger 
> than the hackbench default, the intrinsic noise of this workload was still 
> 1.6% - 
> and that does not include boot-to-boot systematic noise.

That's helpful, thanks.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-03 Thread Ingo Molnar


* Will Deacon  wrote:

> On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote:
> > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon  wrote:
> > >
> > > Given that the vast majority of weakly ordered architectures respect
> > > address dependencies, I would expect all of them to be hurt if they
> > > were forced to use barrier instructions instead, even those where the
> > > microarchitecture is fairly strongly ordered in practice.
> > 
> > I do wonder if it would be all that noticeable, though. I don't think
> > we've really had benchmarks.
> > 
> > For example, most of the RCU list traversal shows up on x86 - where
> > loads are already acquires. But they show up not because of that, but
> > because a RCU list traversal is pretty much always going to take the
> > cache miss.
> > 
> > So it would actually be interesting to just try it - what happens to
> > kernel-centric benchmarks (which are already fairly rare) on arm if we
> > change the rcu_dereference() to be a smp_load_acquire()?
> > 
> > Because maybe nothing happens at all. I don't think we've ever tried it.
> 
> FWIW, and this is by no means conclusive, I hacked that up quickly and ran 
> hackbench a few times on the nearest idle arm64 system. The results were 
> consistently ~4% slower using acquire for rcu_dereference.

Could you please double check that? The thing is that hackbench is a 
_notoriously_ 
unstable workload and very dependent on various small details such as kernel 
image 
layout and random per-bootup cache/memory layouts details.

In fact I'd suggest to test this via a quick runtime hack like this in 
rcupdate.h:

extern int panic_timeout;

...

if (panic_timeout)
smp_load_acquire(p);
else
typeof(*p) *p1 = (typeof(*p) 
*__force)lockless_dereference(p);

(or so)

and then you can start a loop of hackbench runs, and in another terminal change 
the ordering primitive via:

   echo 1 > /proc/sys/kernel/panic  # smpload_acquire()
   echo 0 > /proc/sys/kernel/panic  # smp_read_barrier_depends()

without having to reboot the kernel.

Also, instead of using hackbench which has a too short runtime that makes it 
sensitive to scheduling micro-details, you could try the perf-bench hackbench 
work-alike where the number of loops is parametric:

 triton:~/tip> perf bench sched messaging -l 1
 # Running 'sched/messaging' benchmark:
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

 Total time: 4.532 [sec]

and you could get specific numbers of noise estimations via:

 triton:~/tip> perf stat --null --repeat 10 perf bench sched messaging -l 1

 [...]

 Performance counter stats for 'perf bench sched messaging -l 1' (10 runs):

   4.616404309 seconds time elapsed 
 ( +-  1.67% )

note that even with a repeat count of 10 runs and a loop count 100 times larger 
than the hackbench default, the intrinsic noise of this workload was still 1.6% 
- 
and that does not include boot-to-boot systematic noise.

It's very easy to get systemic noise with hackbench workloads and go down the 
entirely wrong road.

Of course, the numbers might also confirm your 4% figure!

Thanks,

Ingo

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Paul E. McKenney

On Tue, Feb 02, 2016 at 06:23:37PM +0100, Peter Zijlstra wrote:
> 
> 
> On 2 February 2016 07:44:33 CET, "Paul E. McKenney" 
>  wrote:
> 
> >
> >> If so, do we also need to take the following pairing into
> >consideration?
> >> 
> >> o  smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 
> 
> 
> We rely on this with smp_cond_aquire() to extend the chain.

We do indeed!  And it is not possible to use smp_load_acquire() given
this API becauseof the "!".  Hmmm...

If this one turns out to be problematic, there are some ways of dealing
with it.  But thank you for reminding me of it.  I think, anyway.  ;-)

Thanx, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Paul E. McKenney

On Tue, Feb 02, 2016 at 09:56:14AM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 4:02 AM, Paul E. McKenney
>  wrote:
> >
> > The sorts of things I am really worried about are abominations like this
> > (and far worse):
> 
> That one doesn't have any causal chain that I can see, so I agree that
> it's an abomination, but it also doesn't act as an argument.
> 
> >  r1 == 1 && r2 == 1 && c == 2 && r3 == 0 ???
> 
> What do you see as the problem here? The above can happen in a
> strictly ordered situation: thread2 runs first (c == 2, r3 = 0), then
> thread3 runs (d = 1, a = 1) then thread0 runs (r1 = 1) and then
> thread1 starts running but the store to c doesn't complete (now r2 =
> 1).

Apologies, I should have added that the condition does not get evaluated
until all the dust settles.  At that point both stores to c would have
completed, so that c == 1.

> So there's no reason for your case to not happen, but the real issue
> is that there is no causal relationship that your example describes,
> so it's not even interesting.

Because of the write-to-write relationship between thread1() and
thread2(), yes.  And I am very glad that you find this one uninteresting,
because including it would make things -really- complicated.

> Causality breaking is what really screws with peoples minds. The
> reason transitivity is important (and why smp_read_barrier_depends()
> is so annoying) is because causal breaks make peoples minds twist in
> bad ways.

Agreed, which means it is very important that the various flavors of
release-acquire chains be transitive.

In addition, smp_mb() is transitive, as are synchronize_rcu() and friends.

> Sadly, memory orderings are very seldom described as honoring
> causality, and instead people have the crazy litmus tests.

Indeed, a memory model defined solely by litmus tests would qualify as
an exotic form of torture.  What we do instead is use sets of litmus
tests as test cases for the prototype memory model under consideration.
It is all too easy to create a set of rules that look good and sound
good, but which mess something up.  The litmus tests help catch these
sorts of errors.

Thanx, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Linus Torvalds

On Tue, Feb 2, 2016 at 11:30 AM, Will Deacon  wrote:
>
> FWIW, and this is by no means conclusive, I hacked that up quickly and
> ran hackbench a few times on the nearest idle arm64 system. The results
> were consistently ~4% slower using acquire for rcu_dereference.

Ok, that's *much* more noticeable than I would have expected. I take
it that load-acquire is really really slow on current arm64
implementations.

That, btw, is one reason why I despise weak memory ordering. In most
cases I've ever seen, the hardware designers have said "barriers don't
much matter" and just made them crazy slow. My old alpha was the worst
of the worst.

It's shades of my least favorite x86 microarchitecture ever: netburst.
Make common cases fast, but make exceptional cases _so_ slow that it's
actually a noticeable pain.

Just out of interest, is store-release slow too? Because that should
be easy to make fast.

Linus

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Will Deacon

On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon  wrote:
> >
> > Given that the vast majority of weakly ordered architectures respect
> > address dependencies, I would expect all of them to be hurt if they
> > were forced to use barrier instructions instead, even those where the
> > microarchitecture is fairly strongly ordered in practice.
> 
> I do wonder if it would be all that noticeable, though. I don't think
> we've really had benchmarks.
> 
> For example, most of the RCU list traversal shows up on x86 - where
> loads are already acquires. But they show up not because of that, but
> because a RCU list traversal is pretty much always going to take the
> cache miss.
> 
> So it would actually be interesting to just try it - what happens to
> kernel-centric benchmarks (which are already fairly rare) on arm if we
> change the rcu_dereference() to be a smp_load_acquire()?
> 
> Because maybe nothing happens at all. I don't think we've ever tried it.

FWIW, and this is by no means conclusive, I hacked that up quickly and
ran hackbench a few times on the nearest idle arm64 system. The results
were consistently ~4% slower using acquire for rcu_dereference.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Linus Torvalds

On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon  wrote:
>
> Given that the vast majority of weakly ordered architectures respect
> address dependencies, I would expect all of them to be hurt if they
> were forced to use barrier instructions instead, even those where the
> microarchitecture is fairly strongly ordered in practice.

I do wonder if it would be all that noticeable, though. I don't think
we've really had benchmarks.

For example, most of the RCU list traversal shows up on x86 - where
loads are already acquires. But they show up not because of that, but
because a RCU list traversal is pretty much always going to take the
cache miss.

So it would actually be interesting to just try it - what happens to
kernel-centric benchmarks (which are already fairly rare) on arm if we
change the rcu_dereference() to be a smp_load_acquire()?

Because maybe nothing happens at all. I don't think we've ever tried it.

> As far as I understand it, the problems with "consume" have centred
> largely around compiler and specification issues, which we don't have
> with rcu_dereference (i.e. we ignore thin-air and use volatile casts
> /barrier() to keep the optimizer at bay).

Oh, I agree. The C++ consume orderings have been different from the
kernel worries.

But if it turns out that we have situations where we lose transitivity
because of rcu_dereference not being an acquire, then we have kernel
problems.

I do see that your later email said that the pointer dependency (which
we assume in rcu) should retain the transitivity, so maybe there is no
real reason to strengthen things.

But I _would_ argue that transitivity is so important (because of the
whole "individual orderings make sense and are causal, so the end
result must make sense and be causal") that if that were to break,
then such an architecture really should just strengthen the orderings.

Because the *worst* kinds of bugs are exactly the ones where the code
makes sense and works locally, but then the combination of two or
three pieces of code that are individually sensible ends up not
working for some crazy non-transitivity reason. That really is not
something that people can cope with.

And maybe we will one day have widely available automated ordering
proofs that work across the whole kernel, and we don't even need to
worry about "this breaks peoples minds", but I don't think we are
there yet.

  Linus

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Linus Torvalds

On Tue, Feb 2, 2016 at 4:02 AM, Paul E. McKenney
 wrote:
>
> The sorts of things I am really worried about are abominations like this
> (and far worse):

That one doesn't have any causal chain that I can see, so I agree that
it's an abomination, but it also doesn't act as an argument.

>  r1 == 1 && r2 == 1 && c == 2 && r3 == 0 ???

What do you see as the problem here? The above can happen in a
strictly ordered situation: thread2 runs first (c == 2, r3 = 0), then
thread3 runs (d = 1, a = 1) then thread0 runs (r1 = 1) and then
thread1 starts running but the store to c doesn't complete (now r2 =
1).

So there's no reason for your case to not happen, but the real issue
is that there is no causal relationship that your example describes,
so it's not even interesting.

Causality breaking is what really screws with peoples minds. The
reason transitivity is important (and why smp_read_barrier_depends()
is so annoying) is because causal breaks make peoples minds twist in
bad ways.

Sadly, memory orderings are very seldom described as honoring
causality, and instead people have the crazy litmus tests.

 Linus

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Will Deacon

On Tue, Feb 02, 2016 at 09:30:26AM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 1:34 AM, Boqun Feng  wrote:
> >
> > Just to be clear, what Will, Paul and I are discussing here is about
> > local transitivity,
> 
> I really don't think that changes the picture.

For the general point about mixed methods, perhaps not, but it does
mean that we can't describe all of the issues using fewer than three
processors.

> Given that
> 
>  (a) we already mix ordering methods and there are good reasons for
> it, and I'd expect transitivity only makes that more likely
> 
>  (b) we expect transitivity from the individual ordering methods
> 
>  (c) I don't think that there are any relevant CPU's that violate this anyway
> 
> I really think that not expecting that to hold for mixed accesses
> would be a complete disaster. It will confuse the hell out of people.
> 
> And the basic argument really stands: we should make the memory
> ordering expectations as strong as we can, given the existing relevant
> architecture constraints (ie x86/arm/power).
> 
> If that then means that some other architecture might need to add
> extra serialization that that architecture doesn't _want_ to add,
> tough luck. I absolutely hate the fact that alpha forced us to add
> that crazy read-depends barrier, and I want to discourage that a lot.
> 
> In fact, I'd be willing to strengthen our existing orderings just in
> the name of sanity, and say that "rcu_dereference()" should just be an
> acquire, and say that if the architecture makes that more expensive,
> then who the hell cares? I have not been very happy with the "consume"
> memory ordering discussions for C++. Yes, it would hurt pre-lwsync
> power a bit, and it would hurt 32-bit arm, but enough that we should
> have the headache of the existing semantics?

Given that the vast majority of weakly ordered architectures respect
address dependencies, I would expect all of them to be hurt if they
were forced to use barrier instructions instead, even those where the
microarchitecture is fairly strongly ordered in practice.

Even load-acquire on ARMv8 has more work to do than a plain old address
dependency, so I'd be sad to see us upgrading rcu_dereference like this,
particularly when its a relatively uncontentious, easy to understand
part of the kernel memory model.

As far as I understand it, the problems with "consume" have centred
largely around compiler and specification issues, which we don't have
with rcu_dereference (i.e. we ignore thin-air and use volatile casts
/barrier() to keep the optimizer at bay).

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Will Deacon

On Tue, Feb 02, 2016 at 09:12:37AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 02, 2016 at 12:20:25PM +, Will Deacon wrote:
> > On Tue, Feb 02, 2016 at 08:12:30PM +0800, Boqun Feng wrote:
> > > On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote:
> > > > Well, the following ISA2 test is permitted on ARM:
> > > > 
> > > > 
> > > > P0:
> > > > Wx=1
> > > > WyRel=1 // rcu_assign_pointer
> > > > 
> > > > P1:
> > > > Ry=1// rcu_dereference
> > > 
> > > What if a  dependency is added here? Same result?
> > 
> > Right, that fixes it. So if we're only considering things like:
> > 
> >   rcu_dereference
> >   
> >   RELEASE
> > 
> > then local transitivity should be preserved.
> 
> Whew!!!  ;-)
> 
> > I think the same applies to , which seems to match your later
> > example.
> 
> Could you please check?

I should've been more concrete here: it does indeed apply if you replace
the  with a  in the snippet above, I just hadn't got Boqun's
example paged in and didn't want to commit on the examples being identical.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Linus Torvalds

On Tue, Feb 2, 2016 at 1:34 AM, Boqun Feng  wrote:
>
> Just to be clear, what Will, Paul and I are discussing here is about
> local transitivity,

I really don't think that changes the picture.

Given that

 (a) we already mix ordering methods and there are good reasons for
it, and I'd expect transitivity only makes that more likely

 (b) we expect transitivity from the individual ordering methods

 (c) I don't think that there are any relevant CPU's that violate this anyway

I really think that not expecting that to hold for mixed accesses
would be a complete disaster. It will confuse the hell out of people.

And the basic argument really stands: we should make the memory
ordering expectations as strong as we can, given the existing relevant
architecture constraints (ie x86/arm/power).

If that then means that some other architecture might need to add
extra serialization that that architecture doesn't _want_ to add,
tough luck. I absolutely hate the fact that alpha forced us to add
that crazy read-depends barrier, and I want to discourage that a lot.

In fact, I'd be willing to strengthen our existing orderings just in
the name of sanity, and say that "rcu_dereference()" should just be an
acquire, and say that if the architecture makes that more expensive,
then who the hell cares? I have not been very happy with the "consume"
memory ordering discussions for C++. Yes, it would hurt pre-lwsync
power a bit, and it would hurt 32-bit arm, but enough that we should
have the headache of the existing semantics?

So I think our current memory orderings are potentially too _weak_,
and we sure as hell shouldn't strive to weaken them further.

I think doing "smp_read_barrier_depends()" was a mistake, but it was a
mistake driven by the fact that our memory ordering _used_ to be
barrier-centric. If alpha were to have happened today, I would say
that rather than have smp_read_barrier_depends(), we should just say
that anything that requires it should use smp_load_acquire() (or
rcu_dereference - which I think could have the same semantics), and be
done with it.

And if I would make that choice today, why isn't the right thing to
just get rid of that weak nasty thing, and convert the existing (few -
there really aren't that many) smp_read_barriers() away from that
model.

See what I'm saying? We've been pandering to weak memory ordering
before. We may have had our reasons to do so, but I think it's a
mistake. It results in code that is hard to think about.

So the fact that people worry about "rcu_dereference()" really makes
me think that one is just too weak, and we should just bite the bullet
and say it's an acquire. I'd much rather take a few barriers than make
our programming model hard to understand.

Linus

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Peter Zijlstra



On 2 February 2016 07:44:33 CET, "Paul E. McKenney" 
 wrote:

>
>> If so, do we also need to take the following pairing into
>consideration?
>> 
>> osmp_store_release() -> READ_ONCE(); if ;smp_rmb(); 


We rely on this with smp_cond_aquire() to extend the chain.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Paul E. McKenney

On Tue, Feb 02, 2016 at 12:20:25PM +, Will Deacon wrote:
> On Tue, Feb 02, 2016 at 08:12:30PM +0800, Boqun Feng wrote:
> > On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote:
> > > On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote:
> > > > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> > > > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > > > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > > > > Locally transitive chain termination:
> > > > > > 
> > > > > > (i.e. these can't be used to extend a chain)
> > > > > 
> > > > > Agreed.
> > > > > 
> > > > > > > o smp_store_release() -> lockless_dereference() (???)
> > > > > > > o rcu_assign_pointer() -> rcu_dereference()
> > > > > > > o smp_store_release() -> READ_ONCE(); if
> > > > 
> > > > Just want to make sure, this one is actually:
> > > > 
> > > > o   smp_store_release() -> READ_ONCE(); if ;
> > > > 
> > > > right? Because control dependency only orders READ->WRITE.
> > > > 
> > > > If so, do we also need to take the following pairing into consideration?
> > > > 
> > > > o   smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 
> > > > 
> > > > 
> > > > > 
> > > > > I am OK with the first and last, but I believe that the middle one
> > > > > has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
> > > > > case needs to be locally transitive.
> > > > > 
> > > > 
> > > > Hmm... I don't think we should differ rcu_dereference() and
> > > > lockless_dereference(). One reason: list_for_each_entry_rcu() are using
> > > > lockless_dereference() right now, which means we used to think
> > > > rcu_dereference() and lockless_dereference() are interchangeable, right?
> > > > 
> > > > Besides, Will, what's the reason of having a locally transitive chain
> > > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may
> > > > not be locally transitive?
> > > 
> > > Well, the following ISA2 test is permitted on ARM:
> > > 
> > > 
> > > P0:
> > > Wx=1
> > > WyRel=1   // rcu_assign_pointer
> > > 
> > > P1:
> > > Ry=1  // rcu_dereference
> > 
> > What if a  dependency is added here? Same result?
> 
> Right, that fixes it. So if we're only considering things like:
> 
>   rcu_dereference
>   
>   RELEASE
> 
> then local transitivity should be preserved.

Whew!!!  ;-)

> I think the same applies to , which seems to match your later
> example.

Could you please check?

Thanx, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Måns Rullgård

Ralf Baechle  writes:

> On Tue, Feb 02, 2016 at 02:54:06PM +, Måns Rullgård wrote:
>
>> We don't really have much choice given the reality of existing hardware.
>
> No, of course not - but I want us discourage new weakly ordered
> platforms as much as possible.

Where do you draw the line though?  You can't exactly go and impose some
kind of global ordering in a multi-master system.

-- 
Måns Rullgård

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Ralf Baechle

On Tue, Feb 02, 2016 at 02:54:06PM +, Måns Rullgård wrote:

> We don't really have much choice given the reality of existing hardware.

No, of course not - but I want us discourage new weakly ordered
platforms as much as possible.

  Ralf

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Måns Rullgård

Ralf Baechle  writes:

> On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote:
>
>> Memory ordering is confusing enough as it is. We should not make
>> people worry more than they already have to. Strong rules are good.
>
> Confusing and the resulting bugs can be very hard to debug.
>
> One of the problems I've experienced is that Linux does support liberal
> memory ordering, even as extreme as the Alpha.

We don't really have much choice given the reality of existing hardware.

-- 
Måns Rullgård

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Ralf Baechle

On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote:

> Memory ordering is confusing enough as it is. We should not make
> people worry more than they already have to. Strong rules are good.

Confusing and the resulting bugs can be very hard to debug.

One of the problems I've experienced is that Linux does support liberal
memory ordering, even as extreme as the Alpha.  And every once in a
while a hardware guy is asking me if Linux their preferred variant of
weak ordering and answering honestly I have to say yes.  Even advising
against it in strong words against it, guess what will happen - another
weakly ordered core implementation.

To this point we've only fixed theoretical memory ordering issues on MIPS
but it's only a matter of time until we get bitten by a bug that does
actual damage.

SGI and a few others have managed to build large systems with significant
memory latencies yet managed to get decent performance with strong
ordering.

  Ralf

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Boqun Feng

On Tue, Feb 02, 2016 at 12:20:25PM +, Will Deacon wrote:
[...]
> > > > 
> > > > Besides, Will, what's the reason of having a locally transitive chain
> > > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may
> > > > not be locally transitive?
> > > 
> > > Well, the following ISA2 test is permitted on ARM:
> > > 
> > > 
> > > P0:
> > > Wx=1
> > > WyRel=1   // rcu_assign_pointer
> > > 
> > > P1:
> > > Ry=1  // rcu_dereference
> > 
> > What if a  dependency is added here? Same result?
> 
> Right, that fixes it. So if we're only considering things like:
> 
>   rcu_dereference
>   
>   RELEASE
> 
> then local transitivity should be preserved.
> 
> I think the same applies to , which seems to match your later
> example.
> 

Thank you ;-)

Now I understand why you want thoses pairings to be locally transitive
chain terminations, they have more subtle requirements to extend locally
transitive chains and slight different behaviors on different
architectures. It's better for us to put them aside until we figure out
thoses subtle requirements and different behaviors.

Regards,
Boqun

> Will


signature.asc
Description: PGP signature

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Will Deacon

On Tue, Feb 02, 2016 at 08:12:30PM +0800, Boqun Feng wrote:
> On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote:
> > On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote:
> > > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> > > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > > > Locally transitive chain termination:
> > > > > 
> > > > > (i.e. these can't be used to extend a chain)
> > > > 
> > > > Agreed.
> > > > 
> > > > > > o   smp_store_release() -> lockless_dereference() (???)
> > > > > > o   rcu_assign_pointer() -> rcu_dereference()
> > > > > > o   smp_store_release() -> READ_ONCE(); if
> > > 
> > > Just want to make sure, this one is actually:
> > > 
> > > o smp_store_release() -> READ_ONCE(); if ;
> > > 
> > > right? Because control dependency only orders READ->WRITE.
> > > 
> > > If so, do we also need to take the following pairing into consideration?
> > > 
> > > o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 
> > > 
> > > > 
> > > > I am OK with the first and last, but I believe that the middle one
> > > > has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
> > > > case needs to be locally transitive.
> > > > 
> > > 
> > > Hmm... I don't think we should differ rcu_dereference() and
> > > lockless_dereference(). One reason: list_for_each_entry_rcu() are using
> > > lockless_dereference() right now, which means we used to think
> > > rcu_dereference() and lockless_dereference() are interchangeable, right?
> > > 
> > > Besides, Will, what's the reason of having a locally transitive chain
> > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may
> > > not be locally transitive?
> > 
> > Well, the following ISA2 test is permitted on ARM:
> > 
> > 
> > P0:
> > Wx=1
> > WyRel=1 // rcu_assign_pointer
> > 
> > P1:
> > Ry=1// rcu_dereference
> 
> What if a  dependency is added here? Same result?

Right, that fixes it. So if we're only considering things like:

  rcu_dereference
  
  RELEASE

then local transitivity should be preserved.

I think the same applies to , which seems to match your later
example.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Boqun Feng

On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote:
> On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote:
> > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > > Locally transitive chain termination:
> > > > 
> > > > (i.e. these can't be used to extend a chain)
> > > 
> > > Agreed.
> > > 
> > > > > o smp_store_release() -> lockless_dereference() (???)
> > > > > o rcu_assign_pointer() -> rcu_dereference()
> > > > > o smp_store_release() -> READ_ONCE(); if
> > 
> > Just want to make sure, this one is actually:
> > 
> > o   smp_store_release() -> READ_ONCE(); if ;
> > 
> > right? Because control dependency only orders READ->WRITE.
> > 
> > If so, do we also need to take the following pairing into consideration?
> > 
> > o   smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 
> > 
> > > 
> > > I am OK with the first and last, but I believe that the middle one
> > > has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
> > > case needs to be locally transitive.
> > > 
> > 
> > Hmm... I don't think we should differ rcu_dereference() and
> > lockless_dereference(). One reason: list_for_each_entry_rcu() are using
> > lockless_dereference() right now, which means we used to think
> > rcu_dereference() and lockless_dereference() are interchangeable, right?
> > 
> > Besides, Will, what's the reason of having a locally transitive chain
> > termination? Because on some architectures RELEASE->DEPENDENCY pairs may
> > not be locally transitive?
> 
> Well, the following ISA2 test is permitted on ARM:
> 
> 
> P0:
> Wx=1
> WyRel=1   // rcu_assign_pointer
> 
> P1:
> Ry=1  // rcu_dereference

What if a  dependency is added here? Same result?

> WzRel=1 // rcu_assign_pointer
> 
> P2:
> Rz=1  // rcu_dereference
> 
> Rx=0
> 
> 
> Make one of the rcu_dereferences an ACQUIRE and the behaviour is
> forbidden.
> 
> Paul: what use-cases did you have in mind?
> 
> Will


signature.asc
Description: PGP signature

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Paul E. McKenney

On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 12:07 AM, Linus Torvalds
>  wrote:
> >
> > So we *absolutely* should say that *OF COURSE* these things work:
> >
> >  - CPU A:
> >
> >.. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr);
> >
> >  - CPU B:
> >
> >smp_load_acquire(ptr)  - we can rely on things behind "ptr" being 
> > initialized
> 
> That's a bad example, btw. I shouldn't have made it be a "pointer",
> because then we get the whole address dependency chain ordering
> anyway.
> 
> So instead of "ptr", read "state flag". It might just be an "int" that
> says "data has been initialized".
> 
> So
> 
> .. initialize memory ..
> smp_wmb();
> WRITE_ONCE(&is_initialized, 1);
> 
> should pair with
> 
> if (smp_load_acquire(&is_initialized))
> ... we can read and write the data, knowing it has been initialized ..
> 
> exactly because "smp_wmb()" (cheap write barrier) might be cheaper
> than "smp_store_release()" (expensive full barrier) and thus
> preferred.
> 
> So mixing ordering metaphors actually does make sense, and should be
> entirely well-defined.

I don't believe that anyone is arguing that this particular example
should not work the way that you want it to.

> There's likely less reason to do it the other way (ie
> "smp_store_release()" on one side pairing with "LOAD_ONCE() +
> smp_rmb()" on the other) since there likely isn't the same kind of
> performance reason for that pairing. But even if we would never
> necessarily want to do it, I think our memory ordering rules would be
> *much* better for strongly stating that it has to work, than being
> timid and trying to make the rules weak.
> 
> Memory ordering is confusing enough as it is. We should not make
> people worry more than they already have to. Strong rules are good.

The sorts of things I am really worried about are abominations like this
(and far worse):

void thread0(void)
{
r1 = smp_load_acquire(&a);
smp_store_release(&b, 1);
}

void thread1(void)
{
r2 = smp_load_acquire(&b);
smp_store_release(&c, 1);
}

void thread2(void)
{
WRITE_ONCE(c, 2);
smp_mb();
r3 = READ_ONCE(d);
}

void thread3(void)
{
WRITE_ONCE(d, 1);
smp_store_release(&a, 1);
}

r1 == 1 && r2 == 1 && c == 2 && r3 == 0 ???

I advise discouraging this sort of thing.  But it is your kernel, so
what is your preference?

Thanx, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Will Deacon

On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote:
> On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > Locally transitive chain termination:
> > > 
> > > (i.e. these can't be used to extend a chain)
> > 
> > Agreed.
> > 
> > > > o   smp_store_release() -> lockless_dereference() (???)
> > > > o   rcu_assign_pointer() -> rcu_dereference()
> > > > o   smp_store_release() -> READ_ONCE(); if
> 
> Just want to make sure, this one is actually:
> 
> o smp_store_release() -> READ_ONCE(); if ;
> 
> right? Because control dependency only orders READ->WRITE.
> 
> If so, do we also need to take the following pairing into consideration?
> 
> o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 
> 
> > 
> > I am OK with the first and last, but I believe that the middle one
> > has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
> > case needs to be locally transitive.
> > 
> 
> Hmm... I don't think we should differ rcu_dereference() and
> lockless_dereference(). One reason: list_for_each_entry_rcu() are using
> lockless_dereference() right now, which means we used to think
> rcu_dereference() and lockless_dereference() are interchangeable, right?
> 
> Besides, Will, what's the reason of having a locally transitive chain
> termination? Because on some architectures RELEASE->DEPENDENCY pairs may
> not be locally transitive?

Well, the following ISA2 test is permitted on ARM:


P0:
Wx=1
WyRel=1 // rcu_assign_pointer

P1:
Ry=1// rcu_dereference
WzRel=1 // rcu_assign_pointer

P2:
Rz=1// rcu_dereference

Rx=0


Make one of the rcu_dereferences an ACQUIRE and the behaviour is
forbidden.

Paul: what use-cases did you have in mind?

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Boqun Feng

Hello Linus,

On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote:
> On Tue, Feb 2, 2016 at 12:07 AM, Linus Torvalds
>  wrote:
> >
> > So we *absolutely* should say that *OF COURSE* these things work:
> >
> >  - CPU A:
> >
> >.. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr);
> >
> >  - CPU B:
> >
> >smp_load_acquire(ptr)  - we can rely on things behind "ptr" being 
> > initialized
> 
> That's a bad example, btw. I shouldn't have made it be a "pointer",
> because then we get the whole address dependency chain ordering
> anyway.
> 
> So instead of "ptr", read "state flag". It might just be an "int" that
> says "data has been initialized".
> 
> So
> 
> .. initialize memory ..
> smp_wmb();
> WRITE_ONCE(&is_initialized, 1);
> 
> should pair with
> 
> if (smp_load_acquire(&is_initialized))
> ... we can read and write the data, knowing it has been initialized ..
> 
> exactly because "smp_wmb()" (cheap write barrier) might be cheaper
> than "smp_store_release()" (expensive full barrier) and thus
> preferred.
> 

Just to be clear, what Will, Paul and I are discussing here is about
local transitivity, which refers to something like this following
example:

(a, b and is_initialized are all initially zero)

P0:

WRITE_ONCE(a, 1);
smp_store_release(&is_initialized, 1);

P1:
r1 = smp_load_acquire(&is_initialized);
smp_store_release(&b, 1);


P2:
r2 = smp_load_acquire(&b);
r3 = READ_ONCE(a);

, in which case, r1 == 1 && r2 == 1 && r3 == 0 can not happen because
RELEASE+ACQUIRE pairs guarantee local transitivity.

More on local transitvity:

http://article.gmane.org/gmane.linux.kernel.virtualization/26856


And what I'm asking here is something like this following example:

(a, b and is_initialized are all initially zero)

P0:

WRITE_ONCE(a, 1);
smp_store_release(&is_initialized, 1);

P1:
if (r1 = READ_ONCE(is_initialized))
smp_store_release(&b, 1);


P2:
if (r2 = READ_ONCE(b)) {
smp_rmb();  
r3 = READ_ONCE(a);
}

, in which case, can r1 == 1 && r2 == 1 && r3 == 0 happen?


Please note this example is about two questions on local transitivity:

1.  Could "READ_ONCE(); if" extend a locally transitive chain.

2.  Could "READ_ONCE(); if; smp_rmb()" at least be a locally
transitive chain termination?

> So mixing ordering metaphors actually does make sense, and should be
> entirely well-defined.
> 

I think Paul does agree that smp_{r,w}mb() with applicative memory
operations around could pair with smp_store_release() or
smp_load_acquire().

Hope I didn't misunderstand any of you or make you misunderstood with
each other..

Regards,
Boqun

> There's likely less reason to do it the other way (ie
> "smp_store_release()" on one side pairing with "LOAD_ONCE() +
> smp_rmb()" on the other) since there likely isn't the same kind of
> performance reason for that pairing. But even if we would never
> necessarily want to do it, I think our memory ordering rules would be
> *much* better for strongly stating that it has to work, than being
> timid and trying to make the rules weak.
> 
> Memory ordering is confusing enough as it is. We should not make
> people worry more than they already have to. Strong rules are good.
> 
> Linus


signature.asc
Description: PGP signature

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Linus Torvalds

On Tue, Feb 2, 2016 at 12:07 AM, Linus Torvalds
 wrote:
>
> So we *absolutely* should say that *OF COURSE* these things work:
>
>  - CPU A:
>
>.. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr);
>
>  - CPU B:
>
>smp_load_acquire(ptr)  - we can rely on things behind "ptr" being 
> initialized

That's a bad example, btw. I shouldn't have made it be a "pointer",
because then we get the whole address dependency chain ordering
anyway.

So instead of "ptr", read "state flag". It might just be an "int" that
says "data has been initialized".

So

.. initialize memory ..
smp_wmb();
WRITE_ONCE(&is_initialized, 1);

should pair with

if (smp_load_acquire(&is_initialized))
... we can read and write the data, knowing it has been initialized ..

exactly because "smp_wmb()" (cheap write barrier) might be cheaper
than "smp_store_release()" (expensive full barrier) and thus
preferred.

So mixing ordering metaphors actually does make sense, and should be
entirely well-defined.

There's likely less reason to do it the other way (ie
"smp_store_release()" on one side pairing with "LOAD_ONCE() +
smp_rmb()" on the other) since there likely isn't the same kind of
performance reason for that pairing. But even if we would never
necessarily want to do it, I think our memory ordering rules would be
*much* better for strongly stating that it has to work, than being
timid and trying to make the rules weak.

Memory ordering is confusing enough as it is. We should not make
people worry more than they already have to. Strong rules are good.

Linus

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-02 Thread Linus Torvalds

On Mon, Feb 1, 2016 at 10:44 PM, Paul E. McKenney
 wrote:
> On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote:
>> Hi Paul,
>> If so, do we also need to take the following pairing into consideration?
>>
>> o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 
>
> It looks like we will be restricing smp_rmb() and smp_wmb() to pairwise
> scenarios only.  So no transitivity in any scenarios involving these
> two primitives.

NO!

Stop this idiocy now, Paul.

We are not trying to make the kernel memory ordering rules pander to shit.

> But where there is uncertainty, we should -not- assume ordering.  It is
> easy to tighten the rules later, but hard to loosen them.

Bullshit.

The rules should absolutely be as tight as at all humanly possible,
given real hardware constraints.

It's also entirely wrong to say that "we can tighten the rules later".
No way. Because before those rules get tightened, we'll see confused
people who then try to "fix" code that doesn't want fixing, and adding
new odd constructs. We already had that with the whole bogus "control
dependency" shit. That crap came exactly from the fact that people
thought it was a good idea to make weak rules.

So you had better re-think the whole thing. I refuse to pander to the
idiotic and wrongheaded "weak memory ordering is good" braindamage.

Weak memory ordering is *not* good. It's shit.  It's hard to think
about, and people won't even realize that the assumptions they make
(unconsciously!) may be wrong.

So the memory ordering rules should be as strong as at all possible,
and should be as intuitive as possible, and follow peoples
expectations.

And quite frankly, if some crazy shit-for-brains architecture gets its
memory ordering wrong (like alpha did), then that crazy architecture
should pay the price.

There is absolutely _zero_ valid reason why smp_store_release should
not pair fine with a smp_rmb() on the other side. Can you actually
point to a relevant architecture that doesn't do that? I can't even
imagine how you'd make that pairing not work. If the releasing store
doesn't imply that it is ordered wrt previous stores, then it damn
well isn't a "release". And if the other side does a "smp-rmb()", then
the loads are ordered on the other side, so it had damn well just
work.

How could it not work?

So I can't even see how your "we won't guarantee" that wouldn't work.
It sounds insane.

But more importantly, your whole notion of "let's make things as weak
as possible" is *wrong*. That is now AT ALL what we should strive for.
We should strive for the strictest possible memory ordering that we
can get away with, and have as few "undefined" cases as at all
possible.

So we *absolutely* should say that *OF COURSE* these things work:

 - CPU A:

   .. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr);

 - CPU B:

   smp_load_acquire(ptr)  - we can rely on things behind "ptr" being initialized

and that the above mirror of that (ie smp_store_release paired with
READ_ONCE+smp_rmb) also works.

There are often reasons to prefer one model over the other, and
sometimes the reasons might argue for mixed access models.

For example, we might decide that the producer uses "smp_wmb()",
because that is universally fairly fast, and avoids a full memory
barrier on those architectures that don't really have "release"
semantics natively.  But at the same time the _consumer_ side might
want a "smp_load_acquire()" simply because it requires subsequent
loads and stores to be ordered if it sees that state (see for example
our "sk_state_store/load()" cases in networking - right now those pair
with release/acquire, but I suspect that "release" really could be a
"smp_wmb() + WRITE_ONCE()" instead, which could be faster on some
architectures)..

Because I really don't think there is any sane reason why that kind of
mixed access shouldn't work.

If those pairings don't work, there's something wrong with the
architecture, and the architecture will need to do whatever barriers
it needs to make it work in its store-release/load-acquire so that it
pairs with smp_wmb/rmb.

And if there against all sanity really is some crazy reason why that
pairing can be problematic, then that insane reason needs to be
extensively documented. Because that reason is so insane that it needs
a big honking description, so that people are aware of it.

But at no point should the logic be "let's leave it weakly defined".
Because at that point the documentation is actively _worse_ than not
having documentation at all, and just letting sanity prevail.

 Linus

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-01 Thread Paul E. McKenney

On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > For Linux in general, this is a question: How strict do we want to 
> > > > > > be
> > > > > > about matching the type of write with the corresponding read?  My
> > > > > > default approach is to initially be quite strict and loosen as 
> > > > > > needed.
> > > > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for
> > > > > > the write and rcu_dereference() for the read, as opposed to (say)
> > > > > > ACCESS_ONCE() for the read.  (I am guessing that this would be too
> > > > > > tight, but it makes a good example.)
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > That sounds broadly sensible to me and allows rcu_assign_pointer and
> > > > > rcu_dereference to be used as drop-in replacements for release/acquire
> > > > > where local transitivity isn't required. However, I don't think we can
> > > > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used
> > > > > already in things like the osq_lock (albeit without the address
> > > > > dependency).
> > > > 
> > > > Agreed.  So in the most strict case that I can imagine anyone putting
> > > > up with, we have the following pairings:
> > > 
> > > I think we can group these up:
> > > 
> > > Locally transitive:
> > > 
> > > > o   smp_store_release() -> smp_load_acquire() (locally transitive)
> > > 
> > > Locally transitive chain termination:
> > > 
> > > (i.e. these can't be used to extend a chain)
> > 
> > Agreed.
> > 
> > > > o   smp_store_release() -> lockless_dereference() (???)
> > > > o   rcu_assign_pointer() -> rcu_dereference()
> > > > o   smp_store_release() -> READ_ONCE(); if
> 
> Just want to make sure, this one is actually:
> 
> o smp_store_release() -> READ_ONCE(); if ;
> 
> right? Because control dependency only orders READ->WRITE.

Yep!

> If so, do we also need to take the following pairing into consideration?
> 
> o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 

It looks like we will be restricing smp_rmb() and smp_wmb() to pairwise
scenarios only.  So no transitivity in any scenarios involving these
two primitives.

> > I am OK with the first and last, but I believe that the middle one
> > has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
> > case needs to be locally transitive.
> 
> Hmm... I don't think we should differ rcu_dereference() and
> lockless_dereference(). One reason: list_for_each_entry_rcu() are using
> lockless_dereference() right now, which means we used to think
> rcu_dereference() and lockless_dereference() are interchangeable, right?

They use the same instructions, if that is what you mean, so intuitively
they should behave the same.  I don't feel all that strongly either way.
But where there is uncertainty, we should -not- assume ordering.  It is
easy to tighten the rules later, but hard to loosen them.

That might change if tools that automatically analyze usage patterns
in raw code, but we do not yet have such tools.

Thanx, Paul

> Besides, Will, what's the reason of having a locally transitive chain
> termination? Because on some architectures RELEASE->DEPENDENCY pairs may
> not be locally transitive?
> 
> Regards,
> Boqun
> 
> > > Globally transitive:
> > > 
> > > > o   smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)
> > > > o   synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally 
> > > > transitive)
> > > 
> > > RCU:
> > > 
> > > > o   synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE()
> > > > (strange and wonderful properties)
> > 
> > Agreed.
> > 
> > > > Seem reasonable, or am I missing some?
> > > 
> > > Looks alright to me.
> > 
> > So I have some litmus tests to generate.  ;-)
> > 
> > Thnax, Paul
> >

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-01 Thread Boqun Feng

Hi Paul,

On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > For Linux in general, this is a question: How strict do we want to be
> > > > > about matching the type of write with the corresponding read?  My
> > > > > default approach is to initially be quite strict and loosen as needed.
> > > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for
> > > > > the write and rcu_dereference() for the read, as opposed to (say)
> > > > > ACCESS_ONCE() for the read.  (I am guessing that this would be too
> > > > > tight, but it makes a good example.)
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > That sounds broadly sensible to me and allows rcu_assign_pointer and
> > > > rcu_dereference to be used as drop-in replacements for release/acquire
> > > > where local transitivity isn't required. However, I don't think we can
> > > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used
> > > > already in things like the osq_lock (albeit without the address
> > > > dependency).
> > > 
> > > Agreed.  So in the most strict case that I can imagine anyone putting
> > > up with, we have the following pairings:
> > 
> > I think we can group these up:
> > 
> > Locally transitive:
> > 
> > > o smp_store_release() -> smp_load_acquire() (locally transitive)
> > 
> > Locally transitive chain termination:
> > 
> > (i.e. these can't be used to extend a chain)
> 
> Agreed.
> 
> > > o smp_store_release() -> lockless_dereference() (???)
> > > o rcu_assign_pointer() -> rcu_dereference()
> > > o smp_store_release() -> READ_ONCE(); if

Just want to make sure, this one is actually:

o   smp_store_release() -> READ_ONCE(); if ;

right? Because control dependency only orders READ->WRITE.

If so, do we also need to take the following pairing into consideration?

o   smp_store_release() -> READ_ONCE(); if ;smp_rmb(); 

> 
> I am OK with the first and last, but I believe that the middle one
> has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
> case needs to be locally transitive.
> 

Hmm... I don't think we should differ rcu_dereference() and
lockless_dereference(). One reason: list_for_each_entry_rcu() are using
lockless_dereference() right now, which means we used to think
rcu_dereference() and lockless_dereference() are interchangeable, right?

Besides, Will, what's the reason of having a locally transitive chain
termination? Because on some architectures RELEASE->DEPENDENCY pairs may
not be locally transitive?

Regards,
Boqun

> > Globally transitive:
> > 
> > > o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)
> > > o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)
> > 
> > RCU:
> > 
> > > o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE()
> > >   (strange and wonderful properties)
> 
> Agreed.
> 
> > > Seem reasonable, or am I missing some?
> > 
> > Looks alright to me.
> 
> So I have some litmus tests to generate.  ;-)
> 
>   Thnax, Paul
> 


signature.asc
Description: PGP signature

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-01 Thread Paul E. McKenney

On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote:
> On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
> > > > For Linux in general, this is a question: How strict do we want to be
> > > > about matching the type of write with the corresponding read?  My
> > > > default approach is to initially be quite strict and loosen as needed.
> > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for
> > > > the write and rcu_dereference() for the read, as opposed to (say)
> > > > ACCESS_ONCE() for the read.  (I am guessing that this would be too
> > > > tight, but it makes a good example.)
> > > > 
> > > > Thoughts?
> > > 
> > > That sounds broadly sensible to me and allows rcu_assign_pointer and
> > > rcu_dereference to be used as drop-in replacements for release/acquire
> > > where local transitivity isn't required. However, I don't think we can
> > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used
> > > already in things like the osq_lock (albeit without the address
> > > dependency).
> > 
> > Agreed.  So in the most strict case that I can imagine anyone putting
> > up with, we have the following pairings:
> 
> I think we can group these up:
> 
> Locally transitive:
> 
> > o   smp_store_release() -> smp_load_acquire() (locally transitive)
> 
> Locally transitive chain termination:
> 
> (i.e. these can't be used to extend a chain)

Agreed.

> > o   smp_store_release() -> lockless_dereference() (???)
> > o   rcu_assign_pointer() -> rcu_dereference()
> > o   smp_store_release() -> READ_ONCE(); if

I am OK with the first and last, but I believe that the middle one
has real use cases.  So the rcu_assign_pointer() -> rcu_dereference()
case needs to be locally transitive.

> Globally transitive:
> 
> > o   smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)
> > o   synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)
> 
> RCU:
> 
> > o   synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE()
> > (strange and wonderful properties)

Agreed.

> > Seem reasonable, or am I missing some?
> 
> Looks alright to me.

So I have some litmus tests to generate.  ;-)

Thnax, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-02-01 Thread Will Deacon

On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > For Linux in general, this is a question: How strict do we want to be
> > > about matching the type of write with the corresponding read?  My
> > > default approach is to initially be quite strict and loosen as needed.
> > > Here "quite strict" might mean requiring an rcu_assign_pointer() for
> > > the write and rcu_dereference() for the read, as opposed to (say)
> > > ACCESS_ONCE() for the read.  (I am guessing that this would be too
> > > tight, but it makes a good example.)
> > > 
> > > Thoughts?
> > 
> > That sounds broadly sensible to me and allows rcu_assign_pointer and
> > rcu_dereference to be used as drop-in replacements for release/acquire
> > where local transitivity isn't required. However, I don't think we can
> > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used
> > already in things like the osq_lock (albeit without the address
> > dependency).
> 
> Agreed.  So in the most strict case that I can imagine anyone putting
> up with, we have the following pairings:

I think we can group these up:

Locally transitive:

> o smp_store_release() -> smp_load_acquire() (locally transitive)

Locally transitive chain termination:

(i.e. these can't be used to extend a chain)

> o smp_store_release() -> lockless_dereference() (???)
> o rcu_assign_pointer() -> rcu_dereference()
> o smp_store_release() -> READ_ONCE(); if

Globally transitive:

> o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)
> o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)

RCU:

> o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE()
>   (strange and wonderful properties)
> 
> Seem reasonable, or am I missing some?

Looks alright to me.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-29 Thread Paul E. McKenney

On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote:
> On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote:

[ . . . ]

> > For Linux in general, this is a question: How strict do we want to be
> > about matching the type of write with the corresponding read?  My
> > default approach is to initially be quite strict and loosen as needed.
> > Here "quite strict" might mean requiring an rcu_assign_pointer() for
> > the write and rcu_dereference() for the read, as opposed to (say)
> > ACCESS_ONCE() for the read.  (I am guessing that this would be too
> > tight, but it makes a good example.)
> > 
> > Thoughts?
> 
> That sounds broadly sensible to me and allows rcu_assign_pointer and
> rcu_dereference to be used as drop-in replacements for release/acquire
> where local transitivity isn't required. However, I don't think we can
> rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used
> already in things like the osq_lock (albeit without the address
> dependency).

Agreed.  So in the most strict case that I can imagine anyone putting
up with, we have the following pairings:

o   smp_store_release() -> smp_load_acquire() (locally transitive)

o   smp_store_release() -> lockless_dereference() (???)

o   smp_store_release() -> READ_ONCE(); if

o   rcu_assign_pointer() -> rcu_dereference()

o   smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)

o   synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive)

o   synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE()
(strange and wonderful properties)

Seem reasonable, or am I missing some?

Thanx, Paul

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-29 Thread Will Deacon

Hi Paul,

On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 28, 2016 at 09:57:19AM +, Will Deacon wrote:
> > On Wed, Jan 27, 2016 at 03:38:36PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote:
> > Yes, sorry for the shorthand:
> > 
> >   - Each paragraph is a separate thread
> >   - Wx=1 means WRITE_ONCE(x, 1), Rx=1 means READ_ONCE(x) returns 1
> >   - WxRel means smp_store_release(x,1), RxAcq=1 means smp_load_acquire(x)
> > returns 1
> >   - Everything is initially zero
> > 
> > > > and I suppose a variant of that:
> > > > 
> > > > 
> > > > Wx=1
> > > > WyRel=1
> > > > 
> > > > RyAcq=1
> > > > Wz=1
> > > > 
> > > > Rz=1
> > > > 
> > > > Rx=0
> > > 
> > > Agreed, this would be needed as well, along with the read-read and
> > > read-write variants.  I picked the write-read version (Will's first
> > > test above) because write-read reordering is the most likely on
> > > hardware that I am aware of.
> > 
> > Question: if you replaced "Wz=1" with "WzRel=1" in my second test, would
> > it then be forbidden?
> 
> On Power, yes.  I would guess on ARM as well.

Indeed.

> For Linux in general, this is a question: How strict do we want to be
> about matching the type of write with the corresponding read?  My
> default approach is to initially be quite strict and loosen as needed.
> Here "quite strict" might mean requiring an rcu_assign_pointer() for
> the write and rcu_dereference() for the read, as opposed to (say)
> ACCESS_ONCE() for the read.  (I am guessing that this would be too
> tight, but it makes a good example.)
> 
> Thoughts?

That sounds broadly sensible to me and allows rcu_assign_pointer and
rcu_dereference to be used as drop-in replacements for release/acquire
where local transitivity isn't required. However, I don't think we can
rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used
already in things like the osq_lock (albeit without the address
dependency).

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-28 Thread Paul E. McKenney

On Thu, Jan 28, 2016 at 09:57:19AM +, Will Deacon wrote:
> On Wed, Jan 27, 2016 at 03:38:36PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote:
> > > On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote:
> > > > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote:
> > > > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> > > > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 
> > > > > 16)?
> > > > > 
> > > > > If not, you may need to implement smp_mb__after_unlock_lock for RCU
> > > > > to ensure globally transitive unlock->lock ordering should you decide
> > > > > to relax your locking barriers.
> > > > 
> > > > You know that is a tricky question. Maybe its easier if you give the 3
> > > > cpu litmus test that goes with it.
> > > 
> > > Sure, I was building up to that. I just wanted to make sure the basics
> > > were there (program-order, so same CPU) before we go any further. It
> > > sounds like they are, so that's promising.
> > > 
> > > > Maciej, the tricky point is what, if any, effect the
> > > > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review
> > > > the TRANSITIVITY section in Documentation/memory-barriers.txt and
> > > > replace  with the RELEASE+ACQUIRE pair.
> > > > 
> > > > We've all (Will, Paul and me) had much 'fun' trying to decipher the
> > > > MIPS64r6 manual but failed to reach a conclusion on this.
> > > 
> > > For the inter-thread case, Paul had a previous example along the lines
> > > of:
> > > 
> > > 
> > > Wx=1
> > > WyRel=1
> > > 
> > > RyAcq=1
> > > Rz=0
> > > 
> > > Wz=1
> > > smp_mb()
> > > Rx=0
> > 
> > Each paragraph being a separate thread, correct?  If so, agreed.
> 
> Yes, sorry for the shorthand:
> 
>   - Each paragraph is a separate thread
>   - Wx=1 means WRITE_ONCE(x, 1), Rx=1 means READ_ONCE(x) returns 1
>   - WxRel means smp_store_release(x,1), RxAcq=1 means smp_load_acquire(x)
> returns 1
>   - Everything is initially zero
> 
> > > and I suppose a variant of that:
> > > 
> > > 
> > > Wx=1
> > > WyRel=1
> > > 
> > > RyAcq=1
> > > Wz=1
> > > 
> > > Rz=1
> > > 
> > > Rx=0
> > 
> > Agreed, this would be needed as well, along with the read-read and
> > read-write variants.  I picked the write-read version (Will's first
> > test above) because write-read reordering is the most likely on
> > hardware that I am aware of.
> 
> Question: if you replaced "Wz=1" with "WzRel=1" in my second test, would
> it then be forbidden?

On Power, yes.  I would guess on ARM as well.

For Linux in general, this is a question: How strict do we want to be
about matching the type of write with the corresponding read?  My
default approach is to initially be quite strict and loosen as needed.
Here "quite strict" might mean requiring an rcu_assign_pointer() for
the write and rcu_dereference() for the read, as opposed to (say)
ACCESS_ONCE() for the read.  (I am guessing that this would be too
tight, but it makes a good example.)

Thoughts?

Thanx, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-28 Thread Will Deacon

On Wed, Jan 27, 2016 at 03:38:36PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote:
> > On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote:
> > > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> > > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?
> > > > 
> > > > If not, you may need to implement smp_mb__after_unlock_lock for RCU
> > > > to ensure globally transitive unlock->lock ordering should you decide
> > > > to relax your locking barriers.
> > > 
> > > You know that is a tricky question. Maybe its easier if you give the 3
> > > cpu litmus test that goes with it.
> > 
> > Sure, I was building up to that. I just wanted to make sure the basics
> > were there (program-order, so same CPU) before we go any further. It
> > sounds like they are, so that's promising.
> > 
> > > Maciej, the tricky point is what, if any, effect the
> > > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review
> > > the TRANSITIVITY section in Documentation/memory-barriers.txt and
> > > replace  with the RELEASE+ACQUIRE pair.
> > > 
> > > We've all (Will, Paul and me) had much 'fun' trying to decipher the
> > > MIPS64r6 manual but failed to reach a conclusion on this.
> > 
> > For the inter-thread case, Paul had a previous example along the lines
> > of:
> > 
> > 
> > Wx=1
> > WyRel=1
> > 
> > RyAcq=1
> > Rz=0
> > 
> > Wz=1
> > smp_mb()
> > Rx=0
> 
> Each paragraph being a separate thread, correct?  If so, agreed.

Yes, sorry for the shorthand:

  - Each paragraph is a separate thread
  - Wx=1 means WRITE_ONCE(x, 1), Rx=1 means READ_ONCE(x) returns 1
  - WxRel means smp_store_release(x,1), RxAcq=1 means smp_load_acquire(x)
returns 1
  - Everything is initially zero

> > and I suppose a variant of that:
> > 
> > 
> > Wx=1
> > WyRel=1
> > 
> > RyAcq=1
> > Wz=1
> > 
> > Rz=1
> > 
> > Rx=0
> 
> Agreed, this would be needed as well, along with the read-read and
> read-write variants.  I picked the write-read version (Will's first
> test above) because write-read reordering is the most likely on
> hardware that I am aware of.

Question: if you replaced "Wz=1" with "WzRel=1" in my second test, would
it then be forbidden?

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Boqun Feng

Hi Maciej,

On Wed, Jan 27, 2016 at 12:41:29PM +, Maciej W. Rozycki wrote:
> On Wed, 27 Jan 2016, Will Deacon wrote:
> 
> > >  Overall I think it should be safe after all to use SYNC_RELEASE and 
> > > other 
> > > modern lightweight barriers uncondtionally under the assumption that 
> > > architecture was meant to remain backward compatible.  Even though it 
> > > might be possible someone would implement unusual semantics for the then 
> > > undefined `stype' values, I highly doubt it as it would be extra effort 
> > > and hardware logic space for no gain.  We could try and reach 
> > > architecture 
> > > overseers to double-check whether the `stype' encodings, somewhat 
> > > irregularly distributed, were indeed defined in a manner so as not to 
> > > clash with values implementers chose to use before rev. 2.61 of the 
> > > architecture specification.
> > 
> > Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?
> 
>  By my reading of architecture specifications it does.  Specifically 
> SYNC_RELEASE (18) applies to older loads and stores, and newer stores, and 
> SYNC_ACQUIRE (17) applies to older loads, and newer loads and stores.  So 
> the two combined ought to be the equivalent to SYNC_MB (16), which applies 
> to both older and newer loads and stores.  Of course care has to be taken 

Hmm.. so the following reordering couldn't happen?

Program order:

LOAD A
SYNC_RELEASE
STORE B
LOAD C
SYNC_ACQUIRE
LOAD D

First becomes:

LOAD C < SYNC_RELEASE doesn't order newer loads.
LOAD A
SYNC_RELEASE
STORE B
SYNC_ACQUIRE
LOAD D

And then becomes:

LOAD C
 < SYNC_ACQUIRE still affect those loads.
LOAD D < SYNC_RELEASE doesn't order newer loads.
LOAD A
SYNC_RELEASE
STORE B
SYNC_ACQUIRE

 here doesn't mean that SYNC instructions can be
reordered, it here means that the reordering doesn't break
SYNC_ACQUIRE's guarantee.

I ask this because some architectures(e.g. PPC) allows this kind of
reordering. Please see "ACQUIRING FUNCTIONS" in memory-barriers.txt for
more information. Thank you ;-)

Regards,
Boqun

> about what happens between SYNC_RELEASE and SYNC_ACQUIRE.  This is still 
> more lightweight than classic SYNC (0).  See the architecture documents, 
> e.g. the MIPS32 one[1] for details.
> 
> References:
> 
> [1] "MIPS Architecture For Programmers, Volume II-A: The MIPS32 
> Instruction Set", MIPS Technologies, Inc., Document Number: MD00086,
> Revision 5.04, December 11, 2013, Table 4.7 "Encodings of the 
> Bits[10:6] of the SYNC instruction; the SType Field", p. 305
> 
>  HTH,
> 
>   Maciej
> 


signature.asc
Description: PGP signature

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Paul E. McKenney

On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote:
> On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote:
> > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?
> > > 
> > > If not, you may need to implement smp_mb__after_unlock_lock for RCU
> > > to ensure globally transitive unlock->lock ordering should you decide
> > > to relax your locking barriers.
> > 
> > You know that is a tricky question. Maybe its easier if you give the 3
> > cpu litmus test that goes with it.
> 
> Sure, I was building up to that. I just wanted to make sure the basics
> were there (program-order, so same CPU) before we go any further. It
> sounds like they are, so that's promising.
> 
> > Maciej, the tricky point is what, if any, effect the
> > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review
> > the TRANSITIVITY section in Documentation/memory-barriers.txt and
> > replace  with the RELEASE+ACQUIRE pair.
> > 
> > We've all (Will, Paul and me) had much 'fun' trying to decipher the
> > MIPS64r6 manual but failed to reach a conclusion on this.
> 
> For the inter-thread case, Paul had a previous example along the lines
> of:
> 
> 
> Wx=1
> WyRel=1
> 
> RyAcq=1
> Rz=0
> 
> Wz=1
> smp_mb()
> Rx=0

Each paragraph being a separate thread, correct?  If so, agreed.

> and I suppose a variant of that:
> 
> 
> Wx=1
> WyRel=1
> 
> RyAcq=1
> Wz=1
> 
> Rz=1
> 
> Rx=0

Agreed, this would be needed as well, along with the read-read and
read-write variants.  I picked the write-read version (Will's first
test above) because write-read reordering is the most likely on
hardware that I am aware of.

Thanx, Paul

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Will Deacon

On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote:
> > Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?
> > 
> > If not, you may need to implement smp_mb__after_unlock_lock for RCU
> > to ensure globally transitive unlock->lock ordering should you decide
> > to relax your locking barriers.
> 
> You know that is a tricky question. Maybe its easier if you give the 3
> cpu litmus test that goes with it.

Sure, I was building up to that. I just wanted to make sure the basics
were there (program-order, so same CPU) before we go any further. It
sounds like they are, so that's promising.

> Maciej, the tricky point is what, if any, effect the
> SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review
> the TRANSITIVITY section in Documentation/memory-barriers.txt and
> replace  with the RELEASE+ACQUIRE pair.
> 
> We've all (Will, Paul and me) had much 'fun' trying to decipher the
> MIPS64r6 manual but failed to reach a conclusion on this.

For the inter-thread case, Paul had a previous example along the lines
of:


Wx=1
WyRel=1

RyAcq=1
Rz=0

Wz=1
smp_mb()
Rx=0


and I suppose a variant of that:


Wx=1
WyRel=1

RyAcq=1
Wz=1

Rz=1

Rx=0


Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Peter Zijlstra

On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote:
> Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?
> 
> If not, you may need to implement smp_mb__after_unlock_lock for RCU
> to ensure globally transitive unlock->lock ordering should you decide
> to relax your locking barriers.

You know that is a tricky question. Maybe its easier if you give the 3
cpu litmus test that goes with it.

Maciej, the tricky point is what, if any, effect the
SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review
the TRANSITIVITY section in Documentation/memory-barriers.txt and
replace  with the RELEASE+ACQUIRE pair.

We've all (Will, Paul and me) had much 'fun' trying to decipher the
MIPS64r6 manual but failed to reach a conclusion on this.

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Maciej W. Rozycki

On Wed, 27 Jan 2016, Will Deacon wrote:

> >  Overall I think it should be safe after all to use SYNC_RELEASE and other 
> > modern lightweight barriers uncondtionally under the assumption that 
> > architecture was meant to remain backward compatible.  Even though it 
> > might be possible someone would implement unusual semantics for the then 
> > undefined `stype' values, I highly doubt it as it would be extra effort 
> > and hardware logic space for no gain.  We could try and reach architecture 
> > overseers to double-check whether the `stype' encodings, somewhat 
> > irregularly distributed, were indeed defined in a manner so as not to 
> > clash with values implementers chose to use before rev. 2.61 of the 
> > architecture specification.
> 
> Do you know whether a SYNC 18 (RELEASE) followed in program order by a
> SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?

 By my reading of architecture specifications it does.  Specifically 
SYNC_RELEASE (18) applies to older loads and stores, and newer stores, and 
SYNC_ACQUIRE (17) applies to older loads, and newer loads and stores.  So 
the two combined ought to be the equivalent to SYNC_MB (16), which applies 
to both older and newer loads and stores.  Of course care has to be taken 
about what happens between SYNC_RELEASE and SYNC_ACQUIRE.  This is still 
more lightweight than classic SYNC (0).  See the architecture documents, 
e.g. the MIPS32 one[1] for details.

References:

[1] "MIPS Architecture For Programmers, Volume II-A: The MIPS32 
Instruction Set", MIPS Technologies, Inc., Document Number: MD00086,
Revision 5.04, December 11, 2013, Table 4.7 "Encodings of the 
Bits[10:6] of the SYNC instruction; the SType Field", p. 305

 HTH,

  Maciej

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Will Deacon

Hi Maciej,

Thanks for digging all this up.

On Wed, Jan 27, 2016 at 09:57:24AM +, Maciej W. Rozycki wrote:
> On Thu, 12 Nov 2015, David Daney wrote:
> 
> > > > Certainly we can load up the code with "SYNC" all over the place, but
> > > > it will kill performance on SMP systems.  So, my vote would be to make
> > > > it as light weight as possible, but no lighter.  That will mean
> > > > inventing the proper barrier primitives.
> > > 
> > > It seems to me that the proper barrier here is a "SYNC 18" aka
> > > SYNC_RELEASE instruction, at least on CPUs that implement that variant.
> 
>  For the record, we've had "cooked" aliases in the toolchain for a short 
> while now -- since Sep 2010 or binutils 2.21 -- so for readability you can 
> actually use `sync_release' in your source code rather than obscure `sync 
> 18' (of course you could define a macro instead, but there's no need now), 
> and disassembly will show the "cooked" mnemonic too.
> 
>  Although Documentation/Changes still lists binutils 2.12 as the minimum, 
> so perhaps using macros is indeed the way to go now, at least for the time 
> being.
> 
> > Yes, unfortunately very few CPUs implement that.  It is an instruction that
> > MIPS invented only recently, so older CPUs need a different solution.
> 
>  Hmm, it looks to me we might actually be safe, although as often the 
> situation seems more complicated than it had to be.

[... trim ISA archaeology ...]

>  Overall I think it should be safe after all to use SYNC_RELEASE and other 
> modern lightweight barriers uncondtionally under the assumption that 
> architecture was meant to remain backward compatible.  Even though it 
> might be possible someone would implement unusual semantics for the then 
> undefined `stype' values, I highly doubt it as it would be extra effort 
> and hardware logic space for no gain.  We could try and reach architecture 
> overseers to double-check whether the `stype' encodings, somewhat 
> irregularly distributed, were indeed defined in a manner so as not to 
> clash with values implementers chose to use before rev. 2.61 of the 
> architecture specification.

Do you know whether a SYNC 18 (RELEASE) followed in program order by a
SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)?

If not, you may need to implement smp_mb__after_unlock_lock for RCU
to ensure globally transitive unlock->lock ordering should you decide
to relax your locking barriers.

Will

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2016-01-27 Thread Maciej W. Rozycki

On Thu, 12 Nov 2015, David Daney wrote:

> > > Certainly we can load up the code with "SYNC" all over the place, but
> > > it will kill performance on SMP systems.  So, my vote would be to make
> > > it as light weight as possible, but no lighter.  That will mean
> > > inventing the proper barrier primitives.
> > 
> > It seems to me that the proper barrier here is a "SYNC 18" aka
> > SYNC_RELEASE instruction, at least on CPUs that implement that variant.

 For the record, we've had "cooked" aliases in the toolchain for a short 
while now -- since Sep 2010 or binutils 2.21 -- so for readability you can 
actually use `sync_release' in your source code rather than obscure `sync 
18' (of course you could define a macro instead, but there's no need now), 
and disassembly will show the "cooked" mnemonic too.

 Although Documentation/Changes still lists binutils 2.12 as the minimum, 
so perhaps using macros is indeed the way to go now, at least for the time 
being.

> Yes, unfortunately very few CPUs implement that.  It is an instruction that
> MIPS invented only recently, so older CPUs need a different solution.

 Hmm, it looks to me we might actually be safe, although as often the 
situation seems more complicated than it had to be.

 Conventional wisdom says that SYNC as the ultimate ordering barrier, aka 
SYNC 0, was added with the MIPS II ISA, with a provision to define less 
restrictive barriers in the future in a backward compatible manner, by the 
means of undefined (any non-zero at the time) barrier types defaulting to 
0. Early references seem to have been lost in the mist of time, however a 
few legacy MIPS ISA documents remain, e.g. the MIPS IV ISA document 
says[1]:

"The stype values 1-31 are reserved; they produce the same result as the 
value zero."

making it clear that non-zero arguments will work as expected, albeit 
perhaps with a somewhat heavyweight effect.  But there's sometimes no 
other way.

 This seems more ambiguous with earlier documentation available, e.g. the 
MIPS R4000 processor manual, which omits the mention of `stype' altogether 
and merely defines a single SYNC instruction encoding with all-zeros 
across bits 25:6 of the instruction word, among which `stype' normally 
lives[2].  This appears the same with other MIPS III processor 
documentation (e.g. IDT 79RV4700[3]).  However I'm fairly sure all these 
simply did not bother decoding SYNC beyond the major and minor opcode, so 
again SYNC 0 semantics should be held across the more recently defined 
variants.  I could this actually sometime with an R4000 class processor.

 Modern MIPS architecture specifications started with the same definition 
as the MIPS IV ISA had, rev. 0.95 documents still stated[4][5]:

"The stype values 1-31 are reserved; they produce the same result as the 
value zero."

Unfortunately the requirement got weakened later on, rev. 1.00 
architecture specifications now stated[6][7]:

"The stype values 1-31 are reserved for future extensions to the 
architecture.  A value of zero will always be defined such that it 
performs all defined synchronization operations.  Non-zero values may be 
defined to remove some synchronization operations.  As such, software 
should never use a non-zero value of the stype field, as this may 
inadvertently cause future failures if non-zero values remove 
synchronization operations."

I think the intent was not to break backwards compatibility, and certainly 
anyone who looked at one of the earlier documents might have realised that 
implementing non-zero SYNC operations, that do not have a vendor-specific 
semantics, as aliases to SYNC 0 rather than NOP or RI triggers would be a 
good idea.  However implementers may not have been able to infer that from 
reading the lone current revision of architecture documents.

 It was only with rev. 2.60 of architecture specifications that along new 
SYNC operations the requirement for undefined SYNC operations to behave as 
SYNC 0 was put in the text back in an unambiguous form[8][9]:

"A stype value of zero will always be defined such that it performs the 
most complete set of synchronization operations that are defined.  This 
means stype zero always does a completion barrier that affects both loads 
and stores preceding the SYNC instruction and both loads and stores that 
are subsequent to the SYNC instruction.  Non-zero values of stype may be 
defined by the architecture or specific implementations to perform 
synchronization behaviors that are less complete than that of stype zero. 
If an implementation does not use one of these non-zero values to define a 
different synchronization behavior, then that non-zero value of stype must 
act the same as stype zero completion barrier.  This allows software 
written for an implementation with a lighter-weight barrier to work on 
another implementation which only implements the stype zero completion 
barrier."

This definition has then been retained in the architecture specification 
througho

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread David Daney


On 11/12/2015 10:13 AM, Måns Rullgård wrote:

David Daney  writes:


On 11/12/2015 04:31 AM, Peter Zijlstra wrote:

Hi

I think the MIPS arch_spin_unlock() is borken.

spin_unlock() must have RELEASE semantics, these require that no LOADs
nor STOREs leak out from the critical section.

  From what I know MIPS has a relaxed memory model which allows reads to
pass stores, and as implemented arch_spin_unlock() only issues a wmb
which doesn't order prior reads vs later stores.

Therefore upgrade the wmb() to smp_mb().

(Also, why the unconditional wmb, as opposed to smp_wmb() ?)


asm/spinlock.h is only used for !CONFIG_SMP.  So, smp_wmb() would
imply that special handling for non-SMP is needed, when this is
already only used for the SMP build case.



Maybe-Signed-off-by: Peter Zijlstra (Intel) 
---
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 40196bebe849..b2ca13f06152 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
   static inline void arch_spin_unlock(arch_spinlock_t *lock)
   {
unsigned int serving_now = lock->h.serving_now + 1;
-   wmb();
+   smp_mb();


That is too heavy.

It implies a full MIPS "SYNC" operation which stalls execution until
all previous writes are committed and globally visible.

We really want just release semantics, and there is no standard named
primitive that gives us that.

For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:

 smp_wmb();
 smp_rmb();

Which expands to exactly the same thing as wmb() because smp_rmb()
expands to nothing.

For CPUs that have out-of-order loads, smp_rmb() should expand to
something lighter weight than "SYNC"

Certainly we can load up the code with "SYNC" all over the place, but
it will kill performance on SMP systems.  So, my vote would be to make
it as light weight as possible, but no lighter.  That will mean
inventing the proper barrier primitives.


It seems to me that the proper barrier here is a "SYNC 18" aka
SYNC_RELEASE instruction, at least on CPUs that implement that variant.



Yes, unfortunately very few CPUs implement that.  It is an instruction 
that MIPS invented only recently, so older CPUs need a different solution.


David Daney
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Måns Rullgård

David Daney  writes:

> On 11/12/2015 04:31 AM, Peter Zijlstra wrote:
>> Hi
>>
>> I think the MIPS arch_spin_unlock() is borken.
>>
>> spin_unlock() must have RELEASE semantics, these require that no LOADs
>> nor STOREs leak out from the critical section.
>>
>>  From what I know MIPS has a relaxed memory model which allows reads to
>> pass stores, and as implemented arch_spin_unlock() only issues a wmb
>> which doesn't order prior reads vs later stores.
>>
>> Therefore upgrade the wmb() to smp_mb().
>>
>> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
>
> asm/spinlock.h is only used for !CONFIG_SMP.  So, smp_wmb() would
> imply that special handling for non-SMP is needed, when this is
> already only used for the SMP build case.
>
>>
>> Maybe-Signed-off-by: Peter Zijlstra (Intel) 
>> ---
>> diff --git a/arch/mips/include/asm/spinlock.h 
>> b/arch/mips/include/asm/spinlock.h
>> index 40196bebe849..b2ca13f06152 100644
>> --- a/arch/mips/include/asm/spinlock.h
>> +++ b/arch/mips/include/asm/spinlock.h
>> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>>   static inline void arch_spin_unlock(arch_spinlock_t *lock)
>>   {
>>  unsigned int serving_now = lock->h.serving_now + 1;
>> -wmb();
>> +smp_mb();
>
> That is too heavy.
>
> It implies a full MIPS "SYNC" operation which stalls execution until
> all previous writes are committed and globally visible.
>
> We really want just release semantics, and there is no standard named
> primitive that gives us that.
>
> For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:
>
> smp_wmb();
> smp_rmb();
>
> Which expands to exactly the same thing as wmb() because smp_rmb()
> expands to nothing.
>
> For CPUs that have out-of-order loads, smp_rmb() should expand to
> something lighter weight than "SYNC"
>
> Certainly we can load up the code with "SYNC" all over the place, but
> it will kill performance on SMP systems.  So, my vote would be to make
> it as light weight as possible, but no lighter.  That will mean
> inventing the proper barrier primitives.

It seems to me that the proper barrier here is a "SYNC 18" aka
SYNC_RELEASE instruction, at least on CPUs that implement that variant.

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Peter Zijlstra

On Thu, Nov 12, 2015 at 09:46:53AM -0800, David Daney wrote:

> For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:
> 
> smp_wmb();
> smp_rmb();
> 
> Which expands to exactly the same thing as wmb() because smp_rmb() expands
> to nothing.

OK, so the current code isn't broken because for Cavium wmb is suffient
because rmb is a no-op, and for !Cavium wmb expands to SYNC.

> You yourself seem to have added smp_store_release(), so we could even do:
> 
>  smp_store_release(&lock->h.serving_now, lock->h.serving_now + 1);
> 
> That would leave us to cook up a proper definition of smp_store_release().

That is indeed the better solution.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread David Daney


On 11/12/2015 04:31 AM, Peter Zijlstra wrote:

Hi

I think the MIPS arch_spin_unlock() is borken.

spin_unlock() must have RELEASE semantics, these require that no LOADs
nor STOREs leak out from the critical section.

 From what I know MIPS has a relaxed memory model which allows reads to
pass stores, and as implemented arch_spin_unlock() only issues a wmb
which doesn't order prior reads vs later stores.

Therefore upgrade the wmb() to smp_mb().

(Also, why the unconditional wmb, as opposed to smp_wmb() ?)


asm/spinlock.h is only used for !CONFIG_SMP.  So, smp_wmb() would imply 
that special handling for non-SMP is needed, when this is already only 
used for the SMP build case.




Maybe-Signed-off-by: Peter Zijlstra (Intel) 
---
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 40196bebe849..b2ca13f06152 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
  static inline void arch_spin_unlock(arch_spinlock_t *lock)
  {
unsigned int serving_now = lock->h.serving_now + 1;
-   wmb();
+   smp_mb();


That is too heavy.

It implies a full MIPS "SYNC" operation which stalls execution until all 
previous writes are committed and globally visible.


We really want just release semantics, and there is no standard named 
primitive that gives us that.


For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:

smp_wmb();
smp_rmb();

Which expands to exactly the same thing as wmb() because smp_rmb() 
expands to nothing.


For CPUs that have out-of-order loads, smp_rmb() should expand to 
something lighter weight than "SYNC"


Certainly we can load up the code with "SYNC" all over the place, but it 
will kill performance on SMP systems.  So, my vote would be to make it 
as light weight as possible, but no lighter.  That will mean inventing 
the proper barrier primitives.



You yourself seem to have added smp_store_release(), so we could even do:

 smp_store_release(&lock->h.serving_now, lock->h.serving_now + 1);

That would leave us to cook up a proper definition of smp_store_release().

David Daney


lock->h.serving_now = (u16)serving_now;
nudge_writes();
  }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Paul E. McKenney

On Thu, Nov 12, 2015 at 02:50:00PM +, Måns Rullgård wrote:
> "Paul E. McKenney"  writes:
> 
> > On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
> >> Hi
> >> 
> >> I think the MIPS arch_spin_unlock() is borken.
> >> 
> >> spin_unlock() must have RELEASE semantics, these require that no LOADs
> >> nor STOREs leak out from the critical section.
> >> 
> >> >From what I know MIPS has a relaxed memory model which allows reads to
> >> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> >> which doesn't order prior reads vs later stores.
> >> 
> >> Therefore upgrade the wmb() to smp_mb().
> >> 
> >> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
> >
> > One guess is that they want to order I/O accesses within the critical
> > section?
> 
> Isn't that what mmiowb() is for?

Indeed it is.  Perhaps they didn't trust the device drivers that they
care about to use it?  Anyway, just my guess.  Just out of curiosity,
what is your guess?

Thanx, Paul

> >> Maybe-Signed-off-by: Peter Zijlstra (Intel) 
> >> ---
> >> diff --git a/arch/mips/include/asm/spinlock.h 
> >> b/arch/mips/include/asm/spinlock.h
> >> index 40196bebe849..b2ca13f06152 100644
> >> --- a/arch/mips/include/asm/spinlock.h
> >> +++ b/arch/mips/include/asm/spinlock.h
> >> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t 
> >> *lock)
> >>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >>  {
> >>unsigned int serving_now = lock->h.serving_now + 1;
> >> -  wmb();
> >> +  smp_mb();
> >>lock->h.serving_now = (u16)serving_now;
> >>nudge_writes();
> >>  }
> >> 
> >
> 
> -- 
> Måns Rullgård
> m...@mansr.com
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Måns Rullgård

"Paul E. McKenney"  writes:

> On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
>> Hi
>> 
>> I think the MIPS arch_spin_unlock() is borken.
>> 
>> spin_unlock() must have RELEASE semantics, these require that no LOADs
>> nor STOREs leak out from the critical section.
>> 
>> >From what I know MIPS has a relaxed memory model which allows reads to
>> pass stores, and as implemented arch_spin_unlock() only issues a wmb
>> which doesn't order prior reads vs later stores.
>> 
>> Therefore upgrade the wmb() to smp_mb().
>> 
>> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
>
> One guess is that they want to order I/O accesses within the critical
> section?

Isn't that what mmiowb() is for?

>> Maybe-Signed-off-by: Peter Zijlstra (Intel) 
>> ---
>> diff --git a/arch/mips/include/asm/spinlock.h 
>> b/arch/mips/include/asm/spinlock.h
>> index 40196bebe849..b2ca13f06152 100644
>> --- a/arch/mips/include/asm/spinlock.h
>> +++ b/arch/mips/include/asm/spinlock.h
>> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>>  {
>>  unsigned int serving_now = lock->h.serving_now + 1;
>> -wmb();
>> +smp_mb();
>>  lock->h.serving_now = (u16)serving_now;
>>  nudge_writes();
>>  }
>> 
>

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Paul E. McKenney

On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
> Hi
> 
> I think the MIPS arch_spin_unlock() is borken.
> 
> spin_unlock() must have RELEASE semantics, these require that no LOADs
> nor STOREs leak out from the critical section.
> 
> >From what I know MIPS has a relaxed memory model which allows reads to
> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> which doesn't order prior reads vs later stores.
> 
> Therefore upgrade the wmb() to smp_mb().
> 
> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)

One guess is that they want to order I/O accesses within the critical
section?

Thanx, Paul

> Maybe-Signed-off-by: Peter Zijlstra (Intel) 
> ---
> diff --git a/arch/mips/include/asm/spinlock.h 
> b/arch/mips/include/asm/spinlock.h
> index 40196bebe849..b2ca13f06152 100644
> --- a/arch/mips/include/asm/spinlock.h
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
>   unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
>   lock->h.serving_now = (u16)serving_now;
>   nudge_writes();
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Måns Rullgård

Peter Zijlstra  writes:

> Hi
>
> I think the MIPS arch_spin_unlock() is borken.
>
> spin_unlock() must have RELEASE semantics, these require that no LOADs
> nor STOREs leak out from the critical section.
>
> From what I know MIPS has a relaxed memory model which allows reads to
> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> which doesn't order prior reads vs later stores.

This is correct.

> Therefore upgrade the wmb() to smp_mb().
>
> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)

Good question.

The current MIPS asm/barrier.h uses a plain SYNC instruction for all
kinds of barriers (except on Cavium Octeon), which is a bit wasteful.
A MIPS implementation can optionally support partial barriers (load,
store, acquire, release) which all behave like a full barrier if not
implemented, so those really ought to be used.

> Maybe-Signed-off-by: Peter Zijlstra (Intel) 
> ---
> diff --git a/arch/mips/include/asm/spinlock.h 
> b/arch/mips/include/asm/spinlock.h
> index 40196bebe849..b2ca13f06152 100644
> --- a/arch/mips/include/asm/spinlock.h
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
>   unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
>   lock->h.serving_now = (u16)serving_now;
>   nudge_writes();
>  }

All this weirdness was added in commit 500c2e1f:

MIPS: Optimize spinlocks.

The current locking mechanism uses a ll/sc sequence to release a
spinlock.  This is slower than a wmb() followed by a store to unlock.

The branching forward to .subsection 2 on sc failure slows down the
contended case.  So we get rid of that part too.

Since we are now working on naturally aligned u16 values, we can get
rid of a masking operation as the LHU already does the right thing.
The ANDI are reversed for better scheduling on multi-issue CPUs

On a 12 CPU 750MHz Octeon cn5750 this patch improves ipv4 UDP packet
forwarding rates from 3.58*10^6 PPS to 3.99*10^6 PPS, or about 11%.

Signed-off-by: David Daney 
To: linux-m...@linux-mips.org
Patchwork: http://patchwork.linux-mips.org/patch/937/
Signed-off-by: Ralf Baechle 

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Peter Zijlstra

On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
>   unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
>   lock->h.serving_now = (u16)serving_now;

Also, you might want to consider using WRITE_ONCE() for that store.

>   nudge_writes();

And that thing is creative, I've not seen any other arch issues barriers
to speed up the store buffer flush. Does it really matter for MIPS?

>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH] mips: Fix arch_spin_unlock()

2015-11-12 Thread Peter Zijlstra

Hi

I think the MIPS arch_spin_unlock() is borken.

spin_unlock() must have RELEASE semantics, these require that no LOADs
nor STOREs leak out from the critical section.

>From what I know MIPS has a relaxed memory model which allows reads to
pass stores, and as implemented arch_spin_unlock() only issues a wmb
which doesn't order prior reads vs later stores.

Therefore upgrade the wmb() to smp_mb().

(Also, why the unconditional wmb, as opposed to smp_wmb() ?)

Maybe-Signed-off-by: Peter Zijlstra (Intel) 
---
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 40196bebe849..b2ca13f06152 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
unsigned int serving_now = lock->h.serving_now + 1;
-   wmb();
+   smp_mb();
lock->h.serving_now = (u16)serving_now;
nudge_writes();
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

54 matches

Mail list logo