Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 09, 2016 at 12:23:58PM +0100, Ingo Molnar wrote: > * Will Deacon wrote: > > On Wed, Feb 03, 2016 at 01:32:10PM +, Will Deacon wrote: > > > On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote: > > > > In fact I'd suggest to test this via a quick runtime hack like this in > > > > rcupdate.h: > > > > > > > > extern int panic_timeout; > > > > > > > > ... > > > > > > > > if (panic_timeout) > > > > smp_load_acquire(p); > > > > else > > > > typeof(*p) *p1 = (typeof(*p) > > > > *__force)lockless_dereference(p); > > > > > > > > (or so) > > > > > > So the problem with this is that a LOAD LOAD sequence isn't an > > > ordering hazard on ARM, so you're potentially at the mercy of the branch > > > predictor as to whether you get an acquire. That's not to say it won't > > > be discarded as soon as the conditional is resolved, but it could > > > screw up the benchmarking. > > > > > > I'd be better off doing some runtime patching, but that's not something > > > I can knock up in a couple of minutes (so I'll add it to my list). > > > > ... so I actually got that up and running, believe it or not. Filthy stuff. > > Wow! > > I tried to implement the simpler solution by hacking rcupdate.h, but got > drowned > in nasty circular header file dependencies and gave up... I hacked linux/compiler.h, but got similar problems and ended up copying what I need under a new namespace (posh way of saying I added _WILL to everything until it built). > If you are not overly embarrassed by posting hacky patches, mind posting your > solution? Ok, since you asked! I'm thoroughly ashamed of the hacks, but maybe it will keep those spammy Microsoft recruiters at bay ;) Note that the "trigger" is changing the loglevel, but you need to do this by echoing to /proc/sysrq-trigger, otherwise the whole thing gets invoked in irq context which ends badly. It's also heavily arm64-specific. > > The good news is that you're right, and I'm now seeing ~1% difference > > between > > the runs with ~0.3% noise for either of them. I still think that's > > significant, > > but it's a lot more reassuring than 4%. > > hm, so for such marginal effects I think we could improve the testing method > a > bit: we could improve 'perf bench sched messaging' to allow 'steady state > testing': to not exit+restart all the processes between test iterations, but > to > continuously measure and print out current performance figures. > > I.e. every 10 seconds it could print a decaying running average of current > throughput. > > That way you could patch/unpatch the instructions without having to restart > the > tasks. If you still see an effect (in the numbers reported every 10 seconds), > then > that's a guaranteed result. > > [ We have such functionality in 'perf bench numa' (the --show-convergence > option), > for similar reasons, to allow runtime monitoring and tweaking of kernel > parameters. ] That sounds handy. Will --->8 diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 8f271b83f910..1a1e353983be 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -30,8 +30,8 @@ #define ARM64_HAS_LSE_ATOMICS 5 #define ARM64_WORKAROUND_CAVIUM_23154 6 #define ARM64_WORKAROUND_8342207 - -#define ARM64_NCAPS8 +#define ARM64_RCU_USES_ACQUIRE 8 +#define ARM64_NCAPS9 #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kernel/alternative.c b/arch/arm64/kernel/alternative.c index d2ee1b21a10d..edbf188a8541 100644 --- a/arch/arm64/kernel/alternative.c +++ b/arch/arm64/kernel/alternative.c @@ -143,6 +143,66 @@ static int __apply_alternatives_multi_stop(void *unused) return 0; } +static void __apply_alternatives_rcu(void *alt_region) +{ + struct alt_instr *alt; + struct alt_region *region = alt_region; + u32 *origptr, *replptr; + + for (alt = region->begin; alt < region->end; alt++) { + u32 insn, orig_insn; + int nr_inst; + + if (alt->cpufeature != ARM64_RCU_USES_ACQUIRE) + continue; + + BUG_ON(alt->alt_len != alt->orig_len); + + origptr = ALT_ORIG_PTR(alt); + replptr = ALT_REPL_PTR(alt); + nr_inst = alt->alt_len / sizeof(insn); + + BUG_ON(nr_inst != 1); + + insn = le32_to_cpu(*replptr); + orig_insn = le32_to_cpu(*origptr); + *(origptr) = cpu_to_le32(insn); + *replptr = cpu_to_le32(orig_insn); + + flush_icache_range((uintptr_t)origptr, (uintptr_t)(origptr + 1)); + pr_info_ratelimited("%p: 0x%x => 0x%x\n", origptr, orig_insn, insn); + } +} + +static int will_patched; + +static int __apply_alternatives
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
* Will Deacon wrote: > On Wed, Feb 03, 2016 at 01:32:10PM +, Will Deacon wrote: > > On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote: > > > In fact I'd suggest to test this via a quick runtime hack like this in > > > rcupdate.h: > > > > > > extern int panic_timeout; > > > > > > ... > > > > > > if (panic_timeout) > > > smp_load_acquire(p); > > > else > > > typeof(*p) *p1 = (typeof(*p) > > > *__force)lockless_dereference(p); > > > > > > (or so) > > > > So the problem with this is that a LOAD LOAD sequence isn't an > > ordering hazard on ARM, so you're potentially at the mercy of the branch > > predictor as to whether you get an acquire. That's not to say it won't > > be discarded as soon as the conditional is resolved, but it could > > screw up the benchmarking. > > > > I'd be better off doing some runtime patching, but that's not something > > I can knock up in a couple of minutes (so I'll add it to my list). > > ... so I actually got that up and running, believe it or not. Filthy stuff. Wow! I tried to implement the simpler solution by hacking rcupdate.h, but got drowned in nasty circular header file dependencies and gave up... If you are not overly embarrassed by posting hacky patches, mind posting your solution? > The good news is that you're right, and I'm now seeing ~1% difference between > the runs with ~0.3% noise for either of them. I still think that's > significant, > but it's a lot more reassuring than 4%. hm, so for such marginal effects I think we could improve the testing method a bit: we could improve 'perf bench sched messaging' to allow 'steady state testing': to not exit+restart all the processes between test iterations, but to continuously measure and print out current performance figures. I.e. every 10 seconds it could print a decaying running average of current throughput. That way you could patch/unpatch the instructions without having to restart the tasks. If you still see an effect (in the numbers reported every 10 seconds), then that's a guaranteed result. [ We have such functionality in 'perf bench numa' (the --show-convergence option), for similar reasons, to allow runtime monitoring and tweaking of kernel parameters. ] Thanks, Ingo
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 11:55:57AM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 11:30 AM, Will Deacon wrote: > > > > FWIW, and this is by no means conclusive, I hacked that up quickly and > > ran hackbench a few times on the nearest idle arm64 system. The results > > were consistently ~4% slower using acquire for rcu_dereference. > > Ok, that's *much* more noticeable than I would have expected. I take > it that load-acquire is really really slow on current arm64 > implementations. See my reply to Ingo, but it seems a bunch of this was down to rebooting the system between runs and hackbench being particularly susceptible to that. > Just out of interest, is store-release slow too? Because that should > be easy to make fast. There's a slight gotcha with arm64's store-release instruction in that it's RCsc and therefore orders against a subsequent load-acquire. That's not to say you can't make it fast, but it's potentially more involved than posting a flag in a store buffer (or whatever you were envisaging :) Measuring store-release is much more difficult, because you can't replace it with a dependency or the like, only other barrier constructs. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, Feb 03, 2016 at 01:32:10PM +, Will Deacon wrote: > On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote: > > In fact I'd suggest to test this via a quick runtime hack like this in > > rcupdate.h: > > > > extern int panic_timeout; > > > > ... > > > > if (panic_timeout) > > smp_load_acquire(p); > > else > > typeof(*p) *p1 = (typeof(*p) > > *__force)lockless_dereference(p); > > > > (or so) > > So the problem with this is that a LOAD LOAD sequence isn't an > ordering hazard on ARM, so you're potentially at the mercy of the branch > predictor as to whether you get an acquire. That's not to say it won't > be discarded as soon as the conditional is resolved, but it could > screw up the benchmarking. > > I'd be better off doing some runtime patching, but that's not something > I can knock up in a couple of minutes (so I'll add it to my list). ... so I actually got that up and running, believe it or not. Filthy stuff. The good news is that you're right, and I'm now seeing ~1% difference between the runs with ~0.3% noise for either of them. I still think that's significant, but it's a lot more reassuring than 4%. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote: > * Will Deacon wrote: > > On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote: > > > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon wrote: > > > > > > > > Given that the vast majority of weakly ordered architectures respect > > > > address dependencies, I would expect all of them to be hurt if they > > > > were forced to use barrier instructions instead, even those where the > > > > microarchitecture is fairly strongly ordered in practice. > > > > > > I do wonder if it would be all that noticeable, though. I don't think > > > we've really had benchmarks. > > > > > > For example, most of the RCU list traversal shows up on x86 - where > > > loads are already acquires. But they show up not because of that, but > > > because a RCU list traversal is pretty much always going to take the > > > cache miss. > > > > > > So it would actually be interesting to just try it - what happens to > > > kernel-centric benchmarks (which are already fairly rare) on arm if we > > > change the rcu_dereference() to be a smp_load_acquire()? > > > > > > Because maybe nothing happens at all. I don't think we've ever tried it. > > > > FWIW, and this is by no means conclusive, I hacked that up quickly and ran > > hackbench a few times on the nearest idle arm64 system. The results were > > consistently ~4% slower using acquire for rcu_dereference. > > Could you please double check that? The thing is that hackbench is a > _notoriously_ > unstable workload and very dependent on various small details such as kernel > image > layout and random per-bootup cache/memory layouts details. Yup. FWIW, I did try across a few reboots to arrive at the 4% figure, but it's not conclusive (as I said :). > In fact I'd suggest to test this via a quick runtime hack like this in > rcupdate.h: > > extern int panic_timeout; > > ... > > if (panic_timeout) > smp_load_acquire(p); > else > typeof(*p) *p1 = (typeof(*p) > *__force)lockless_dereference(p); > > (or so) So the problem with this is that a LOAD LOAD sequence isn't an ordering hazard on ARM, so you're potentially at the mercy of the branch predictor as to whether you get an acquire. That's not to say it won't be discarded as soon as the conditional is resolved, but it could screw up the benchmarking. I'd be better off doing some runtime patching, but that's not something I can knock up in a couple of minutes (so I'll add it to my list). > and you could get specific numbers of noise estimations via: > > triton:~/tip> perf stat --null --repeat 10 perf bench sched messaging -l > 1 > > [...] > > Performance counter stats for 'perf bench sched messaging -l 1' (10 > runs): > >4.616404309 seconds time elapsed >( +- 1.67% ) > > note that even with a repeat count of 10 runs and a loop count 100 times > larger > than the hackbench default, the intrinsic noise of this workload was still > 1.6% - > and that does not include boot-to-boot systematic noise. That's helpful, thanks. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
* Will Deacon wrote: > On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote: > > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon wrote: > > > > > > Given that the vast majority of weakly ordered architectures respect > > > address dependencies, I would expect all of them to be hurt if they > > > were forced to use barrier instructions instead, even those where the > > > microarchitecture is fairly strongly ordered in practice. > > > > I do wonder if it would be all that noticeable, though. I don't think > > we've really had benchmarks. > > > > For example, most of the RCU list traversal shows up on x86 - where > > loads are already acquires. But they show up not because of that, but > > because a RCU list traversal is pretty much always going to take the > > cache miss. > > > > So it would actually be interesting to just try it - what happens to > > kernel-centric benchmarks (which are already fairly rare) on arm if we > > change the rcu_dereference() to be a smp_load_acquire()? > > > > Because maybe nothing happens at all. I don't think we've ever tried it. > > FWIW, and this is by no means conclusive, I hacked that up quickly and ran > hackbench a few times on the nearest idle arm64 system. The results were > consistently ~4% slower using acquire for rcu_dereference. Could you please double check that? The thing is that hackbench is a _notoriously_ unstable workload and very dependent on various small details such as kernel image layout and random per-bootup cache/memory layouts details. In fact I'd suggest to test this via a quick runtime hack like this in rcupdate.h: extern int panic_timeout; ... if (panic_timeout) smp_load_acquire(p); else typeof(*p) *p1 = (typeof(*p) *__force)lockless_dereference(p); (or so) and then you can start a loop of hackbench runs, and in another terminal change the ordering primitive via: echo 1 > /proc/sys/kernel/panic # smpload_acquire() echo 0 > /proc/sys/kernel/panic # smp_read_barrier_depends() without having to reboot the kernel. Also, instead of using hackbench which has a too short runtime that makes it sensitive to scheduling micro-details, you could try the perf-bench hackbench work-alike where the number of loops is parametric: triton:~/tip> perf bench sched messaging -l 1 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 4.532 [sec] and you could get specific numbers of noise estimations via: triton:~/tip> perf stat --null --repeat 10 perf bench sched messaging -l 1 [...] Performance counter stats for 'perf bench sched messaging -l 1' (10 runs): 4.616404309 seconds time elapsed ( +- 1.67% ) note that even with a repeat count of 10 runs and a loop count 100 times larger than the hackbench default, the intrinsic noise of this workload was still 1.6% - and that does not include boot-to-boot systematic noise. It's very easy to get systemic noise with hackbench workloads and go down the entirely wrong road. Of course, the numbers might also confirm your 4% figure! Thanks, Ingo
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 06:23:37PM +0100, Peter Zijlstra wrote: > > > On 2 February 2016 07:44:33 CET, "Paul E. McKenney" > wrote: > > > > >> If so, do we also need to take the following pairing into > >consideration? > >> > >> o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > > We rely on this with smp_cond_aquire() to extend the chain. We do indeed! And it is not possible to use smp_load_acquire() given this API becauseof the "!". Hmmm... If this one turns out to be problematic, there are some ways of dealing with it. But thank you for reminding me of it. I think, anyway. ;-) Thanx, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 09:56:14AM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 4:02 AM, Paul E. McKenney > wrote: > > > > The sorts of things I am really worried about are abominations like this > > (and far worse): > > That one doesn't have any causal chain that I can see, so I agree that > it's an abomination, but it also doesn't act as an argument. > > > r1 == 1 && r2 == 1 && c == 2 && r3 == 0 ??? > > What do you see as the problem here? The above can happen in a > strictly ordered situation: thread2 runs first (c == 2, r3 = 0), then > thread3 runs (d = 1, a = 1) then thread0 runs (r1 = 1) and then > thread1 starts running but the store to c doesn't complete (now r2 = > 1). Apologies, I should have added that the condition does not get evaluated until all the dust settles. At that point both stores to c would have completed, so that c == 1. > So there's no reason for your case to not happen, but the real issue > is that there is no causal relationship that your example describes, > so it's not even interesting. Because of the write-to-write relationship between thread1() and thread2(), yes. And I am very glad that you find this one uninteresting, because including it would make things -really- complicated. > Causality breaking is what really screws with peoples minds. The > reason transitivity is important (and why smp_read_barrier_depends() > is so annoying) is because causal breaks make peoples minds twist in > bad ways. Agreed, which means it is very important that the various flavors of release-acquire chains be transitive. In addition, smp_mb() is transitive, as are synchronize_rcu() and friends. > Sadly, memory orderings are very seldom described as honoring > causality, and instead people have the crazy litmus tests. Indeed, a memory model defined solely by litmus tests would qualify as an exotic form of torture. What we do instead is use sets of litmus tests as test cases for the prototype memory model under consideration. It is all too easy to create a set of rules that look good and sound good, but which mess something up. The litmus tests help catch these sorts of errors. Thanx, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 2, 2016 at 11:30 AM, Will Deacon wrote: > > FWIW, and this is by no means conclusive, I hacked that up quickly and > ran hackbench a few times on the nearest idle arm64 system. The results > were consistently ~4% slower using acquire for rcu_dereference. Ok, that's *much* more noticeable than I would have expected. I take it that load-acquire is really really slow on current arm64 implementations. That, btw, is one reason why I despise weak memory ordering. In most cases I've ever seen, the hardware designers have said "barriers don't much matter" and just made them crazy slow. My old alpha was the worst of the worst. It's shades of my least favorite x86 microarchitecture ever: netburst. Make common cases fast, but make exceptional cases _so_ slow that it's actually a noticeable pain. Just out of interest, is store-release slow too? Because that should be easy to make fast. Linus
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon wrote: > > > > Given that the vast majority of weakly ordered architectures respect > > address dependencies, I would expect all of them to be hurt if they > > were forced to use barrier instructions instead, even those where the > > microarchitecture is fairly strongly ordered in practice. > > I do wonder if it would be all that noticeable, though. I don't think > we've really had benchmarks. > > For example, most of the RCU list traversal shows up on x86 - where > loads are already acquires. But they show up not because of that, but > because a RCU list traversal is pretty much always going to take the > cache miss. > > So it would actually be interesting to just try it - what happens to > kernel-centric benchmarks (which are already fairly rare) on arm if we > change the rcu_dereference() to be a smp_load_acquire()? > > Because maybe nothing happens at all. I don't think we've ever tried it. FWIW, and this is by no means conclusive, I hacked that up quickly and ran hackbench a few times on the nearest idle arm64 system. The results were consistently ~4% slower using acquire for rcu_dereference. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon wrote: > > Given that the vast majority of weakly ordered architectures respect > address dependencies, I would expect all of them to be hurt if they > were forced to use barrier instructions instead, even those where the > microarchitecture is fairly strongly ordered in practice. I do wonder if it would be all that noticeable, though. I don't think we've really had benchmarks. For example, most of the RCU list traversal shows up on x86 - where loads are already acquires. But they show up not because of that, but because a RCU list traversal is pretty much always going to take the cache miss. So it would actually be interesting to just try it - what happens to kernel-centric benchmarks (which are already fairly rare) on arm if we change the rcu_dereference() to be a smp_load_acquire()? Because maybe nothing happens at all. I don't think we've ever tried it. > As far as I understand it, the problems with "consume" have centred > largely around compiler and specification issues, which we don't have > with rcu_dereference (i.e. we ignore thin-air and use volatile casts > /barrier() to keep the optimizer at bay). Oh, I agree. The C++ consume orderings have been different from the kernel worries. But if it turns out that we have situations where we lose transitivity because of rcu_dereference not being an acquire, then we have kernel problems. I do see that your later email said that the pointer dependency (which we assume in rcu) should retain the transitivity, so maybe there is no real reason to strengthen things. But I _would_ argue that transitivity is so important (because of the whole "individual orderings make sense and are causal, so the end result must make sense and be causal") that if that were to break, then such an architecture really should just strengthen the orderings. Because the *worst* kinds of bugs are exactly the ones where the code makes sense and works locally, but then the combination of two or three pieces of code that are individually sensible ends up not working for some crazy non-transitivity reason. That really is not something that people can cope with. And maybe we will one day have widely available automated ordering proofs that work across the whole kernel, and we don't even need to worry about "this breaks peoples minds", but I don't think we are there yet. Linus
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 2, 2016 at 4:02 AM, Paul E. McKenney wrote: > > The sorts of things I am really worried about are abominations like this > (and far worse): That one doesn't have any causal chain that I can see, so I agree that it's an abomination, but it also doesn't act as an argument. > r1 == 1 && r2 == 1 && c == 2 && r3 == 0 ??? What do you see as the problem here? The above can happen in a strictly ordered situation: thread2 runs first (c == 2, r3 = 0), then thread3 runs (d = 1, a = 1) then thread0 runs (r1 = 1) and then thread1 starts running but the store to c doesn't complete (now r2 = 1). So there's no reason for your case to not happen, but the real issue is that there is no causal relationship that your example describes, so it's not even interesting. Causality breaking is what really screws with peoples minds. The reason transitivity is important (and why smp_read_barrier_depends() is so annoying) is because causal breaks make peoples minds twist in bad ways. Sadly, memory orderings are very seldom described as honoring causality, and instead people have the crazy litmus tests. Linus
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 09:30:26AM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 1:34 AM, Boqun Feng wrote: > > > > Just to be clear, what Will, Paul and I are discussing here is about > > local transitivity, > > I really don't think that changes the picture. For the general point about mixed methods, perhaps not, but it does mean that we can't describe all of the issues using fewer than three processors. > Given that > > (a) we already mix ordering methods and there are good reasons for > it, and I'd expect transitivity only makes that more likely > > (b) we expect transitivity from the individual ordering methods > > (c) I don't think that there are any relevant CPU's that violate this anyway > > I really think that not expecting that to hold for mixed accesses > would be a complete disaster. It will confuse the hell out of people. > > And the basic argument really stands: we should make the memory > ordering expectations as strong as we can, given the existing relevant > architecture constraints (ie x86/arm/power). > > If that then means that some other architecture might need to add > extra serialization that that architecture doesn't _want_ to add, > tough luck. I absolutely hate the fact that alpha forced us to add > that crazy read-depends barrier, and I want to discourage that a lot. > > In fact, I'd be willing to strengthen our existing orderings just in > the name of sanity, and say that "rcu_dereference()" should just be an > acquire, and say that if the architecture makes that more expensive, > then who the hell cares? I have not been very happy with the "consume" > memory ordering discussions for C++. Yes, it would hurt pre-lwsync > power a bit, and it would hurt 32-bit arm, but enough that we should > have the headache of the existing semantics? Given that the vast majority of weakly ordered architectures respect address dependencies, I would expect all of them to be hurt if they were forced to use barrier instructions instead, even those where the microarchitecture is fairly strongly ordered in practice. Even load-acquire on ARMv8 has more work to do than a plain old address dependency, so I'd be sad to see us upgrading rcu_dereference like this, particularly when its a relatively uncontentious, easy to understand part of the kernel memory model. As far as I understand it, the problems with "consume" have centred largely around compiler and specification issues, which we don't have with rcu_dereference (i.e. we ignore thin-air and use volatile casts /barrier() to keep the optimizer at bay). Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 09:12:37AM -0800, Paul E. McKenney wrote: > On Tue, Feb 02, 2016 at 12:20:25PM +, Will Deacon wrote: > > On Tue, Feb 02, 2016 at 08:12:30PM +0800, Boqun Feng wrote: > > > On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote: > > > > Well, the following ISA2 test is permitted on ARM: > > > > > > > > > > > > P0: > > > > Wx=1 > > > > WyRel=1 // rcu_assign_pointer > > > > > > > > P1: > > > > Ry=1// rcu_dereference > > > > > > What if a dependency is added here? Same result? > > > > Right, that fixes it. So if we're only considering things like: > > > > rcu_dereference > > > > RELEASE > > > > then local transitivity should be preserved. > > Whew!!! ;-) > > > I think the same applies to , which seems to match your later > > example. > > Could you please check? I should've been more concrete here: it does indeed apply if you replace the with a in the snippet above, I just hadn't got Boqun's example paged in and didn't want to commit on the examples being identical. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 2, 2016 at 1:34 AM, Boqun Feng wrote: > > Just to be clear, what Will, Paul and I are discussing here is about > local transitivity, I really don't think that changes the picture. Given that (a) we already mix ordering methods and there are good reasons for it, and I'd expect transitivity only makes that more likely (b) we expect transitivity from the individual ordering methods (c) I don't think that there are any relevant CPU's that violate this anyway I really think that not expecting that to hold for mixed accesses would be a complete disaster. It will confuse the hell out of people. And the basic argument really stands: we should make the memory ordering expectations as strong as we can, given the existing relevant architecture constraints (ie x86/arm/power). If that then means that some other architecture might need to add extra serialization that that architecture doesn't _want_ to add, tough luck. I absolutely hate the fact that alpha forced us to add that crazy read-depends barrier, and I want to discourage that a lot. In fact, I'd be willing to strengthen our existing orderings just in the name of sanity, and say that "rcu_dereference()" should just be an acquire, and say that if the architecture makes that more expensive, then who the hell cares? I have not been very happy with the "consume" memory ordering discussions for C++. Yes, it would hurt pre-lwsync power a bit, and it would hurt 32-bit arm, but enough that we should have the headache of the existing semantics? So I think our current memory orderings are potentially too _weak_, and we sure as hell shouldn't strive to weaken them further. I think doing "smp_read_barrier_depends()" was a mistake, but it was a mistake driven by the fact that our memory ordering _used_ to be barrier-centric. If alpha were to have happened today, I would say that rather than have smp_read_barrier_depends(), we should just say that anything that requires it should use smp_load_acquire() (or rcu_dereference - which I think could have the same semantics), and be done with it. And if I would make that choice today, why isn't the right thing to just get rid of that weak nasty thing, and convert the existing (few - there really aren't that many) smp_read_barriers() away from that model. See what I'm saying? We've been pandering to weak memory ordering before. We may have had our reasons to do so, but I think it's a mistake. It results in code that is hard to think about. So the fact that people worry about "rcu_dereference()" really makes me think that one is just too weak, and we should just bite the bullet and say it's an acquire. I'd much rather take a few barriers than make our programming model hard to understand. Linus
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On 2 February 2016 07:44:33 CET, "Paul E. McKenney" wrote: > >> If so, do we also need to take the following pairing into >consideration? >> >> osmp_store_release() -> READ_ONCE(); if ;smp_rmb(); We rely on this with smp_cond_aquire() to extend the chain. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 12:20:25PM +, Will Deacon wrote: > On Tue, Feb 02, 2016 at 08:12:30PM +0800, Boqun Feng wrote: > > On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote: > > > On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote: > > > > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote: > > > > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > > > > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > > > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > > > > Locally transitive chain termination: > > > > > > > > > > > > (i.e. these can't be used to extend a chain) > > > > > > > > > > Agreed. > > > > > > > > > > > > o smp_store_release() -> lockless_dereference() (???) > > > > > > > o rcu_assign_pointer() -> rcu_dereference() > > > > > > > o smp_store_release() -> READ_ONCE(); if > > > > > > > > Just want to make sure, this one is actually: > > > > > > > > o smp_store_release() -> READ_ONCE(); if ; > > > > > > > > right? Because control dependency only orders READ->WRITE. > > > > > > > > If so, do we also need to take the following pairing into consideration? > > > > > > > > o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > > > > > > > > > > > > > > > > > I am OK with the first and last, but I believe that the middle one > > > > > has real use cases. So the rcu_assign_pointer() -> rcu_dereference() > > > > > case needs to be locally transitive. > > > > > > > > > > > > > Hmm... I don't think we should differ rcu_dereference() and > > > > lockless_dereference(). One reason: list_for_each_entry_rcu() are using > > > > lockless_dereference() right now, which means we used to think > > > > rcu_dereference() and lockless_dereference() are interchangeable, right? > > > > > > > > Besides, Will, what's the reason of having a locally transitive chain > > > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may > > > > not be locally transitive? > > > > > > Well, the following ISA2 test is permitted on ARM: > > > > > > > > > P0: > > > Wx=1 > > > WyRel=1 // rcu_assign_pointer > > > > > > P1: > > > Ry=1 // rcu_dereference > > > > What if a dependency is added here? Same result? > > Right, that fixes it. So if we're only considering things like: > > rcu_dereference > > RELEASE > > then local transitivity should be preserved. Whew!!! ;-) > I think the same applies to , which seems to match your later > example. Could you please check? Thanx, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Ralf Baechle writes: > On Tue, Feb 02, 2016 at 02:54:06PM +, Måns Rullgård wrote: > >> We don't really have much choice given the reality of existing hardware. > > No, of course not - but I want us discourage new weakly ordered > platforms as much as possible. Where do you draw the line though? You can't exactly go and impose some kind of global ordering in a multi-master system. -- Måns Rullgård
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 02:54:06PM +, Måns Rullgård wrote: > We don't really have much choice given the reality of existing hardware. No, of course not - but I want us discourage new weakly ordered platforms as much as possible. Ralf
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Ralf Baechle writes: > On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote: > >> Memory ordering is confusing enough as it is. We should not make >> people worry more than they already have to. Strong rules are good. > > Confusing and the resulting bugs can be very hard to debug. > > One of the problems I've experienced is that Linux does support liberal > memory ordering, even as extreme as the Alpha. We don't really have much choice given the reality of existing hardware. -- Måns Rullgård
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote: > Memory ordering is confusing enough as it is. We should not make > people worry more than they already have to. Strong rules are good. Confusing and the resulting bugs can be very hard to debug. One of the problems I've experienced is that Linux does support liberal memory ordering, even as extreme as the Alpha. And every once in a while a hardware guy is asking me if Linux their preferred variant of weak ordering and answering honestly I have to say yes. Even advising against it in strong words against it, guess what will happen - another weakly ordered core implementation. To this point we've only fixed theoretical memory ordering issues on MIPS but it's only a matter of time until we get bitten by a bug that does actual damage. SGI and a few others have managed to build large systems with significant memory latencies yet managed to get decent performance with strong ordering. Ralf
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 12:20:25PM +, Will Deacon wrote: [...] > > > > > > > > Besides, Will, what's the reason of having a locally transitive chain > > > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may > > > > not be locally transitive? > > > > > > Well, the following ISA2 test is permitted on ARM: > > > > > > > > > P0: > > > Wx=1 > > > WyRel=1 // rcu_assign_pointer > > > > > > P1: > > > Ry=1 // rcu_dereference > > > > What if a dependency is added here? Same result? > > Right, that fixes it. So if we're only considering things like: > > rcu_dereference > > RELEASE > > then local transitivity should be preserved. > > I think the same applies to , which seems to match your later > example. > Thank you ;-) Now I understand why you want thoses pairings to be locally transitive chain terminations, they have more subtle requirements to extend locally transitive chains and slight different behaviors on different architectures. It's better for us to put them aside until we figure out thoses subtle requirements and different behaviors. Regards, Boqun > Will signature.asc Description: PGP signature
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 08:12:30PM +0800, Boqun Feng wrote: > On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote: > > On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote: > > > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote: > > > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > > > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > > > Locally transitive chain termination: > > > > > > > > > > (i.e. these can't be used to extend a chain) > > > > > > > > Agreed. > > > > > > > > > > o smp_store_release() -> lockless_dereference() (???) > > > > > > o rcu_assign_pointer() -> rcu_dereference() > > > > > > o smp_store_release() -> READ_ONCE(); if > > > > > > Just want to make sure, this one is actually: > > > > > > o smp_store_release() -> READ_ONCE(); if ; > > > > > > right? Because control dependency only orders READ->WRITE. > > > > > > If so, do we also need to take the following pairing into consideration? > > > > > > o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > > > > > > > > > > I am OK with the first and last, but I believe that the middle one > > > > has real use cases. So the rcu_assign_pointer() -> rcu_dereference() > > > > case needs to be locally transitive. > > > > > > > > > > Hmm... I don't think we should differ rcu_dereference() and > > > lockless_dereference(). One reason: list_for_each_entry_rcu() are using > > > lockless_dereference() right now, which means we used to think > > > rcu_dereference() and lockless_dereference() are interchangeable, right? > > > > > > Besides, Will, what's the reason of having a locally transitive chain > > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may > > > not be locally transitive? > > > > Well, the following ISA2 test is permitted on ARM: > > > > > > P0: > > Wx=1 > > WyRel=1 // rcu_assign_pointer > > > > P1: > > Ry=1// rcu_dereference > > What if a dependency is added here? Same result? Right, that fixes it. So if we're only considering things like: rcu_dereference RELEASE then local transitivity should be preserved. I think the same applies to , which seems to match your later example. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 11:45:59AM +, Will Deacon wrote: > On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote: > > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote: > > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > > Locally transitive chain termination: > > > > > > > > (i.e. these can't be used to extend a chain) > > > > > > Agreed. > > > > > > > > o smp_store_release() -> lockless_dereference() (???) > > > > > o rcu_assign_pointer() -> rcu_dereference() > > > > > o smp_store_release() -> READ_ONCE(); if > > > > Just want to make sure, this one is actually: > > > > o smp_store_release() -> READ_ONCE(); if ; > > > > right? Because control dependency only orders READ->WRITE. > > > > If so, do we also need to take the following pairing into consideration? > > > > o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > > > > > > > I am OK with the first and last, but I believe that the middle one > > > has real use cases. So the rcu_assign_pointer() -> rcu_dereference() > > > case needs to be locally transitive. > > > > > > > Hmm... I don't think we should differ rcu_dereference() and > > lockless_dereference(). One reason: list_for_each_entry_rcu() are using > > lockless_dereference() right now, which means we used to think > > rcu_dereference() and lockless_dereference() are interchangeable, right? > > > > Besides, Will, what's the reason of having a locally transitive chain > > termination? Because on some architectures RELEASE->DEPENDENCY pairs may > > not be locally transitive? > > Well, the following ISA2 test is permitted on ARM: > > > P0: > Wx=1 > WyRel=1 // rcu_assign_pointer > > P1: > Ry=1 // rcu_dereference What if a dependency is added here? Same result? > WzRel=1 // rcu_assign_pointer > > P2: > Rz=1 // rcu_dereference > > Rx=0 > > > Make one of the rcu_dereferences an ACQUIRE and the behaviour is > forbidden. > > Paul: what use-cases did you have in mind? > > Will signature.asc Description: PGP signature
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 12:07 AM, Linus Torvalds > wrote: > > > > So we *absolutely* should say that *OF COURSE* these things work: > > > > - CPU A: > > > >.. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr); > > > > - CPU B: > > > >smp_load_acquire(ptr) - we can rely on things behind "ptr" being > > initialized > > That's a bad example, btw. I shouldn't have made it be a "pointer", > because then we get the whole address dependency chain ordering > anyway. > > So instead of "ptr", read "state flag". It might just be an "int" that > says "data has been initialized". > > So > > .. initialize memory .. > smp_wmb(); > WRITE_ONCE(&is_initialized, 1); > > should pair with > > if (smp_load_acquire(&is_initialized)) > ... we can read and write the data, knowing it has been initialized .. > > exactly because "smp_wmb()" (cheap write barrier) might be cheaper > than "smp_store_release()" (expensive full barrier) and thus > preferred. > > So mixing ordering metaphors actually does make sense, and should be > entirely well-defined. I don't believe that anyone is arguing that this particular example should not work the way that you want it to. > There's likely less reason to do it the other way (ie > "smp_store_release()" on one side pairing with "LOAD_ONCE() + > smp_rmb()" on the other) since there likely isn't the same kind of > performance reason for that pairing. But even if we would never > necessarily want to do it, I think our memory ordering rules would be > *much* better for strongly stating that it has to work, than being > timid and trying to make the rules weak. > > Memory ordering is confusing enough as it is. We should not make > people worry more than they already have to. Strong rules are good. The sorts of things I am really worried about are abominations like this (and far worse): void thread0(void) { r1 = smp_load_acquire(&a); smp_store_release(&b, 1); } void thread1(void) { r2 = smp_load_acquire(&b); smp_store_release(&c, 1); } void thread2(void) { WRITE_ONCE(c, 2); smp_mb(); r3 = READ_ONCE(d); } void thread3(void) { WRITE_ONCE(d, 1); smp_store_release(&a, 1); } r1 == 1 && r2 == 1 && c == 2 && r3 == 0 ??? I advise discouraging this sort of thing. But it is your kernel, so what is your preference? Thanx, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote: > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > Locally transitive chain termination: > > > > > > (i.e. these can't be used to extend a chain) > > > > Agreed. > > > > > > o smp_store_release() -> lockless_dereference() (???) > > > > o rcu_assign_pointer() -> rcu_dereference() > > > > o smp_store_release() -> READ_ONCE(); if > > Just want to make sure, this one is actually: > > o smp_store_release() -> READ_ONCE(); if ; > > right? Because control dependency only orders READ->WRITE. > > If so, do we also need to take the following pairing into consideration? > > o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > > > > I am OK with the first and last, but I believe that the middle one > > has real use cases. So the rcu_assign_pointer() -> rcu_dereference() > > case needs to be locally transitive. > > > > Hmm... I don't think we should differ rcu_dereference() and > lockless_dereference(). One reason: list_for_each_entry_rcu() are using > lockless_dereference() right now, which means we used to think > rcu_dereference() and lockless_dereference() are interchangeable, right? > > Besides, Will, what's the reason of having a locally transitive chain > termination? Because on some architectures RELEASE->DEPENDENCY pairs may > not be locally transitive? Well, the following ISA2 test is permitted on ARM: P0: Wx=1 WyRel=1 // rcu_assign_pointer P1: Ry=1// rcu_dereference WzRel=1 // rcu_assign_pointer P2: Rz=1// rcu_dereference Rx=0 Make one of the rcu_dereferences an ACQUIRE and the behaviour is forbidden. Paul: what use-cases did you have in mind? Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Hello Linus, On Tue, Feb 02, 2016 at 12:19:04AM -0800, Linus Torvalds wrote: > On Tue, Feb 2, 2016 at 12:07 AM, Linus Torvalds > wrote: > > > > So we *absolutely* should say that *OF COURSE* these things work: > > > > - CPU A: > > > >.. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr); > > > > - CPU B: > > > >smp_load_acquire(ptr) - we can rely on things behind "ptr" being > > initialized > > That's a bad example, btw. I shouldn't have made it be a "pointer", > because then we get the whole address dependency chain ordering > anyway. > > So instead of "ptr", read "state flag". It might just be an "int" that > says "data has been initialized". > > So > > .. initialize memory .. > smp_wmb(); > WRITE_ONCE(&is_initialized, 1); > > should pair with > > if (smp_load_acquire(&is_initialized)) > ... we can read and write the data, knowing it has been initialized .. > > exactly because "smp_wmb()" (cheap write barrier) might be cheaper > than "smp_store_release()" (expensive full barrier) and thus > preferred. > Just to be clear, what Will, Paul and I are discussing here is about local transitivity, which refers to something like this following example: (a, b and is_initialized are all initially zero) P0: WRITE_ONCE(a, 1); smp_store_release(&is_initialized, 1); P1: r1 = smp_load_acquire(&is_initialized); smp_store_release(&b, 1); P2: r2 = smp_load_acquire(&b); r3 = READ_ONCE(a); , in which case, r1 == 1 && r2 == 1 && r3 == 0 can not happen because RELEASE+ACQUIRE pairs guarantee local transitivity. More on local transitvity: http://article.gmane.org/gmane.linux.kernel.virtualization/26856 And what I'm asking here is something like this following example: (a, b and is_initialized are all initially zero) P0: WRITE_ONCE(a, 1); smp_store_release(&is_initialized, 1); P1: if (r1 = READ_ONCE(is_initialized)) smp_store_release(&b, 1); P2: if (r2 = READ_ONCE(b)) { smp_rmb(); r3 = READ_ONCE(a); } , in which case, can r1 == 1 && r2 == 1 && r3 == 0 happen? Please note this example is about two questions on local transitivity: 1. Could "READ_ONCE(); if" extend a locally transitive chain. 2. Could "READ_ONCE(); if; smp_rmb()" at least be a locally transitive chain termination? > So mixing ordering metaphors actually does make sense, and should be > entirely well-defined. > I think Paul does agree that smp_{r,w}mb() with applicative memory operations around could pair with smp_store_release() or smp_load_acquire(). Hope I didn't misunderstand any of you or make you misunderstood with each other.. Regards, Boqun > There's likely less reason to do it the other way (ie > "smp_store_release()" on one side pairing with "LOAD_ONCE() + > smp_rmb()" on the other) since there likely isn't the same kind of > performance reason for that pairing. But even if we would never > necessarily want to do it, I think our memory ordering rules would be > *much* better for strongly stating that it has to work, than being > timid and trying to make the rules weak. > > Memory ordering is confusing enough as it is. We should not make > people worry more than they already have to. Strong rules are good. > > Linus signature.asc Description: PGP signature
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 2, 2016 at 12:07 AM, Linus Torvalds wrote: > > So we *absolutely* should say that *OF COURSE* these things work: > > - CPU A: > >.. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr); > > - CPU B: > >smp_load_acquire(ptr) - we can rely on things behind "ptr" being > initialized That's a bad example, btw. I shouldn't have made it be a "pointer", because then we get the whole address dependency chain ordering anyway. So instead of "ptr", read "state flag". It might just be an "int" that says "data has been initialized". So .. initialize memory .. smp_wmb(); WRITE_ONCE(&is_initialized, 1); should pair with if (smp_load_acquire(&is_initialized)) ... we can read and write the data, knowing it has been initialized .. exactly because "smp_wmb()" (cheap write barrier) might be cheaper than "smp_store_release()" (expensive full barrier) and thus preferred. So mixing ordering metaphors actually does make sense, and should be entirely well-defined. There's likely less reason to do it the other way (ie "smp_store_release()" on one side pairing with "LOAD_ONCE() + smp_rmb()" on the other) since there likely isn't the same kind of performance reason for that pairing. But even if we would never necessarily want to do it, I think our memory ordering rules would be *much* better for strongly stating that it has to work, than being timid and trying to make the rules weak. Memory ordering is confusing enough as it is. We should not make people worry more than they already have to. Strong rules are good. Linus
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Mon, Feb 1, 2016 at 10:44 PM, Paul E. McKenney wrote: > On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote: >> Hi Paul, >> If so, do we also need to take the following pairing into consideration? >> >> o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > It looks like we will be restricing smp_rmb() and smp_wmb() to pairwise > scenarios only. So no transitivity in any scenarios involving these > two primitives. NO! Stop this idiocy now, Paul. We are not trying to make the kernel memory ordering rules pander to shit. > But where there is uncertainty, we should -not- assume ordering. It is > easy to tighten the rules later, but hard to loosen them. Bullshit. The rules should absolutely be as tight as at all humanly possible, given real hardware constraints. It's also entirely wrong to say that "we can tighten the rules later". No way. Because before those rules get tightened, we'll see confused people who then try to "fix" code that doesn't want fixing, and adding new odd constructs. We already had that with the whole bogus "control dependency" shit. That crap came exactly from the fact that people thought it was a good idea to make weak rules. So you had better re-think the whole thing. I refuse to pander to the idiotic and wrongheaded "weak memory ordering is good" braindamage. Weak memory ordering is *not* good. It's shit. It's hard to think about, and people won't even realize that the assumptions they make (unconsciously!) may be wrong. So the memory ordering rules should be as strong as at all possible, and should be as intuitive as possible, and follow peoples expectations. And quite frankly, if some crazy shit-for-brains architecture gets its memory ordering wrong (like alpha did), then that crazy architecture should pay the price. There is absolutely _zero_ valid reason why smp_store_release should not pair fine with a smp_rmb() on the other side. Can you actually point to a relevant architecture that doesn't do that? I can't even imagine how you'd make that pairing not work. If the releasing store doesn't imply that it is ordered wrt previous stores, then it damn well isn't a "release". And if the other side does a "smp-rmb()", then the loads are ordered on the other side, so it had damn well just work. How could it not work? So I can't even see how your "we won't guarantee" that wouldn't work. It sounds insane. But more importantly, your whole notion of "let's make things as weak as possible" is *wrong*. That is now AT ALL what we should strive for. We should strive for the strictest possible memory ordering that we can get away with, and have as few "undefined" cases as at all possible. So we *absolutely* should say that *OF COURSE* these things work: - CPU A: .. initialize data structure -> smp_wmb() -> WRITE_ONCE(ptr); - CPU B: smp_load_acquire(ptr) - we can rely on things behind "ptr" being initialized and that the above mirror of that (ie smp_store_release paired with READ_ONCE+smp_rmb) also works. There are often reasons to prefer one model over the other, and sometimes the reasons might argue for mixed access models. For example, we might decide that the producer uses "smp_wmb()", because that is universally fairly fast, and avoids a full memory barrier on those architectures that don't really have "release" semantics natively. But at the same time the _consumer_ side might want a "smp_load_acquire()" simply because it requires subsequent loads and stores to be ordered if it sees that state (see for example our "sk_state_store/load()" cases in networking - right now those pair with release/acquire, but I suspect that "release" really could be a "smp_wmb() + WRITE_ONCE()" instead, which could be faster on some architectures).. Because I really don't think there is any sane reason why that kind of mixed access shouldn't work. If those pairings don't work, there's something wrong with the architecture, and the architecture will need to do whatever barriers it needs to make it work in its store-release/load-acquire so that it pairs with smp_wmb/rmb. And if there against all sanity really is some crazy reason why that pairing can be problematic, then that insane reason needs to be extensively documented. Because that reason is so insane that it needs a big honking description, so that people are aware of it. But at no point should the logic be "let's leave it weakly defined". Because at that point the documentation is actively _worse_ than not having documentation at all, and just letting sanity prevail. Linus
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Tue, Feb 02, 2016 at 01:19:04PM +0800, Boqun Feng wrote: > Hi Paul, > > On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote: > > > > > > > > [ . . . ] > > > > > > > > > > For Linux in general, this is a question: How strict do we want to > > > > > > be > > > > > > about matching the type of write with the corresponding read? My > > > > > > default approach is to initially be quite strict and loosen as > > > > > > needed. > > > > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for > > > > > > the write and rcu_dereference() for the read, as opposed to (say) > > > > > > ACCESS_ONCE() for the read. (I am guessing that this would be too > > > > > > tight, but it makes a good example.) > > > > > > > > > > > > Thoughts? > > > > > > > > > > That sounds broadly sensible to me and allows rcu_assign_pointer and > > > > > rcu_dereference to be used as drop-in replacements for release/acquire > > > > > where local transitivity isn't required. However, I don't think we can > > > > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used > > > > > already in things like the osq_lock (albeit without the address > > > > > dependency). > > > > > > > > Agreed. So in the most strict case that I can imagine anyone putting > > > > up with, we have the following pairings: > > > > > > I think we can group these up: > > > > > > Locally transitive: > > > > > > > o smp_store_release() -> smp_load_acquire() (locally transitive) > > > > > > Locally transitive chain termination: > > > > > > (i.e. these can't be used to extend a chain) > > > > Agreed. > > > > > > o smp_store_release() -> lockless_dereference() (???) > > > > o rcu_assign_pointer() -> rcu_dereference() > > > > o smp_store_release() -> READ_ONCE(); if > > Just want to make sure, this one is actually: > > o smp_store_release() -> READ_ONCE(); if ; > > right? Because control dependency only orders READ->WRITE. Yep! > If so, do we also need to take the following pairing into consideration? > > o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); It looks like we will be restricing smp_rmb() and smp_wmb() to pairwise scenarios only. So no transitivity in any scenarios involving these two primitives. > > I am OK with the first and last, but I believe that the middle one > > has real use cases. So the rcu_assign_pointer() -> rcu_dereference() > > case needs to be locally transitive. > > Hmm... I don't think we should differ rcu_dereference() and > lockless_dereference(). One reason: list_for_each_entry_rcu() are using > lockless_dereference() right now, which means we used to think > rcu_dereference() and lockless_dereference() are interchangeable, right? They use the same instructions, if that is what you mean, so intuitively they should behave the same. I don't feel all that strongly either way. But where there is uncertainty, we should -not- assume ordering. It is easy to tighten the rules later, but hard to loosen them. That might change if tools that automatically analyze usage patterns in raw code, but we do not yet have such tools. Thanx, Paul > Besides, Will, what's the reason of having a locally transitive chain > termination? Because on some architectures RELEASE->DEPENDENCY pairs may > not be locally transitive? > > Regards, > Boqun > > > > Globally transitive: > > > > > > > o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) > > > > o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally > > > > transitive) > > > > > > RCU: > > > > > > > o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE() > > > > (strange and wonderful properties) > > > > Agreed. > > > > > > Seem reasonable, or am I missing some? > > > > > > Looks alright to me. > > > > So I have some litmus tests to generate. ;-) > > > > Thnax, Paul > >
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Hi Paul, On Mon, Feb 01, 2016 at 07:54:58PM -0800, Paul E. McKenney wrote: > On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote: > > > > > > [ . . . ] > > > > > > > > For Linux in general, this is a question: How strict do we want to be > > > > > about matching the type of write with the corresponding read? My > > > > > default approach is to initially be quite strict and loosen as needed. > > > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for > > > > > the write and rcu_dereference() for the read, as opposed to (say) > > > > > ACCESS_ONCE() for the read. (I am guessing that this would be too > > > > > tight, but it makes a good example.) > > > > > > > > > > Thoughts? > > > > > > > > That sounds broadly sensible to me and allows rcu_assign_pointer and > > > > rcu_dereference to be used as drop-in replacements for release/acquire > > > > where local transitivity isn't required. However, I don't think we can > > > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used > > > > already in things like the osq_lock (albeit without the address > > > > dependency). > > > > > > Agreed. So in the most strict case that I can imagine anyone putting > > > up with, we have the following pairings: > > > > I think we can group these up: > > > > Locally transitive: > > > > > o smp_store_release() -> smp_load_acquire() (locally transitive) > > > > Locally transitive chain termination: > > > > (i.e. these can't be used to extend a chain) > > Agreed. > > > > o smp_store_release() -> lockless_dereference() (???) > > > o rcu_assign_pointer() -> rcu_dereference() > > > o smp_store_release() -> READ_ONCE(); if Just want to make sure, this one is actually: o smp_store_release() -> READ_ONCE(); if ; right? Because control dependency only orders READ->WRITE. If so, do we also need to take the following pairing into consideration? o smp_store_release() -> READ_ONCE(); if ;smp_rmb(); > > I am OK with the first and last, but I believe that the middle one > has real use cases. So the rcu_assign_pointer() -> rcu_dereference() > case needs to be locally transitive. > Hmm... I don't think we should differ rcu_dereference() and lockless_dereference(). One reason: list_for_each_entry_rcu() are using lockless_dereference() right now, which means we used to think rcu_dereference() and lockless_dereference() are interchangeable, right? Besides, Will, what's the reason of having a locally transitive chain termination? Because on some architectures RELEASE->DEPENDENCY pairs may not be locally transitive? Regards, Boqun > > Globally transitive: > > > > > o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) > > > o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) > > > > RCU: > > > > > o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE() > > > (strange and wonderful properties) > > Agreed. > > > > Seem reasonable, or am I missing some? > > > > Looks alright to me. > > So I have some litmus tests to generate. ;-) > > Thnax, Paul > signature.asc Description: PGP signature
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Mon, Feb 01, 2016 at 01:56:22PM +, Will Deacon wrote: > On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote: > > > > [ . . . ] > > > > > > For Linux in general, this is a question: How strict do we want to be > > > > about matching the type of write with the corresponding read? My > > > > default approach is to initially be quite strict and loosen as needed. > > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for > > > > the write and rcu_dereference() for the read, as opposed to (say) > > > > ACCESS_ONCE() for the read. (I am guessing that this would be too > > > > tight, but it makes a good example.) > > > > > > > > Thoughts? > > > > > > That sounds broadly sensible to me and allows rcu_assign_pointer and > > > rcu_dereference to be used as drop-in replacements for release/acquire > > > where local transitivity isn't required. However, I don't think we can > > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used > > > already in things like the osq_lock (albeit without the address > > > dependency). > > > > Agreed. So in the most strict case that I can imagine anyone putting > > up with, we have the following pairings: > > I think we can group these up: > > Locally transitive: > > > o smp_store_release() -> smp_load_acquire() (locally transitive) > > Locally transitive chain termination: > > (i.e. these can't be used to extend a chain) Agreed. > > o smp_store_release() -> lockless_dereference() (???) > > o rcu_assign_pointer() -> rcu_dereference() > > o smp_store_release() -> READ_ONCE(); if I am OK with the first and last, but I believe that the middle one has real use cases. So the rcu_assign_pointer() -> rcu_dereference() case needs to be locally transitive. > Globally transitive: > > > o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) > > o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) > > RCU: > > > o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE() > > (strange and wonderful properties) Agreed. > > Seem reasonable, or am I missing some? > > Looks alright to me. So I have some litmus tests to generate. ;-) Thnax, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Fri, Jan 29, 2016 at 02:22:53AM -0800, Paul E. McKenney wrote: > On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote: > > [ . . . ] > > > > For Linux in general, this is a question: How strict do we want to be > > > about matching the type of write with the corresponding read? My > > > default approach is to initially be quite strict and loosen as needed. > > > Here "quite strict" might mean requiring an rcu_assign_pointer() for > > > the write and rcu_dereference() for the read, as opposed to (say) > > > ACCESS_ONCE() for the read. (I am guessing that this would be too > > > tight, but it makes a good example.) > > > > > > Thoughts? > > > > That sounds broadly sensible to me and allows rcu_assign_pointer and > > rcu_dereference to be used as drop-in replacements for release/acquire > > where local transitivity isn't required. However, I don't think we can > > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used > > already in things like the osq_lock (albeit without the address > > dependency). > > Agreed. So in the most strict case that I can imagine anyone putting > up with, we have the following pairings: I think we can group these up: Locally transitive: > o smp_store_release() -> smp_load_acquire() (locally transitive) Locally transitive chain termination: (i.e. these can't be used to extend a chain) > o smp_store_release() -> lockless_dereference() (???) > o rcu_assign_pointer() -> rcu_dereference() > o smp_store_release() -> READ_ONCE(); if Globally transitive: > o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) > o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) RCU: > o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE() > (strange and wonderful properties) > > Seem reasonable, or am I missing some? Looks alright to me. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Fri, Jan 29, 2016 at 09:59:59AM +, Will Deacon wrote: > On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote: [ . . . ] > > For Linux in general, this is a question: How strict do we want to be > > about matching the type of write with the corresponding read? My > > default approach is to initially be quite strict and loosen as needed. > > Here "quite strict" might mean requiring an rcu_assign_pointer() for > > the write and rcu_dereference() for the read, as opposed to (say) > > ACCESS_ONCE() for the read. (I am guessing that this would be too > > tight, but it makes a good example.) > > > > Thoughts? > > That sounds broadly sensible to me and allows rcu_assign_pointer and > rcu_dereference to be used as drop-in replacements for release/acquire > where local transitivity isn't required. However, I don't think we can > rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used > already in things like the osq_lock (albeit without the address > dependency). Agreed. So in the most strict case that I can imagine anyone putting up with, we have the following pairings: o smp_store_release() -> smp_load_acquire() (locally transitive) o smp_store_release() -> lockless_dereference() (???) o smp_store_release() -> READ_ONCE(); if o rcu_assign_pointer() -> rcu_dereference() o smp_mb(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) o synchronize_rcu(); WRITE_ONCE() -> READ_ONCE(); (globally transitive) o synchronize_rcu(); WRITE_ONCE() -> rcu_read_lock(); READ_ONCE() (strange and wonderful properties) Seem reasonable, or am I missing some? Thanx, Paul -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Hi Paul, On Thu, Jan 28, 2016 at 02:31:31PM -0800, Paul E. McKenney wrote: > On Thu, Jan 28, 2016 at 09:57:19AM +, Will Deacon wrote: > > On Wed, Jan 27, 2016 at 03:38:36PM -0800, Paul E. McKenney wrote: > > > On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote: > > Yes, sorry for the shorthand: > > > > - Each paragraph is a separate thread > > - Wx=1 means WRITE_ONCE(x, 1), Rx=1 means READ_ONCE(x) returns 1 > > - WxRel means smp_store_release(x,1), RxAcq=1 means smp_load_acquire(x) > > returns 1 > > - Everything is initially zero > > > > > > and I suppose a variant of that: > > > > > > > > > > > > Wx=1 > > > > WyRel=1 > > > > > > > > RyAcq=1 > > > > Wz=1 > > > > > > > > Rz=1 > > > > > > > > Rx=0 > > > > > > Agreed, this would be needed as well, along with the read-read and > > > read-write variants. I picked the write-read version (Will's first > > > test above) because write-read reordering is the most likely on > > > hardware that I am aware of. > > > > Question: if you replaced "Wz=1" with "WzRel=1" in my second test, would > > it then be forbidden? > > On Power, yes. I would guess on ARM as well. Indeed. > For Linux in general, this is a question: How strict do we want to be > about matching the type of write with the corresponding read? My > default approach is to initially be quite strict and loosen as needed. > Here "quite strict" might mean requiring an rcu_assign_pointer() for > the write and rcu_dereference() for the read, as opposed to (say) > ACCESS_ONCE() for the read. (I am guessing that this would be too > tight, but it makes a good example.) > > Thoughts? That sounds broadly sensible to me and allows rcu_assign_pointer and rcu_dereference to be used as drop-in replacements for release/acquire where local transitivity isn't required. However, I don't think we can rule out READ_ONCE/WRITE_ONCE interactions as they seem to be used already in things like the osq_lock (albeit without the address dependency). Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Thu, Jan 28, 2016 at 09:57:19AM +, Will Deacon wrote: > On Wed, Jan 27, 2016 at 03:38:36PM -0800, Paul E. McKenney wrote: > > On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote: > > > On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote: > > > > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote: > > > > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > > > > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC > > > > > 16)? > > > > > > > > > > If not, you may need to implement smp_mb__after_unlock_lock for RCU > > > > > to ensure globally transitive unlock->lock ordering should you decide > > > > > to relax your locking barriers. > > > > > > > > You know that is a tricky question. Maybe its easier if you give the 3 > > > > cpu litmus test that goes with it. > > > > > > Sure, I was building up to that. I just wanted to make sure the basics > > > were there (program-order, so same CPU) before we go any further. It > > > sounds like they are, so that's promising. > > > > > > > Maciej, the tricky point is what, if any, effect the > > > > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review > > > > the TRANSITIVITY section in Documentation/memory-barriers.txt and > > > > replace with the RELEASE+ACQUIRE pair. > > > > > > > > We've all (Will, Paul and me) had much 'fun' trying to decipher the > > > > MIPS64r6 manual but failed to reach a conclusion on this. > > > > > > For the inter-thread case, Paul had a previous example along the lines > > > of: > > > > > > > > > Wx=1 > > > WyRel=1 > > > > > > RyAcq=1 > > > Rz=0 > > > > > > Wz=1 > > > smp_mb() > > > Rx=0 > > > > Each paragraph being a separate thread, correct? If so, agreed. > > Yes, sorry for the shorthand: > > - Each paragraph is a separate thread > - Wx=1 means WRITE_ONCE(x, 1), Rx=1 means READ_ONCE(x) returns 1 > - WxRel means smp_store_release(x,1), RxAcq=1 means smp_load_acquire(x) > returns 1 > - Everything is initially zero > > > > and I suppose a variant of that: > > > > > > > > > Wx=1 > > > WyRel=1 > > > > > > RyAcq=1 > > > Wz=1 > > > > > > Rz=1 > > > > > > Rx=0 > > > > Agreed, this would be needed as well, along with the read-read and > > read-write variants. I picked the write-read version (Will's first > > test above) because write-read reordering is the most likely on > > hardware that I am aware of. > > Question: if you replaced "Wz=1" with "WzRel=1" in my second test, would > it then be forbidden? On Power, yes. I would guess on ARM as well. For Linux in general, this is a question: How strict do we want to be about matching the type of write with the corresponding read? My default approach is to initially be quite strict and loosen as needed. Here "quite strict" might mean requiring an rcu_assign_pointer() for the write and rcu_dereference() for the read, as opposed to (say) ACCESS_ONCE() for the read. (I am guessing that this would be too tight, but it makes a good example.) Thoughts? Thanx, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, Jan 27, 2016 at 03:38:36PM -0800, Paul E. McKenney wrote: > On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote: > > On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote: > > > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote: > > > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > > > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? > > > > > > > > If not, you may need to implement smp_mb__after_unlock_lock for RCU > > > > to ensure globally transitive unlock->lock ordering should you decide > > > > to relax your locking barriers. > > > > > > You know that is a tricky question. Maybe its easier if you give the 3 > > > cpu litmus test that goes with it. > > > > Sure, I was building up to that. I just wanted to make sure the basics > > were there (program-order, so same CPU) before we go any further. It > > sounds like they are, so that's promising. > > > > > Maciej, the tricky point is what, if any, effect the > > > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review > > > the TRANSITIVITY section in Documentation/memory-barriers.txt and > > > replace with the RELEASE+ACQUIRE pair. > > > > > > We've all (Will, Paul and me) had much 'fun' trying to decipher the > > > MIPS64r6 manual but failed to reach a conclusion on this. > > > > For the inter-thread case, Paul had a previous example along the lines > > of: > > > > > > Wx=1 > > WyRel=1 > > > > RyAcq=1 > > Rz=0 > > > > Wz=1 > > smp_mb() > > Rx=0 > > Each paragraph being a separate thread, correct? If so, agreed. Yes, sorry for the shorthand: - Each paragraph is a separate thread - Wx=1 means WRITE_ONCE(x, 1), Rx=1 means READ_ONCE(x) returns 1 - WxRel means smp_store_release(x,1), RxAcq=1 means smp_load_acquire(x) returns 1 - Everything is initially zero > > and I suppose a variant of that: > > > > > > Wx=1 > > WyRel=1 > > > > RyAcq=1 > > Wz=1 > > > > Rz=1 > > > > Rx=0 > > Agreed, this would be needed as well, along with the read-read and > read-write variants. I picked the write-read version (Will's first > test above) because write-read reordering is the most likely on > hardware that I am aware of. Question: if you replaced "Wz=1" with "WzRel=1" in my second test, would it then be forbidden? Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Hi Maciej, On Wed, Jan 27, 2016 at 12:41:29PM +, Maciej W. Rozycki wrote: > On Wed, 27 Jan 2016, Will Deacon wrote: > > > > Overall I think it should be safe after all to use SYNC_RELEASE and > > > other > > > modern lightweight barriers uncondtionally under the assumption that > > > architecture was meant to remain backward compatible. Even though it > > > might be possible someone would implement unusual semantics for the then > > > undefined `stype' values, I highly doubt it as it would be extra effort > > > and hardware logic space for no gain. We could try and reach > > > architecture > > > overseers to double-check whether the `stype' encodings, somewhat > > > irregularly distributed, were indeed defined in a manner so as not to > > > clash with values implementers chose to use before rev. 2.61 of the > > > architecture specification. > > > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? > > By my reading of architecture specifications it does. Specifically > SYNC_RELEASE (18) applies to older loads and stores, and newer stores, and > SYNC_ACQUIRE (17) applies to older loads, and newer loads and stores. So > the two combined ought to be the equivalent to SYNC_MB (16), which applies > to both older and newer loads and stores. Of course care has to be taken Hmm.. so the following reordering couldn't happen? Program order: LOAD A SYNC_RELEASE STORE B LOAD C SYNC_ACQUIRE LOAD D First becomes: LOAD C < SYNC_RELEASE doesn't order newer loads. LOAD A SYNC_RELEASE STORE B SYNC_ACQUIRE LOAD D And then becomes: LOAD C < SYNC_ACQUIRE still affect those loads. LOAD D < SYNC_RELEASE doesn't order newer loads. LOAD A SYNC_RELEASE STORE B SYNC_ACQUIRE here doesn't mean that SYNC instructions can be reordered, it here means that the reordering doesn't break SYNC_ACQUIRE's guarantee. I ask this because some architectures(e.g. PPC) allows this kind of reordering. Please see "ACQUIRING FUNCTIONS" in memory-barriers.txt for more information. Thank you ;-) Regards, Boqun > about what happens between SYNC_RELEASE and SYNC_ACQUIRE. This is still > more lightweight than classic SYNC (0). See the architecture documents, > e.g. the MIPS32 one[1] for details. > > References: > > [1] "MIPS Architecture For Programmers, Volume II-A: The MIPS32 > Instruction Set", MIPS Technologies, Inc., Document Number: MD00086, > Revision 5.04, December 11, 2013, Table 4.7 "Encodings of the > Bits[10:6] of the SYNC instruction; the SType Field", p. 305 > > HTH, > > Maciej > signature.asc Description: PGP signature
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, Jan 27, 2016 at 03:21:58PM +, Will Deacon wrote: > On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote: > > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote: > > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? > > > > > > If not, you may need to implement smp_mb__after_unlock_lock for RCU > > > to ensure globally transitive unlock->lock ordering should you decide > > > to relax your locking barriers. > > > > You know that is a tricky question. Maybe its easier if you give the 3 > > cpu litmus test that goes with it. > > Sure, I was building up to that. I just wanted to make sure the basics > were there (program-order, so same CPU) before we go any further. It > sounds like they are, so that's promising. > > > Maciej, the tricky point is what, if any, effect the > > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review > > the TRANSITIVITY section in Documentation/memory-barriers.txt and > > replace with the RELEASE+ACQUIRE pair. > > > > We've all (Will, Paul and me) had much 'fun' trying to decipher the > > MIPS64r6 manual but failed to reach a conclusion on this. > > For the inter-thread case, Paul had a previous example along the lines > of: > > > Wx=1 > WyRel=1 > > RyAcq=1 > Rz=0 > > Wz=1 > smp_mb() > Rx=0 Each paragraph being a separate thread, correct? If so, agreed. > and I suppose a variant of that: > > > Wx=1 > WyRel=1 > > RyAcq=1 > Wz=1 > > Rz=1 > > Rx=0 Agreed, this would be needed as well, along with the read-read and read-write variants. I picked the write-read version (Will's first test above) because write-read reordering is the most likely on hardware that I am aware of. Thanx, Paul
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, Jan 27, 2016 at 03:54:21PM +0100, Peter Zijlstra wrote: > On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote: > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? > > > > If not, you may need to implement smp_mb__after_unlock_lock for RCU > > to ensure globally transitive unlock->lock ordering should you decide > > to relax your locking barriers. > > You know that is a tricky question. Maybe its easier if you give the 3 > cpu litmus test that goes with it. Sure, I was building up to that. I just wanted to make sure the basics were there (program-order, so same CPU) before we go any further. It sounds like they are, so that's promising. > Maciej, the tricky point is what, if any, effect the > SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review > the TRANSITIVITY section in Documentation/memory-barriers.txt and > replace with the RELEASE+ACQUIRE pair. > > We've all (Will, Paul and me) had much 'fun' trying to decipher the > MIPS64r6 manual but failed to reach a conclusion on this. For the inter-thread case, Paul had a previous example along the lines of: Wx=1 WyRel=1 RyAcq=1 Rz=0 Wz=1 smp_mb() Rx=0 and I suppose a variant of that: Wx=1 WyRel=1 RyAcq=1 Wz=1 Rz=1 Rx=0 Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, Jan 27, 2016 at 11:43:48AM +, Will Deacon wrote: > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? > > If not, you may need to implement smp_mb__after_unlock_lock for RCU > to ensure globally transitive unlock->lock ordering should you decide > to relax your locking barriers. You know that is a tricky question. Maybe its easier if you give the 3 cpu litmus test that goes with it. Maciej, the tricky point is what, if any, effect the SYNC_RELEASE+SYNC_ACQUIRE pair has on an unrelated CPU. Please review the TRANSITIVITY section in Documentation/memory-barriers.txt and replace with the RELEASE+ACQUIRE pair. We've all (Will, Paul and me) had much 'fun' trying to decipher the MIPS64r6 manual but failed to reach a conclusion on this.
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Wed, 27 Jan 2016, Will Deacon wrote: > > Overall I think it should be safe after all to use SYNC_RELEASE and other > > modern lightweight barriers uncondtionally under the assumption that > > architecture was meant to remain backward compatible. Even though it > > might be possible someone would implement unusual semantics for the then > > undefined `stype' values, I highly doubt it as it would be extra effort > > and hardware logic space for no gain. We could try and reach architecture > > overseers to double-check whether the `stype' encodings, somewhat > > irregularly distributed, were indeed defined in a manner so as not to > > clash with values implementers chose to use before rev. 2.61 of the > > architecture specification. > > Do you know whether a SYNC 18 (RELEASE) followed in program order by a > SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? By my reading of architecture specifications it does. Specifically SYNC_RELEASE (18) applies to older loads and stores, and newer stores, and SYNC_ACQUIRE (17) applies to older loads, and newer loads and stores. So the two combined ought to be the equivalent to SYNC_MB (16), which applies to both older and newer loads and stores. Of course care has to be taken about what happens between SYNC_RELEASE and SYNC_ACQUIRE. This is still more lightweight than classic SYNC (0). See the architecture documents, e.g. the MIPS32 one[1] for details. References: [1] "MIPS Architecture For Programmers, Volume II-A: The MIPS32 Instruction Set", MIPS Technologies, Inc., Document Number: MD00086, Revision 5.04, December 11, 2013, Table 4.7 "Encodings of the Bits[10:6] of the SYNC instruction; the SType Field", p. 305 HTH, Maciej
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Hi Maciej, Thanks for digging all this up. On Wed, Jan 27, 2016 at 09:57:24AM +, Maciej W. Rozycki wrote: > On Thu, 12 Nov 2015, David Daney wrote: > > > > > Certainly we can load up the code with "SYNC" all over the place, but > > > > it will kill performance on SMP systems. So, my vote would be to make > > > > it as light weight as possible, but no lighter. That will mean > > > > inventing the proper barrier primitives. > > > > > > It seems to me that the proper barrier here is a "SYNC 18" aka > > > SYNC_RELEASE instruction, at least on CPUs that implement that variant. > > For the record, we've had "cooked" aliases in the toolchain for a short > while now -- since Sep 2010 or binutils 2.21 -- so for readability you can > actually use `sync_release' in your source code rather than obscure `sync > 18' (of course you could define a macro instead, but there's no need now), > and disassembly will show the "cooked" mnemonic too. > > Although Documentation/Changes still lists binutils 2.12 as the minimum, > so perhaps using macros is indeed the way to go now, at least for the time > being. > > > Yes, unfortunately very few CPUs implement that. It is an instruction that > > MIPS invented only recently, so older CPUs need a different solution. > > Hmm, it looks to me we might actually be safe, although as often the > situation seems more complicated than it had to be. [... trim ISA archaeology ...] > Overall I think it should be safe after all to use SYNC_RELEASE and other > modern lightweight barriers uncondtionally under the assumption that > architecture was meant to remain backward compatible. Even though it > might be possible someone would implement unusual semantics for the then > undefined `stype' values, I highly doubt it as it would be extra effort > and hardware logic space for no gain. We could try and reach architecture > overseers to double-check whether the `stype' encodings, somewhat > irregularly distributed, were indeed defined in a manner so as not to > clash with values implementers chose to use before rev. 2.61 of the > architecture specification. Do you know whether a SYNC 18 (RELEASE) followed in program order by a SYNC 17 (ACQUIRE) creates a full barrier (i.e. something like SYNC 16)? If not, you may need to implement smp_mb__after_unlock_lock for RCU to ensure globally transitive unlock->lock ordering should you decide to relax your locking barriers. Will
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Thu, 12 Nov 2015, David Daney wrote: > > > Certainly we can load up the code with "SYNC" all over the place, but > > > it will kill performance on SMP systems. So, my vote would be to make > > > it as light weight as possible, but no lighter. That will mean > > > inventing the proper barrier primitives. > > > > It seems to me that the proper barrier here is a "SYNC 18" aka > > SYNC_RELEASE instruction, at least on CPUs that implement that variant. For the record, we've had "cooked" aliases in the toolchain for a short while now -- since Sep 2010 or binutils 2.21 -- so for readability you can actually use `sync_release' in your source code rather than obscure `sync 18' (of course you could define a macro instead, but there's no need now), and disassembly will show the "cooked" mnemonic too. Although Documentation/Changes still lists binutils 2.12 as the minimum, so perhaps using macros is indeed the way to go now, at least for the time being. > Yes, unfortunately very few CPUs implement that. It is an instruction that > MIPS invented only recently, so older CPUs need a different solution. Hmm, it looks to me we might actually be safe, although as often the situation seems more complicated than it had to be. Conventional wisdom says that SYNC as the ultimate ordering barrier, aka SYNC 0, was added with the MIPS II ISA, with a provision to define less restrictive barriers in the future in a backward compatible manner, by the means of undefined (any non-zero at the time) barrier types defaulting to 0. Early references seem to have been lost in the mist of time, however a few legacy MIPS ISA documents remain, e.g. the MIPS IV ISA document says[1]: "The stype values 1-31 are reserved; they produce the same result as the value zero." making it clear that non-zero arguments will work as expected, albeit perhaps with a somewhat heavyweight effect. But there's sometimes no other way. This seems more ambiguous with earlier documentation available, e.g. the MIPS R4000 processor manual, which omits the mention of `stype' altogether and merely defines a single SYNC instruction encoding with all-zeros across bits 25:6 of the instruction word, among which `stype' normally lives[2]. This appears the same with other MIPS III processor documentation (e.g. IDT 79RV4700[3]). However I'm fairly sure all these simply did not bother decoding SYNC beyond the major and minor opcode, so again SYNC 0 semantics should be held across the more recently defined variants. I could this actually sometime with an R4000 class processor. Modern MIPS architecture specifications started with the same definition as the MIPS IV ISA had, rev. 0.95 documents still stated[4][5]: "The stype values 1-31 are reserved; they produce the same result as the value zero." Unfortunately the requirement got weakened later on, rev. 1.00 architecture specifications now stated[6][7]: "The stype values 1-31 are reserved for future extensions to the architecture. A value of zero will always be defined such that it performs all defined synchronization operations. Non-zero values may be defined to remove some synchronization operations. As such, software should never use a non-zero value of the stype field, as this may inadvertently cause future failures if non-zero values remove synchronization operations." I think the intent was not to break backwards compatibility, and certainly anyone who looked at one of the earlier documents might have realised that implementing non-zero SYNC operations, that do not have a vendor-specific semantics, as aliases to SYNC 0 rather than NOP or RI triggers would be a good idea. However implementers may not have been able to infer that from reading the lone current revision of architecture documents. It was only with rev. 2.60 of architecture specifications that along new SYNC operations the requirement for undefined SYNC operations to behave as SYNC 0 was put in the text back in an unambiguous form[8][9]: "A stype value of zero will always be defined such that it performs the most complete set of synchronization operations that are defined. This means stype zero always does a completion barrier that affects both loads and stores preceding the SYNC instruction and both loads and stores that are subsequent to the SYNC instruction. Non-zero values of stype may be defined by the architecture or specific implementations to perform synchronization behaviors that are less complete than that of stype zero. If an implementation does not use one of these non-zero values to define a different synchronization behavior, then that non-zero value of stype must act the same as stype zero completion barrier. This allows software written for an implementation with a lighter-weight barrier to work on another implementation which only implements the stype zero completion barrier." This definition has then been retained in the architecture specification througho
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On 11/12/2015 10:13 AM, Måns Rullgård wrote: David Daney writes: On 11/12/2015 04:31 AM, Peter Zijlstra wrote: Hi I think the MIPS arch_spin_unlock() is borken. spin_unlock() must have RELEASE semantics, these require that no LOADs nor STOREs leak out from the critical section. From what I know MIPS has a relaxed memory model which allows reads to pass stores, and as implemented arch_spin_unlock() only issues a wmb which doesn't order prior reads vs later stores. Therefore upgrade the wmb() to smp_mb(). (Also, why the unconditional wmb, as opposed to smp_wmb() ?) asm/spinlock.h is only used for !CONFIG_SMP. So, smp_wmb() would imply that special handling for non-SMP is needed, when this is already only used for the SMP build case. Maybe-Signed-off-by: Peter Zijlstra (Intel) --- diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h index 40196bebe849..b2ca13f06152 100644 --- a/arch/mips/include/asm/spinlock.h +++ b/arch/mips/include/asm/spinlock.h @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) static inline void arch_spin_unlock(arch_spinlock_t *lock) { unsigned int serving_now = lock->h.serving_now + 1; - wmb(); + smp_mb(); That is too heavy. It implies a full MIPS "SYNC" operation which stalls execution until all previous writes are committed and globally visible. We really want just release semantics, and there is no standard named primitive that gives us that. For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be: smp_wmb(); smp_rmb(); Which expands to exactly the same thing as wmb() because smp_rmb() expands to nothing. For CPUs that have out-of-order loads, smp_rmb() should expand to something lighter weight than "SYNC" Certainly we can load up the code with "SYNC" all over the place, but it will kill performance on SMP systems. So, my vote would be to make it as light weight as possible, but no lighter. That will mean inventing the proper barrier primitives. It seems to me that the proper barrier here is a "SYNC 18" aka SYNC_RELEASE instruction, at least on CPUs that implement that variant. Yes, unfortunately very few CPUs implement that. It is an instruction that MIPS invented only recently, so older CPUs need a different solution. David Daney -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
David Daney writes: > On 11/12/2015 04:31 AM, Peter Zijlstra wrote: >> Hi >> >> I think the MIPS arch_spin_unlock() is borken. >> >> spin_unlock() must have RELEASE semantics, these require that no LOADs >> nor STOREs leak out from the critical section. >> >> From what I know MIPS has a relaxed memory model which allows reads to >> pass stores, and as implemented arch_spin_unlock() only issues a wmb >> which doesn't order prior reads vs later stores. >> >> Therefore upgrade the wmb() to smp_mb(). >> >> (Also, why the unconditional wmb, as opposed to smp_wmb() ?) > > asm/spinlock.h is only used for !CONFIG_SMP. So, smp_wmb() would > imply that special handling for non-SMP is needed, when this is > already only used for the SMP build case. > >> >> Maybe-Signed-off-by: Peter Zijlstra (Intel) >> --- >> diff --git a/arch/mips/include/asm/spinlock.h >> b/arch/mips/include/asm/spinlock.h >> index 40196bebe849..b2ca13f06152 100644 >> --- a/arch/mips/include/asm/spinlock.h >> +++ b/arch/mips/include/asm/spinlock.h >> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) >> static inline void arch_spin_unlock(arch_spinlock_t *lock) >> { >> unsigned int serving_now = lock->h.serving_now + 1; >> -wmb(); >> +smp_mb(); > > That is too heavy. > > It implies a full MIPS "SYNC" operation which stalls execution until > all previous writes are committed and globally visible. > > We really want just release semantics, and there is no standard named > primitive that gives us that. > > For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be: > > smp_wmb(); > smp_rmb(); > > Which expands to exactly the same thing as wmb() because smp_rmb() > expands to nothing. > > For CPUs that have out-of-order loads, smp_rmb() should expand to > something lighter weight than "SYNC" > > Certainly we can load up the code with "SYNC" all over the place, but > it will kill performance on SMP systems. So, my vote would be to make > it as light weight as possible, but no lighter. That will mean > inventing the proper barrier primitives. It seems to me that the proper barrier here is a "SYNC 18" aka SYNC_RELEASE instruction, at least on CPUs that implement that variant. -- Måns Rullgård m...@mansr.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Thu, Nov 12, 2015 at 09:46:53AM -0800, David Daney wrote: > For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be: > > smp_wmb(); > smp_rmb(); > > Which expands to exactly the same thing as wmb() because smp_rmb() expands > to nothing. OK, so the current code isn't broken because for Cavium wmb is suffient because rmb is a no-op, and for !Cavium wmb expands to SYNC. > You yourself seem to have added smp_store_release(), so we could even do: > > smp_store_release(&lock->h.serving_now, lock->h.serving_now + 1); > > That would leave us to cook up a proper definition of smp_store_release(). That is indeed the better solution. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On 11/12/2015 04:31 AM, Peter Zijlstra wrote: Hi I think the MIPS arch_spin_unlock() is borken. spin_unlock() must have RELEASE semantics, these require that no LOADs nor STOREs leak out from the critical section. From what I know MIPS has a relaxed memory model which allows reads to pass stores, and as implemented arch_spin_unlock() only issues a wmb which doesn't order prior reads vs later stores. Therefore upgrade the wmb() to smp_mb(). (Also, why the unconditional wmb, as opposed to smp_wmb() ?) asm/spinlock.h is only used for !CONFIG_SMP. So, smp_wmb() would imply that special handling for non-SMP is needed, when this is already only used for the SMP build case. Maybe-Signed-off-by: Peter Zijlstra (Intel) --- diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h index 40196bebe849..b2ca13f06152 100644 --- a/arch/mips/include/asm/spinlock.h +++ b/arch/mips/include/asm/spinlock.h @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) static inline void arch_spin_unlock(arch_spinlock_t *lock) { unsigned int serving_now = lock->h.serving_now + 1; - wmb(); + smp_mb(); That is too heavy. It implies a full MIPS "SYNC" operation which stalls execution until all previous writes are committed and globally visible. We really want just release semantics, and there is no standard named primitive that gives us that. For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be: smp_wmb(); smp_rmb(); Which expands to exactly the same thing as wmb() because smp_rmb() expands to nothing. For CPUs that have out-of-order loads, smp_rmb() should expand to something lighter weight than "SYNC" Certainly we can load up the code with "SYNC" all over the place, but it will kill performance on SMP systems. So, my vote would be to make it as light weight as possible, but no lighter. That will mean inventing the proper barrier primitives. You yourself seem to have added smp_store_release(), so we could even do: smp_store_release(&lock->h.serving_now, lock->h.serving_now + 1); That would leave us to cook up a proper definition of smp_store_release(). David Daney lock->h.serving_now = (u16)serving_now; nudge_writes(); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Thu, Nov 12, 2015 at 02:50:00PM +, Måns Rullgård wrote: > "Paul E. McKenney" writes: > > > On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote: > >> Hi > >> > >> I think the MIPS arch_spin_unlock() is borken. > >> > >> spin_unlock() must have RELEASE semantics, these require that no LOADs > >> nor STOREs leak out from the critical section. > >> > >> >From what I know MIPS has a relaxed memory model which allows reads to > >> pass stores, and as implemented arch_spin_unlock() only issues a wmb > >> which doesn't order prior reads vs later stores. > >> > >> Therefore upgrade the wmb() to smp_mb(). > >> > >> (Also, why the unconditional wmb, as opposed to smp_wmb() ?) > > > > One guess is that they want to order I/O accesses within the critical > > section? > > Isn't that what mmiowb() is for? Indeed it is. Perhaps they didn't trust the device drivers that they care about to use it? Anyway, just my guess. Just out of curiosity, what is your guess? Thanx, Paul > >> Maybe-Signed-off-by: Peter Zijlstra (Intel) > >> --- > >> diff --git a/arch/mips/include/asm/spinlock.h > >> b/arch/mips/include/asm/spinlock.h > >> index 40196bebe849..b2ca13f06152 100644 > >> --- a/arch/mips/include/asm/spinlock.h > >> +++ b/arch/mips/include/asm/spinlock.h > >> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t > >> *lock) > >> static inline void arch_spin_unlock(arch_spinlock_t *lock) > >> { > >>unsigned int serving_now = lock->h.serving_now + 1; > >> - wmb(); > >> + smp_mb(); > >>lock->h.serving_now = (u16)serving_now; > >>nudge_writes(); > >> } > >> > > > > -- > Måns Rullgård > m...@mansr.com > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
"Paul E. McKenney" writes: > On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote: >> Hi >> >> I think the MIPS arch_spin_unlock() is borken. >> >> spin_unlock() must have RELEASE semantics, these require that no LOADs >> nor STOREs leak out from the critical section. >> >> >From what I know MIPS has a relaxed memory model which allows reads to >> pass stores, and as implemented arch_spin_unlock() only issues a wmb >> which doesn't order prior reads vs later stores. >> >> Therefore upgrade the wmb() to smp_mb(). >> >> (Also, why the unconditional wmb, as opposed to smp_wmb() ?) > > One guess is that they want to order I/O accesses within the critical > section? Isn't that what mmiowb() is for? >> Maybe-Signed-off-by: Peter Zijlstra (Intel) >> --- >> diff --git a/arch/mips/include/asm/spinlock.h >> b/arch/mips/include/asm/spinlock.h >> index 40196bebe849..b2ca13f06152 100644 >> --- a/arch/mips/include/asm/spinlock.h >> +++ b/arch/mips/include/asm/spinlock.h >> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) >> static inline void arch_spin_unlock(arch_spinlock_t *lock) >> { >> unsigned int serving_now = lock->h.serving_now + 1; >> -wmb(); >> +smp_mb(); >> lock->h.serving_now = (u16)serving_now; >> nudge_writes(); >> } >> > -- Måns Rullgård m...@mansr.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote: > Hi > > I think the MIPS arch_spin_unlock() is borken. > > spin_unlock() must have RELEASE semantics, these require that no LOADs > nor STOREs leak out from the critical section. > > >From what I know MIPS has a relaxed memory model which allows reads to > pass stores, and as implemented arch_spin_unlock() only issues a wmb > which doesn't order prior reads vs later stores. > > Therefore upgrade the wmb() to smp_mb(). > > (Also, why the unconditional wmb, as opposed to smp_wmb() ?) One guess is that they want to order I/O accesses within the critical section? Thanx, Paul > Maybe-Signed-off-by: Peter Zijlstra (Intel) > --- > diff --git a/arch/mips/include/asm/spinlock.h > b/arch/mips/include/asm/spinlock.h > index 40196bebe849..b2ca13f06152 100644 > --- a/arch/mips/include/asm/spinlock.h > +++ b/arch/mips/include/asm/spinlock.h > @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) > static inline void arch_spin_unlock(arch_spinlock_t *lock) > { > unsigned int serving_now = lock->h.serving_now + 1; > - wmb(); > + smp_mb(); > lock->h.serving_now = (u16)serving_now; > nudge_writes(); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
Peter Zijlstra writes: > Hi > > I think the MIPS arch_spin_unlock() is borken. > > spin_unlock() must have RELEASE semantics, these require that no LOADs > nor STOREs leak out from the critical section. > > From what I know MIPS has a relaxed memory model which allows reads to > pass stores, and as implemented arch_spin_unlock() only issues a wmb > which doesn't order prior reads vs later stores. This is correct. > Therefore upgrade the wmb() to smp_mb(). > > (Also, why the unconditional wmb, as opposed to smp_wmb() ?) Good question. The current MIPS asm/barrier.h uses a plain SYNC instruction for all kinds of barriers (except on Cavium Octeon), which is a bit wasteful. A MIPS implementation can optionally support partial barriers (load, store, acquire, release) which all behave like a full barrier if not implemented, so those really ought to be used. > Maybe-Signed-off-by: Peter Zijlstra (Intel) > --- > diff --git a/arch/mips/include/asm/spinlock.h > b/arch/mips/include/asm/spinlock.h > index 40196bebe849..b2ca13f06152 100644 > --- a/arch/mips/include/asm/spinlock.h > +++ b/arch/mips/include/asm/spinlock.h > @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) > static inline void arch_spin_unlock(arch_spinlock_t *lock) > { > unsigned int serving_now = lock->h.serving_now + 1; > - wmb(); > + smp_mb(); > lock->h.serving_now = (u16)serving_now; > nudge_writes(); > } All this weirdness was added in commit 500c2e1f: MIPS: Optimize spinlocks. The current locking mechanism uses a ll/sc sequence to release a spinlock. This is slower than a wmb() followed by a store to unlock. The branching forward to .subsection 2 on sc failure slows down the contended case. So we get rid of that part too. Since we are now working on naturally aligned u16 values, we can get rid of a masking operation as the LHU already does the right thing. The ANDI are reversed for better scheduling on multi-issue CPUs On a 12 CPU 750MHz Octeon cn5750 this patch improves ipv4 UDP packet forwarding rates from 3.58*10^6 PPS to 3.99*10^6 PPS, or about 11%. Signed-off-by: David Daney To: linux-m...@linux-mips.org Patchwork: http://patchwork.linux-mips.org/patch/937/ Signed-off-by: Ralf Baechle -- Måns Rullgård m...@mansr.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] mips: Fix arch_spin_unlock()
On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote: > +++ b/arch/mips/include/asm/spinlock.h > @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) > static inline void arch_spin_unlock(arch_spinlock_t *lock) > { > unsigned int serving_now = lock->h.serving_now + 1; > - wmb(); > + smp_mb(); > lock->h.serving_now = (u16)serving_now; Also, you might want to consider using WRITE_ONCE() for that store. > nudge_writes(); And that thing is creative, I've not seen any other arch issues barriers to speed up the store buffer flush. Does it really matter for MIPS? > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] mips: Fix arch_spin_unlock()
Hi I think the MIPS arch_spin_unlock() is borken. spin_unlock() must have RELEASE semantics, these require that no LOADs nor STOREs leak out from the critical section. >From what I know MIPS has a relaxed memory model which allows reads to pass stores, and as implemented arch_spin_unlock() only issues a wmb which doesn't order prior reads vs later stores. Therefore upgrade the wmb() to smp_mb(). (Also, why the unconditional wmb, as opposed to smp_wmb() ?) Maybe-Signed-off-by: Peter Zijlstra (Intel) --- diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h index 40196bebe849..b2ca13f06152 100644 --- a/arch/mips/include/asm/spinlock.h +++ b/arch/mips/include/asm/spinlock.h @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock) static inline void arch_spin_unlock(arch_spinlock_t *lock) { unsigned int serving_now = lock->h.serving_now + 1; - wmb(); + smp_mb(); lock->h.serving_now = (u16)serving_now; nudge_writes(); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/