On Mon, Nov 27, 2023 at 12:16:43PM +0100, Peter Zijlstra wrote: > On Fri, Nov 24, 2023 at 09:51:53PM -0500, Guo Ren wrote: > > On Fri, Nov 24, 2023 at 11:15:19AM +0100, Peter Zijlstra wrote: > > > On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote: > > > > From: Christoph Müllner <christoph.muell...@vrull.eu> > > > > > > > > The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg > > > > CSR to switch the memory consistency model at run-time from RVWMO to TSO > > > > (and back). The active consistency model can therefore be switched on a > > > > per-hart base and managed by the kernel on a per-process/thread base. > > > > > > You guys, computers are hartless, nobody told ya? > > > > > > > This patch implements basic Ssdtso support and adds a prctl API on top > > > > so that user-space processes can switch to a stronger memory consistency > > > > model (than the kernel was written for) at run-time. > > > > > > > > I am not sure if other architectures support switching the memory > > > > consistency model at run-time, but designing the prctl API in an > > > > arch-independent way allows reusing it in the future. > > > > > > IIRC some Sparc chips could do this, but I don't think anybody ever > > > exposed this to userspace (or used it much). > > > > > > IA64 had planned to do this, except they messed it up and did it the > > > wrong way around (strong first and then relax it later), which lead to > > > the discovery that all existing software broke (d'uh). > > > > > > I think ARM64 approached this problem by adding the > > > load-acquire/store-release instructions and for TSO based code, > > > translate into those (eg. x86 -> arm64 transpilers). > > > Keeping global TSO order is easier and faster than mixing > > acquire/release and regular load/store. That means when ssdtso is > > enabled, the transpiler's load-acquire/store-release becomes regular > > load/store. Some micro-arch hardwares could speed up the performance. > > Why is it faster? Because the release+acquire thing becomes RcSC instead > of RcTSO? Surely that can be fixed with a weaker store-release variant > ot something? The "ld.acq + st.rel" could only be close to the ideal RCtso because maintaining "ld.acq + st.rel + ld + st" is more complex in LSU than "ld + st" by global TSO. So, that is why we want a global TSO flag to simplify the micro-arch implementation, especially for some small processors in the big-little system.
> > The problem I have with all of this is that you need to context switch > this state and that you need to deal with exceptions, which must be > written for the weak model but then end up running in the tso model -- > possibly slower than desired. The s-mode TSO is useless for the riscv Linux kernel and this patch only uses u-mode TSO. So, the exception handler and the whole kernel always run in WMO. Two years ago, we worried about stuff like io_uring, which means io_uring userspace is in TSO, but the kernel side is in WMO. But it still seems like no problem because every side has a different implementation, but they all ensure their order. So, there should be no problem between TSO & WMO io_uring communication. The only things we need to prevent are: 1. Do not let the WMO code run in TSO mode, which is inefficient. (you mentioned) 2. Do not let the TSO code run in WMO mode, which is incorrect. > If OTOH you only have a single model, everything becomes so much > simpler. You just need to be able to express exactly what you want. The ssdtso is no harm to the current WMO; it's just a tradeoff for micro-arch implementation. You still could use "ld + st" are "ld.acq + st.rl", but they are the same in the global tso state. > > >