On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote: > On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote: > > Hi Paul, Will, Peter, > > > > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO > > is able to break rcu_dereference. This seems to be taken care of by > > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree. > > > > In the liburcu user-space library, we have this comment near > > rcu_dereference() in > > include/urcu/static/pointer.h: > > > > * The compiler memory barrier in CMM_LOAD_SHARED() ensures that > > value-speculative > > * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform > > the > > * data read before the pointer read by speculating the value of the > > pointer. > > * Correct ordering is ensured because the pointer is read as a volatile > > access. > > * This acts as a global side-effect operation, which forbids reordering of > > * dependent memory operations. Note that such concern about > > dependency-breaking > > * optimizations will eventually be taken care of by the > > "memory_order_consume" > > * addition to forthcoming C++ standard. > > > > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was > > introduced in > > liburcu as a public API before READ_ONCE() existed in the Linux kernel) > > > > Peter tells me the "memory_order_consume" is not something which can be > > used today. > > Any information on its status at C/C++ standard levels and > > implementation-wise ?
Actually, you really can use memory_order_consume. All current implementations will compile it as if it was memory_order_acquire. This will work correctly, but may be slower than you would like on ARM, PowerPC, and so on. On things like x86, the penalty is forgone optimizations, so less of a problem there. > > Pragmatically speaking, what should we change in liburcu to ensure we don't > > generate > > broken code when LTO is enabled ? I suspect there are a few options here: > > > > 1) Fail to build if LTO is enabled, > > 2) Generate slower code for rcu_dereference, either on all architectures or > > only > > on weakly-ordered architectures, > > 3) Generate different code depending on whether LTO is enabled or not. > > AFAIU this would only > > work if every compile unit is aware that it will end up being optimized > > with LTO. Not sure > > how this could be done in the context of user-space. > > 4) [ Insert better idea here. ] Use memory_order_consume if LTO is enabled. That will work now, and might generate good code in some hoped-for future. > > Thoughts ? > > Using memory_order_acquire is safe; and is basically what Will did for > ARM64. > > The problematic tranformations are possible even without LTO, although > less likely due to less visibility, but everybody agrees they're > possible and allowed. > > OTOH we do not have a positive sighting of it actually happening (I > think), we're all just being cautious and not willing to debug the > resulting wreckage if it does indeed happen. And yes, you can also use memory_order_acquire. Thanx, Paul