Re: [patch] x86: improved memory barrier implementation
On Sat, Sep 29, 2007 at 09:07:30AM -0700, Linus Torvalds wrote: > > > On Sat, 29 Sep 2007, Nick Piggin wrote: > > > > > > The non-temporal stores should be basically considered to be "IO", not > > > any > > > normal memory operation. > > > > Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable > > RAM apparently can go out of order too, and they are being used in the > > kernel > > for some things. > > I'm really saying that people to a first approximation should think "NT is > an IO (DMA) thing". Whether cached or not. Exactly because they do not > honor the normal memory ordering. > > It may be worth noting that "clflush" falls under that heading too - even > if all the actual *writes* were done with totally normal writes, if > anybody does a clflush instruction, that breaks the ordering, and that > turns it to "DMA ordering" again - ie we're not talking about the normal > SMP ordering rules at all. > > So all the spinlocks and all the smp_*mb() barriers have never really done > *anything* for those things (in particular, "smp_wmb()" has *always* > ignored them on i386!) Yes that's true. I'm just noting that in some places, we do nontemporal stores to cacheable RAM, and in that case we just have to remember to be careful of memory ordering (especially in library functions like copy_user_nocache where we never know exactly what it will be used for, even if it seems safe). Speaking of which, would it be a good idea to put an sfence before and after this function? It seems like the safe thing to do... > > Likewise for rep stos, apparently. > > No. As far as I can tell, the fast string operations are unordered > *within*themselves*, but not wrt the operations around it. > > In other words, you cannot depend on the ordering of stores *in* the > memcpy() or memset() when it is implemented by "rep movs/stos" - but that > is 100% equivalent to the fact that you cannot depend on the ordering even > when it isn't - since the "memcpy()" library routine might be copying > memory backwards for all you know! > > The Intel memory ordering paper doesn't talk about the fast string > instructions (except to say that the rules it *does* speak about do not > hold), but the regular IA manuals do say (for example): > >"Code dependent upon sequential store ordering should not use the > string operations for the entire data structure to be stored. Data and > semaphores should be separated. Order dependent code should use a > discrete semaphore uniquely stored to after any string operations to > allow correctly ordered data to be seen by all processors." > > and note how it says you should just store to the semaphore. If you think > about it, that semahore will be involving all the memory ordering > requirements that we *already* depend on, so if a semaphore is sufficient > to order the fast string instruction, then by definition using a spinlock > around them must be the same thing! > > In other words, by Intels architecture manual, fast string instructions > cannot escape a "semaphore" - but that means that they cannot escape a > spinlock either (since the two are exactly the same wrt memory ordering > rules! In other words, whenever the Intel docs say "semaphore", think > "mutual exclusion lock", not necessarily the kernel kind of "sleeping > semaphore"). > > But it might be good to have that explicitly mentioned in the IA memory > ordering thing, so I'll ask the Intel people about that. However, I'd say > that given our *current* documentation, string instructions may be > *internally* out-of-order, but they would not escape a lock. OK, it's good to hear that from you :) The one tiny doubt I have left is that IIRC Intel talk about lock prefixed operations as semaphore operations sometimes? I don't suppose that would be the case here? Anyway, if you can get them to put something in writing, that would be grand. > > But this means they are already at odds with spin_unlock, unless they > > are enclosed with mfences everywhere they are used (of which I think > > most are not). So this is an existing bug in the kernel. > > See above. I do not believe that it's an existing bug, but the basic point > that the change to "smp_rmb()" doesn't change our existing rules is true. OK, so all that's left is to restore the locked smp_rmb case for broken ppro and resend the patch? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote: > > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb > > can also be a simple barrier on i386 too. > > The IDT Winchip can do SMP apparently. >From the Winchip3 (which was the final winchip) specs.. "The IDT WinChip 3 processor also omits the software interface to the Intel-proprietary symmetric multiprocessing support: APIC. This bus function is omitted since the target market for the IDT WinChip 3 processor is typical desktop systems (which do not support APIC multiprocessing)." It didn't offer any alternative DIY-SMP either (or at least none that's documented). Centaur only became SMP capable with some of the C3 Nehemiah's a year or two back. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
On Sat, 29 Sep 2007, Nick Piggin wrote: > > > > The non-temporal stores should be basically considered to be "IO", not any > > normal memory operation. > > Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable > RAM apparently can go out of order too, and they are being used in the kernel > for some things. I'm really saying that people to a first approximation should think "NT is an IO (DMA) thing". Whether cached or not. Exactly because they do not honor the normal memory ordering. It may be worth noting that "clflush" falls under that heading too - even if all the actual *writes* were done with totally normal writes, if anybody does a clflush instruction, that breaks the ordering, and that turns it to "DMA ordering" again - ie we're not talking about the normal SMP ordering rules at all. So all the spinlocks and all the smp_*mb() barriers have never really done *anything* for those things (in particular, "smp_wmb()" has *always* ignored them on i386!) > Likewise for rep stos, apparently. No. As far as I can tell, the fast string operations are unordered *within*themselves*, but not wrt the operations around it. In other words, you cannot depend on the ordering of stores *in* the memcpy() or memset() when it is implemented by "rep movs/stos" - but that is 100% equivalent to the fact that you cannot depend on the ordering even when it isn't - since the "memcpy()" library routine might be copying memory backwards for all you know! The Intel memory ordering paper doesn't talk about the fast string instructions (except to say that the rules it *does* speak about do not hold), but the regular IA manuals do say (for example): "Code dependent upon sequential store ordering should not use the string operations for the entire data structure to be stored. Data and semaphores should be separated. Order dependent code should use a discrete semaphore uniquely stored to after any string operations to allow correctly ordered data to be seen by all processors." and note how it says you should just store to the semaphore. If you think about it, that semahore will be involving all the memory ordering requirements that we *already* depend on, so if a semaphore is sufficient to order the fast string instruction, then by definition using a spinlock around them must be the same thing! In other words, by Intels architecture manual, fast string instructions cannot escape a "semaphore" - but that means that they cannot escape a spinlock either (since the two are exactly the same wrt memory ordering rules! In other words, whenever the Intel docs say "semaphore", think "mutual exclusion lock", not necessarily the kernel kind of "sleeping semaphore"). But it might be good to have that explicitly mentioned in the IA memory ordering thing, so I'll ask the Intel people about that. However, I'd say that given our *current* documentation, string instructions may be *internally* out-of-order, but they would not escape a lock. > But this means they are already at odds with spin_unlock, unless they > are enclosed with mfences everywhere they are used (of which I think > most are not). So this is an existing bug in the kernel. See above. I do not believe that it's an existing bug, but the basic point that the change to "smp_rmb()" doesn't change our existing rules is true. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
On Fri, Sep 28, 2007 at 06:18:31PM +0100, Alan Cox wrote: > > on the broken ppro stores config option if you just tell me what should > > be there (again, remember that my patch isn't actually changing anything > > already there except for smp_rmb side). > > The PPro needs rmb to ensure a store doesn't go for a walk on the wild > side and pass the read especially when we are dealing with device space. smp_rmb (ie. the non device space version) needs to order stores too? OK, that's definitely something we'll need a locked op for. I can't see how it could possibly close all holes, though: the level at which memory ordering dependencies operate in lockless code is way above what the hardware can know about. For any 2 given code sequences store 1 -> X store 2 -> Y and load Y load X There may be a requirement to have wmb and rmb respectively for correctness, or it may be perfectly correct without ordering. Unless the issue is *purely* that store and/or loads can erroneously execute out of order, then adding magic to barrier primitives cannot fix all cases (and the errata certainly looked spookier than simply out of order problems). > The rest of the stuff is a little vague which makes me nervous when > considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at > the moment the PPro is effectively treated as an out of order cpu and so > shouldn't hit anything occassionally that a PPC wouldn't hit every time. Well, it's all a little vague to me ;) (maybe you were privy to info that you aren't allowed to talk about...). IIRC the errata I could find, mentioned concurrent stores to a single variable as being problematic, and nothing about memory ordering, or loads, at all. One difference with a normal weakly ordered machine, is that the normal one wouldn't require rmb to order loads vs stores, for example. Anyway, after all that blowing of smoke (I'm just genuinely curious), I have no problems with any fixups for ppro and defer to your greater understanding of the issues. However, all that can go under the config variable we have for that purpose. Let me know what you need? Is it just a matter of putting the dummy lock'ed op back in smp_rmb? You sure you don't also need smp_wmb to order stores? I can cook up a patch to do either. > > > - and for modern processors its still not remotely clear your patch is > > > correct because of NT stores. > > > > No. We already _have_ to rely on this model for barriers > > Well Linus has dealt with the question of NT stores for us now... He actually didn't, quite ;) (see other post) > Given this isn't an issue on 64bit I'm inclined to argue that only 64bit > behaviour should be changed at this point. That's the easy way out, but I think it creates more problems down the line. It diverges the code base and testing base. I think it is pretty simple (and serves as good documentation) to extend the use of the ppro ifdef to do what we need here, rather than implicitly rely on the barriers being stronger than required for non-broken implementations. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
On Fri, Sep 28, 2007 at 09:15:06AM -0700, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Alan Cox wrote: > > > > However > > - You've not shown the patch has any performance gain > > It would be nice to see this. Actually, in a userspace test I have (which actually does enough work to trigger out of order operations on my core2 but is otherwise pretty trivial), lfence takes 13 cycles, sfence takes 40 (neither of which actually solve the problem of load vs store ordering, but at least they might be operating on a slightly utilised memory subsystem rather than the stupidest possible microbenchmark). The dummy lock op takes around 75 cycles (of course, the core2 would always use the fences, but older CPUs will not and will be worse at the lock op too, probably). I suppose these could take significantly longer if there are uncached memory operations and such (I wasn't doing any significant amount of IO) -- I can't be sure, though. So it isn't much, but it could be helpful. If the code is important enough to go without locks and instead use complex barriers, it might easily be worth saving this kind of cycles on. > > - You've probably broken Pentium Pro > > Probably not a big deal, but yeah, we should have that broken-ppro option. Will add, I'll ask Alan to specify what he'd like to see there. > > - and for modern processors its still not remotely clear your patch is > > correct because of NT stores. > > This one I disagree with. The *old* code doesn't honor NT stores *either*. > > The fact is, if you use NT stores and then depend on ordering, it has > nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use > the DMA barriers (ie the non-smp ones). > > The non-temporal stores should be basically considered to be "IO", not any > normal memory operation. Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable RAM apparently can go out of order too, and they are being used in the kernel for some things. Likewise for rep stos, apparently. But this means they are already at odds with spin_unlock, unless they are enclosed with mfences everywhere they are used (of which I think most are not). So this is an existing bug in the kernel. So again the question comes up -- do we promote these kinds of stores to be regular x86 citizens, keep the strong memory barriers as they are, and eat 40 cycles with an sfence before each spin_unlock store; or do we fix the few users of non-temporal stores and continue with the model we've always had where stores are in-order? Or I guess the implicit option is to do nothing until some poor bastard has the pleasure of having to debug some problem. Anyway, just keep in mind that this patch is not making any changes which are not already fundamentally broken. Sure, it might happen to cause more actual individual cases to break, but if they just happened to be using real locking instead of explicit barriers, they would be broken anyway, right? (IOW, any new breakage is already conceptually broken, even if OK in practice due to the overstrictness of our current barriers). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
> on the broken ppro stores config option if you just tell me what should > be there (again, remember that my patch isn't actually changing anything > already there except for smp_rmb side). The PPro needs rmb to ensure a store doesn't go for a walk on the wild side and pass the read especially when we are dealing with device space. The rest of the stuff is a little vague which makes me nervous when considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at the moment the PPro is effectively treated as an out of order cpu and so shouldn't hit anything occassionally that a PPC wouldn't hit every time. > > - and for modern processors its still not remotely clear your patch is > > correct because of NT stores. > > No. We already _have_ to rely on this model for barriers Well Linus has dealt with the question of NT stores for us now... Given this isn't an issue on 64bit I'm inclined to argue that only 64bit behaviour should be changed at this point. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote: > > The only alternative is to assume a weak memory model, and add the required > > barriers to spin_unlock -- something that has been explicitly avoided, but > > We have the barriers in spin_unlock already for Pentium Pro and IDT > Winchip systems. The Winchip explicitly supports out of order store (and > was about 10-20% faster with it set this way), the PPro has some annoying > little store ordering bugs which is why we xchgb out of spin_unlock in > those cases. And we still have the correct smp_wmb barrier for OOOSTORE systems. And we never did any smp_wmb tricks for broken ppros in barrier code. IOW, my patch is a complete noop for both (it changes smp_rmb to be a noop, of course, but AFAIKS that shouldn't be relevant for either problem -- and if it was, then you need to add special ifdefs for them rather than make everyone else eat proverbial shit, like we do for spin_unlock). > > The complaint that broken old ppros get more broken seems to be bogus as wel l: > > wmb on i386 never catered to the old ppro to start with. The errata seem to > > There are only a couple of awkward cases with the PPro and we carefully > work around them. The other half of the work is the flush_write_buffers. Note that for IO, rmb/wmb/mb remain as they are. > > Buggy ppros: I have no interest in, given the vague errata and apparently > > unfixable problems with lockless code; anybody is welcome to try adding > > whatever they like to help them work, but again please don't confuse that > > with having anything to do with this patch... > > PPro was fixed, made to work and brutally tested on quad PPro. There are > cases that wouldn't work (PPro + AGP which never happens) and there are > cases that need specific handling (PPro + Voodoo card and some other 3D > bits) which are handled currently in the X server. The rest is fine. Aside from the fact that we currently don't do anything for ppro in smp_wmb *anyway*, I really don't think adding stuff there can fix it: you can fix critical sections, because you can be sure that all stores to the same memory regions are strictly fenced inside the locks, except for the ops on the locks themselves. For those stores, you obviously can work around the problems by using lock ops. However, for lockless code, it is entirely possible to do conflicting unlocked stores to the same location. Memory barriers are never going to stop this, because they do not provide any synchronisation. They may make some things less likely or some actual patterns effectively safe... Anyway, I'm not arguing against adding "stuff" here for it. I repeat: that isn't concerned with this patch. > > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb > > can also be a simple barrier on i386 too. > > The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP > winchip board so I don't think it matters and even if one existed we > wouldn't support it anyway OK, so in theory we could make smp_wmb just a simple barrier, however the code for ooostore there is under a config option, so there is no harm in keeping it around really. > However > - You've not shown the patch has any performance gain Can I theorise that it will reduce icache misses and be done with? ;) > - You've probably broken Pentium Pro Doubt it. It would be really interesting to have an _exact_ trace of how it breaks that isn't fundamentally broken already. I suspect that information will never come out of intel. But again, if you think we need some stronger stuff in smp_wmb, I can do a patch that depends on the broken ppro stores config option if you just tell me what should be there (again, remember that my patch isn't actually changing anything already there except for smp_rmb side). > - and for modern processors its still not remotely clear your patch is > correct because of NT stores. No. We already _have_ to rely on this model for barriers because we have a lock-less spin_unlock. We can either: put explicit mfences around all stores that aren't executed inorder WRT other stores; or make spin_unlock use the lock prefix. We have pretty explicitly decided to do the former, and the fact that subsequent patches to implement these exotic stores have broken the memory ordering model does in no way invalidate this patch, IMO. > > So NAK Thanks :) Very nice feedback (I wasn't having a gentle dig at you, honest!) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
On Fri, 28 Sep 2007, Alan Cox wrote: > > However > - You've not shown the patch has any performance gain It would be nice to see this. > - You've probably broken Pentium Pro Probably not a big deal, but yeah, we should have that broken-ppro option. > - and for modern processors its still not remotely clear your patch is > correct because of NT stores. This one I disagree with. The *old* code doesn't honor NT stores *either*. The fact is, if you use NT stores and then depend on ordering, it has nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use the DMA barriers (ie the non-smp ones). The non-temporal stores should be basically considered to be "IO", not any normal memory operation. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] x86: improved memory barrier implementation
> The only alternative is to assume a weak memory model, and add the required > barriers to spin_unlock -- something that has been explicitly avoided, but We have the barriers in spin_unlock already for Pentium Pro and IDT Winchip systems. The Winchip explicitly supports out of order store (and was about 10-20% faster with it set this way), the PPro has some annoying little store ordering bugs which is why we xchgb out of spin_unlock in those cases. > The complaint that broken old ppros get more broken seems to be bogus as well: > wmb on i386 never catered to the old ppro to start with. The errata seem to There are only a couple of awkward cases with the PPro and we carefully work around them. The other half of the work is the flush_write_buffers. > Buggy ppros: I have no interest in, given the vague errata and apparently > unfixable problems with lockless code; anybody is welcome to try adding > whatever they like to help them work, but again please don't confuse that > with having anything to do with this patch... PPro was fixed, made to work and brutally tested on quad PPro. There are cases that wouldn't work (PPro + AGP which never happens) and there are cases that need specific handling (PPro + Voodoo card and some other 3D bits) which are handled currently in the X server. The rest is fine. > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb > can also be a simple barrier on i386 too. The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP winchip board so I don't think it matters and even if one existed we wouldn't support it anyway However - You've not shown the patch has any performance gain - You've probably broken Pentium Pro - and for modern processors its still not remotely clear your patch is correct because of NT stores. So NAK Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/