subject:"Re\: \[patch\] x86\: improved memory barrier implementation"

Re: [patch] x86: improved memory barrier implementation

2007-09-30 Thread Nick Piggin

On Sat, Sep 29, 2007 at 09:07:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 29 Sep 2007, Nick Piggin wrote:
> > > 
> > > The non-temporal stores should be basically considered to be "IO", not 
> > > any 
> > > normal memory operation.
> > 
> > Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
> > RAM apparently can go out of order too, and they are being used in the 
> > kernel
> > for some things.
> 
> I'm really saying that people to a first approximation should think "NT is 
> an IO (DMA) thing". Whether cached or not. Exactly because they do not 
> honor the normal memory ordering.
> 
> It may be worth noting that "clflush" falls under that heading too - even 
> if all the actual *writes* were done with totally normal writes, if 
> anybody does a clflush instruction, that breaks the ordering, and that 
> turns it to "DMA ordering" again - ie we're not talking about the normal 
> SMP ordering rules at all.
> 
> So all the spinlocks and all the smp_*mb() barriers have never really done 
> *anything* for those things (in particular, "smp_wmb()" has *always* 
> ignored them on i386!)

Yes that's true. I'm just noting that in some places, we do nontemporal
stores to cacheable RAM, and in that case we just have to remember to be
careful of memory ordering (especially in library functions like
copy_user_nocache where we never know exactly what it will be used for,
even if it seems safe).

Speaking of which, would it be a good idea to put an sfence before and
after this function? It seems like the safe thing to do...


> > Likewise for rep stos, apparently.
> 
> No. As far as I can tell, the fast string operations are unordered 
> *within*themselves*, but not wrt the operations around it.
> 
> In other words, you cannot depend on the ordering of stores *in* the 
> memcpy() or memset() when it is implemented by "rep movs/stos" - but that 
> is 100% equivalent to the fact that you cannot depend on the ordering even 
> when it isn't - since the "memcpy()" library routine might be copying 
> memory backwards for all you know!
> 
> The Intel memory ordering paper doesn't talk about the fast string 
> instructions (except to say that the rules it *does* speak about do not 
> hold), but the regular IA manuals do say (for example):
> 
>"Code dependent upon sequential store ordering should not use the 
> string operations for the entire data structure to be stored. Data and 
> semaphores should be separated. Order dependent code should use a 
> discrete semaphore uniquely stored to after any string operations to 
> allow correctly ordered data to be seen by all processors."
> 
> and note how it says you should just store to the semaphore. If you think 
> about it, that semahore will be involving all the memory ordering 
> requirements that we *already* depend on, so if a semaphore is sufficient 
> to order the fast string instruction, then by definition using a spinlock 
> around them must be the same thing!
> 
> In other words, by Intels architecture manual, fast string instructions 
> cannot escape a "semaphore" - but that means that they cannot escape a 
> spinlock either (since the two are exactly the same wrt memory ordering 
> rules! In other words, whenever the Intel docs say "semaphore", think 
> "mutual exclusion lock", not necessarily the kernel kind of "sleeping 
> semaphore").
> 
> But it might be good to have that explicitly mentioned in the IA memory 
> ordering thing, so I'll ask the Intel people about that. However, I'd say 
> that given our *current* documentation, string instructions may be 
> *internally* out-of-order, but they would not escape a lock.

OK, it's good to hear that from you :) The one tiny doubt I have left
is that IIRC Intel talk about lock prefixed operations as semaphore
operations sometimes? I don't suppose that would be the case here? Anyway,
if you can get them to put something in writing, that would be grand.

 
> > But this means they are already at odds with spin_unlock, unless they 
> > are enclosed with mfences everywhere they are used (of which I think 
> > most are not). So this is an existing bug in the kernel.
> 
> See above. I do not believe that it's an existing bug, but the basic point 
> that the change to "smp_rmb()" doesn't change our existing rules is true.

OK, so all that's left is to restore the locked smp_rmb case for broken
ppro and resend the patch?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Dave Jones

On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote:

 > > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
 > > can also be a simple barrier on i386 too.
 > 
 > The IDT Winchip can do SMP apparently.

>From the Winchip3 (which was the final winchip) specs..

"The IDT WinChip 3 processor also omits the software interface
 to the Intel-proprietary symmetric multiprocessing support: APIC.
 This bus function is omitted since the target market for the
 IDT WinChip 3 processor is typical desktop systems (which
 do not support APIC multiprocessing)."

It didn't offer any alternative DIY-SMP either (or at least
none that's documented).

Centaur only became SMP capable with some of the C3 Nehemiah's
a year or two back.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Linus Torvalds

On Sat, 29 Sep 2007, Nick Piggin wrote:
> > 
> > The non-temporal stores should be basically considered to be "IO", not any 
> > normal memory operation.
> 
> Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
> RAM apparently can go out of order too, and they are being used in the kernel
> for some things.

I'm really saying that people to a first approximation should think "NT is 
an IO (DMA) thing". Whether cached or not. Exactly because they do not 
honor the normal memory ordering.

It may be worth noting that "clflush" falls under that heading too - even 
if all the actual *writes* were done with totally normal writes, if 
anybody does a clflush instruction, that breaks the ordering, and that 
turns it to "DMA ordering" again - ie we're not talking about the normal 
SMP ordering rules at all.

So all the spinlocks and all the smp_*mb() barriers have never really done 
*anything* for those things (in particular, "smp_wmb()" has *always* 
ignored them on i386!)

> Likewise for rep stos, apparently.

No. As far as I can tell, the fast string operations are unordered 
*within*themselves*, but not wrt the operations around it.

In other words, you cannot depend on the ordering of stores *in* the 
memcpy() or memset() when it is implemented by "rep movs/stos" - but that 
is 100% equivalent to the fact that you cannot depend on the ordering even 
when it isn't - since the "memcpy()" library routine might be copying 
memory backwards for all you know!

The Intel memory ordering paper doesn't talk about the fast string 
instructions (except to say that the rules it *does* speak about do not 
hold), but the regular IA manuals do say (for example):

   "Code dependent upon sequential store ordering should not use the 
string operations for the entire data structure to be stored. Data and 
semaphores should be separated. Order dependent code should use a 
discrete semaphore uniquely stored to after any string operations to 
allow correctly ordered data to be seen by all processors."

and note how it says you should just store to the semaphore. If you think 
about it, that semahore will be involving all the memory ordering 
requirements that we *already* depend on, so if a semaphore is sufficient 
to order the fast string instruction, then by definition using a spinlock 
around them must be the same thing!

In other words, by Intels architecture manual, fast string instructions 
cannot escape a "semaphore" - but that means that they cannot escape a 
spinlock either (since the two are exactly the same wrt memory ordering 
rules! In other words, whenever the Intel docs say "semaphore", think 
"mutual exclusion lock", not necessarily the kernel kind of "sleeping 
semaphore").

But it might be good to have that explicitly mentioned in the IA memory 
ordering thing, so I'll ask the Intel people about that. However, I'd say 
that given our *current* documentation, string instructions may be 
*internally* out-of-order, but they would not escape a lock.

> But this means they are already at odds with spin_unlock, unless they 
> are enclosed with mfences everywhere they are used (of which I think 
> most are not). So this is an existing bug in the kernel.

See above. I do not believe that it's an existing bug, but the basic point 
that the change to "smp_rmb()" doesn't change our existing rules is true.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Nick Piggin

On Fri, Sep 28, 2007 at 06:18:31PM +0100, Alan Cox wrote:
> > on the broken ppro stores config option if you just tell me what should
> > be there (again, remember that my patch isn't actually changing anything
> > already there except for smp_rmb side).
> 
> The PPro needs rmb to ensure a store doesn't go for a walk on the wild
> side and pass the read especially when we are dealing with device space.

smp_rmb (ie. the non device space version) needs to order stores too?
OK, that's definitely something we'll need a locked op for. I can't see how
it could possibly close all holes, though: the level at which memory ordering
dependencies operate in lockless code is way above what the hardware can
know about.

For any 2 given code sequences
store 1 -> X
store 2 -> Y
and
load Y
load X

There may be a requirement to have wmb and rmb respectively for
correctness, or it may be perfectly correct without ordering. Unless the
issue is *purely* that store and/or loads can erroneously execute out
of order, then adding magic to barrier primitives cannot fix all cases
(and the errata certainly looked spookier than simply out of order
problems).

> The rest of the stuff is a little vague which makes me nervous when
> considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at
> the moment the PPro is effectively treated as an out of order cpu and so
> shouldn't hit anything occassionally that a PPC wouldn't hit every time.

Well, it's all a little vague to me ;) (maybe you were privy to info that you
aren't allowed to talk about...). IIRC the errata I could find, mentioned
concurrent stores to a single variable as being problematic, and nothing
about memory ordering, or loads, at all.

One difference with a normal weakly ordered machine, is that the normal
one wouldn't require rmb to order loads vs stores, for example.

Anyway, after all that blowing of smoke (I'm just genuinely curious), I have
no problems with any fixups for ppro and defer to your greater understanding
of the issues. However, all that can go under the config variable we have
for that purpose. Let me know what you need? Is it just a matter of putting
the dummy lock'ed op back in smp_rmb? You sure you don't also need
smp_wmb to order stores? I can cook up a patch to do either.

> > > - and for modern processors its still not remotely clear your patch is
> > > correct because of NT stores.
> > 
> > No. We already _have_ to rely on this model for barriers 
> 
> Well Linus has dealt with the question of NT stores for us now...

He actually didn't, quite ;) (see other post)

> Given this isn't an issue on 64bit I'm inclined to argue that only 64bit
> behaviour should be changed at this point.

That's the easy way out, but I think it creates more problems down the
line. It diverges the code base and testing base. I think it is pretty simple
(and serves as good documentation) to extend the use of the ppro ifdef
to do what we need here, rather than implicitly rely on the barriers being
stronger than required for non-broken implementations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Nick Piggin

On Fri, Sep 28, 2007 at 09:15:06AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 28 Sep 2007, Alan Cox wrote:
> > 
> > However 
> > - You've not shown the patch has any performance gain
> 
> It would be nice to see this.

Actually, in a userspace test I have (which actually does enough
work to trigger out of order operations on my core2 but is otherwise
pretty trivial), lfence takes 13 cycles, sfence takes 40 (neither of
which actually solve the problem of load vs store ordering, but at
least they might be operating on a slightly utilised memory subsystem
rather than the stupidest possible microbenchmark).

The dummy lock op takes around 75 cycles (of course, the core2 would
always use the fences, but older CPUs will not and will be worse at the
lock op too, probably).

I suppose these could take significantly longer if there are uncached
memory operations and such (I wasn't doing any significant amount
of IO) -- I can't be sure, though.

So it isn't much, but it could be helpful. If the code is important enough
to go without locks and instead use complex barriers, it might easily be
worth saving this kind of cycles on.

> > - You've probably broken Pentium Pro
> 
> Probably not a big deal, but yeah, we should have that broken-ppro option.

Will add, I'll ask Alan to specify what he'd like to see there.

> > - and for modern processors its still not remotely clear your patch is
> > correct because of NT stores.
> 
> This one I disagree with. The *old* code doesn't honor NT stores *either*. 
> 
> The fact is, if you use NT stores and then depend on ordering, it has 
> nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use 
> the DMA barriers (ie the non-smp ones).
> 
> The non-temporal stores should be basically considered to be "IO", not any 
> normal memory operation.

Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
RAM apparently can go out of order too, and they are being used in the kernel
for some things. Likewise for rep stos, apparently. But this means they are
already at odds with spin_unlock, unless they are enclosed with mfences 
everywhere they are used (of which I think most are not). So this is an
existing bug in the kernel.

So again the question comes up -- do we promote these kinds of stores
to be regular x86 citizens, keep the strong memory barriers as they are, and
eat 40 cycles with an sfence before each spin_unlock store; or do we fix the
few users of non-temporal stores and continue with the model we've always
had where stores are in-order? Or I guess the implicit option is to do nothing
until some poor bastard has the pleasure of having to debug some problem.

Anyway, just keep in mind that this patch is not making any changes
which are not already fundamentally broken. Sure, it might happen to
cause more actual individual cases to break, but if they just happened
to be using real locking instead of explicit barriers, they would be
broken anyway, right? (IOW, any new breakage is already conceptually
broken, even if OK in practice due to the overstrictness of our current
barriers).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Alan Cox

> on the broken ppro stores config option if you just tell me what should
> be there (again, remember that my patch isn't actually changing anything
> already there except for smp_rmb side).

The PPro needs rmb to ensure a store doesn't go for a walk on the wild
side and pass the read especially when we are dealing with device space.
The rest of the stuff is a little vague which makes me nervous when
considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at
the moment the PPro is effectively treated as an out of order cpu and so
shouldn't hit anything occassionally that a PPC wouldn't hit every time.

> > - and for modern processors its still not remotely clear your patch is
> > correct because of NT stores.
> 
> No. We already _have_ to rely on this model for barriers 

Well Linus has dealt with the question of NT stores for us now...

Given this isn't an issue on 64bit I'm inclined to argue that only 64bit
behaviour should be changed at this point.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Nick Piggin

On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote:
> > The only alternative is to assume a weak memory model, and add the required
> > barriers to spin_unlock -- something that has been explicitly avoided, but
> 
> We have the barriers in spin_unlock already for Pentium Pro and IDT
> Winchip systems. The Winchip explicitly supports out of order store (and
> was about 10-20% faster with it set this way), the PPro has some annoying
> little store ordering bugs which is why we xchgb out of spin_unlock in
> those cases.

And we still have the correct smp_wmb barrier for OOOSTORE systems. And
we never did any smp_wmb tricks for broken ppros in barrier code.

IOW, my patch is a complete noop for both (it changes smp_rmb to be a
noop, of course, but AFAIKS that shouldn't be relevant for either
problem -- and if it was, then you need to add special ifdefs for
them rather than make everyone else eat proverbial shit, like we do
for spin_unlock).

> > The complaint that broken old ppros get more broken seems to be bogus as wel
l:
> > wmb on i386 never catered to the old ppro to start with. The errata seem to
>
> There are only a couple of awkward cases with the PPro and we carefully
> work around them. The other half of the work is the flush_write_buffers.

Note that for IO, rmb/wmb/mb remain as they are.

> > Buggy ppros: I have no interest in, given the vague errata and apparently
> > unfixable problems with lockless code; anybody is welcome to try adding
> > whatever they like to help them work, but again please don't confuse that
> > with having anything to do with this patch...
>
> PPro was fixed, made to work and brutally tested on quad PPro. There are
> cases that wouldn't work (PPro + AGP which never happens) and there are
> cases that need specific handling (PPro + Voodoo card and some other 3D
> bits) which are handled currently in the X server. The rest is fine.

Aside from the fact that we currently don't do anything for ppro in
smp_wmb *anyway*, I really don't think adding stuff there can fix it:
you can fix critical sections, because you can be sure that all stores
to the same memory regions are strictly fenced inside the locks, except
for the ops on the locks themselves. For those stores, you obviously can
work around the problems by using lock ops.

However, for lockless code, it is entirely possible to do conflicting
unlocked stores to the same location. Memory barriers are never going to
stop this, because they do not provide any synchronisation. They may make
some things less likely or some actual patterns effectively safe...

Anyway, I'm not arguing against adding "stuff" here for it. I repeat:
that isn't concerned with this patch.

> > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
> > can also be a simple barrier on i386 too.
> 
> The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP
> winchip board so I don't think it matters and even if one existed we
> wouldn't support it anyway

OK, so in theory we could make smp_wmb just a simple barrier, however the
code for ooostore there is under a config option, so there is no harm in
keeping it around really.

> However 
> - You've not shown the patch has any performance gain

Can I theorise that it will reduce icache misses and be done with? ;)

> - You've probably broken Pentium Pro

Doubt it. It would be really interesting to have an _exact_ trace of
how it breaks that isn't fundamentally broken already. I suspect that
information will never come out of intel. But again, if you think we
need some stronger stuff in smp_wmb, I can do a patch that depends
on the broken ppro stores config option if you just tell me what should
be there (again, remember that my patch isn't actually changing anything
already there except for smp_rmb side).

> - and for modern processors its still not remotely clear your patch is
> correct because of NT stores.

No. We already _have_ to rely on this model for barriers because we
have a lock-less spin_unlock. We can either: put explicit mfences around
all stores that aren't executed inorder WRT other stores; or make
spin_unlock use the lock prefix.

We have pretty explicitly decided to do the former, and the fact that
subsequent patches to implement these exotic stores have broken the
memory ordering model does in no way invalidate this patch, IMO.

> 
> So NAK

Thanks :) Very nice feedback (I wasn't having a gentle dig at you,
honest!)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Linus Torvalds

On Fri, 28 Sep 2007, Alan Cox wrote:
> 
> However 
> - You've not shown the patch has any performance gain

It would be nice to see this.

> - You've probably broken Pentium Pro

Probably not a big deal, but yeah, we should have that broken-ppro option.

> - and for modern processors its still not remotely clear your patch is
> correct because of NT stores.

This one I disagree with. The *old* code doesn't honor NT stores *either*. 

The fact is, if you use NT stores and then depend on ordering, it has 
nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use 
the DMA barriers (ie the non-smp ones).

The non-temporal stores should be basically considered to be "IO", not any 
normal memory operation.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Alan Cox

> The only alternative is to assume a weak memory model, and add the required
> barriers to spin_unlock -- something that has been explicitly avoided, but

We have the barriers in spin_unlock already for Pentium Pro and IDT
Winchip systems. The Winchip explicitly supports out of order store (and
was about 10-20% faster with it set this way), the PPro has some annoying
little store ordering bugs which is why we xchgb out of spin_unlock in
those cases.

> The complaint that broken old ppros get more broken seems to be bogus as well:
> wmb on i386 never catered to the old ppro to start with. The errata seem to

There are only a couple of awkward cases with the PPro and we carefully
work around them. The other half of the work is the flush_write_buffers.

> Buggy ppros: I have no interest in, given the vague errata and apparently
> unfixable problems with lockless code; anybody is welcome to try adding
> whatever they like to help them work, but again please don't confuse that
> with having anything to do with this patch...

PPro was fixed, made to work and brutally tested on quad PPro. There are
cases that wouldn't work (PPro + AGP which never happens) and there are
cases that need specific handling (PPro + Voodoo card and some other 3D
bits) which are handled currently in the X server. The rest is fine.

> Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
> can also be a simple barrier on i386 too.

The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP
winchip board so I don't think it matters and even if one existed we
wouldn't support it anyway

However 
- You've not shown the patch has any performance gain
- You've probably broken Pentium Pro
- and for modern processors its still not remotely clear your patch is
correct because of NT stores.

So NAK

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

Re: [patch] x86: improved memory barrier implementation

9 matches

Site Navigation

Mail list logo

Footer information